Sometimes you may need to download some publicly available data in
fastq format in order to re-analyze it. There are multiple ways to do
it: a) use the ENA website; b) use fastq-dump
on the
terminal.
Before to start I will introduce the structure of a GEO repository.
Most of the non-patient data are deposited in the Gene Expression Omnibus (GEO) from the NCBI. In this portal you can have access to a specif work by article title, author names, or the GEO ID (accession number).
Let’s use the example of Gregoricchio,
et al. (Nuc. Acids Res., 2022). This work is
available at the accession number GSE172088
.
The page that opens (snapshot hereafter) contains multiple
information. This specific project contains multiple “sub-folders”,
called “SuperSeries”, to organize the data depending on the data type
(ATAC-seq, ChIP-seq, RNA-seq, 4C-seq) and/or the cell line used. At the
end of the page you can find the “BioProject number” or “project
number”; for this case: PRJNA721971
.
Using the project number you can better inspect the information
concerning the data deposited. Indeed, you can search for the
PRJNA721971
at the ENA broweser
website.
At the corresponding page you will be able to find all the
SuperSeries available for the specific project. To make this tutorial
faster we can imagine that we want download 763’s 4C-seq data (very
small fastq) corresponding to the project number
PRJNA841832
(see snapshot).
It will open a page showing all the fastq available for that specific serie:
From the previous page is possible to select the fastq of interest and then click on Download selected files (or Download all if you need all of them) and .zip file containing the fastq files will be downloaded:
In this specific case the 4C-seq data are in single-end (only a single file per sample), but also paired-end data may be available, but in this case the table will look like this:
fastq-dump
Alternatively to the manual download from the ENA browser, for a
large number of fastq, employ the fastq-dump
tools may be
quite handy. This allows for the download of your fastq directly on the
server, and the only input is the SRR number (same name than the fastq
files).
The easiest way to collect these numbers is to download the metadata table form the ENA browser by clicking on Download report: TSV:
In this table you can find multiple information and among them the SRR numbers (run_accession column):
Now you can use the SRR numbers to download the fastq files.
fastq-dump
installationThere are two ways to get access to the tools: (A) direct download; (B) using a conda environment.
Download the Ubuntu Linux 64 bit architecture
.tar.gz
file for the SRA
toolkit GitHub.
Place the file in your favorite folder and unzip it:
tar -xzf /path/to/your/folder/sratoolkit.3.0.2-ubuntu64.tar.gz
The fastq-dump
is ready-to-go; you just need to indicate
the path to the tools when you need to use it. For instance:
/path/to/your/folder/sratoolkit.3.0.2-ubuntu64/bin/fastq-dump --help
##
## Usage:
## /home/s.gregoricchio/sratoolkit.3.0.2-ubuntu64/bin/fastq-dump [options] <path> [<path>...]
## /home/s.gregoricchio/sratoolkit.3.0.2-ubuntu64/bin/fastq-dump [options] <accession>
##
## INPUT
## -A|--accession <accession> Replaces accession derived from <path> in
## filename(s) and deflines (only for single
## table dump)
## --table <table-name> Table name within cSRA object, default is
## "SEQUENCE"
##
## PROCESSING
##
## Read Splitting Sequence data may be used in raw form or
Alternatively, you can build a conda environment dedicated to the SRA toolkit. Unfortunately, for reasons unknown to human beings, our (server) conda does not install the latest version of the SRA-tools (even in an empty environment). And if you are using older versions the connection to the SRA server will fail.
Therefore, to install the latest version we will use
mamba
, which works similarly to conda.
conda create -n SRA
conda create -n mamba
conda activate mamba
conda install -c conda-forge mamba
mamba install -c conda-forge -c anaconda -c bioconda -c defaults -n SRA 'sra-tools==3.0.0'
conda activate SRA
fastq-dump
should be available within the SRA
environment: fastq-dump --help
fastq-dump
for downloadsTo download files just type the following commands followed by the SRR numbers:
## stand-alone version
/path/to/sratoolkit.3.0.2-ubuntu64/bin/fastq-dump --split-3 --gzip -O /output/download/folder/ SRR19371735 SRR19371736 SRR19371736 SRR19371738
## conda version
conda activate SRA
fastq-dump --split-3 --gzip -O /output/download/folder/ SRR19371735 SRR19371736 SRR19371736 SRR19371738
## 'close' the SRA conda environment
conda deactivate
Alternatively you can collect the data directly for the .tsv/.txt file that you previously downloaded:
## save SRA numbers in a variable
FASTQ=$(tail -n +2 filereport_read_run_PRJNAXXXXXXX_tsv.tx | cut -f 4)
## if you want to check the SRR numbers collected type:
echo $FASTQ
## run fastq-dump
fastq-dump --split-3 --gzip -O /output/folder/ $FASTQ
If you are downloading paired date, you may have noticed that in the
output folder there are two files for each SRR numbers such as
SRR19371736_1.fastq.gz
and
SRR19371736_2.fastq.gz
. These correspond to the read1 and
read2 of the paired-end sequencing.
However, if you want to align the data with the DNA-mapping pipeline the
fastq.gz files need to have the suffixes ['_R1', '_R2']
in
the name in front of fastq.gz. Therefore, we need to rename the
*_1.fastq.gz
/*_2.fastq.gz
to
*_R1.fastq.gz
/*_R2.fastq.gz
. To do that you
can run the following loop (it will affect only paired data):
## move to the directory where you downloaded the fastq files
cd /output/download/folder/
## get the SRR numbers of paired data
PAIRED_NUMBERS=$(ls *_1.fastq.gz | sed 's/_1.fastq.gz//')
## run the renaming loop
for SRR in $PAIRED_NUMBERS
do
mv ${SRR}_1.fastq.gz ${SRR}_R1.fastq.gz
mv ${SRR}_2.fastq.gz ${SRR}_R2.fastq.gz
done