1 Introduction

Sometimes you may need to download some publicly available data in fastq format in order to re-analyze it. There are multiple ways to do it: a) use the ENA website; b) use fastq-dump on the terminal.

Before to start I will introduce the structure of a GEO repository.

1.1 GEO structure

Most of the non-patient data are deposited in the Gene Expression Omnibus (GEO) from the NCBI. In this portal you can have access to a specif work by article title, author names, or the GEO ID (accession number).

Let’s use the example of Gregoricchio, et al. (Nuc. Acids Res., 2022). This work is available at the accession number GSE172088.

The page that opens (snapshot hereafter) contains multiple information. This specific project contains multiple “sub-folders”, called “SuperSeries”, to organize the data depending on the data type (ATAC-seq, ChIP-seq, RNA-seq, 4C-seq) and/or the cell line used. At the end of the page you can find the “BioProject number” or “project number”; for this case: PRJNA721971.

1.2 ENA browser

Using the project number you can better inspect the information concerning the data deposited. Indeed, you can search for the PRJNA721971 at the ENA broweser website.

At the corresponding page you will be able to find all the SuperSeries available for the specific project. To make this tutorial faster we can imagine that we want download 763’s 4C-seq data (very small fastq) corresponding to the project number PRJNA841832 (see snapshot).

It will open a page showing all the fastq available for that specific serie:

2 Fastq files download

2.1 From ENA browser

From the previous page is possible to select the fastq of interest and then click on Download selected files (or Download all if you need all of them) and .zip file containing the fastq files will be downloaded:

In this specific case the 4C-seq data are in single-end (only a single file per sample), but also paired-end data may be available, but in this case the table will look like this:

2.2 Using `fastq-dump`

2.2.1 Obtain SRR numbers

Alternatively to the manual download from the ENA browser, for a large number of fastq, employ the fastq-dump tools may be quite handy. This allows for the download of your fastq directly on the server, and the only input is the SRR number (same name than the fastq files).

The easiest way to collect these numbers is to download the metadata table form the ENA browser by clicking on Download report: TSV:

In this table you can find multiple information and among them the SRR numbers (run_accession column):

Now you can use the SRR numbers to download the fastq files.

2.2.2 `fastq-dump` installation

There are two ways to get access to the tools: (A) direct download; (B) using a conda environment.

2.2.2.1 Method A >>> Stand-alone version

Download the Ubuntu Linux 64 bit architecture .tar.gz file for the SRA toolkit GitHub.

Place the file in your favorite folder and unzip it:

tar -xzf /path/to/your/folder/sratoolkit.3.0.2-ubuntu64.tar.gz

The fastq-dump is ready-to-go; you just need to indicate the path to the tools when you need to use it. For instance:

/path/to/your/folder/sratoolkit.3.0.2-ubuntu64/bin/fastq-dump --help

## 
## Usage:
##   /home/s.gregoricchio/sratoolkit.3.0.2-ubuntu64/bin/fastq-dump [options] <path> [<path>...]
##   /home/s.gregoricchio/sratoolkit.3.0.2-ubuntu64/bin/fastq-dump [options] <accession>
## 
## INPUT
##   -A|--accession <accession>       Replaces accession derived from <path> in 
##                                    filename(s) and deflines (only for single 
##                                    table dump) 
##   --table <table-name>             Table name within cSRA object, default is 
##                                    "SEQUENCE" 
## 
## PROCESSING
## 
## Read Splitting                     Sequence data may be used in raw form or

2.2.2.2 Method B >>> conda environment

Alternatively, you can build a conda environment dedicated to the SRA toolkit. Unfortunately, for reasons unknown to human beings, our (server) conda does not install the latest version of the SRA-tools (even in an empty environment). And if you are using older versions the connection to the SRA server will fail.

Therefore, to install the latest version we will use mamba, which works similarly to conda.

create an environment for SRA-tools: conda create -n SRA
create an environment for mamba: conda create -n mamba
activate mamba environment: conda activate mamba
now install mamba (more info here): conda install -c conda-forge mamba
within the mamba environment, install SRA-tools last version. To check the last version available you can consult this pasge. mamba install -c conda-forge -c anaconda -c bioconda -c defaults -n SRA 'sra-tools==3.0.0'
activate the SRA environment: conda activate SRA
now fastq-dump should be available within the SRA environment: fastq-dump --help

2.2.3 Run `fastq-dump` for downloads

To download files just type the following commands followed by the SRR numbers:

## stand-alone version
/path/to/sratoolkit.3.0.2-ubuntu64/bin/fastq-dump --split-3 --gzip -O /output/download/folder/ SRR19371735 SRR19371736 SRR19371736 SRR19371738

## conda version
conda activate SRA
fastq-dump --split-3 --gzip -O /output/download/folder/ SRR19371735 SRR19371736 SRR19371736 SRR19371738

## 'close' the SRA conda environment
conda deactivate

Alternatively you can collect the data directly for the .tsv/.txt file that you previously downloaded:

## save SRA numbers in a variable
FASTQ=$(tail -n +2 filereport_read_run_PRJNAXXXXXXX_tsv.tx | cut -f 4)

## if you want to check the SRR numbers collected type:
echo $FASTQ

## run fastq-dump
fastq-dump --split-3 --gzip -O /output/folder/ $FASTQ

2.2.3.1 Paired-end data exception

If you are downloading paired date, you may have noticed that in the output folder there are two files for each SRR numbers such as SRR19371736_1.fastq.gz and SRR19371736_2.fastq.gz. These correspond to the read1 and read2 of the paired-end sequencing.

However, if you want to align the data with the DNA-mapping pipeline the fastq.gz files need to have the suffixes ['_R1', '_R2'] in the name in front of fastq.gz. Therefore, we need to rename the *_1.fastq.gz/*_2.fastq.gz to *_R1.fastq.gz/*_R2.fastq.gz. To do that you can run the following loop (it will affect only paired data):

## move to the directory where you downloaded the fastq files
cd /output/download/folder/

## get the SRR numbers of paired data
PAIRED_NUMBERS=$(ls *_1.fastq.gz | sed 's/_1.fastq.gz//')

## run the renaming loop
for SRR in $PAIRED_NUMBERS
do
  mv ${SRR}_1.fastq.gz ${SRR}_R1.fastq.gz
  mv ${SRR}_2.fastq.gz ${SRR}_R2.fastq.gz
done

Download publicly available fastq

Sebastian Gregoricchio & Tesa M. Severson

23 January, 2023

1 Introduction

1.1 GEO structure

1.2 ENA browser

2 Fastq files download

2.1 From ENA browser

2.2 Using `fastq-dump`

2.2.1 Obtain SRR numbers

2.2.2 `fastq-dump` installation

2.2.2.1 Method A >>> Stand-alone version

2.2.2.2 Method B >>> conda environment

2.2.3 Run `fastq-dump` for downloads

2.2.3.1 Paired-end data exception

Download publicly available fastq

Sebastian Gregoricchio & Tesa M. Severson

23 January, 2023

1 Introduction

1.1 GEO structure

1.2 ENA browser

2 Fastq files download

2.1 From ENA browser

2.2 Using fastq-dump

2.2.1 Obtain SRR numbers

2.2.2 fastq-dump installation

2.2.2.1 Method A >>> Stand-alone version

2.2.2.2 Method B >>> conda environment

2.2.3 Run fastq-dump for downloads

2.2.3.1 Paired-end data exception

2.2 Using `fastq-dump`

2.2.2 `fastq-dump` installation

2.2.3 Run `fastq-dump` for downloads