multiDUMP
- Introduction
- Installation an dependencies
- How to run the pipeline
- Collect several SRA accession numbers
- Package history and releases
- Contact
- License
Introduction
multiDUMP
is simple snakemake based pipeline to parallelize the download of SRA data through fastq-dump
.
Citation
If you use this package, please acknowledge Sebastian Gregoricchio
in your paper.
Installation an dependencies
To install the pipeline it is required to download this repository and the installation of a conda environment is strongly recommended. Follow the steps below for the installation:
- place yourself in the directory where the repository should be downloaded by typing
cd </target/folder>
- download the GitHub repository with
git clone https://github.com/sebastian-gregoricchio/multiDUMP
, or click on Code > Download ZIP on the GitHub page - install the conda environment from the .yaml environment file contained in the repository:
conda env create -f </target/folder>/multiDUMP/workflow/envs/multiDUMP_env.yaml
- activate the conda environment:
conda activate multiDUMP
(if the env is not activated the pipeline won’t work properly)
Notice that if you are encountering problems in the installation via conda
, try to use mamba
instead.
How to run the pipeline
To download a list of SRA numbers what you need is to prepare a sample configuration table with the SRA number and the corresponding name to assign to the corresponding fastq files:
SRA_ID | sample_name |
---|---|
SRR125346 | sampleA |
SRR578951 | sampleB |
Then, upon conda environment activation, run the following commands (one can use the -n
flag for a dry run):
snakemake \
-s </target/folder>/multiDUMP/workflow/multiDUMP.snakefile \
--cores 5 \
--config \
TABLE="/path/to/sample_config_table.txt" \
OUTDIR="/full/path/to/output/directory" \
SUFFIX="['_R1', '_R2']" \
EXTENSION=".fastq.gz"
Where the config flags correspond to:
- TABLE: the full path to the sample configuration table
- OUTDIR: full path to the output directory
- SUFFIX: a python-formatted list with the suffix to use for the read1 (R1) and read2 (R2) files, respectively
- EXTENSION: the extension to use for the fastq files (the fastq-dump default is
fastq.gz
)
Alternatively to the manual --config
flags one can provide a .yaml file as follows:
TABLE = "/path/to/sample_config_table.txt"
OUTDIR = "/full/path/to/output/directory"
SUFFIX = ['_R1', '_R2']
EXTENSION = ".fastq.gz"
And run the following code:
snakemake \
-s </target/folder>/multiDUMP/workflow/multiDUMP.snakefile \
--cores 5 \
--configfile /path/to/config.yaml
Collect several SRA accession numbers
To inspect and collect the samples belonging to a specific project you can follow the fastq downloading tutorial. A tab-delimited tables can be downloaded from the ENA browser as described in paragraph 2.2.1 of the tutorial.
Package history and releases
A list of all releases and respective description of changes applied could be found here.
Contact
For any suggestion, bug fixing, commentary please report it in the issues/request tab of this repository.
License
This repository is under a GNU General Public License (version 3).