View on GitHub

snakeATAC

Snakemake pipeline for analysis and normalization of ATAC-seq data starting from fastq.gz files.

Snakemake release license forks

snakeATAC





Introduction

SnakeATAC is a snakemake based end-to-end pipeline to analyze ATAC-seq data. The input files required to run the pipeline are Paired-End fastq files. The pipeline include data quality check and normalization. It is included also a step of data reads shifting in order to take into account the Tn5 transposome insertion bias. Indeed, reads should be shifted +4bp and −5bp for positive and negative strand respectively, to account for the 9-bp duplication created by DNA repair of the nick by Tn5 transposase and achieve base-pair resolution of TF footprint and motif-related analyses (Yan F., et al., Genome Biol. 2020).

Citation

If you use this package, please cite:

*--- No publication associated yet ---*



Installation an dependencies

To install the pipeline it is required to download this repository and the installation of a conda environment is strongly suggested. Follow the following steps for the installation:



How to run the pipeline

The snakemake pipeline requires only two files: a) the .snakefile, containing all the rules that will be run; b) the configuration.yaml file, in which the user can define and customize all the parameters for the different pipeline steps.
To partially avoid unexpected errors during the execution of the pipeline, a so called ‘dry-run’ is strongly recommended. Indeed, adding a -n at the end of the snakemake running command will allow snakemake to check that all links and file/parameters dependencies are satisfied before to run the “real” processes. This command will therefore help the debugging process.

snakemake -s </target/folder>/snakeATAC/workflow/snakeATAC.snakefile --configfile </target/folder>/snakeATAC/config/snakeATAC_config.yaml --cores 12 -n

If no errors occur, the pipeline can be run with the same command line without the terminal -n:

snakemake -s </target/folder>/snakeATAC/workflow/snakeATAC.snakefile --configfile </target/folder>/snakeATAC/config/snakeATAC_config.yaml --cores 12

Notice that the absence of errors does not mean that the pipeline will run without any issues; the “dry-run” is only checking whether all the resources are available.
Furthermore, there is the possibility to run the pipeline only partailly. An example of usage could be if someone wants to have a look to the fast quality controls (fastQC and multiQC reports) before to perform the alignment. To do that, it is sufficient to run a dry-run (-n mode); pick the name of the rule at which you want the pipeline to stop; type the following command:

snakemake -s </target/folder>/snakeATAC/workflow/snakeATAC.snakefile --configfile </target/folder>/snakeATAC/config/snakeATAC_config.yaml --cores 12 --until stop_rule_name



Snakefile

The snakefile are contained all the condition checks and rules (processing steps) that will be performed by the pipeline. In the following schematic mapping the main steps performed by the pipeline are depicted.
Briefly, first of a quality control (fastQC) of the raw fastq data is performed. Then, bwa-mem is used to align the paired-end sequences onto the genome of reference generating temporary SAM files that are further filter for mapping quality score and deprived of mitochondrial reads before to be converted to BAM. From this alignment files duplicated reads are removed (or marked) by PICARD and then shifted to correct the Tn5 nick repair bias (Yan F., et al., Genome Biol. 2020) by deeptools alignmentSieve tool. A quality control for each alignment and a multiQC report are generated, as well samtools flagstat metrics and deeptools fragment size distribution plots (see 03_BAM_dedup / 03_BAM_mdup paragraph for more details).
Then two possible path can be taken in order to normalize the data:

snakeATAC workflow

Configuration file

The configuration file is a yaml-formatted file containing all the parameters that are passed to different steps of the pipelines such as the directory with the input files, reference genome, threads of the tools, etc.
The snakeATAC configuration file is divided in two sections: a) ‘experiment-specific’, with al the parameters that most likely are changing from experiment to experiment; b) ‘common’, with parameters that are quite stable independently of the experiments design. The latter should be changed only for very specific needs and is in turn compose by two sections depending on whether the copy number variation is performed or not.
Hereafter, the meaning of the different parameters is described.



Experiment-specific section

Parameter Description
runs_directory The full path to the directory were the input fastq files are contained, e.g. /home/user/ATAC/00_runs/. Importantly, the name of the files, deprived of the read suffix (e.g., _R1/_R2) and file extension (e.g., .fastq.gz) will be used as sample name.
output_directory The full path to the folder in which the results should be stored, e.g. "/home/user/ATAC/".
fastq_extension The extension of the input fastq files. This string will be removed from the input file names to obtain the sample names. Examples: ".fastq.gz", ".fq.gz", ".fasta", ".fq".
runs_suffix A list (python format) with the two reads suffixes corresponding to the read1 and read2 for the paired-end sequencing. Example: ["_R1", "_R2"].
blacklist_file The full path to a BED-formatted file containing the regions corresponding to blacklisted regions (regions masked for normalization and peak calling). Blacklisted regions for the most common genome assemblies can be downloaded from the Boyle lab’s git-page.
genome_fasta The full path to a fasta-formatted file containing the sequence of there reference genome into which the reads will be aligned. If the index (.fai) file is not present in the same folder, it will be build by the pipeline during its execution. The reference genomes can be downloaded, for instance, from the UCSC golden path.
perform_HMCan_correction Default "False". If set to "True" the signal will be corrected for the presence of CNVs (Copy Number Variations) and the peak calling will be performed by HMCan. This correction could be useful in the case of heterogeneous cancer samples. Instead, if set to "False", the CNVs correction is skipped and the peak calling performed by MACS3. The CNV correction implementation is still in beta-testing phase.
effective_genomeSize Effective genome size used to normalize the ATAC-seq signal; e.g. for Hg38: 2900338458. A table for the most common genome assemblies is available in this repository at snakeATAC/resources/chromosome_sizes_for_normalization.yaml.
ignore_for_normalization A string of space separated chromosome names that should be excluded for ATAC-seq signal normalization, e.g. "X Y MT M KI270728.1". A table for the most common genome assemblies is available in this repository at snakeATAC/resources/chromosome_sizes_for_normalization.yaml.
call_variants If true, variant calling by GATK4 will be performed.
call_SNPs If true, Single Nucleotide Variation (SNP) calling by GATK4 will be performed.
call_indels If true, Insertion/Deletion (indel) calling by GATK4 will be performed.
dbsnp_file SNP database file (.dbsnp) for base recalibration required by GATK4. It could happen that your .bam files contain the ‘chr’ prefix in the chromosome names while your dbSNP file does not (or vice versa). This can be fixed in the .dbsnp file with the bcftools annotate --rename-chrs command. For Hg38, for instance, the dbSNP file can be downloaded from the broad institute cloud storage. Do not forget to download the INDEX as well!



Common section

Parameter Description
fastQC_threads Default: 2. Number of CPUs to use for fastQC (fastq quality control)
bwa_threads Default: 8. Number of CPUs to use for the mapping performed by bwa-mem.
mapQ_cutoff Default: 20. All reads with a mapping quality (MAPQ) score lower than this value will be filtered out from the bam files.
SAMtools_threads Default: 8. Number of CPUs used by samtools for bam indexing and filtering.
remove_duplicates Default: "true". Logical value to indicate whether the optical duplicates should be removed from the bams by PICARD. If set as "true" the suffix of output bam files will be ‘_dedup’ and the duplicates will be removed, otherwise the suffix will be ‘_mdup’ and the duplicates only marked when set to "false".
PICARD_max_records_in_ram Default: 250000 This will specify the number of records stored in RAM before spilling to disk. The higher the number the higher the amount of RAM needed.
PICARD_max_file_handles_for_read_ends_map Default: 4000 Maximum number of file handles to keep open when spilling read ends to disk. This number should be slightly lower than the per-process maximum number of file that may be open in your system (ulimit -n).
minFragmentLength Default: 0. Minimum fragment length needed for pair inclusion during Tn5 tagmentation bias correction by deeptools alignmentSieve tool.
maxFragmentLength Default: 0. Maximum fragment length needed for pair inclusion during Tn5 tagmentation bias correction by deeptools alignmentSieve tool. A value of 0 indicates no limit.
plot_format Default: "pdf". File format of the images of the fragment size distribution generated by deeptools bamPEFragmentSize.
bamPEFragmentSize_threads Default: 0. Number of CPUs deeptools bamPEFragmentSize should use to compute the fragment size distribution.
max_fragment_length Default: 2000. Maximum fragment size length to be plotted by deeptools bamPEFragmentSize. A value of 0 indicates to use twice the mean fragment length.
window_length Default: 1000. Size, in bp, of the sliding window used to compute the fragment size distribution by deeptools bamPEFragmentSize.
plotFingerprint_threads Default: 8. Number of CPUs to be used by deeptools plotFingerprint to compute the Lorenz curves.
plotFingerprint_binSize Default: 500. Size of the bins, in bp, by which the genome should be subdivided in order to compute the Lorenz curves by deeptools plotFingerprint.
plotFingerprint_sampledRegions Default: 500000. Number of regions to be samples in order to compute the Lorenz curves by deeptools plotFingerprint.
plotFingerprint_extra_parameters Default: "" (empty). A string with any additional parameter to add for the Lorenz curve computation performed by deeptools plotFingerprint.
binning_window_size Default: 10000. Size of the bins, in bp, by which the genome should be subdivided in order to compute the average score matrix by deeptools multiBigwigSummary. This matrix is used to compute sample correlation and Principal Component Analyses (PCA).
multiBigwigSummary_threads Default: 4. Number of CPUs to be used by deeptools multiBigwigSummary in order to compute an average signal over all the genome for each sample. The resulting matrix is used to compute sample correlation and Principal Component Analyses (PCA).
heatmap_color Default: "Blues". A string indicating the color gradient pattern to use for the correlation heatmaps. This value is passed to matplotlib/seaborn. Therefore, available options (see matplotlib page for examples) are the following: ‘Accent’, ‘Blues’, ‘BrBG’, ‘BuGn’, ‘BuPu’, ‘CMRmap’, ‘Dark2’, ‘GnBu’, ‘Greens’, ‘Greys’, ‘OrRd’, ‘Oranges’, ‘PRGn’, ‘Paired’, ‘Pastel1’, ‘Pastel2’, ‘PiYG’, ‘PuBu’, ‘PuBuGn’, ‘PuOr’, ‘PuRd’, ‘Purples’, ‘RdBu’, ‘RdGy’, ‘RdPu’, ‘RdYlBu’, ‘RdYlGn’, ‘Reds’, ‘Set1’, ‘Set2’, ‘Set3’, ‘Spectral’, ‘Wistia’, ‘YlGn’, ‘YlGnBu’, ‘YlOrBr’, ‘YlOrRd’, ‘afmhot’, ‘autumn’, ‘binary’, ‘bone’, ‘brg’, ‘bwr’, ‘cividis’, ‘cool’, ‘coolwarm’, ‘copper’, ‘cubehelix’, ‘flag’, ‘gist_earth’, ‘gist_gray’, ‘gist_heat’, ‘gist_ncar’, ‘gist_rainbow’, ‘gist_stern’, ‘gist_yarg’, ‘gnuplot’, ‘gnuplot2’, ‘gray’, ‘hot’, ‘hsv’, ‘icefire’, ‘inferno’, ‘jet’, ‘magma’, ‘mako’, ‘nipy_spectral’, ‘ocean’, ‘pink’, ‘plasma’, ‘prism’, ‘rainbow’, ‘rocket’, ‘seismic’, ‘spring’, ‘summer’, ‘tab10’, ‘tab20’, ‘tab20b’, ‘tab20c’, ‘terrain’, ‘twilight’, ‘twilight_shifted’, ‘viridis’, ‘vlag’, ‘winter’.
zScore_heatmap_color Default: "seismic". A string indicating the color gradient pattern to use for the peak score heatmaps. This value is passed to matplotlib/seaborn. Therefore, available options (see matplotlib page for examples) are the following: ‘Accent’, ‘Blues’, ‘BrBG’, ‘BuGn’, ‘BuPu’, ‘CMRmap’, ‘Dark2’, ‘GnBu’, ‘Greens’, ‘Greys’, ‘OrRd’, ‘Oranges’, ‘PRGn’, ‘Paired’, ‘Pastel1’, ‘Pastel2’, ‘PiYG’, ‘PuBu’, ‘PuBuGn’, ‘PuOr’, ‘PuRd’, ‘Purples’, ‘RdBu’, ‘RdGy’, ‘RdPu’, ‘RdYlBu’, ‘RdYlGn’, ‘Reds’, ‘Set1’, ‘Set2’, ‘Set3’, ‘Spectral’, ‘Wistia’, ‘YlGn’, ‘YlGnBu’, ‘YlOrBr’, ‘YlOrRd’, ‘afmhot’, ‘autumn’, ‘binary’, ‘bone’, ‘brg’, ‘bwr’, ‘cividis’, ‘cool’, ‘coolwarm’, ‘copper’, ‘cubehelix’, ‘flag’, ‘gist_earth’, ‘gist_gray’, ‘gist_heat’, ‘gist_ncar’, ‘gist_rainbow’, ‘gist_stern’, ‘gist_yarg’, ‘gnuplot’, ‘gnuplot2’, ‘gray’, ‘hot’, ‘hsv’, ‘icefire’, ‘inferno’, ‘jet’, ‘magma’, ‘mako’, ‘nipy_spectral’, ‘ocean’, ‘pink’, ‘plasma’, ‘prism’, ‘rainbow’, ‘rocket’, ‘seismic’, ‘spring’, ‘summer’, ‘tab10’, ‘tab20’, ‘tab20b’, ‘tab20c’, ‘terrain’, ‘twilight’, ‘twilight_shifted’, ‘viridis’, ‘vlag’, ‘winter’.



Copy Number Variation signal correction (for details see the HMCan page)

Parameter Description
HMCan_path Full path to the folder containing the HMCan scripts. Instruction for the download and build of HMCan can be found at the the HMCan page.
HMCan_threads Default: 12. Number of CPUs to be used by HMCan.
chromosome_sizes_file Full path to the chromosome sizes file used to convert the CNV corrected signal into a bigWig. These files can be downloaded from the UCSC golden path.
reference_sample Default: "NA". A string to define the reference sample (sample ID) to which all the samples should be normalized. "NA" indicates that the first samples in alphabetic order will be used as reference.
format Default: "BAM" (used to build an HMCan configFile).
GCIndex (used to build an HMCan configFile) Full path to the GC index file provided at the HMCan page. Example: “/home/user/HMCan-master/data/GC_profile_25KbWindow_Mapp100bp_hg38.cnp”.
smallBinLength Default: 5 (used to build an HMCan configFile). For details see the HMCan page.
largeBinLength Default: 25000 (used to build an HMCan configFile). For details see the HMCan page.
*pvalueThreshold Default: 0.001 (used to build an HMCan configFile). For details see the HMCan page.
mergeDistance Default: 0 (used to build an HMCan configFile). For details see the HMCan page.
iterationThreshold Default: 5 (used to build an HMCan configFile). For details see the HMCan page.
finalThreshold Default: 0 (used to build an HMCan configFile). For details see the HMCan page.
maxIter Default: 20 (used to build an HMCan configFile). For details see the HMCan page.
PrintWig Default: "False" (used to build an HMCan configFile). For details see the HMCan page.
PrintPosterior Default: "True" (used to build an HMCan configFile). For details see the HMCan page.
PrintBedGraph Default: "True" (used to build an HMCan configFile). For details see the HMCan page.
CallPeaks Default: "True" (used to build an HMCan configFile). For details see the HMCan page.
pairedEnds Default: "True" (used to build an HMCan configFile). For details see the HMCan page.
Type Default: "ATAC-seq" (used to build an HMCan configFile). For details see the HMCan page.
GCmergeDistance Default: 1000 (used to build an HMCan configFile). For details see the HMCan page.
RemoveDuplicates Default: "False" (used to build an HMCan configFile). For details see the HMCan page.
CNAnormalization Default: "True" (used to build an HMCan configFile). For details see the HMCan page.
multiBamSummary_threads Default: 6. Number of CPUs to be used by deeptools multiBamSummary in order to compute an average signal over all the genome for each sample for the calculation of the scaling factors. This factors will be used to normalize the CNV corrected signal by sequencing depth.



Standard normalization and peak calling (without CNV correction)

Parameter Description
bigWig_binSize Default: 5. Size, in bp, of the bins used to compute the normalized bigWig files.
normalization_method Default: "RPGC", reads per genomic content (1x normalization). Type of normalization to be used to generated the normalized bigWig files by deeptools bamCoverage.
bamCoverage_threads Default: 8. Number of CPUs to be used by deeptools bamCoverage in order to compute the signal normalization and the bigWig generation.
genome_size_MACS A string or a number indicating the genome size to use by MACS3 for the peak calling. Example: "hs". Some values: hs = 2.7e9, mm = 1.87e9, ce = 9e7, dm = 1.2e8.
FDR_cutoff Deafult: 0.01. False Discovery Ratio (FDR) cutoff used by MACS3 to filter the significant peaks.
call_summits Default: "True". Logical value to define whether MACS3 should also call the peak summits (position with the highest value).
FRiP_threshold Default 20. This value will be used to label the FRiP (Fraction of Reads in a Peak) score as “good” or “bad in the summary table for each single sample. A FRiP above the 20% (FRiP = 0.02) is considered a good score for ATAC-seq peaks by the ENCODE guidelines.



Samples comparison for overall variance estimation

Parameter Description
binning_window_size Default: 10000 (base-pairs). Size of the window used to scan the genome and calculate the bins average score for the sample correlation.
multiBigwigSummary_threads Default: 4. Number of threads to be used by multiBigwigSummary (deeptools) to compute the matrix for the sample peak score correlation.
heatmap_color Default: "Blues". A string indicating the color gradient pattern to use for the peaks score correlation heatmaps. This value is passed to matplotlib/seaborn. Therefore, available options (see matplotlib page for examples) are the following: ‘Accent’, ‘Blues’, ‘BrBG’, ‘BuGn’, ‘BuPu’, ‘CMRmap’, ‘Dark2’, ‘GnBu’, ‘Greens’, ‘Greys’, ‘OrRd’, ‘Oranges’, ‘PRGn’, ‘Paired’, ‘Pastel1’, ‘Pastel2’, ‘PiYG’, ‘PuBu’, ‘PuBuGn’, ‘PuOr’, ‘PuRd’, ‘Purples’, ‘RdBu’, ‘RdGy’, ‘RdPu’, ‘RdYlBu’, ‘RdYlGn’, ‘Reds’, ‘Set1’, ‘Set2’, ‘Set3’, ‘Spectral’, ‘Wistia’, ‘YlGn’, ‘YlGnBu’, ‘YlOrBr’, ‘YlOrRd’, ‘afmhot’, ‘autumn’, ‘binary’, ‘bone’, ‘brg’, ‘bwr’, ‘cividis’, ‘cool’, ‘coolwarm’, ‘copper’, ‘cubehelix’, ‘flag’, ‘gist_earth’, ‘gist_gray’, ‘gist_heat’, ‘gist_ncar’, ‘gist_rainbow’, ‘gist_stern’, ‘gist_yarg’, ‘gnuplot’, ‘gnuplot2’, ‘gray’, ‘hot’, ‘hsv’, ‘icefire’, ‘inferno’, ‘jet’, ‘magma’, ‘mako’, ‘nipy_spectral’, ‘ocean’, ‘pink’, ‘plasma’, ‘prism’, ‘rainbow’, ‘rocket’, ‘seismic’, ‘spring’, ‘summer’, ‘tab10’, ‘tab20’, ‘tab20b’, ‘tab20c’, ‘terrain’, ‘twilight’, ‘twilight_shifted’, ‘viridis’, ‘vlag’, ‘winter’.
zScore_heatmap_color Default: "seismic". A string indicating the color gradient pattern to use for the peak score heatmaps. This value is passed to matplotlib/seaborn. Therefore, available options (see matplotlib page for examples) are the following: ‘Accent’, ‘Blues’, ‘BrBG’, ‘BuGn’, ‘BuPu’, ‘CMRmap’, ‘Dark2’, ‘GnBu’, ‘Greens’, ‘Greys’, ‘OrRd’, ‘Oranges’, ‘PRGn’, ‘Paired’, ‘Pastel1’, ‘Pastel2’, ‘PiYG’, ‘PuBu’, ‘PuBuGn’, ‘PuOr’, ‘PuRd’, ‘Purples’, ‘RdBu’, ‘RdGy’, ‘RdPu’, ‘RdYlBu’, ‘RdYlGn’, ‘Reds’, ‘Set1’, ‘Set2’, ‘Set3’, ‘Spectral’, ‘Wistia’, ‘YlGn’, ‘YlGnBu’, ‘YlOrBr’, ‘YlOrRd’, ‘afmhot’, ‘autumn’, ‘binary’, ‘bone’, ‘brg’, ‘bwr’, ‘cividis’, ‘cool’, ‘coolwarm’, ‘copper’, ‘cubehelix’, ‘flag’, ‘gist_earth’, ‘gist_gray’, ‘gist_heat’, ‘gist_ncar’, ‘gist_rainbow’, ‘gist_stern’, ‘gist_yarg’, ‘gnuplot’, ‘gnuplot2’, ‘gray’, ‘hot’, ‘hsv’, ‘icefire’, ‘inferno’, ‘jet’, ‘magma’, ‘mako’, ‘nipy_spectral’, ‘ocean’, ‘pink’, ‘plasma’, ‘prism’, ‘rainbow’, ‘rocket’, ‘seismic’, ‘spring’, ‘summer’, ‘tab10’, ‘tab20’, ‘tab20b’, ‘tab20c’, ‘terrain’, ‘twilight’, ‘twilight_shifted’, ‘viridis’, ‘vlag’, ‘winter’.
zScore_heatmap_color Default: "seismic". A string indicating the color gradient pattern to use for the peak score heatmaps. This value is passed to matplotlib/seaborn. Therefore, available options (see matplotlib page for examples) are the following: ‘Accent’, ‘Blues’, ‘BrBG’, ‘BuGn’, ‘BuPu’, ‘CMRmap’, ‘Dark2’, ‘GnBu’, ‘Greens’, ‘Greys’, ‘OrRd’, ‘Oranges’, ‘PRGn’, ‘Paired’, ‘Pastel1’, ‘Pastel2’, ‘PiYG’, ‘PuBu’, ‘PuBuGn’, ‘PuOr’, ‘PuRd’, ‘Purples’, ‘RdBu’, ‘RdGy’, ‘RdPu’, ‘RdYlBu’, ‘RdYlGn’, ‘Reds’, ‘Set1’, ‘Set2’, ‘Set3’, ‘Spectral’, ‘Wistia’, ‘YlGn’, ‘YlGnBu’, ‘YlOrBr’, ‘YlOrRd’, ‘afmhot’, ‘autumn’, ‘binary’, ‘bone’, ‘brg’, ‘bwr’, ‘cividis’, ‘cool’, ‘coolwarm’, ‘copper’, ‘cubehelix’, ‘flag’, ‘gist_earth’, ‘gist_gray’, ‘gist_heat’, ‘gist_ncar’, ‘gist_rainbow’, ‘gist_stern’, ‘gist_yarg’, ‘gnuplot’, ‘gnuplot2’, ‘gray’, ‘hot’, ‘hsv’, ‘icefire’, ‘inferno’, ‘jet’, ‘magma’, ‘mako’, ‘nipy_spectral’, ‘ocean’, ‘pink’, ‘plasma’, ‘prism’, ‘rainbow’, ‘rocket’, ‘seismic’, ‘spring’, ‘summer’, ‘tab10’, ‘tab20’, ‘tab20b’, ‘tab20c’, ‘terrain’, ‘twilight’, ‘twilight_shifted’, ‘viridis’, ‘vlag’, ‘winter’.
zScore_heatmap_color Default: "seismic". A string indicating the color gradient pattern to use for the peak score heatmaps. This value is passed to matplotlib/seaborn. Therefore, available options (see matplotlib page for examples) are the following: ‘Accent’, ‘Blues’, ‘BrBG’, ‘BuGn’, ‘BuPu’, ‘CMRmap’, ‘Dark2’, ‘GnBu’, ‘Greens’, ‘Greys’, ‘OrRd’, ‘Oranges’, ‘PRGn’, ‘Paired’, ‘Pastel1’, ‘Pastel2’, ‘PiYG’, ‘PuBu’, ‘PuBuGn’, ‘PuOr’, ‘PuRd’, ‘Purples’, ‘RdBu’, ‘RdGy’, ‘RdPu’, ‘RdYlBu’, ‘RdYlGn’, ‘Reds’, ‘Set1’, ‘Set2’, ‘Set3’, ‘Spectral’, ‘Wistia’, ‘YlGn’, ‘YlGnBu’, ‘YlOrBr’, ‘YlOrRd’, ‘afmhot’, ‘autumn’, ‘binary’, ‘bone’, ‘brg’, ‘bwr’, ‘cividis’, ‘cool’, ‘coolwarm’, ‘copper’, ‘cubehelix’, ‘flag’, ‘gist_earth’, ‘gist_gray’, ‘gist_heat’, ‘gist_ncar’, ‘gist_rainbow’, ‘gist_stern’, ‘gist_yarg’, ‘gnuplot’, ‘gnuplot2’, ‘gray’, ‘hot’, ‘hsv’, ‘icefire’, ‘inferno’, ‘jet’, ‘magma’, ‘mako’, ‘nipy_spectral’, ‘ocean’, ‘pink’, ‘plasma’, ‘prism’, ‘rainbow’, ‘rocket’, ‘seismic’, ‘spring’, ‘summer’, ‘tab10’, ‘tab20’, ‘tab20b’, ‘tab20c’, ‘terrain’, ‘twilight’, ‘twilight_shifted’, ‘viridis’, ‘vlag’, ‘winter’.



GATK variant calling

Parameter Description
bgzip_threads Default: 4. Number of threads to be used by samtools bgzip to compress VCFs (Variant Call Format).
DP_snp_threshold Default: 20. Threshold for the hard-filtering of the SNP (single-nucleotide polymorphism) VCF table for sequencing depth (DP) by SnpSift Filter.
QUAL_snp_threshold Default: 0. Threshold for the hard-filtering of the SNP (single-nucleotide polymorphism) VCF table for quality (QUAL) by SnpSift Filter.
DP_indel_threshold Default: 20. Threshold for the hard-filtering of the InDel (insertion/delition) VCF table for sequencing depth (DP) by SnpSift Filter.
QUAL_indel_threshold Default: 0. Threshold for the hard-filtering of the InDel (insertion/delition) VCF table for quality (QUAL) by SnpSift Filter.
SnpSift_vcf_fields_to_extract Default: [ "CHROM", "POS", "ID", "REF", "ALT", "QUAL", "DP", "FILTER", "FORMAT", "GEN[*].GT", "GEN[*].AD" ]. A python list to indicate the fields to be extracted from the VCFs (both SNP and InDel) during the conversion to .txt file by SnpSift Extract Fields.



Results

The structure of the output_folder is the following:

output_folder
├── 01_fastQC_raw
│   ├── sample_fastqc.html
│   ├── sample_fastqc.zip
│   └── multiQC_raw
│       ├── multiQC_report_fastqRaw_data
│       │   ├── multiqc_citations.txt
│       │   ├── multiqc_data.json
│       │   ├── multiqc_fastqc.txt
│       │   ├── multiqc_general_stats.txt
│       │   ├── multiqc.log
│       │   └── multiqc_sources.txt
│       └── multiQC_report_fastqRaw.html
|
├── 02_BAM
│   ├── sample_mapQ20_sorted_woMT.bam
│   ├── sample_mapQ20_sorted_woMT.bam.bai
│   └── flagstat
│       ├── sample_flagstat_filtered_bam_woMT.txt
│       └── sample_flagstat_UNfiltered_bam.txt
|
├── 03_BAM_dedup (or 03_BAM_mdup, if duplicates are not removed)
│   ├── sample_mapQ20_woMT_dedup_shifted_sorted.bam
│   ├── sample_mapQ20_woMT_dedup_shifted_sorted.bam.bai
│   ├── fastQC
│   │   ├── sample_sorted_woMT_dedup_fastqc.html
│   │   ├── sample_sorted_woMT_dedup_fastqc.zip
│   │   └── multiQC_dedup_bams
│   │       ├── multiQC_report_BAMs_dedup_data
│   │       │   ├── multiqc_citations.txt
│   │       │   ├── multiqc_data.json
│   │       │   ├── multiqc_fastqc.txt
│   │       │   ├── multiqc_general_stats.txt
│   │       │   ├── multiqc.log
│   │       │   └── multiqc_sources.txt
│   │       └── multiQC_report_BAMs_dedup.html
│   ├── flagstat
│   │   ├── sample_flagstat_filtered_bam_woMT_dedup.txt
│   │   └── sample_flagstat_woMT_dedup_shifted_sorted.txt
│   ├── fragmentSizeDistribution_plots
|   |   ├── ALL.samples_fragment_size_distribution.pdf
│   │   └── sample_fragment_size_distribution.pdf
│   ├── metrics
│   │   └── sample_metrics_woMT_dedup_bam.txt
│   └── unshifted_bams
│       ├── sample_mapQ20_sorted_woMT_dedup.bam
│       └── sample_mapQ20_sorted_woMT_dedup.bai
|
├── 04_Normalization
│   ├── HMCan_output ### --> only if HMCan correction is performed ###
│   └── normalized_bigWigs
│       └── sample_mapQ20_woMT_dedup_shifted_normalized_bs5.bw
│
├── 05_Peaks_MACS3 ### --> if HMCan correction is not performed ###
│   ├── sample_mapQ20_woMT_dedup_shifted_FDR0.01_peaks.narrowPeak
│   ├── sample_mapQ20_woMT_dedup_shifted_FDR0.01_peaks.xls
│   ├── sample_mapQ20_woMT_dedup_shifted_FDR0.01_summits.bed
│   └── log
│       └── sample_mapQ20_woMT_dedup_shifted_FDR0.01.log
|
├── 06_Overall_quality_and_info
|   ├── Lorenz_curve_deeptools.plotFingreprint_allSamples.pdf
|   ├── Counts
|   │   ├── counts_summary.txt
|   │   └── subread_featureCounts_output
|   │       └── sample
|   │           ├── sample.readCountInPeaks
|   │           ├── sample.readCountInPeaks.log
|   │           └── sample.readCountInPeaks.summary
|   └── Sample_comparisons
|       ├── multiBigWigSummary_matrix_allSamples.npz
|       ├── PCA_on_BigWigs_wholeGenome.pdf
|       ├── Peak_comparison
|       │   ├── all_samples_peaks_concatenation_collapsed_sorted.bed
|       │   ├── peaks_score_matrix_all_samples_MACS3.npz
|       │   └── peaks_score_matrix_all_samples_table_MACS3.tsv
|       |   └── Heatmaps
|       |       ├── Heatmap_on_log1p.rawScores_for_MACS3.peaks_union_population.pdf
|       │       └── Heatmap_on_zScores_for_MACS3.peaks_union_population.pdf
|       └── Sample_correlation
|           ├── Correlation_heatmap_on_BigWigs_wholeGenome_pearsonMethod.pdf
|           ├── Correlation_heatmap_on_BigWigs_wholeGenome_spearmanMethod.pdf
|           ├── Correlation_scatterplot_on_BigWigs_wholeGenome_pearsonMethod.pdf
|           └── Correlation_scatterplot_on_BigWigs_wholeGenome_spearmanMethod.pdf
|
└── 07_Variant_calling
    ├── all_samples_peaks_concatenation_collapsed_sorted.bed
    ├── all.samples_dedup_gatk-indel_filtered.DP20.QUAL20
    ├── all.samples_dedup_gatk-snp_filtered.DP20.QUAL20
    ├── all.samples_INDEL_counts_plot.pdf
    ├── all.samples_SNP_counts_plot.pdf
    └── sample
        ├── sample_dedup_gatk-indel_filtered.DP20.QUAL20.txt
        ├── sample_dedup_gatk-indel_filtered.DP20.QUAL20.vcf.gz
        ├── sample_dedup_gatk-indel_filtered.DP20.QUAL20.vcf.gz.tbi
        ├── sample_dedup_gatk-snp_filtered.DP20.QUAL20.txt
        ├── sample_dedup_gatk-snp_filtered.DP20.QUAL20.vcf.gz
        ├── sample_dedup_gatk-snp_filtered.DP20.QUAL20.vcf.gz.tbi
        ├── sample_dedup_gatk.vcf.gz
        ├── sample_dedup_gatk.vcf.gz.tbi
        ├── sample_mapQ20_sorted_woMT_dedup_bsqr.bai
        ├── sample_mapQ20_sorted_woMT_dedup_bsqr.bam
        ├── sample_mapQ20_sorted_woMT_dedup_bsqr.table
        └── sample_plotCoverage.pdf



01_fastQC_raw

This folder contains a the fastq quality control (fastQC) reports for each fastq file and a summary report of multiQC.



02_BAM

When the reads are aligned onto the reference genome by bwa, the resulting SAM files are filtered for mapping quality (MAPQ) and the mithocondrial (suffix: woMT) reads are removed before sorting. Flagstat metrics is generated for each file and stored in the homonym folder.



03_BAM_dedup / 03_BAM_mdup

PICARD is used to remove (suffix: dedup) or mark (suffix: mdup) duplicates in the BAM files. The resulting BAMs are stored in the subfolder “unshifted_bams”, while the PICARD metrics is stored in the “metrics” folder. A fastq quality control (fastQC) and relative multiQC report is performed on the unshifted bams.
Then, the Tn5 nick reparation bias is corrected by shifting of the reads using deeptools alignmentSieve (suffix: shifted). Notice that, after shifting, the read sequence information is lost in the shifted BAMs.
Flagstat metrics is generated for each unshifted and shifted bam file and stored in the “falgstat” folder.

Furthermore, in the “fragmentSizeDistribution_plots” folder the distribution of the fragment sizes for each sample (shifted BAMs) and a file collecting all the plots in a single file. Here after an example of a good (left) and a bad (right) fragment size distribution.

fragment size distribution examples

An optimal fragment size distribution should be included within a range of 50-800bp, with a periodicity of ~150bp (corrsponding to mono-, di-, tri-, … nucleosomes) with a lower intensity for larger fragments.



04_Normalization

Shifted signal is normalized on sequencing depth library upon copy number variation correction by HMCan (if requested by the user). The bin size used is indicated in the resulting bigWig file name (suffix: bs#).
However, these bigWig files can be normalized more precisely normalized in the case that you dispone of a corresponding RNA-seq data set using CHIPIN (L. Polit et.al, BMC Bioinformatics, 2021). Examples of CHIPIN usage can be found at S. Gregoricchio et al., Nucleic Acids Research (2022).



05_Peaks_MACS3 (when HMCan correction is not performed)

Peaks and summits (if required by the user) are called by MACS3 on shifted BAMs. The FDR (False Discovery Ratio) threshold used is indicated in the file name (suffix: FDR#). When HMCan correction is active, the peaks are called by HMCan itself.



06_Overall_quality_and_info

This folder contains multiple quality controls, feature counts and sample correlation plots:

lorenz curve examples



Column Description
Sample Sample name
Reads_R1 Number of reads in read.1 fastq file.
Reads_R2 Number of reads in read.2 fastq file.
Reads_total Total number of reads (read.1 + read.2).
unfiltered_BAM Total number of reads in the bam file after filtering by map quality (MAPQ).
Percentage_MT Approximative percentage of reads represented by the mithocondrial DNA. Ideally lower than 10-20%.
dedup_BAM Total number of reads left after BAM reads deduplication.
duplicated_reads Number of duplicated reads. If the duplicates are not remove the value will be 0.
shifted_BAM Number of reads in the shifted BAMs.
loss_post_shifting Number of reads lost upon BAM shifting. Consider that reads falling in blacklisted regions are removed.
n.peaks Total number of peaks called.
FRiP.perc Frequency Reads in Peaks percentage, corresponds to the number of reads falling in peak regions divide by the total number of reads and multiplied by 100.
FRiP.quality A label (“good” or “bad”) to indicate whether the FRiP score is good or not for a given sample. The threshold can be changed in the config file by the user, by the default 20 (as suggested by the ENCODE guidelines).





07_Variant_calling

If required by the user, the pipeline can call altered single-nucleotide polymorphism (SNP) and insertion/deletions (InDel). PICARD CreateSequenceDictionary is used to create a genome dictionary in order to perform a Base Quality Score Recalibration (BQSR) of unshifted BAM files (after filling of the ReadGroups by PICARD AddOrReplaceReadGroups) by GATK BaseRecalibrator and GATK ApplyBQSR.
The recalibrated BAMs are used by GATK HaplotypeCaller to generate a GVCF (Genomic Variant Call Format) file containing the genomic variants individuated at the regions included in the file resulting by the merge of all the called peaks. This GVCF file is then recalibrated by GATK GenotypeGVCFs resulting in a VCF file.
The VCF recalibrated file is selected by GATK SelectVariants to obtain two separate VCF files corresponding to SNPs and InDels.
Ultimately, these VCF are hard-filtered for sequencing depth (DP) and quality (QUAL) by SnpSift Filter and then exported in a txt table by SnpSift Extract Fields. Notably, the genotype ‘0|0’ (both alleles not mutated) is filtered out from the .txt table.
The SNP, or InDel, .txt tables from all samples are merged in a unique one with the addition of a column corresponding to the sample name. These tables are used to generate a plot of the counts of variants found in each sample. Further, a coverage plot at the merged peaks is generated by deeptools plotCoverage.




Package history and releases

A list of all releases and respective description of changes applied could be found here.

Contact

For any suggestion, bug fixing, commentary please report it in the issues/request tab of this repository.

License

This repository is under a GNU General Public License (version 3).


Contributors

contributors