release license

1 Introduction

The concept behind the Rseb (R-package for Simplified End-to-end data Build-up) is to provide a toolkit that allows the automation of different type of tasks avoiding retyping of code and loss of time. Furthermore, the advantage is that, in most of the cases, the functions are built in R-language making it suitable for all the operating systems

From a more biological point of view, this package simplifies many downstream analyses of high-throughput data that otherwise would require many hours of code typing and case-to-case adaptation. Moreover, most of the functions aimed to visualize these kind of data are thought to provide a high level of possible customization with a large number of graphical parameters compared to the commonly used already available tools. Another advantage of this package is that it offers multiple methods, with a corresponding visualization, to quantify the difference of signal between samples, in a qualitatively and/or quantitatively way, without any additional coding.

The guide is divided in four parts: 1) the first one will explore some analyses and visualization of RNA-seq data; 2) the second one the representation and quantification of targeted sequencing data (ChIP-seq, ATAC-seq, etc.); 3) the third part is focused on the analyses of RNA expression by RT-qPCR; 4) while the last part is focused on some of the “general” tools available in this package.

1.1 Citation

If you use this package, please cite:

Citation

“HDAC1 and PRC2 mediate combinatorial control in SPI1/PU.1-dependent gene repression in murine erythroleukaemia.”.
Gregoricchio S. et al., Nucleic Acids Research (2022).
doi: 10.1093/nar/gkac613

citation("Rseb")
> 
> To cite 'Rseb' in publications use:
> 
>   S. Gregoricchio et al., HDAC1 and PRC2 mediate combinatorial control
>   in SPI1/PU.1-dependent gene repression in murine erythroleukaemia.
>   Nucleic Acids Research, Volume 50, Issue 14 (2022), Pages 7938-7958;
>   doi: 10.1093/nar/gkac613
> 
> A BibTeX entry for LaTeX users is
> 
>   @Article{,
>     title = {HDAC1 and PRC2 mediate combinatorial control in SPI1/PU.1-dependent gene repression in murine erythroleukaemia},
>     author = {Sebastian Gregoricchio and Lelia Polit and Michela Esposito and Jeremy Berthelet and Laure Delestre and Emilie Evanno and M'Boyba Diop and Isabelle Gallais and Hanna Aleth and Mathilde Poplineau and Wilbert Zwart and Frank Rosenbauer and Fernando Rodrigues-Lima and Estelle Duprez and Valentina Boeva and Christel Guillouf},
>     journal = {Nucleic Acids Research},
>     year = {2022},
>     volume = {50},
>     issue = {14},
>     pages = {7938--7958},
>     url = {https://doi.org/10.1093/nar/gkac613},
>   }

2 RNA-seq data

A common analysis performed on RNA-seq data is the evaluation of the differentially expressed genes between two different conditions (e.g., untreated vs treated cells).

For this analysis it is common to use the R-package DESeq2 ¹ which returns a table as the following one:

data("RNAseq", package = "Rseb")
RNAseq

**DESeq2 results table example**
geneName	baseMean	log2FC	lfcSE	stat	pvalue	padj
Slc5a4b	1.013385	0.0017493	0.1547084	0.0113069	0.9909786	0.9996304
Cfap65	1.013385	0.0017493	0.1547084	0.0113069	0.9909786	0.9996304
Gm12193	1.180052	0.0349659	0.1636364	0.2136801	0.8307965	0.9996304
Hepacam2	2.824978	-0.0167307	0.2216010	-0.0754990	0.9398177	0.9996304
Orai2	2926.479691	-0.9358489	0.0797280	-11.7380155	0.0000000	0.0000000
Sync	6.885291	-0.1496662	0.2764683	-0.5413504	0.5882661	0.9996304
Btnl7-ps	3.564508	0.1962625	0.2380590	0.8244279	0.4096965	0.9996304
Gm12902	10.819662	0.1299938	0.2931625	0.4434190	0.6574627	0.9996304
Slc10a4	2.725130	-0.1148920	0.2176023	-0.5279906	0.5975059	0.9996304
Pgf	18.953588	-1.2437818	0.2983811	-4.1684332	0.0000307	0.0003180

2.1 Differentially expressed genes

The differential genes are defined depending on the Fold Change (FC) of expression and the adjusted p-value (P_adj). The function DEstatus helps in this definition. It takes as input the fold change expression and the p-value adjusted, then a threshold for these two parameters can be set to define four status:

UP-regulated (FC greater than threshold, significant p-value);
DOWN-regulated (FC lower than threshold, significant p-value);
UNRESPONSIVE/NoResp (FC within the range defined by the unresponsive FC threshold, not significant p-value);
NULL (all the other genes).

All the labels and thresholds can be custom. We can proceed to add a column with the differential expression status to the original table.

require(dplyr)

RNAseq <-
  RNAseq %>%
  mutate(DE.status = Rseb::DE.status(log2FC = RNAseq$log2FC,
                                     p.value.adjusted = RNAseq$padj,
                                     FC_threshold = 2, # Linear value
                                     FC_NoResp_left = 0.9, # Automatically 0.9 <= FC <= 1/0.9)
                                     p.value_threshold = 0.05,
                                     low.FC.status.label = "DOWN",
                                     high.FC.status.label = "UP",
                                     unresponsive.label = "UNRESPONSIVE",
                                     null.label = "NULL"))

**RNA-seq table with differential expression status**
geneName	baseMean	log2FC	lfcSE	stat	pvalue	padj	DE.status
Slc5a4b	1.013385	0.0017493	0.1547084	0.0113069	0.9909786	0.9996304	UNRESPONSIVE
Cfap65	1.013385	0.0017493	0.1547084	0.0113069	0.9909786	0.9996304	UNRESPONSIVE
Gm12193	1.180052	0.0349659	0.1636364	0.2136801	0.8307965	0.9996304	UNRESPONSIVE
Hepacam2	2.824978	-0.0167307	0.2216010	-0.0754990	0.9398177	0.9996304	UNRESPONSIVE
Orai2	2926.479691	-0.9358489	0.0797280	-11.7380155	0.0000000	0.0000000	NULL
Sync	6.885291	-0.1496662	0.2764683	-0.5413504	0.5882661	0.9996304	UNRESPONSIVE
Btnl7-ps	3.564508	0.1962625	0.2380590	0.8244279	0.4096965	0.9996304	NULL
Gm12902	10.819662	0.1299938	0.2931625	0.4434190	0.6574627	0.9996304	UNRESPONSIVE
Slc10a4	2.725130	-0.1148920	0.2176023	-0.5279906	0.5975059	0.9996304	UNRESPONSIVE
Pgf	18.953588	-1.2437818	0.2983811	-4.1684332	0.0000307	0.0003180	DOWN

It is possible now to use the DE.status column to count the number of genes per each group:

RNAseq.summary.table <-
  RNAseq %>%
  group_by(DE.status) %>%
  summarise(N = n()) %>%
  rbind(c("Total", nrow(RNAseq)))

**RNA-seq differential expression summary**
DE.status	N
DOWN	38
NULL	650
UNRESPONSIVE	1292
UP	20
Total	2000

2.2 Representation of the RNA-seq data

Two simple representations for RNA-seq data are the MA-plot and the volcano-plot. The first allows to visualize the Fold Change expression as function of basal expression of each gene, while the second always the FC but depending on the significance of the FC computation.

2.2.1 MA-plot

The MA-plot helps to estimate the difference between two samples plotting the Fold Change expression of a gene as function of its expression among all the samples (all the replicates of both conditions compared, defined by the “baseMean” in the DESeq2 output table). Different colors could be used depending on the “DE.status” column that we just added to the RNA-seq table.

require(ggplot2)

MA.plot <-
  ggplot(data = RNAseq,
         aes(x = log2(baseMean),
             y = log2FC,
             col = DE.status)) +
  geom_point(size = 2) +
  scale_color_manual(values = c("#F8766D", "gray30", "#00A5CF", "#00BA38")) +
  ggtitle("MA-plot") +
  theme_classic()

MA.plot

<div style="text-align: justify"> <font size="-0.5"> **MA-plot of RNA-seq data** <br> Correlation between log~2~(mean normalized expression in all samples and replicates) and log~2~(Fold Change expression) among two conditions are reprresented as dots for each single gene. Up-regulated (FC ≥ 2, P~adj~ < 0.05), down-regulated (FC ≤ 0.5, P~adj~ < 0.05), unresponsive (0.9 ≤ FC ≤ 1.1, P~adj~ ≥ 0.05) or null genes are indicated as pink, green, blue or gray points respectively. </font> </div> <br>

MA-plot of RNA-seq data
Correlation between log₂(mean normalized expression in all samples and replicates) and log₂(Fold Change expression) among two conditions are reprresented as dots for each single gene. Up-regulated (FC ≥ 2, P_adj < 0.05), down-regulated (FC ≤ 0.5, P_adj < 0.05), unresponsive (0.9 ≤ FC ≤ 1.1, P_adj ≥ 0.05) or null genes are indicated as pink, green, blue or gray points respectively.

2.2.2 Volcano-plot

To plot the Fold Change as function of the p-value it is possible to use the function volcano.

volcano.plot <-
  Rseb::volcano(log2FC_data = RNAseq$log2FC,
                padj_data = RNAseq$padj,
                FC_t = 2,
                p_t = 0.05,
                FC_unresponsive_left = 0.9,
                left_label = "DOWN",
                unresponsive_label = "UNRESPONSIVE",
                right_label = "UP",
                null_label = "NULL",
                title = "Volcano",
                point_size = 2)

volcano.plot

<div style="text-align: justify"> <font size="-0.5"> **Volcano plot of RNA-seq data** <br> Correlation between log~2~(Foldchange expression) among two conditions and -log~10~(p-value~adjusted~) of the comparisons are reprresented as dots for each single gene. Up-regulated (FC ≥ 2, P~adj~ < 0.05), down-regulated (FC ≤ 0.5, P~adj~ < 0.05), unresponsive (0.9 ≤ FC ≤ 1.1, P~adj~ ≥ 0.05) or null genes are indicated as pink, green, blue or gray points respectively. </font> </div> <br>

Volcano plot of RNA-seq data
Correlation between log₂(Foldchange expression) among two conditions and -log₁₀(p-value_adjusted) of the comparisons are reprresented as dots for each single gene. Up-regulated (FC ≥ 2, P_adj < 0.05), down-regulated (FC ≤ 0.5, P_adj < 0.05), unresponsive (0.9 ≤ FC ≤ 1.1, P_adj ≥ 0.05) or null genes are indicated as pink, green, blue or gray points respectively.

This function allows to label the points corresponding to differentially expressed genes with the corresponding gene name if required.

volcano.plot.with.names <-
  Rseb::volcano(log2FC_data = RNAseq$log2FC,
                padj_data = RNAseq$padj,
                names = RNAseq$geneName,
                FC_t = 2,
                p_t = 0.05,
                FC_unresponsive_left = 0.9,
                left_label = "DOWN",
                unresponsive_label = "UNRESPONSIVE",
                right_label = "UP",
                null_label = "NULL",
                title = "Volcano",
                point_size = 2,
                right_names = T,
                names_size = 3)

volcano.plot.with.names

3 Sequencing signal representation

Transcription factors and histone marks occupancy (eg. ChIP/Cut&Tag-seq), chromatin condensation (eg. ATAC/MNase-seq), DNA-methylation sites (e.g. MeDIP) are some of the possible high-throughput sequencing signals that could be represented as tracks along the genome by software such as Integrative Genome Browser (IGV) as discussed further. This kind of visualization allows the representation of a single locus at the time, but it could be necessary to summarize the signal of multiple loci simultaneously. To reach this goal, density profiles are the most used method to represent the mean of signal among many loci and compare it for different samples/signals. On the other end, it could be useful to define whether two or more sets of regions show different levels of a signal comparing it region by region.

The general workflow to generate this type of graphs includes the generation of a score matrix for each position of each region and each signal (A). This matrix than can be reshaped in order to calculate the mean/median/sum of the signal among regions in a set (B, density profile) or for each single region (C, density summary. Furthermore, it is possible to compute and visualize the difference of signal between to samples (D, density difference area and/or their correlation). This process is schematized in the following figure and will be discussed in detail in the further paragraphs.

Sequencing signal visualization (Full size image)
(A) The function computeMatrix from deeptools used by computeMatrix.deeptools, in the “reference-point mode”, extends the center of all the regions in one or more datasets (.bed files) and divides each one in n bins of a desired size (eg. 20bp). Then the signal of one or more signals tracks (.bw files) is calculated at each bin and the scores values are stored in a matrix. (B) The function plot.density.profile calculates the mean/median/sum and the variance for each bin of all the regions in a dataset for each track signal provided (left). Then can be generated a plot (right) showing the profile density around the center of the regions. The plot could be generated by sample (a plot per sample with a different curve for each dataset of regions) or by region (a plot per region with a different curve for each sample). (C) The function plot.density.summary computes the mean/median/sum of the signal over each single region for each sample. Then violins plot can be generated to reppresent the distribution of all the single density values of the regions by dataset or sample (simalary to plot.density.profile). The function computes as well the means statistical comparison whose corresponding significance level/p-value can be directly added to the plots. (D) The function plot.density.differences computes the difference of mean/median/sum signal for each single region for each sample in each group. Two graphical rappresentation are available. On top, an area/line plot showing the values of the signal difference for each region which are ranked by these values. On the X-axis it is indicated the number of regions with negative and positive difference. On bottom, a scatter/correlation plot showing the correlation between the mean/median/sum signal of a region in one sample vs an other one. Positive and negative values of the difference in the signal are indicated by different colors. The function infers the significance for the means difference in the groups for all the pair comparisons and returns it as a list of tables for each group.

3.1 Score matrix computation

For the generation of the score matrix I refer to the function computeMatrix from deepTools ². This package provides a function called computeMatrix.deeptools that allows to automatically build the command line required to run the tool on the command line.

Alternatively, but based on a more time consuming algorithm, a fully R-based function, density.matrix, is available with this package (see box below). This function generates a list that can be directly used as input for others functions as described further. The parameters are more or less the same found in the computeMatrix.deeptools function and for this reason only the latter will be described in this guide.

The density.matrix function

This function is completely built in R-code but requires much more time for the execution. This drawback is to attribute to the way by which R handles .bigWig and .bed files. Indeed, a first step required to build a score matrix is to get the bigWig score for each position in each region. This process requires to import a bed file (either from a data.frame or a text file) as a GRanges object by the rtracklayer package (Bioconductor) and get the score position-by-position from a given bigWig. Unfortunately, GRanges objects are quite heavy to handle in R and therefore much more CPU performance and RAM memory are required.

Then, if the chosen modality of matrix computation is “reference-point” all the final regions will have the same length (defined by the upstream and downstream region) and it will be sufficient to split each region score vector (with length n equal to the number of bases of a given region) in chunks of length equal to the required bin size to then calculate the mean/median/sum for each chuck. This process will result in a final score vector of length n/binSize.

In the case the mode selected is instead “scale-regions” the process is more complicated. Indeed, to scale the regions in a perfect way it would require bringing every single region to the body size indicated by the users. This would require calculating the least common multiple (lcm) between the lengths of all the regions and the indicated body length and repeat each base score as much times as lcm/lengthRegion (“base-repetition”). The issue is the size order of the lcm number that makes quite hard the computation in a reasonable time. Hence, the escamotage used in the function is to bring, with the “base-repetition” method, each region to a length ≥ body length and then split in an almost-equal number of chunks. The number of chunks is equivalent to body length. This definition ends often in up to ~5 bases excluded by this splitting in chunks which will be integrated in the last chuck generated be-fore to compute the mean of the scores inside each chunk. At the end of this process to each region corresponds a list of scores of length equivalent to the user-defined body length. Now that all the regions have the same length it is possible to proceed to define the bins scores as described for the “reference-point” mode.

To generate the score matrix, the required arguments are the score/signal files (usually .bigWig/.bw files) and regions files (.bed files). The function build.bed guides the user in the creation of bed files with the option to include header (track line) useful for visualization on IGV or UCSC genome browser (eg. color shadow by region score, track color, track name, etc.). This function allows to export directly the bed file as well save it into a variable as data.frame. Furthermore, many internal controls are automatically performed such as check that END position comes after START position or that all the columns are filled if not provided by the user. Beside the quality controls, the bed could be automatically sorted depending on the genomic position and score³ (by the function sort.bed of this package) and duplicated rows are removed. </div”>

bed <- Rseb::build.bed(chr = paste("chr", c(round(runif(8,1,23)),"X","Y"), sep=""),
                       start = round(runif(10,1,100000)),
                       end = round(runif(10,100001,1000000)),
                       name = paste("peak_", round(runif(10,1,1000)), sep=""),
                       strand = "+",
                       bed.file.name = "/path/to/ouput/file.bed",
                       return.data.frame = T)

**Bed file example generated by `build.bed`**
chr	start	end	name	strand
chr4	27958	855821	peak_328	+
chr7	13607	776836	peak_553	+
chr8	58609	749896	peak_826	+
chr8	60231	542629	peak_927	+
chr10	20591	279132	peak_312	+
chr14	56935	740606	peak_89	+
chr18	20663	777167	peak_417	+
chr23	76796	562839	peak_605	+
chrX	27760	810442	peak_877	+
chrY	44401	632740	peak_424	+

Sometimes it happens that in a bed file are present overlapping or adjacent regions that we would like to merge in order to obtain an unique one composed by the fusion of the original regions (e.g., adjacent ChIP-seq peaks). To do that, in this package is available the collapse.bed function. This function allows to specify whether the merge should take into account the region strandness (the merge will be applied only if the regions are in the same strand), or to merge the regions in only one strand. Furthermore, the output file consists in a .bed⁴ in which the region names will correspond to the concatenation of the names of the original regions that served to produced the overlapped region, and the scores (if available) will correspond to the median/mean/sum of the parental scores.

Once built and/or available the region and signal files it is possible to generate the score matrix. The key parameters are⁵:

mode:
- “reference-point” to center all the regions on a specific position
- “scale-regions” to re-size all the regions in order to make correspond the extremities of the regions;
scoreFileName: a vector with the full path to the signal file(s);
regionsFileName: a vector with the full path to the regions file(s);
outFileName: the full path for the output matrix archive (.gz);
upstream|downstream: to indicate the number of bases to consider before (upstream)/after (downstream) the reference point/region extremities depending on the mode used;
binSize: the window size by which each region will be subdivided and for each of which the scores will be calculate;
computeMatrix.deeptools.command: the path to the computeMatrix script (here an example for conda installed deepTools).

computeMatrix.deeptools(
  mode = "reference-point",
  scoreFileName = c("path/to/signal_1.bw", "path/to/signal_3.bw"),
  regionsFileName = c("path/to/regions_1.bed", "path/to/regions_2.bed", "path/to/regions_3.bed"),
  outFileName = "/path/to/matrix_file.gz",
  upstream = 2000, #bp
  downstream = 2000, #bp
  binSize = 50, #bp
  referencePoint = "center",
  smartLabels = T,
  missingDataAsZero = T,
  computeMatrix.deeptools.command = "/home/user/anaconda3/bin/computeMatrix",
  quiet = T)

Score matrix from deepTools can be used directly (full path) to generate density plots or can be read by the function read.computeMatrix.file which returns a list composed of a data.frame containing the matrix metadata (bin size, regions size, sample_names, region_names, etc.) and the scores table. To plot the signals it can be used this list as well as the path of the source matrix.gz archive.

matrix <- read.computeMatrix.file(matrix.file = "/path/to/matrix_file.gz")

An example of deepTools matrix is available with the package:

data("deeptools.matrix", package = "Rseb")
deeptools.matrix$metadata

**Example of `read.computeMatrix.file` result: metadata**
parameters	values
upstream	1000,1000
downstream	1000,1000
body	0,0
bin_size	50,50
ref_point	center,center
verbose	false
bin_avg_type	mean
missing_data_as_zero	false
min_threshold	null
max_threshold	null
scale	1.0
skip_zeros	false
nan_after_end	false
proc_number	8
sort_regions	keep
sort_using	mean
unscaled_5_prime	0,0
unscaled_3_prime	0,0
group_labels	Set1,Set2,Set3,Set4
group_boundaries	0,75,143,417,606
sample_labels	Sample1,Sample2
sample_boundaries	0,40,80

deeptools.matrix$data.table

**Example of `read.computeMatrix.file` result: matrix.data**
V1	V2	V3	V4	V6	V7	V8	V9	V10
chr7	27447368	27451368	Blvrb	+	82.929848	69.573706	56.438229	60.928720
chr5	34962773	34966773	34964773	+	0.000015	1.960157	4.962074	6.935999
chr10	40126720	40130720	Slc16a10_r1	-	4.870655	4.261194	7.425745	11.794807
chr11	97410467	97414467	Arhgap23	+	6.212588	6.615168	7.060051	8.687984
chr4	116992413	116996413	116994413	-	70.210929	54.316755	23.938043	17.708667
chr13	91459050	91463050	91461050	+	36.472908	52.331799	59.001591	73.745193
chr10	62229251	62233251	62231251	-	2.673625	0.050686	0.000015	0.000015
chr13	80884021	80888021	Arrdc3_r1	+	102.346816	115.157184	109.947098	92.705473
chr2	150654410	150658410	Acss1	-	5.784847	1.156981	0.000015	2.026873
chr11	117972876	117976876	Socs3	-	7.654683	2.713264	3.315623	5.784847

deeptools.matrix$original.file.path
> [1] "/user/path/to/matrix.deepTools.gz"

3.2 Density profile

High-throughput sequencing data allow the detection of different types of signals at genome-wide scale as well at a single locus as described further. A common analysis involves the comparison of the average signal over different categories of regions (e.g., Transcription Staring Site (TSS) vs Enhancers, Activated vs Repressed genes, etc.). For this aim the average signal over regions of the same group can be plotted and compared sample-by-sample or group-by-group. To generate this kind of signal profile it can be used the plot.density.profile/plot.density.profile.smooth function which accepts as input a score matrix as described hereafter. Noteworthy, it is possible to narrow the final plot to a specific range by the option “subset.range” without the need to re-compute the matrix.

The function returns a list with the scores table, a list of each single plot and a single multiplot with the combination of all the single plots. The latter can be directly exported by passing the full output path to the option “multiplot.export.file”, while the option “print.multiplot” allows to show the multiplot output without call/print the variable (useful to save time during the parameters test phase).

Here an example of the output list structure:

> List of 4
>  $ data.table:'data.frame':   320 obs. of  8 variables:
>  $ metadata  :'data.frame':   22 obs. of  2 variables:
>  $ plot.list :List of 4
>  $ multiplot :List of 9

# Load the example matrix list
data("deeptools.matrix", package = "Rseb")

# Generate the density profile by group
density.profile.by.group <-
  Rseb::plot.density.profile(
    matrix.file = deeptools.matrix,
    signal.type = "mean",
    error.type = "sem",
    plot.by.group = T,
    missing.data.as.zero = T,
    y.identical.auto = T,
    text.size = 10,
    axis.line.width = 0.25,
    line.width = 0.5,
    plot.error = T,
    write.reference.points = T,
    plot.vertical.lines = F,
    colors = c("steelblue", "mediumseagreen"),
    n.row.multiplot = 2,
    by.row = T,
    y.lab = "Mean ChIP-seq signal \u00b1 SEM")

# Generate the density profile by sample
density.profile.by.sample <-
  Rseb::plot.density.profile(
    matrix.file = deeptools.matrix,
    signal.type = "mean",
    error.type = "sem",
    plot.by.group = FALSE,
    missing.data.as.zero = TRUE,
    y.identical.auto = TRUE,
    text.size = 10,
    axis.line.width = 0.25,
    line.width = 0.5,
    plot.error = TRUE,
    write.reference.points = TRUE,
    plot.vertical.lines = FALSE,
    colors = rep(c("indianred", "darkgoldenrod2"), 2),
    line.type = rep(c("solid", "dotted"), each = 2),
    n.row.multiplot = 1,
    legend.position = c(0.2, 0.75),
    y.lab = "Mean ChIP-seq signal \u00b1 SEM")


cowplot::plot_grid(density.profile.by.group$multiplot,
                   density.profile.by.sample$multiplot,
                   nrow = 2, labels = "AUTO", rel_heights = c(2, 1))

Density profile
Example of density plot profiles of ChIP-seq normalized signal around the center of four different dataset regions for each sample (A) or group (B).

3.3 Quantification of the signal differences

With the density profiles it is possible to qualitatively estimate the difference of signal between two or more groups of regions, or samples, over the same region type. Albeit sometimes the difference between two datasets is obvious, other cases might require a more precises quantification and statistical evaluation of the difference.

Therefore, below in this section, three different methods to infer and visualize the difference between two samples will be explored: the density summary, the area difference and the correlation plot.
All these methods import a score matrix generated, for instance, by deepTools or by the density.matrix function.

Noteworthy, in all cases it is possible to carryout the analyses narrowing down the range on which compute the scores by specifying the range to use in the “subset.range” parameter without the need to re-compute the matrix.

3.3.1 Statistics of the signal

Both the two analyses that will be addressed in the next paragraphs, beside the graphics, provide a table with the p-values for the comparisons. Indeed, two types of comparisons could be carried out:

region-by-region: in this case each score of one signal over a single region is compared with the score of an other signal on the same region in a paired-test.
group-by-group/sample-by-sample: in this case all the scores on all the regions for one group/sample can be compared with those of an other group/sample in an un-paired test.

For both this two methods a t-test (parametric) or Wilcoxon-test (non parametric) could be automatically performed, whereas a test on the distributions cold be manually performed by Kruskal-Wallis test using the values stored in the data.table in the output.

3.3.2 Density summary

The first type of representation is generated by the function plot.density.summary which shows, through a violin plot, the distribution of the scores over each region in a given group/sample. An option allows to add a mean symbol with the corresponding SD/SEM. Furthermore, the statistical comparison could be added directly above each violin plot either as “stars” or numeric format. As well as for a boxplot, on each violin plot three lines indicate respectively the 25^th, 50^th and 75^th percentile of the data distribution, respectively.

Since the original size of the regions could not be the same it, is preferable to use as score on which compute the analyses the sum of the score over each region (equivalent to the area under the curve of a density profile).

The output of this function is composed of a list of single plots, as well as a multiple plot (multiplot) with the distribution of the scores of the different samples per each group or alternatively the scores on different regions per each sample or group. The output list provides a summary plot with all the region groups for each sample, or alternatively all the samples for each region group. Finally, also a table with all the comparisons with the respective significance is available.

Here an example of the output list structure:

> List of 8
>  $ data.table          :'data.frame': 1212 obs. of  13 variables:
>  $ metadata            :'data.frame': 22 obs. of  2 variables:
>  $ plot.list           :List of 4
>  $ density.profile     :List of 9
>  $ multiplot           :List of 9
>  $ summary.plot.samples:List of 9
>  $ summary.plot.regions:List of 9
>  $ means.comparisons   :'data.frame': 4 obs. of  10 variables:

# Load the example matrix list
data("deeptools.matrix", package = "Rseb")

# Generate the density profile by group
summary.plot.by.group <-
    Rseb::plot.density.summary(
      matrix.file = deeptools.matrix,
      plot.by.group = TRUE,
      missing.data.as.zero = TRUE,
      signal.type = "sum",
      stat.paired = FALSE,
      stat.hide.ns = FALSE,
      axis.line.width = 0.5,
      mean.symbol.size = 0.2,
      y.identical.auto = TRUE,
      text.size = 10,
      border.width = 0.25,
      subset.range = c(-1000, 1000), #bp from 0 point
      colors = c("steelblue", "mediumseagreen"),
      legend.position = "right",
      n.row.multiplot = 2,
      by.row = TRUE)

# Generate the density profile by sample
summary.plot.by.sample <-
    Rseb::plot.density.summary(
      matrix.file = deeptools.matrix,
      plot.by.group = FALSE,
      missing.data.as.zero = TRUE,
      signal.type = "sum",
      stat.paired = FALSE,
      stat.hide.ns = FALSE,
      axis.line.width = 0.5,
      mean.symbol.size = 0.2,
      y.identical.auto = TRUE,
      text.size = 10,
      border.width = 0.25,
      subset.range = c(-1000, 1000), #bp from 0 point
      colors = c("steelblue", "mediumseagreen"),
      legend.position = "right",
      n.row.multiplot = 2,
      by.row = TRUE)

cowplot::plot_grid(summary.plot.by.group$multiplot,
                   summary.plot.by.sample$multiplot,
                   nrow = 1, labels = "AUTO", rel_widths = c(3, 2))

<div style="text-align: justify"> <font size="-0.5"> **Density summary plot** <br> Violin plots indicate the distribution of signal **(A)** for each dataset comparing all the samples in each dataset, or **(B)** for each sample comparing all the datasets. Horizontal bars with stars indicate the means comparison perfomerd by paired Wilcoxon test. <i>P</i>* < 0.05, <i>P</i>** < 0.01, <i>P</i>*** < 0.001, <i>P</i>**** < 0.0001, *ns* = not significant. Blue dots with bars indicate the value of the mean ± SEM. </font> </div> <br>

Density summary plot
Violin plots indicate the distribution of signal (A) for each dataset comparing all the samples in each dataset, or (B) for each sample comparing all the datasets. Horizontal bars with stars indicate the means comparison perfomerd by paired Wilcoxon test. P* < 0.05, P** < 0.01, P*** < 0.001, P**** < 0.0001, ns = not significant. Blue dots with bars indicate the value of the mean ± SEM.

cowplot::plot_grid(summary.plot.by.group$summary.plot.samples,
                   summary.plot.by.group$summary.plot.regions,
                   nrow = 2, labels = "AUTO")

<div style="text-align: justify"> <font size="-0.5"> **Combined density summary plot** <br> Violin plots indicate the distribution of signal for all datasetes for all the samples comparing **(A)** the datasetes for each sample or **(B)** the samples for each dataset. Blue dots with bars indicate the value of the mean ± SEM. </font> </div> <br>

Combined density summary plot
Violin plots indicate the distribution of signal for all datasetes for all the samples comparing (A) the datasetes for each sample or (B) the samples for each dataset. Blue dots with bars indicate the value of the mean ± SEM.

summary.plot.by.sample$means.comparisons

**Example of comparisons table for the plots by sample**
sample	signal.type	region.1	region.2	p	p.adj	p.format	p.signif	method	paired
Sample1	sum	Set1	Set2	0.0000000	0.0e+00	1.5e-14	****	Wilcoxon	FALSE
Sample1	sum	Set1	Set3	0.0000000	0.0e+00	< 2e-16	****	Wilcoxon	FALSE
Sample1	sum	Set1	Set4	0.0000000	0.0e+00	< 2e-16	****	Wilcoxon	FALSE
Sample1	sum	Set2	Set3	0.5878520	5.9e-01	0.5879	ns	Wilcoxon	FALSE
Sample1	sum	Set2	Set4	0.0954291	1.9e-01	0.0954	ns	Wilcoxon	FALSE
Sample1	sum	Set3	Set4	0.0002012	6.0e-04	0.0002	***	Wilcoxon	FALSE
Sample2	sum	Set1	Set2	0.0000000	0.0e+00	1.8e-15	****	Wilcoxon	FALSE
Sample2	sum	Set1	Set3	0.0000000	0.0e+00	1.4e-14	****	Wilcoxon	FALSE
Sample2	sum	Set1	Set4	0.0000000	0.0e+00	< 2e-16	****	Wilcoxon	FALSE
Sample2	sum	Set2	Set3	0.0056998	1.1e-02	0.0057	**	Wilcoxon	FALSE
Sample2	sum	Set2	Set4	0.1623257	1.6e-01	0.1623	ns	Wilcoxon	FALSE
Sample2	sum	Set3	Set4	0.0000000	1.0e-07	2.8e-08	****	Wilcoxon	FALSE

3.3.3 Density differences

In this section two methods to represent the difference of signal between two conditions region-by-region will be described. Both methods, that involve the generation of a correlation/scatter plot and an area/line plot, can be computed by the function plot.density.differences. Also in this case the input of the function is a score matrix generated by computeMatrix.deeptools or density.matrix and it is possible to scale-down to a specific range the final plot by the option “subset.range” without the need to re-compute the matrix..

3.3.3.1 Signal correlation

On method by which it is possible to visualize the difference between two samples is to correlate the signal for each given region in the two conditions, generating in this way a correlation/scatter plot. Dots corresponding to regions with no difference between two conditions will lie on the diagonal line (defined by the equation $y = x$), while regions with a signal difference will be placed in the half-quadrant of one or the other axis (the two conditions) depending on which condition has a stronger signal on the given region. To generate this kind of plots it is possible to use the function as described below.

# Load the example matrix list
data("deeptools.matrix", package = "Rseb")

# Generate the correlation.plot
correlation.plot <-
  Rseb::plot.density.differences(matrix.file = deeptools.matrix,
                                 inverted.comparisons = T,
                                 missing.data.as.zero = T,
                                 signal.type = "sum",
                                 stat.paired = T,
                                 points.size = 0.25,
                                 axis.line.width = 0.25,
                                 area.y.identical.auto = T,
                                 text.size = 10,
                                 correlation.show.equation = T,
                                 correlation.correlation.line.width = 0.5,
                                 correlation.correlation.line.color = "#ED8141",
                                 subset.range = c(-1000, 1000), #bp from 0 point
                                 colors = c("steelblue", "mediumseagreen", "purple"),
                                 legend.position = c(0.8, 0.25),
                                 error.type = "sem")

cowplot::plot_grid(plotlist = Rseb::uniform.y.axis(Rseb::uniform.x.axis(correlation.plot$correlation.plot.byGroup.list)),
                   nrow = 2, byrow = T)

<div style="text-align: justify"> <font size="-0.5"> **Correlation plot** <br> Correlation of the signal over each region in each datasetes for one sample is correlated with the signal in the same region in the second sample. Blue dots correspond to regions with a signal higher in Sample1 while green dots indicate regions with a higher signal in Sample2, while regions whose signal is unchanged among the two samples are indicated in violet. The diagonal dashed line corresponds to the $y = x$ equation. Scores of each dot are projectied perpendicularly to each axis and indicated by a green/blue line. Orange line indicates the correlation function that exists between the two samples with the relative SEM. Correlation equation and R-squared for the regression are indicated on the plot. </font> </div> <br>

Correlation plot
Correlation of the signal over each region in each datasetes for one sample is correlated with the signal in the same region in the second sample. Blue dots correspond to regions with a signal higher in Sample1 while green dots indicate regions with a higher signal in Sample2, while regions whose signal is unchanged among the two samples are indicated in violet. The diagonal dashed line corresponds to the $y = x$ equation. Scores of each dot are projectied perpendicularly to each axis and indicated by a green/blue line. Orange line indicates the correlation function that exists between the two samples with the relative SEM. Correlation equation and R-squared for the regression are indicated on the plot.

3.3.3.2 Signal difference

As already mentioned above, another way to quantify the difference between two signals is literally the “difference”, intended as subtraction, between the signal over a region (equivalent to the area under the curve in a density profile but for only one region) in one condition and another. This could be computed region-by-region by the function plot.density.differences, but in this case instead of the population shift the regions are ranked by the value of the area/signal subtraction. Then the closest to 0 (“no difference”) rank position is calculated and its value subtracted to all the rank positions. In this way on the plot it is possible to know the number of regions with a negative (lower signal) or positive (higher signal) score. This representation allows firstly to know how many regions have an enriched signal (Sample 1) compared to those that have an enrichment of the other compared signal (Sample 2), secondly setting the axis to the same ranges it is possible to observe whether de differences between two samples for a region set compared to another are wide or rather narrowed around the 0 (no difference between the samples signal), beside eventual outliers that could distort the mean in the density profile.

# Load the example matrix list
data("deeptools.matrix", package = "Rseb")

# Generate the area.plot
areaDifference.plot <-
  Rseb::plot.density.differences(matrix.file = deeptools.matrix,
                                 inverted.comparisons = T,
                                 missing.data.as.zero = T,
                                 signal.type = "sum",
                                 stat.paired = T,
                                 points.size = 0.25,
                                 axis.line.width = 0.25,
                                 area.y.identical.auto = T,
                                 text.size = 10,
                                 subset.range = c(-1000, 1000), #bp from 0 point
                                 colors = c("steelblue", "mediumseagreen", "purple"),
                                 legend.position = c(0.2, 0.80),
                                 error.type = "sem")

cowplot::plot_grid(plotlist = Rseb::uniform.y.axis(areaDifference.plot$area.plot.byGroup.list),
                   nrow = 2, byrow = T)

<div style="text-align: justify"> <font size="-0.5"> **Area plot** <br> Distribution of the difference between Sample1 (blue) and Sample2 (green) of the signals within each region for each dataset. The dots represent the signal difference value over each region. Blue and green dots signal intensity higher in Sample1 and Sample2 respectively. Regions with unchanged signal among the two samples are indicated in purple. </font> </div> <br>

Area plot
Distribution of the difference between Sample1 (blue) and Sample2 (green) of the signals within each region for each dataset. The dots represent the signal difference value over each region. Blue and green dots signal intensity higher in Sample1 and Sample2 respectively. Regions with unchanged signal among the two samples are indicated in purple.

4 IGV script

The previous paragraphs described how to visualize the signal, or the difference of signal, over multiple regions. In this section instead I will focus on the generation of a script that can be run in the Integrative Genome Browser (IGV ⁶) software in order to generate a large number of snapshots of multiple loci for the desired signal tracks (.bw files) and/or regions (.bed files) automatically.

The function IGVsnap accepts as input both a list of gene symbols (the genomic localization will be retrieved using biomaRt) or a list of loci in any format accepted by IGV. As output will be generated a text file with a script that can be run on IGV: IGV > Tools > Run Batch Script… > chose your batch script file generated by IGVsnap. To chose the tracks that will be used there are two possibilities:

specify an IGV session file (.xml) to be opened before the generation of the snapshots automatically when the script is run;
run the batch script only after have opened all the required tracks on IGV.

# Define the gene list
gene_list <- c("Spi1", "Idh1", "Bcl2l11", "Mcl1", "Polr2a", "Hdac1")

# Generate the script
Rseb::IGVsnap(loci_vector = gene_list,
              input_type = "genes",
              biomart = "ensembl",
              dataset = "mmusculus_gene_ensembl",
              reference_genome = "mm10",
              fivePrime = 1000,
              threePrime = 1000,
              snap_names = gene_list,
              IGV_batch_file = "path/to/output/script.txt",
              snap_directory = "path/to/directory/in/wich/the/images/will/be/generated",
              snap_image_format = "png",
              maxPanelHeight = 1500)

5 Real-Time qPCR analyses for RNA expression

This tool is designed to analyse RNA expression RT-qPCR data. In particular it is based on the output of RT-qPCR machines of the Applied Biosystems. Indeed it is possible to use the function qPCR.rna.exp directly on results in excel file format generated by the qPCR devices (it could be required to indicate the position number of the excel sheet to use, using the parameter results.sheet.position). However, it is also possible to provide a data.frame with at least two columns: Target Name (e.g., primers), Sample Name (e.g. samples), CT [all must include space and capital letters as indicated].

A mandatory parameter is the list of “housekeeping” genes that will be used for the computation of the normalized expression. On the other hand, also fold changes (FC) over a reference sample are computed. The reference sample if it is not provided by the user will be automatically defined as the first one appearing in the list of all the samples. Notably, specific samples (e.g., blank the negative controls) could be excluded using the parameter exclude.samples.

In the list resulting form the analyses there is a plot showing the technical replicates detected in the data. Technical replicates that as a difference greater than a certain threshold could be excluded setting the parameter max.delta.reps (default value 0.5)

5.1 How to run the analyses

# Load the data
data_to_analyse <- Rseb::qPCR.results.rep1

print(data_to_analyse)

**Example of RT-qPCR data.frame for RNA expression analyses**
Well	Well Position	Omit	Sample Name	Target Name	Task	Reporter	Quencher	Quantity	Quantity Mean	Quantity SD	RQ	RQ Min	RQ Max	CT	Ct Mean	Ct SD	Delta Ct	Delta Ct Mean	Delta Ct SE	Delta Delta Ct	Tm1	Automatic Ct Threshold	Tm2	Ct Threshold	Tm3	Automatic Baseline	Baseline Start	Baseline End	Comments	Amp Score	Cq Conf	HIGHSD	OUTLIERRG	MTP
1	A1	FALSE	Ctrl	geneA	UNKNOWN	SYBR	None	NA	NA	NA	1.000000	0.8043882	1.243181	26.44833	26.36080	0.1195739	NA	2.033886	0.1131073	0.0000000	82.29413	FALSE	NA	2e+05	NA	TRUE	1	45	NA	3.186632	0.9885684	N	N	N
2	A2	FALSE	Ctrl	geneA	UNKNOWN	SYBR	None	NA	NA	NA	1.000000	0.8043882	1.243181	26.40951	26.36080	0.1195739	NA	2.033886	0.1131073	0.0000000	82.43059	FALSE	NA	2e+05	NA	TRUE	1	45	NA	3.159700	0.9946904	N	N	N
3	A3	FALSE	Ctrl	geneA	UNKNOWN	SYBR	None	NA	NA	NA	1.000000	0.8043882	1.243181	26.22456	26.36080	0.1195739	NA	2.033886	0.1131073	0.0000000	82.43059	FALSE	NA	2e+05	NA	TRUE	1	45	NA	3.179218	0.9952576	N	N	N
25	B1	FALSE	Ctrl	geneB	UNKNOWN	SYBR	None	NA	NA	NA	NA	NA	NA	24.34007	24.32691	0.1551834	NA	NA	NA	NA	80.24725	FALSE	NA	2e+05	NA	TRUE	1	45	NA	3.207761	0.9925219	N	N	N
26	B2	FALSE	Ctrl	geneB	UNKNOWN	SYBR	None	NA	NA	NA	NA	NA	NA	24.47509	24.32691	0.1551834	NA	NA	NA	NA	80.38371	FALSE	NA	2e+05	NA	TRUE	1	45	NA	3.187331	0.9947301	N	N	N
27	B3	FALSE	Ctrl	geneB	UNKNOWN	SYBR	None	NA	NA	NA	NA	NA	NA	24.16557	24.32691	0.1551834	NA	NA	NA	NA	80.38371	FALSE	NA	2e+05	NA	TRUE	1	45	NA	3.187497	0.9944257	N	N	N
49	C1	FALSE	Ctrl	housekeeping	UNKNOWN	SYBR	None	NA	NA	NA	1.000000	0.8302051	1.204522	28.44176	28.51381	0.0629779	NA	4.186895	0.0966921	0.0000000	79.29204	FALSE	NA	2e+05	NA	TRUE	1	45	NA	3.174745	0.9953029	N	N	N
50	C2	FALSE	Ctrl	housekeeping	UNKNOWN	SYBR	None	NA	NA	NA	1.000000	0.8302051	1.204522	28.55836	28.51381	0.0629779	NA	4.186895	0.0966921	0.0000000	79.15558	FALSE	NA	2e+05	NA	TRUE	1	45	NA	3.183335	0.9955710	N	N	N
51	C3	FALSE	Ctrl	housekeeping	UNKNOWN	SYBR	None	NA	NA	NA	1.000000	0.8302051	1.204522	28.54130	28.51381	0.0629779	NA	4.186895	0.0966921	0.0000000	79.29204	FALSE	NA	2e+05	NA	TRUE	1	45	NA	3.152569	0.9954020	N	N	N
4	A4	FALSE	sample_1	geneA	UNKNOWN	SYBR	None	NA	NA	NA	1.598307	1.3542683	1.886322	24.91835	24.92879	0.0304175	NA	1.357341	0.0860925	-0.6765448	82.56705	FALSE	NA	2e+05	NA	TRUE	1	45	NA	3.171771	0.9965763	N	N	N

# Run the analyses
analyses <- Rseb::qPCR.rna.exp(results.file = data_to_analyse,
                               housekeeping.genes = c("geneB", "housekeeping"),
                               reference.sample = "Ctrl",
                               max.delta.reps = 0.5)

5.2 Interpretation of the output

The structure of the analyzed data is the following:

>  analyses
>    ¦--original.table
>    ¦--reshaped.table
>    ¦--reshaped.table.cleaned
>    ¦--reps.validation.plot
>    ¦--analyzed.data
>    ¦   ¦--geneB
>    ¦   ¦--housekeeping
>    ¦   °--mean_FC_housekeeping
>    ¦--expression.plots
>    ¦   ¦--geneB
>    ¦   °--housekeeping
>    °--foldChange.plots
>        ¦--geneB
>        ¦--housekeeping
>        °--mean_FC_housekeeping

The original.table is the input table, while the reshaped.table contains the replicates and their deltas. The reshaped.table.cleaned table instead is a filtered version without the discarded replicates.

**RT-qPCR analyses results: `reshaped.table.cleaned`**
Sample Name	Target Name	rep1	rep2	rep3	rep1-rep2	rep1-rep3	rep2-rep3
Ctrl	geneA	26.4483280181885	26.4095058441162	26.2245559692383	0.0388222	0.2237720	0.1849499
Ctrl	geneB	24.3400726318359	24.4750938415527	24.1655654907227	0.1350212	0.1745071	0.3095284
Ctrl	housekeeping	28.4417552947998	28.5583629608154	28.5412979125977	0.1166077	0.0995426	0.0170650
sample_1	geneA	24.9183483123779	24.9630489349365	24.9049625396729	0.0447006	0.0133858	0.0580864
sample_1	geneB	23.6467819213867	23.6643676757812	23.4031867980957	0.0175858	0.2435951	0.2611809
sample_1	housekeeping	25.4094772338867	25.3954696655273	25.6484184265137	0.0140076	0.2389412	0.2529488
sample_2	geneA	25.5913791656494	25.4916076660156	25.5847873687744	0.0997715	0.0065918	0.0931797
sample_2	geneB	24.4471492767334	24.5785312652588	24.5931491851807	0.1313820	0.1459999	0.0146179
sample_2	housekeeping	27.7569007873535	27.7141723632812	27.7601623535156	0.0427284	0.0032616	0.0459900
sample_3	geneA	25.4794883728027	25.1694927215576	25.0211524963379	0.3099957	0.4583359	0.1483402
sample_3	geneB	23.2305488586426	23.0962467193604	22.9340419769287	0.1343021	0.2965069	0.1622047
sample_3	housekeeping	26.6894512176514	26.9485759735107	26.8660678863525	0.2591248	0.1766167	0.0825081
sample_4	geneA	NA	NA	NA	0.1775093	0.4670792	0.6445885
sample_4	geneB	24.0955085754395	23.9845962524414	23.8393287658691	0.1109123	0.2561798	0.1452675
sample_4	housekeeping	28.0180683135986	27.8829364776611	27.6275424957275	0.1351318	0.3905258	0.2553940

The replicate deltas can be inspected in the reps.validation.plot. The red background indicates that the corresponding replicate comparison did not pass the threshold test defined by max.delta.reps.

<div style="text-align: justify"> <font size="-0.5"> **Replicates validation plot** <br> The difference of all the replicate, two-by-two, is computed for each Sample/Target combination. Red boxes indicate that the corresponding replicate comparison did not pass the threshold test defined by `max.delta.reps`. </font> </div> <br>

Replicates validation plot
The difference of all the replicate, two-by-two, is computed for each Sample/Target combination. Red boxes indicate that the corresponding replicate comparison did not pass the threshold test defined by max.delta.reps.

The analyzed.data branch is a list of tables, one for each housekeeping and an extra one for the mean of all housekeeping genes. Each table shows the replicate CT values, the replicate deltas, the mean of the (filtered) CTs, the calculated expression normalized to the corresponding houskeeping gene and the Fold Change (FC) of expression over the reference.sample.

**RT-qPCR analyses results: `analyzed.data$housekeeping`**
Sample Name	Target Name	rep1	rep2	rep3	rep1-rep2	rep1-rep3	rep2-rep3	CT_reps_mean	exp_norm_to_housekeeping	FC_over_Ctrl
Ctrl	geneA	26.44833	26.40951	26.22456	0.0388222	0.2237720	0.1849499	26.36080	4.447544	1.0000000
sample_1	geneA	24.91835	24.96305	24.90496	0.0447006	0.0133858	0.0580864	24.92879	1.469850	0.3304857
sample_2	geneA	25.59138	25.49161	25.58479	0.0997715	0.0065918	0.0931797	25.55592	4.556166	1.0244231
sample_3	geneA	25.47949	25.16949	25.02115	0.3099957	0.4583359	0.1483402	25.22338	3.055314	0.6869665
sample_4	geneA	NA	NA	NA	0.1775093	0.4670792	0.6445885	NaN	NaN	NaN
sample_5	geneA	27.55612	27.26831	27.37729	0.2878075	0.1788254	0.1089821	27.40057	10.228913	2.2999015
sample_6	geneA	25.51902	25.50203	25.30318	0.0169945	0.2158375	0.1988430	25.44141	14.340681	3.2244048
sample_7	geneA	26.62583	26.64904	26.93015	0.0232048	0.3043194	0.2811146	26.73501	NaN	NaN
sample_8	geneA	28.93781	28.65683	28.59197	0.2809811	0.3458405	0.0648594	28.72887	15.897693	3.5744884
sample_9	geneA	26.72434	26.69491	26.48433	0.0294304	0.2400131	0.2105827	26.63453	8.464328	1.9031466
sample_10	geneA	29.87918	30.31601	30.02470	0.4368229	0.1455116	0.2913113	30.07330	1.966096	0.4420633
sample_11	geneA	29.26953	29.26442	NA	0.0051041	0.7235126	0.7184086	29.26697	NaN	NaN
sample_12	geneA	27.87499	27.71922	27.84903	0.1557713	0.0259647	0.1298065	27.81442	6.892375	1.5497036
Ctrl	geneB	24.34007	24.47509	24.16557	0.1350212	0.1745071	0.3095284	24.32691	18.212975	1.0000000
sample_1	geneB	23.64678	23.66437	23.40319	0.0175858	0.2435951	0.2611809	23.57145	3.765939	0.2067723

The results then are plotted either as normalized expression to each housekeeping gene (A) or the Fold Change expression of the reference.sample (B). Furthermore, also a mean of the fold changes obtained from the data normalized to each housekeeping genes is plotted as well (C).

<div style="text-align: justify"> <font size="-0.5"> **Expression and Fold Change plots** <br> **(A)** Expression normalized to one of the housekeeping genes. The numbers above the histograms indicate the values of the graph. The housekeeping gene is removed by default but it can be included setting `exclude.housekeeping.FC=FALSE`. **(B)** Similar plot than in B is showed for the Fold Change expression over the `reference.sample` for each normalization. **(C)** The mean of the fold changes obtained from the data normalized to each housekeeping genes is plotted. The mean value is idicated above each histogram and the bars indicate the SEM. </font> </div> <br>

Expression and Fold Change plots
(A) Expression normalized to one of the housekeeping genes. The numbers above the histograms indicate the values of the graph. The housekeeping gene is removed by default but it can be included setting exclude.housekeeping.FC=FALSE. (B) Similar plot than in B is showed for the Fold Change expression over the reference.sample for each normalization. (C) The mean of the fold changes obtained from the data normalized to each housekeeping genes is plotted. The mean value is idicated above each histogram and the bars indicate the SEM.

5.3 Merge multiple experiments

In most of the cases, it is required to compute the average among multiple experiments. To achieve that, a list of qPCR.ran.exp objects can be submitted to the function qPCR.rna.mean.reps.

The function will automatically define which are the housekeeping genes and the reference sample using the list names or the column names of the tables in the qPCR.ran.exp objects. However, if the reference sample used are multiple and/or not shared among replicates, the first sample in the order is used as reference. The reference sample can be also redefined manually during the averaging: all the Fold Changes will be recomputed. In all cases a message indicating which reference sample has been used is printed at each time (it will be always the first sample in the fold change plots).

Also in this case, specific samples can be excluded from the analyses (e.g., blank negative control).

In the following example two different analyses are run starting from the same data (rep1 and rep2). Then the two replicates are merged by qPCR.rna.mean.reps.

# Run the analyses for the two replicates
rep1 <- Rseb::qPCR.rna.exp(results.file = Rseb::qPCR.results.rep1,
                           housekeeping.genes = c("geneB", "housekeeping"),
                           reference.sample = "Ctrl",
                           max.delta.reps = 0.5)

rep2 <- Rseb::qPCR.rna.exp(results.file = Rseb::qPCR.results.rep2,
                           housekeeping.genes = "housekeeping",
                           reference.sample = "sample_1",
                           max.delta.reps = 0.3)


# Compute the average of the two replicates
mean_reps <- Rseb::qPCR.rna.mean.reps(reps.list = list(rep1, rep2))
> The 'reference.sample' has been automatically detected and set to: 'Ctrl'.

The structure of the output is similar to that of qPCR.ran.exp:

>  mean_reps
>    ¦--mean.reps.data.table
>    ¦   ¦--geneB
>    ¦   ¦--housekeeping
>    ¦   °--mean_FC.means
>    ¦--mean.exp.plots
>    ¦   ¦--geneB
>    ¦   °--housekeeping
>    °--mean.reps.FC.plots
>        ¦--geneB
>        ¦--housekeeping
>        °--mean_FC.means

However the analyzed data tables (mean.reps.data.table) are simplified showing only the means (exp = normalized expression, FC = Fold Change) and statistics without the single values of each CT in each replicate:

**RT-qPCR merged replicates: `mean.reps.data.table$housekeeping`**
Sample Name	Target Name	n	mean_exp	SD_exp	SEM_exp	mean_FC	SD_FC	SEM_FC
Ctrl	geneA	2	4.805451	0.5061572	0.3579072	1.0000000	0.0000000	0.0000000
sample_1	geneA	2	1.498708	0.0408122	0.0288586	0.3131666	0.0244929	0.0173191
sample_2	geneA	2	4.929204	0.5275550	0.3730377	1.0256605	0.0017500	0.0012374
sample_3	geneA	2	3.411023	0.5030486	0.3557091	0.7082393	0.0300842	0.0212727
sample_5	geneA	1	10.228913	NA	NA	2.2999015	NA	NA
sample_6	geneA	1	14.340681	NA	NA	3.2244048	NA	NA
sample_7	geneA	1	7.916338	NA	NA	1.5331763	NA	NA
sample_8	geneA	2	19.299767	4.8112588	3.4020737	3.9856043	0.5814057	0.4111159
sample_9	geneA	2	9.472048	1.4251323	1.0077207	1.9663943	0.0894458	0.0632478
sample_10	geneA	2	2.136820	0.2414404	0.1707242	0.4444854	0.0034255	0.0024222
sample_11	geneA	1	3.130685	NA	NA	0.6063273	NA	NA
sample_12	geneA	1	6.892375	NA	NA	1.5497036	NA	NA
Ctrl	geneB	1	18.212975	NA	NA	1.0000000	NA	NA
sample_1	geneB	2	4.032931	0.3775832	0.2669917	0.2067723	NA	NA
sample_2	geneB	2	10.361917	1.6206207	1.1459519	0.5060110	NA	NA

The plots are similar as well, but now also the number of values in each bar of gra[h are indicated since the number of samples and/or housekeeping genes could be different among the replicates:

<div style="text-align: justify"> <font size="-0.5"> **Expression and Fold Change plots of merged replicates** <br> **(A)** Expression normalized to one of the housekeeping genes. **(B)** Similar plot than in B is showed for the Fold Change expression over the `reference.sample` for each normalization. <br> The numbers above the histograms indicate the mean values of the graph, while the numbers on the bottom the sample size. The error bars indicate the SEM. </font> </div> <br>

Expression and Fold Change plots of merged replicates
(A) Expression normalized to one of the housekeeping genes. (B) Similar plot than in B is showed for the Fold Change expression over the reference.sample for each normalization.
The numbers above the histograms indicate the mean values of the graph, while the numbers on the bottom the sample size. The error bars indicate the SEM.

6 Generic tools

This package provides functions that can simplify routine task such as axis equal re-scale among a list of plots or filter lines of a data.frame in a shell-like method, and many others.

6.1 Uniform plot axes

In the previous paragraphs the functions uniform.x.axis and uniform.y.axis have been used to re-format and uniform the axis among a list of plots. These functions allow to re-set the maximum, minimum, ticks interval and number formatting of the plot axes.

uniformed.list.of.plots <- Rseb::uniform.x.axis(plot.list = list.of.plots,
                                                x.min = 0,
                                                x.max = NA,
                                                ticks.each = 0.5,
                                                digits = 1)

For this function could be useful to combine multiple lists of plots in a unique one using the function combine.lists of this package. The latter, from an input list of list, generates an output list result of the combination of all the lists contained in the provided input.

6.2 Filter data.frame rows by a specific pattern

The function grep in the shell allows to filter all the rows/lines of a table/file containing a specific pattern in any position. The function grepl.data.frame from this package mimics the behavior of the shell function in R and applies it to a data.frame. It returns a logic vector of length equal to the rows number of the data.frame in which TRUE indicates that the pattern have been found in the corresponding line. This vector than could be used to filter the data.frame rows such as using dplyr::filter() function.

In the following example is shown a dummy table of copy number variation (CNV) individuated in a cohort of patients for a list of genes. For a given gene in a certain patient if a CNV has been found it will be indicated the type of the CNV (“gain”, “loss”, “gain/loss”) otherwise an NA is used.

# Loading the table
data("CNV.data", package = "Rseb")
print(CNV.data)

**Example of CNV annotation table**
geneName	patient_1	patient_2	patient_3	patient_4	patient_5	patient_6	patient_7	patient_8
gene_Y	loss	NA	gain	gain	gain	gain	gain	NA
gene_W	loss	NA	gain	NA	loss	loss	loss	loss
gene_N	gain	loss	loss	gain	loss	loss	gain	loss
gene_D	NA	loss	NA	gain	loss	gain	NA	loss
gene_V	loss	gain	gain	gain	loss	gain	gain	loss
gene_L	gain/loss	gain	loss	loss	loss	loss	NA	gain
gene_P	loss	gain	loss	loss	loss	gain	loss	NA
gene_U	gain	loss	NA	gain	loss	NA	gain/loss	gain
gene_X	loss	gain	loss	loss	gain	loss	loss	gain
gene_U	gain	loss	NA	gain	loss	NA	gain/loss	gain

Now it is possible to filter only the lines containing, in any position, the pattern “gain/loss” using the function grepl.data.frame.

CNV.gain.loss <- dplyr::filter(.data = CNV.data,
                               Rseb::grepl.data.frame(data.frame = CNV.data,
                                                      pattern = "gain/loss"))
print(CNV.gain.loss)

**CNV annotation table filtered for the pattern “gain/loss”**
geneName	patient_1	patient_2	patient_3	patient_4	patient_5	patient_6	patient_7	patient_8
gene_B	NA	loss	gain	loss	gain	gain/loss	loss	gain
gene_C	loss	NA	gain	loss	gain/loss	NA	loss	gain
gene_E	gain	gain	NA	loss	loss	gain/loss	loss	loss
gene_G	gain/loss	gain	loss	NA	NA	gain	NA	NA
gene_J	gain	gain/loss	loss	loss	loss	gain/loss	NA	gain
gene_K	loss	gain/loss	loss	loss	gain	NA	gain	loss
gene_L	gain/loss	gain	loss	loss	loss	loss	NA	gain
gene_R	gain/loss	gain	gain	gain	gain	gain	NA	loss
gene_U	gain	loss	NA	gain	loss	NA	gain/loss	gain

7 Package information

7.1 Documentation

With the package a detailed PDF manual with details for each function and respective parameters is available.

The R-package have been published on GitHub and a git-pages website is available as well. At both these sites it is possible to find the installation procedure, required dependencies, and the links for changeLog, manual and vignette.

7.2 Package history and releases

A list of all releases and respective description of changes applied could be found here.

7.3 Contact

For any suggestion, bug fixing, commentary please contact:

Sebastian Gregoricchio sebastian.gregoricchio@gmail.com

7.4 License

This package is under a GNU General Public License (version 3).

Love, M.I., Huber, W., Anders, S., “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2”, Genome Biology 15(12):550 (2014)↩︎
Ramírez, Fidel, Devon P. Ryan, Björn Grüning, Vivek Bhardwaj, Fabian Kilpert, Andreas S. Richter, Steffen Heyne, Friederike Dündar, and Thomas Manke. deepTools2: A next Generation Web Server for Deep-Sequencing Data Analysis. Nucleic Acids Research (2016). doi:10.1093/nar/gkw257.↩︎
The “classic” way to sort a bed file is to sort it by chromosome > start > end, but sometimes it is required a more extended sorting by score. Missed score sorting could lead to encounter errors in certain software.↩︎
Only up to 6 columns (BED6) will be kept: chr, start, end, name, score, strand↩︎
For more details on parameters visit computeMatrix manual page.↩︎
For more details on parameters visit the IGV batch parameters page.↩︎

Rseb: an R-package for NGS data management and visualization

Sebastian Gregoricchio, Ph.D.

Last update: 10 October, 2022