release license


1 Introduction

The concept behind the DEprot (Differential Expression proteomics) is to provide a toolkit that allows for the normalization, imputation and analyses of the differential protein expression in proteomics data. The data are assumed to be LFQ (label-free quantitation) values.


1.1 Citation

If you use this package, please cite:

Citation

“No publication associated yet et al., XYZ (2123).
doi: XYZ/XYZ


citation("DEprot")
> To cite package 'DEprot' in publications use:
> 
>   Gregoricchio S (????). _DEprot: An R-package for proteomics
>   differential analyses_. R package version 0.1.0,
>   <https://sebastian-gregoricchio.github.io/DEprot/https://github.com/sebastian-gregoricchio/DEprot/https://sebastian-gregoricchio.github.io/>.
> 
> A BibTeX entry for LaTeX users is
> 
>   @Manual{,
>     title = {DEprot: An R-package for proteomics differential analyses},
>     author = {Sebastian Gregoricchio},
>     note = {R package version 0.1.0},
>     url = {https://sebastian-gregoricchio.github.io/DEprot/
> https://github.com/sebastian-gregoricchio/DEprot/
> https://sebastian-gregoricchio.github.io/},
>   }
> 
> ATTENTION: This citation information has been auto-generated from the
> package DESCRIPTION file and may need manual editing, see
> 'help("citation")'.


1.2 General workflow

Data can be loaded as raw MS-counts, pre-normalized unimputed, LFQ values or, imputed LFQ values. Further, if multiple batches are available, it is possible to perform batch corrections by combining LFQ uimputed tables and load this data as “raw” in order to perform a batch harmonization.

Using normalized and imputed data, DEprot will help you in the differential analyses as well in performing quality controls on your data (sample correlation, principal components analyses (PCA)).




2 Loading the data

The package starting point is the building of a DEprot object. This class of objects is specific to this package and requires at least two elements:

  • counts table: this must be a matrix in which the column names represent the samples while the rown ames identify the proteins. The values in the matrix are assumed to be LFQ (label-free quantitation) values, either in linear or log-transformed format.
  • metadata: this must be a data.frame containing at least one column names column.id which values correspond to the column names of the counts_table. Any other additional column can be added (cell lines, treatment, condition, timing, etc) and used to define groups for differential and quality control (PCA and correlation) analyses.

In the next paragraph we will refer to a dataset (pre-loaded in DEprot), in which a breast cancer (BCa) cell line was cultured in either hormone-deprived media or in full media (FBS). If cultered in hormone-deprived condition, it was then treated for 6 hrs with either \(\beta\)-estradiol (E2) or vehicle (DMSO). 4 biological replicates have been analyzed. Hence, the datasets consists of 1 cell lines x 3 conditions x 4 replicates, for a total of 12 sample. Proteins and samples have been “anonymized”.


2.1 Collect pre-loaded data

The counts represent LFQ values in the log2 format and are not-imputed.

# Metadata
data("sample.config", package = "DEprot")
sample.config
Sample metadata table
column.id sample.id cell condition combined.id replicate
Sample_A BCa_FBS_rep1 BCa FBS BCa_FBS rep1
Sample_B BCa_6h.DMSO_rep1 BCa 6h.DMSO BCa_6h.DMSO rep1
Sample_C BCa_6h.10nM.E2_rep1 BCa 6h.10nM.E2 BCa_6h.10nM.E2 rep1
Sample_D BCa_FBS_rep2 BCa FBS BCa_FBS rep2
Sample_E BCa_6h.DMSO_rep2 BCa 6h.DMSO BCa_6h.DMSO rep2
Sample_F BCa_6h.10nM.E2_rep2 BCa 6h.10nM.E2 BCa_6h.10nM.E2 rep2
Sample_G BCa_FBS_rep3 BCa FBS BCa_FBS rep3
Sample_H BCa_6h.DMSO_rep3 BCa 6h.DMSO BCa_6h.DMSO rep3
Sample_I BCa_6h.10nM.E2_rep3 BCa 6h.10nM.E2 BCa_6h.10nM.E2 rep3
Sample_J BCa_FBS_rep4 BCa FBS BCa_FBS rep4
Sample_K BCa_6h.DMSO_rep4 BCa 6h.DMSO BCa_6h.DMSO rep4
Sample_L BCa_6h.10nM.E2_rep4 BCa 6h.10nM.E2 BCa_6h.10nM.E2 rep4


# log2(LFQ) values (not imputed)
data("unimputed.counts", package = "DEprot")
head(unimputed.counts[,1:6])
Unimputed log2(LFQ) values
Sample_A Sample_B Sample_C Sample_D Sample_E Sample_F
protein.1 17.8663 18.4581 18.7380 18.1438 18.7811 19.3043
protein.2 22.5238 22.9057 23.0183 22.3577 23.0094 23.1491
protein.3 16.1388 15.3720 15.2374 17.0161 15.5452 15.7899
protein.4 21.0189 21.0513 21.0460 21.2229 21.3352 21.0178
protein.5 19.1038 19.4950 19.5529 18.5697 19.3690 19.5156
protein.6 15.8377 15.4202 15.4765 15.9765 15.3439 15.2335


2.2 Build DEprot object

Now we will combine the the counts and metadata to create a DEprot object (hereafter defined as dpo).
Notice that another import information is whether the data are log transformed, and if yes, which is the log base. Recommended transformation is the log2(score + 1).

If data are pre-normalized and/or pre-imputed, it can be indicated with the corresponding parameters.
If this is not the case, leave the parameters as NA (not NULL!).

Ultimately, if the metadata table does not contain a column.id column corresponding to the row names of the counts_table, it is possible to indicate the name of another column that should be assumed as to be the column.id.

dpo <- load.counts(counts = unimputed.counts,
                   metadata = sample.config,
                   log.base = 2,
                   imputation = NA,
                   normalization.method = NA,
                   column.id = "column.id")

dpo
> DEprot object:
>            Samples:  12
>           Proteins:  13239
>   Counts available:  raw
> Log transformation:  log2
>   Metadata columns:  column.id, sample.id, cell, condition, combined.id, replicate


This object is an S4-vector of class DEprot. The S4-vectors are containers of slots that can be accessed using the symbol @ (e.g., object@slot.id.
The structure of an object of class DEprot (and DEprot.analyses) is the following:

Slot Description
raw.counts table containing the raw counts, if not available it will be NULL
norm.counts table containing the normalized counts, if not available it will be NULL
imputed.counts table containing the imputed counts, if not available it will be NULL
log.base a number indicating the base of the log used to transform the table, if not available it will be NA
log.transformed logical value indicating whether the data are log-transformed or not
imputed logical value indicating whether the data are imputed or not
imputation discussed further in the Imputation paragraph
normalized logical value indicating whether the data are normalized or not
normalization.method a string indicating the type of normalization applied, if not available it will be NA
boxplot.raw box+violin plot of the distribution of the LFQ intensities per sample obtained from the raw counts
boxplot.norm box+violin plot of the distribution of the LFQ intensities per sample obtained from the normalized counts
boxplot.imputed box+violin plot of the distribution of the LFQ intensities per sample obtained from the imputed counts
analyses.result.list discussed further in the Differential Expression analyses paragraph
contrasts discussed further in the Differential Expression analyses paragraph
differential.analyses.params discussed further in the Differential Expression analyses paragraph


2.3 Rename sample columns

As in our example, sometimes the columns of the counts are not the actual ID of the samples, but rather an identifier. However, it is possible to rename the counts column names indicating any column of the metadata table (having unique values). The original identifiers are stored in a new column (old.column.id) of the metadata.
Notice that the renaming will applied to all counts table available.

dpo <- rename.samples(DEprot.object = dpo,
                      metadata.column = "sample.id")

get.metadata(dpo)
>              column.id           sample.id cell  condition    combined.id
> 1         BCa_FBS_rep1        BCa_FBS_rep1  BCa        FBS        BCa_FBS
> 2     BCa_6h.DMSO_rep1    BCa_6h.DMSO_rep1  BCa    6h.DMSO    BCa_6h.DMSO
> 3  BCa_6h.10nM.E2_rep1 BCa_6h.10nM.E2_rep1  BCa 6h.10nM.E2 BCa_6h.10nM.E2
> 4         BCa_FBS_rep2        BCa_FBS_rep2  BCa        FBS        BCa_FBS
> 5     BCa_6h.DMSO_rep2    BCa_6h.DMSO_rep2  BCa    6h.DMSO    BCa_6h.DMSO
> 6  BCa_6h.10nM.E2_rep2 BCa_6h.10nM.E2_rep2  BCa 6h.10nM.E2 BCa_6h.10nM.E2
> 7         BCa_FBS_rep3        BCa_FBS_rep3  BCa        FBS        BCa_FBS
> 8     BCa_6h.DMSO_rep3    BCa_6h.DMSO_rep3  BCa    6h.DMSO    BCa_6h.DMSO
> 9  BCa_6h.10nM.E2_rep3 BCa_6h.10nM.E2_rep3  BCa 6h.10nM.E2 BCa_6h.10nM.E2
> 10        BCa_FBS_rep4        BCa_FBS_rep4  BCa        FBS        BCa_FBS
> 11    BCa_6h.DMSO_rep4    BCa_6h.DMSO_rep4  BCa    6h.DMSO    BCa_6h.DMSO
> 12 BCa_6h.10nM.E2_rep4 BCa_6h.10nM.E2_rep4  BCa 6h.10nM.E2 BCa_6h.10nM.E2
>    replicate old.column.id
> 1       rep1      Sample_A
> 2       rep1      Sample_B
> 3       rep1      Sample_C
> 4       rep2      Sample_D
> 5       rep2      Sample_E
> 6       rep2      Sample_F
> 7       rep3      Sample_G
> 8       rep3      Sample_H
> 9       rep3      Sample_I
> 10      rep4      Sample_J
> 11      rep4      Sample_K
> 12      rep4      Sample_L
head(dpo@raw.counts[,1:6])
>           BCa_FBS_rep1 BCa_6h.DMSO_rep1 BCa_6h.10nM.E2_rep1 BCa_FBS_rep2
> protein.1      17.8663          18.4581             18.7380      18.1438
> protein.2      22.5238          22.9057             23.0183      22.3577
> protein.3      16.1388          15.3720             15.2374      17.0161
> protein.4      21.0189          21.0513             21.0460      21.2229
> protein.5      19.1038          19.4950             19.5529      18.5697
> protein.6      15.8377          15.4202             15.4765      15.9765
>           BCa_6h.DMSO_rep2 BCa_6h.10nM.E2_rep2
> protein.1          18.7811             19.3043
> protein.2          23.0094             23.1491
> protein.3          15.5452             15.7899
> protein.4          21.3352             21.0178
> protein.5          19.3690             19.5156
> protein.6          15.3439             15.2335



3 Data normalization

When a DEprot object is loaded, automatically a box/violin plot showing the distribution of the LFQ values per samples is generated.
This representation is useful to estimate whether the data are normalized or not.

dpo@boxplot.raw

Violin/boxplot LFQ intensities of unnormalized data
Boxplots display the quantiles of the LFQ intensities, while red and blue dahsed lines correspond to maximum and minimum LFQ value for each sample.

In this package we apply the Modified Balanced Quantile Normalization (MBQN) from the MBQN package and developed by E.Brombacher et al. (Proteomics, 2020). The modification balances the median (or mean) intensity of features (rows) which are rank invariant (RI) or nearly rank invariant (NRI) across samples (columns) before quantile normalization. This prevents an over-correction of the intensity profiles of RI and NRI features by classical quantile normalization and therefore supports the reduction of systematics in downstream analyses.

dpo <- normalize.counts(DEprot.object = dpo,
                        NRI.RI.ratio.threshold = 0.5,
                        balancing.function = "median")

dpo
> DEprot object:
>            Samples:  12
>           Proteins:  13239
>   Counts available:  raw, normalized
> Log transformation:  log2
>   Metadata columns:  column.id, sample.id, cell, condition, combined.id, replicate, old.column.id
dpo@normalization.method
>                    param                  value
> 1                package                   MBQN
> 2                 method Quantile normalization
> 3               balanced                   TRUE
> 4               function                 median
> 5 NRI/RI ratio threshold                    0.5
head(dpo@raw.counts[,1:6])
>           BCa_FBS_rep1 BCa_6h.DMSO_rep1 BCa_6h.10nM.E2_rep1 BCa_FBS_rep2
> protein.1      17.8663          18.4581             18.7380      18.1438
> protein.2      22.5238          22.9057             23.0183      22.3577
> protein.3      16.1388          15.3720             15.2374      17.0161
> protein.4      21.0189          21.0513             21.0460      21.2229
> protein.5      19.1038          19.4950             19.5529      18.5697
> protein.6      15.8377          15.4202             15.4765      15.9765
>           BCa_6h.DMSO_rep2 BCa_6h.10nM.E2_rep2
> protein.1          18.7811             19.3043
> protein.2          23.0094             23.1491
> protein.3          15.5452             15.7899
> protein.4          21.3352             21.0178
> protein.5          19.3690             19.5156
> protein.6          15.3439             15.2335


Also in this case a box/violin plot with the corresponding normalized LFQ values per each sample is generated and stored in a vector slot.

patchwork::wrap_plots(dpo@boxplot.raw, dpo@boxplot.norm, nrow = 1)


3.1 Batch effect correction

Proteomics is highly susceptible to batch effects. Hereafter we use the package HarmonizR package, developed by H.Voß & S.Schlumbohm (Nat.Commun., 2022). This tool besides handling multiple experiments, allows for the batch correction of data derived by the combination of both DIA (Data-Independent Acquisition) and DDA (Data-Dependent Acquisition).

HarmonizR is not a required dependency, therefore if this fucntion is used and HarmonizR is not already installed, a warning will indicate the requred installation. The package can be installed by: BiocManager::install("HarmonizR"), or alternatively devtools::install_github("https://github.com/SimonSchlumbohm/HarmonizR/", subdir = "HarmonizR").

To use the harmonize.batches is sufficient to provide a DEprot object containing a combined table of multiple experiments and indicate the identifier of a column in the metadata table which corresponds to the batch annotation. The result is a DEprot object with the same structure described in the previous paragraph.
Here, for simplicity, we will add manually a column to the metadata table with “dummy” batches.


## Adding batch column to the metadata table
dpo@metadata$batch = rep(c("batch_1","batch_2"), each = 6)

get.metadata(dpo)
>              column.id           sample.id cell  condition    combined.id
> 1         BCa_FBS_rep1        BCa_FBS_rep1  BCa        FBS        BCa_FBS
> 2     BCa_6h.DMSO_rep1    BCa_6h.DMSO_rep1  BCa    6h.DMSO    BCa_6h.DMSO
> 3  BCa_6h.10nM.E2_rep1 BCa_6h.10nM.E2_rep1  BCa 6h.10nM.E2 BCa_6h.10nM.E2
> 4         BCa_FBS_rep2        BCa_FBS_rep2  BCa        FBS        BCa_FBS
> 5     BCa_6h.DMSO_rep2    BCa_6h.DMSO_rep2  BCa    6h.DMSO    BCa_6h.DMSO
> 6  BCa_6h.10nM.E2_rep2 BCa_6h.10nM.E2_rep2  BCa 6h.10nM.E2 BCa_6h.10nM.E2
> 7         BCa_FBS_rep3        BCa_FBS_rep3  BCa        FBS        BCa_FBS
> 8     BCa_6h.DMSO_rep3    BCa_6h.DMSO_rep3  BCa    6h.DMSO    BCa_6h.DMSO
> 9  BCa_6h.10nM.E2_rep3 BCa_6h.10nM.E2_rep3  BCa 6h.10nM.E2 BCa_6h.10nM.E2
> 10        BCa_FBS_rep4        BCa_FBS_rep4  BCa        FBS        BCa_FBS
> 11    BCa_6h.DMSO_rep4    BCa_6h.DMSO_rep4  BCa    6h.DMSO    BCa_6h.DMSO
> 12 BCa_6h.10nM.E2_rep4 BCa_6h.10nM.E2_rep4  BCa 6h.10nM.E2 BCa_6h.10nM.E2
>    replicate old.column.id   batch
> 1       rep1      Sample_A batch_1
> 2       rep1      Sample_B batch_1
> 3       rep1      Sample_C batch_1
> 4       rep2      Sample_D batch_1
> 5       rep2      Sample_E batch_1
> 6       rep2      Sample_F batch_1
> 7       rep3      Sample_G batch_2
> 8       rep3      Sample_H batch_2
> 9       rep3      Sample_I batch_2
> 10      rep4      Sample_J batch_2
> 11      rep4      Sample_K batch_2
> 12      rep4      Sample_L batch_2


## batch correction
dpo <- harmonize.batches(DEprot.object = dpo,
                         batch.column = "batch",
                         cores = 1)



4 Data imputation

Often many NA/NaN values are present in the LFQ tables due to the techanical limitaions of the protein detection by Mass Spectrometry (MS) experiments.

Here we use the package missForest package, developed by DJ.Stekhoven & P.Buehlmann (Bioinformatics, 2012). This tool will impute the NaN and assign and estimated value. It also yields an out-of-bag (OOB) imputation error estimate (general, or per each sample). Moreover, it can be run parallel to save computation time (both examples reported here after).

## Without parallelization
dpo <- impute.counts(DEprot.object = dpo,
                     max.iterations = 100,
                     variable.wise.OOBerror = T,
                     use.normalized.data = T)


## With parallelization
dpo <- impute.counts(DEprot.object = dpo,
                     max.iterations = 100,
                     variable.wise.OOBerror = T,
                     use.normalized.data = T,
                     cores = 10,
                     parallel.mode = "variables")

dpo

dpo@imputation$OOBerror

data.frame(dpo@imputation[-3])
> DEprot object:
>            Samples:  12
>           Proteins:  13239
>   Counts available:  raw, normalized, imputed
> Log transformation:  log2
>   Metadata columns:  column.id, sample.id, cell, condition, combined.id, replicate, old.column.id
>        BCa_FBS_rep1    BCa_6h.DMSO_rep1 BCa_6h.10nM.E2_rep1        BCa_FBS_rep2 
>           0.1443367           0.1218259           0.1429739           0.1272787 
>    BCa_6h.DMSO_rep2 BCa_6h.10nM.E2_rep2        BCa_FBS_rep3    BCa_6h.DMSO_rep3 
>           0.1249806           0.1260666           0.1691520           0.1310784 
> BCa_6h.10nM.E2_rep3        BCa_FBS_rep4    BCa_6h.DMSO_rep4 BCa_6h.10nM.E2_rep4 
>           0.1338480           0.1352660           0.1462996           0.1331815
method max.iterations parallelization.mode cores processing.time
missForest 100 variables 10 14.61 mins


Also in this case a box/violin plot with the corresponding imputed LFQ values per each sample is generated and stored in a vector slot.

patchwork::wrap_plots(dpo@boxplot.raw, dpo@boxplot.norm, dpo@boxplot.imputed, nrow = 1)



5 Sample similarities

5.1 Principal Component Analyses (PCA)

PCA can be performed in order to perform a dimensional reduction and determine which factor explains the variability of the samples.
DEprot includes function dedicated to this aim and specifically build to work with DEprot objects.

Notice that, even if the data are not log-transformed, perform.PCA will do it before performing the analyses.

5.1.1 Compute PCs

## Perform the analyses (DEprot.PCA object)
PCA <- perform.PCA(DEprot.object = dpo,
                   which.data = "imputed") # possible: raw, normalized, imputed

The DEprot.PCA object contains the following slots:

Slot Description
PCA.metadata metadata of the samples used in the PCA (subset of the original DEprot@metadata)
sample.subset vector containing the list of samples analyzed
data.used vector indicating the type of counts used (imputed, normalized, raw)
prcomp object of class prcomp corresponding to the full PCA output
PCs data.frame combining the PC scores and the metadata table, useful for replotting
importance statistical summary table for the PCA analyses per each PC
cumulative.PC.plot ggplot object corresponding to out put of plot.PC.cumulative for this object


5.1.2 Visualize PCAs

## Plot cumulative variance of all PCs
#### equivalent to `PCA@cumulative.PC.plot`
plot.PC.cumulative(DEprot.PCA.object = PCA,
                   bar.color = "steelblue",
                   line.color = "navyblue")

## Plot PC scatters
PC_1.2 <-
  plot.PC.scatter(DEprot.PCA.object = PCA,
                  PC.x = 1,
                  PC.y = 2,
                  color.column = "condition",
                  shape.column = "replicate",
                  label.column = NULL,
                  plot.zero.lines = F) +
  geom_hline(yintercept = 0, color = "gray", linetype = "dashed") +
  theme(legend.position = "none")


PC_2.3 <-
  plot.PC.scatter(DEprot.PCA.object = PCA,
                  PC.x = 2,
                  PC.y = 3,
                  color.column = "condition",
                  shape.column = "replicate",
                  label.column = NULL,
                  plot.zero.lines = TRUE)

patchwork::wrap_plots(PC_1.2, PC_2.3, nrow = 1)


Alternatively, the combination of PC1.2 and PC2.3 can be automatically generated using the function plot.PC.scatter.123:

plot.PC.scatter.123(DEprot.PCA.object = PCA,
                    color.column = "condition",
                    shape.column = "replicate",
                    label.column = "replicate",
                    dot.colors = c("6h.10nM.E2" = "indianred",
                                   "6h.DMSO" = "steelblue",
                                   "FBS" =  "forestgreen"),
                    plot.zero.line.y.12 = TRUE,
                    plot.zero.line.x.12 = FALSE,
                    plot.zero.line.y.23 = TRUE,
                    plot.zero.line.x.23 = TRUE)

5.1.3 Analyze PCs on a sample subset

These analyses can also be performed for a subset of samples by indicated the sample names of interest.
In the example below we will use only the sample in which the estrogen receptor is active (E2 and FBS conditions).

## Perform the analyses (DEprot.PCA object)
PCA.fbs.e2 <-
  perform.PCA(DEprot.object = dpo,
              sample.subset = dpo@metadata$column.id[grepl("E2|FBS",
                                                           dpo@metadata$column.id)],
              which.data = "imputed")


## Plot cumulative variance of all PCs
plot.PC.cumulative(DEprot.PCA.object = PCA.fbs.e2,
                   bar.color = "indianred",
                   line.color = "firebrick4",
                   title = "**Only ERa active**")

## Plot PC scatters
PC.fbs.e2_1.2 <-
  plot.PC.scatter(DEprot.PCA.object = PCA.fbs.e2,
                  PC.x = 1,
                  PC.y = 2,
                  color.column = "condition",
                  shape.column = "replicate",
                  label.column = NULL,
                  plot.zero.lines = F) +
  geom_hline(yintercept = 0, color = "gray", linetype = "dashed") +
  theme(legend.position = "none")
  
  
PC.fbs.e2_2.3 <-
  plot.PC.scatter(DEprot.PCA.object = PCA.fbs.e2,
                  PC.x = 2,
                  PC.y = 3,
                  color.column = "condition",
                  shape.column = "replicate",
                  label.column = NULL,
                  plot.zero.lines = T)

patchwork::wrap_plots(PC.fbs.e2_1.2, PC.fbs.e2_2.3, nrow = 1)


5.2 Correlations

Another method to define the sample clustering/groups, is the overall correlation between the samples.
Hierarchical clustering is performed using the 1 - correlation values, since the hierarchical clustering algorithm is based on dissimilarities while the correlations are an index of similarity.

corr.all.samples <-
  plot.correlation.heatmap(DEprot.object = dpo,
                           which.data = "imputed",
                           palette = viridis::mako(n = 10, direction = -1, begin = 0.25),
                           correlation.scale.limits = c(0.9,1),
                           correlation.method = "pearson",
                           plot.subtitle = "All samples",
                           display.values = TRUE)
corr.all.samples

Also in this case the sample correlation can be computed for a subset of samples as shown before for the PCAs.

corr.ERa.active <-
  plot.correlation.heatmap(DEprot.object = dpo,
                           which.data = "imputed",
                           sample.subset = dpo@metadata$column.id[grepl("E2|FBS",
                                                                        dpo@metadata$column.id)],
                           palette = viridis::magma(n = 10, direction = -1, begin = 0.25),
                           correlation.scale.limits = c(0.9,1),
                           correlation.method = "pearson",
                           plot.subtitle = "Only ERa active",
                           clustering.method = "complete",
                           display.values = TRUE)

corr.ERa.active

The DEprot.correlation correlation object contains the following slots:

Slot Description
heatmap ggplot object corresponding to the correlation heatmap
corr.metadata metadata of the samples used in the correlation (subset of the original DEprot@metadata)
sample.subset vector containing the list of samples analyzed
data.used vector indicating the type of counts used (imputed, normalized, raw)
corr.matrix the correlation matrix on which the heatmap is base on
distance object of class dist corresponding to the output of as.dist(1 - correlation.matrix)
cluster hclust object generated by hclust(d = as.dist(1 - correlation.matrix), method = clustering.method)



6 Differential Expression (DE) analyses

Differential expression analyses between two conditions can be performed employing two different methods using the functions diff.analyses and diff.analyses.limma. With the first (diff.analyses) the conditions will be compared two-by-two (individual t-/Wilcox tests), while with the second (diff.analyses.limma) the limma package is used to fit the data on a linear model.

In both cases, it is sufficient to provide a list of 3-elements vectors. The latter, should indicate any column of the metadata table (grouping factor) and two values (groups) to compare within this column. The first variable will be numerator and the second the denominator of the fold change: c("group.column", "condition.A", "condition.B"), FoldChange = group A/B.

When a replicate column is available, it is possible to run the analyses in paired mode. When using diff.analyses, it is sufficient to indicate the column from the metadata table that corresponds to the replicate identifiers (in our example replicate.column = "replicate") and set paired.test = TRUE. For each contrast it will be verified that replicate identifiers are not repeated within a group and, that replicate number and identifiers coincide between the two groups. By the default analyses ignore the replicates.
If diff.analyses.limma is used instead, the correlation between replicates will be estimated and incorporated in the fitting model. In this case indicate the replicate column identifier (in our example replicate.column = "replicate") and set include.rep.model = TRUE.

In the following example we will compare 6h.10nM.E2 vs 6h.DMSO, and 6h.10nM.E2 vs FBS. Both paired and unpaired examples are shown, but further analyses are based on the paired results only (multi t-test mode).

LIMMA mode

## Unpaired test
dpo_analyses <- diff.analyses.limma(DEprot.object = dpo,
                                    contrast.list = list(c("condition", "6h.10nM.E2", "6h.DMSO"),
                                                         c("condition", "6h.10nM.E2", "FBS")),
                                    linear.FC.th = 2,
                                    padj.th = 0.05,
                                    padj.method = "BH",
                                    fitting.method = "ls",
                                    which.data = "imputed")

## Paired test
dpo_analyses <- diff.analyses.limma(DEprot.object = dpo,
                                    contrast.list = list(c("condition", "6h.10nM.E2", "6h.DMSO"),
                                                         c("condition", "6h.10nM.E2", "FBS")),
                                    replicate.column = "replicate",
                                    include.rep.model = TRUE,
                                    linear.FC.th = 2,
                                    padj.th = 0.05,
                                    padj.method = "BH",
                                    fitting.method = "ls",
                                    which.data = "imputed")


Multiple t-test mode (next paragraphs use these results, in paired mode)

## Unpaired test
dpo_analyses <- diff.analyses(DEprot.object = dpo,
                              contrast.list = list(c("condition", "6h.10nM.E2", "6h.DMSO"),
                                                   c("condition", "6h.10nM.E2", "FBS")),
                              linear.FC.th = 2,
                              padj.th = 0.05,
                              padj.method = "bonferroni",
                              stat.test = "t.test",
                              which.data = "imputed")

## Paired test
dpo_analyses <- diff.analyses(DEprot.object = dpo,
                              contrast.list = list(c("condition", "6h.10nM.E2", "6h.DMSO"),
                                                   c("condition", "6h.10nM.E2", "FBS")),
                              replicate.column = "replicate",
                              paired.test = TRUE,
                              linear.FC.th = 2,
                              padj.th = 0.05,
                              padj.method = "bonferroni",
                              stat.test = "t.test",
                              which.data = "imputed")

dpo_analyses
> DEprot.analyses object:
>           Counts used:  imputed
> Fold Change threshold:  2 (linear)
> FC unresponsive range:  [0.9090909,1.1] (linear)
>        padj threshold:  0.05 (linear)
>           padj method:  bonferroni
> 
> 
> Differential results summary:
>                        contrast.id group.factor     group1  group2 paired.test
> 1 condition: 6h.10nM.E2 vs 6h.DMSO    condition 6h.10nM.E2 6h.DMSO        TRUE
> 2 condition: 6h.10nM.E2 vs 6h.DMSO    condition 6h.10nM.E2 6h.DMSO        TRUE
> 3 condition: 6h.10nM.E2 vs 6h.DMSO    condition 6h.10nM.E2 6h.DMSO        TRUE
> 4 condition: 6h.10nM.E2 vs 6h.DMSO    condition 6h.10nM.E2 6h.DMSO        TRUE
> 5     condition: 6h.10nM.E2 vs FBS    condition 6h.10nM.E2     FBS        TRUE
> 6     condition: 6h.10nM.E2 vs FBS    condition 6h.10nM.E2     FBS        TRUE
> 7     condition: 6h.10nM.E2 vs FBS    condition 6h.10nM.E2     FBS        TRUE
> 8     condition: 6h.10nM.E2 vs FBS    condition 6h.10nM.E2     FBS        TRUE
>    diff.status     n median.FoldChange
> 1      6h.DMSO     0                NA
> 2   6h.10nM.E2     0                NA
> 3 unresponsive  9009       0.008525557
> 4         null  4230       0.152922231
> 5          FBS     0                NA
> 6   6h.10nM.E2     0                NA
> 7 unresponsive  2306      -0.013947464
> 8         null 10933      -0.205561346

The summary can be collected by using the generic function summary:

diff.analyses_summary = summary(dpo)


6.1 DE results

The output will be a DEprot.analyses object. This class is similar to the base DEprot one, however 3 slots are now available:

  • contrasts: corresponds to the list used to define the contrasts, but includes also the IDs of the counts matrix belonging to each subgroup and whether the statistical test was performed in paired mode.
  • differential.analyses.params: a list containing the core parameters used for the differential expression analyses.
  • analyses.result.list: a list with an element for each contrast including all the results of the differential analyses (see below for details).

The analyses.result.list, for each contrast, stores a list with the following elements:

Element Description
results a data.frame containing the results of the analyses; includes average expression of each group, basemean, foldchange, pvalue and p.adj, differential.status
n.diff a summary table showing the number of proteins in each differential expression status (up/down/unresponsive, null)
PCA.data output of perform.PCA for the subset of samples analyzed in a specific contrast
PCA.plots combination of 3 plots: scatter PC1-vs-PC2, scatter PC2-vs-PC3, and cumulative bar plot
correlations combination of Pearson and Spearman correlation heatmaps (obtained by plot.correlation.heatmap) for the subset of samples analyzed in a specific contrast
volcano volcano plot showing the log2(FoldChange) x -log10(p.adjusted) of differential expression results; it can be regenerated using plot.volcano
MA.plot MA-plot showing the log2(basemean) x log2(FoldChange) of differential expression results; it can be regenerated using plot.MA

If the function diff.analyses.limma was used, an additional slot called limma.fit will be available and will contain the results of limma::contrasts.fit.


6.1.1 DE table

The table with the results of the differential analyses can be retrieved directly from the list in the DEprot.analyses object (dpo_analyses@analyses.result.list$contrast.id$results) or using the get.results function.

## Direct access
results = dpo_analyses@analyses.result.list$condition_6h.10nM.E2.vs.6h.DMSO$results

## Function
results = get.results(dpo_analyses, contrast = 1)

head(results)
prot.id basemean.log2 log2.mean.6h.10nM.E2 log2.mean.6h.DMSO log2.Fold_6h.10nM.E2.vs.6h.DMSO p.value padj diff.status
protein.1 18.85442 18.91504 18.79380 0.1212396 0.6342736 1 unresponsive
protein.2 23.07816 23.08739 23.06894 0.0184512 0.8673988 1 unresponsive
protein.3 15.68530 15.65794 15.71266 -0.0547179 0.7188334 1 unresponsive
protein.4 21.14521 21.09930 21.19112 -0.0918296 0.2081734 1 unresponsive
protein.5 19.56671 19.62718 19.50624 0.1209374 0.1169272 1 unresponsive
protein.6 15.31226 15.32551 15.29901 0.0264997 0.6433742 1 unresponsive


6.1.2 PCA and correlation within the comparison

The DEprot.analyses object includes PCA and correlation analyses of the samples involved in the contrast.

dpo_analyses@analyses.result.list$condition_6h.10nM.E2.vs.6h.DMSO$PCA.plots

dpo_analyses@analyses.result.list$condition_6h.10nM.E2.vs.6h.DMSO$correlations


6.1.3 Visualize DE analyses

6.1.3.1 MA and volcano

Differential expressed proteins can be visualized as either a volcano plot or an MA-plot.
Both these plots are available in the dpo_analyses@analyses.result.list$contrast.id list, but can also be generated using the functions plot.volcano and plot.MA.

Of note, if use.uncorrected.pvalue = TRUE, the normal p-value will be used instead of the p.adjusted. In this case the FoldChange and p-value thresholds are collected from the DEprot.analyses object and reapplied to compute the new differential status of the proteins.

volcano = plot.volcano(dpo_analyses, contrast = 1, use.uncorrected.pvalue = TRUE)
MAplot = plot.MA(dpo_analyses, contrast = 1, use.uncorrected.pvalue = TRUE)

patchwork::wrap_plots(volcano, MAplot)


6.1.3.2 Heatmaps

To plot results heatmaps there are two functions available:

  • heatmap.counts: allows for the plotting of raw, normalized or imputed counts.
  • heatmap.contrasts: allows for the plotting of log2(FoldChange) values.


6.1.3.2.1 Counts

With the heatmap.counts function it is possible to generate an heatmap for the raw/normalized/imputed counts. It is possible to select only specific proteins and/or subset samples (from column.id column in metadata table). If the object provided is a DEprot.analyses object, it is also possible to indicate to plot only differential proteins (or top.n differential proteins) from a specific contrast.
The protein ranking definition is based on the differential score, computed as: abs(log2(fold change)) * -log10(Padj).

The resulting object is of class DEprot.counts.heatmap, and contains the ggplot-heatmap, the row cluster and the column cluster.

## Plotting from a DEprot object
imputed_counts_heatmap <- 
  heatmap.counts(DEprot.object = dpo,
                 which.data = "imputed",
                 sample.subset = dpo@metadata$column.id[grep("6h", dpo@metadata$column.id)],
                 show.protein.names = TRUE,
                 protein.subset = c("protein.2295", "protein.304", "protein.657",
                                    "protein.2819", "protein.2168", "protein.10594"),
                 title = "Imputed counts | protein and sample selection") 



## Plotting from a DEprot.analyses object
## top 15 differential proteins from contrast 1
imputed_counts_heatmap_diffProteins <- 
  heatmap.counts(DEprot.object = dpo_analyses,
                 which.data = "imputed",
                 contrast = 1,
                 top.n = 15,
                 palette = viridis::mako(n = 100, direction = -1),
                 cell.border.color = "white",
                 show.protein.names = TRUE,
                 sample.subset = dpo@metadata$column.id[grep("6h", dpo@metadata$column.id)],
                 use.uncorrected.pvalue = TRUE,
                 title = "condition: **6h.10nM.E2** *vs* **6h.DMSO** (top 15) | Imputed counts")


## Combine heatmaps
patchwork::wrap_plots(imputed_counts_heatmap@heatmap,
                      imputed_counts_heatmap_diffProteins@heatmap)


Instead of “pure” counts, it is possibile to compute Z-scores of the values and plot a divergent heatmap.
In this case instead of a palette, it will be required to indicate the color for the positive, negative and 0 values.

We will apply this to the imputed_counts_heatmap_diffProteins heatmap from the previous example.

## Z-score by row
imputed_counts_heatmap_diffProteins_rowScaled <- 
  heatmap.counts(DEprot.object = dpo_analyses,
                 which.data = "imputed",
                 contrast = 1,
                 top.n = 15,
                 high.color = "purple4",
                 low.color = "darkorange",
                 mid.color = "white",
                 cell.border.color = "white",
                 show.protein.names = TRUE,
                 sample.subset = dpo@metadata$column.id[grep("6h", dpo@metadata$column.id)],
                 use.uncorrected.pvalue = TRUE,
                 scale = "rows",
                 title = "condition: **6h.10nM.E2** *vs* **6h.DMSO** (top 15)<br>Imputed counts Z-score")


## Z-score by column
imputed_counts_heatmap_diffProteins_columnScaled <- 
  heatmap.counts(DEprot.object = dpo_analyses,
                 which.data = "imputed",
                 contrast = 1,
                 top.n = 15,
                 high.color = "firebrick",
                 low.color = "steelblue",
                 mid.color = "white",
                 cell.border.color = "white",
                 show.protein.names = TRUE,
                 sample.subset = dpo@metadata$column.id[grep("6h", dpo@metadata$column.id)],
                 use.uncorrected.pvalue = TRUE,
                 scale = "columns",
                 title = "condition: **6h.10nM.E2** *vs* **6h.DMSO** (top 15)<br>Imputed counts Z-score")


## Combine heatmaps
patchwork::wrap_plots(imputed_counts_heatmap_diffProteins_rowScaled@heatmap,
                      imputed_counts_heatmap_diffProteins_columnScaled@heatmap)


Ultimately, With this function is also possible to group data based on the values of a metadata’s column. In this way it is possible to average the values by the values of the defined column (e.g., the replicates of a condition).

In the next example we will average the replicates of each condition using the combined.id column form the metadata of the dpo_analyses object.

imputed_counts_heatmap_diffProteins_rowScaled_grouped.by.condition <- 
  heatmap.counts(DEprot.object = dpo_analyses,
                 group.by.metadata.column = "combined.id",
                 which.data = "imputed",
                 contrast = 1,
                 high.color = "firebrick",
                 low.color = "steelblue",
                 mid.color = "white",
                 cell.border.color = "white",
                 show.protein.names = TRUE,
                 use.uncorrected.pvalue = TRUE,
                 scale = "rows",
                 title = "condition: **6h.10nM.E2** *vs* **6h.DMSO** (all)<br>Imputed counts Z-score")

imputed_counts_heatmap_diffProteins_rowScaled_grouped.by.condition@heatmap


6.1.3.2.2 Fold changes

Instead of counts, it is possible to plot the fold changes of one or multiple contrasts using the function heatmap.contrasts. Only differential proteins are shown.
Also in this case it is possible to show only the top N differential proteins. However the top N of each contrast will be shown. This means than N proteins might be shown.

FC_heatmap <-
  heatmap.contrasts(DEprot.analyses.object = dpo_analyses,
                    contrasts = c(1:2),
                    top.n = 20,
                    high.color = "#35978F",
                    low.color = "#BF812D",
                    mid.color = "white",
                    show.protein.names = TRUE,
                    use.uncorrected.pvalue = TRUE)

FC_heatmap@heatmap


6.1.3.3 Upset-plot

To identify which differential proteins are common among multiple comparisons, it is possible to generate an upset plot using the function plot.upset. With this function all the intersections between all the contrasts including in the provided object can be visualized by co-occurrence. It is also possible to subset only specific contrasts indicating a numeric vector in the flag sort.intersections.

upset.plot <- plot.upset(DEprot.analyses.object = dpo_analyses,
                         contrast.subset = c(1,2),
                         title = "**My upset plot**",
                         sort.intersections = "cardinality",
                         sort.sets = "descending",
                         intersection.bar.color = "navy",
                         setsize.bar.color = "black",
                         show.counts = T,
                         height.ratio = 0.5,
                         width.ratio = 0.4,
                         use.uncorrected.pvalue = TRUE)

upset.plot  # or upset.plot@upset


Besides the plot, also a TRUE/FALSE table is returned. The latter shows for each condition if a protein has been found to be differential. Only proteins that are differential in at least one condition are shown.

upset.plot@obs.matrix
Upset observations matrix
prot.id condition: 6h.10nM.E2 vs 6h.DMSO | 6h.10nM.E2 condition: 6h.10nM.E2 vs 6h.DMSO | 6h.DMSO condition: 6h.10nM.E2 vs FBS | FBS condition: 6h.10nM.E2 vs FBS | 6h.10nM.E2
protein.17 FALSE FALSE TRUE FALSE
protein.25 FALSE FALSE FALSE TRUE
protein.44 FALSE FALSE FALSE TRUE
protein.48 FALSE FALSE TRUE FALSE
protein.53 FALSE FALSE FALSE TRUE



7 Other tools

7.1 Count detected proteins

In order to know how many proteins have been detected in each sample, it is possible to use the function protein.summary. A stacked bar plot will be generated per each sample. The input object must belong to either DEprot or DEprot.analyses classes.

NOTE: this plot is meaningfult only on raw and/or unimputed normalized data. Indeed, estimation of missing values is not possibile on the imputed tables.

protein.counts <-
  protein.summary(DEprot.object = dpo_analyses,
                  n.lables = "counts",
                  show.frequency = F,
                  colors = c("gray", "steelblue4"),
                  title = "**# Proteins identified in each sample**")

protein.counts


Another possibility is to study the protein co-detection by sample groups. A group can be defined using any column of the metadata table. In the output stacked bar plot, each color will show how many proteins in how many samples within a group were detected.

protein.counts.byCondition <- 
  protein.summary(DEprot.object = dpo_analyses,
                  group.column = "condition",
                  n.lables = "percentage",
                  show.frequency = T,
                  x.label.angle = 0,
                  title = "**# Proteins identified per _Condition_**")

protein.counts.byCondition


7.2 Filter proteins

For certain analyses it might be necessary to filter-out/keep only certain specific protein (e.g., keep only nuclear proteins).
With the function filter.proteins is possible to indicate a vector containing the protein IDs (it must correspond to the row.names of the counts table) and chose whether to "keep" or "remove" these proteins.

This function works for both DEprot and DEprot.analyses classes. The output will look exactly as the original, however counts tables (raw, normalized, imputed), count boxplots, PCAs, correlations, differential tables, differential protein counts, violin and MA-plot will be recomputed.

NOTE: we discourage the filtering before normalization/bacth correction and imputation, as well as before limma-base differential analyses. In all these methods are based on the assumption that the proteins ‘available’ are ‘all’ the proteins available, which is not the case.


7.2.1 Filtering examples

Hereafter are shown two examples for keeping nuclear proteins or removing cytoplasmic proteins.

Notice that this is only an example and, since the protein IDs are randomized, it would not applicable for the example dataset availble with DEprot. Furthermore DEprot does not include the packages AnnotationDbi and org.Hs.eg.db, which can be installed throug the Bioconductor portal.


KEEP NUCLEAR PROTEINS

nucleus <- AnnotationDbi::select(org.Hs.eg.db::org.Hs.eg.db,
                                 keytype = "GOALL",
                                 keys = "GO:0005634", #nucleus
                                 columns = c("SYMBOL", "UNIPROT"))

dpo_analyses_nuclear <- filter.proteins(DEprot.object = dpo_analyses,
                                        proteins = nucleus$SYMBOL,
                                        mode = "keep")


REMOVE CYTOPLASMIC PROTEINS

cytoplasm <- AnnotationDbi::select(org.Hs.eg.db::org.Hs.eg.db,
                                   keytype = "GOALL",
                                   keys = "GO:0005737", #cytoplasm
                                   columns = c("SYMBOL", "UNIPROT"))

dpo_analyses_nuclear <- filter.proteins(DEprot.object = dpo_analyses,
                                        proteins = cytoplasm$SYMBOL,
                                        mode = "remove")


7.3 GSEA and OverRepresentation Analyses (ORA)

7.3.1 Define GenSet

Enrichment analyses can be performed on a full contrast of a DEprot.analyses object (GSEA) or on the subset of proteins that belong to a specific category of diff.status in the results table of the given contrast (ORA).

For this analyses a geneSet table will be required as “reference”. The latter must be a data.frame with two columns: “gs_name” (geneSet name) and “gene_symbol” (corresponding to the protein id). In DEprot we provide a ready-to-use list of protein complexes obtained by the CORUM database.

Notice that the “gene_symbol” values must match the “prot.id” in the result tables. To this goal, the geneset.enrichment function contains the parameter gsub.pattern.prot.id that will be used to remove a pattern from the “prot.id”s.


In the following example we will show how to create a genset table from the CORUM data and use it for GSEA and ORA analyses.

7.3.2 Define geneSet

## GeneSet Enrichment Analyses
data("corum_v5.0", package = "DEprot")

corum_geneSet =
  corum_v5.0 %>%
  dplyr::filter(organism == "Human") %>%
  dplyr::rename(gs_name = complex.name,
                gene_symbol = protein.members) %>%
  dplyr::select(gs_name, gene_symbol)

corum_geneSet
CORUM protein complexes (v5.0)
gs_name gene_symbol
BCL6-HDAC4 complex BCL6
BCL6-HDAC4 complex HDAC4
BCL6-HDAC5 complex BCL6
BCL6-HDAC5 complex HDAC5
BCL6-HDAC7 complex BCL6
BCL6-HDAC7 complex HDAC7
Multisubunit ACTR coactivator complex CREBBP
Multisubunit ACTR coactivator complex EP300
Multisubunit ACTR coactivator complex KAT2B
Multisubunit ACTR coactivator complex NCOA3


7.3.3 Perform enrichment analyses

### GSEA
GSEA.results =
  geneset.enrichment(DEprot.analyses.object = dpo_analyses,
                     contrast = 1,
                     TERM2GENE = corum_geneSet,
                     enrichment.type = "GSEA",
                     gsub.pattern.prot.id = "_HUMAN|;.*",
                     pvalueCutoff = 0.05,
                     pAdjustMethod = "BH",
                     dotplot.n = 10)



### OverRepresentation Analyses (ORA)
ORA.results =
  geneset.enrichment(DEprot.analyses.object = dpo_analyses,
                     contrast = 1,
                     TERM2GENE = corum_geneSet,
                     enrichment.type = "GSEA",
                     gsub.pattern.prot.id = "_HUMAN|;.*",
                     pvalueCutoff = 0.05,
                     qvalueCutoff = 0.05,
                     pAdjustMethod = "BH",
                     diff.status.category = "6h.10nM.E2",
                     dotplot.n = 10)


The output is an object of class DEprot.enrichResult:

  • enrichment.discovery: the direct output from clusterProfiler::GSEA/enricher (GSEA and ORA, respectively);
  • protein.network: a string plot showing protein networks (clusterProfiler::cnetplot);
  • pathway.network: a list with clusters and string plot showing pathway/set networks (aPEAR::enrichmentNetwork);
  • NES.plot (GSEA only): a bar plot showing the NES scores for each significantly enriched geneSet;
  • dotplot_gene.ratio: a dotplot showing the geneRatios for each significantly enriched geneSet;
  • dotplot_fold.enrichment (ORA only): a dotplot showing the foldEnrichment for each significantly enriched geneSet,
  • parameters: a list containing the parameters used to run the analyses.



8 Package information

8.1 Documentation

With the package a detailed PDF manual with details for each function and respective parameters is available.

The R-package have been published on GitHub and a git-pages website is available as well. At both these sites it is possible to find the installation procedure, required dependencies, and the links for changeLog, manual and vignette.


8.2 Package history and releases

A list of all releases and respective description of changes applied could be found here.


8.3 Contact

For any suggestion, bug fixing, commentary please contact fill an Issues/Pull requests form on the DEprot github page.

contributors
contributors


8.4 License

This package is under a GNU General Public License (version 3).