1 Introduction
The concept behind the DEprot (Differential Expression proteomics) is to provide a toolkit that allows for the normalization, imputation and analyses of the differential protein expression in proteomics data. The data are assumed to be LFQ (label-free quantitation) values.
1.1 Citation
If you use this package, please cite:
Citation
“No publication associated yet et al., XYZ (2123).
doi: XYZ/XYZ
citation("DEprot")
> To cite package 'DEprot' in publications use:
>
> Gregoricchio S (????). _DEprot: An R-package for proteomics
> differential analyses_. R package version 0.1.0,
> <https://sebastian-gregoricchio.github.io/DEprot/https://github.com/sebastian-gregoricchio/DEprot/https://sebastian-gregoricchio.github.io/>.
>
> A BibTeX entry for LaTeX users is
>
> @Manual{,
> title = {DEprot: An R-package for proteomics differential analyses},
> author = {Sebastian Gregoricchio},
> note = {R package version 0.1.0},
> url = {https://sebastian-gregoricchio.github.io/DEprot/
> https://github.com/sebastian-gregoricchio/DEprot/
> https://sebastian-gregoricchio.github.io/},
> }
>
> ATTENTION: This citation information has been auto-generated from the
> package DESCRIPTION file and may need manual editing, see
> 'help("citation")'.
1.2 General workflow
Data can be loaded as raw values, pre-normalized unimputed LFQ values or, imputed LFQ values. Further, if multiple batches are available, it is possible to perform batch corrections by combining LFQ uimputed tables and load this data as “raw” in order to perform a batch harmonization.
Using normalized and imputed data, DEprot
will help you
in the differential analyses as well in performing quality controls on
your data (sample correlation, principal components analyses (PCA)).
2 Loading the data
The package starting point is the building of a DEprot
object. This class of objects is specific to this package and requires
at least two elements:
- counts table: this must be a matrix in which the column names represent the samples while the rown ames identify the proteins. The values in the matrix are assumed to be LFQ (label-free quantitation) values, either in linear or log-transformed format.
- metadata: this must be a data.frame containing at least one
column names
column.id
which values correspond to the column names of the counts_table. Any other additional column can be added (cell lines, treatment, condition, timing, etc) and used to define groups for differential and quality control (PCA and correlation) analyses.
In the next paragraph we will refer to a dataset (pre-loaded in
DEprot
), in which a breast cancer (BCa) cell line was
cultured in either hormone-deprived media or in full media (FBS). If
cultered in hormone-deprived condition, it was then treated for 6 hrs
with either \(\beta\)-estradiol (E2) or
vehicle (DMSO). 4 biological replicates have been analyzed. Hence, the
datasets consists of 1 cell lines x 3 conditions x 4 replicates, for a
total of 12 sample. Proteins and samples have been “anonymized”.
2.1 Collect pre-loaded data
The counts represent LFQ values in the log2 format and are not-imputed.
column.id | sample.id | cell | condition | combined.id | replicate |
---|---|---|---|---|---|
Sample_A | BCa_FBS_rep1 | BCa | FBS | BCa_FBS | rep1 |
Sample_B | BCa_6h.DMSO_rep1 | BCa | 6h.DMSO | BCa_6h.DMSO | rep1 |
Sample_C | BCa_6h.10nM.E2_rep1 | BCa | 6h.10nM.E2 | BCa_6h.10nM.E2 | rep1 |
Sample_D | BCa_FBS_rep2 | BCa | FBS | BCa_FBS | rep2 |
Sample_E | BCa_6h.DMSO_rep2 | BCa | 6h.DMSO | BCa_6h.DMSO | rep2 |
Sample_F | BCa_6h.10nM.E2_rep2 | BCa | 6h.10nM.E2 | BCa_6h.10nM.E2 | rep2 |
Sample_G | BCa_FBS_rep3 | BCa | FBS | BCa_FBS | rep3 |
Sample_H | BCa_6h.DMSO_rep3 | BCa | 6h.DMSO | BCa_6h.DMSO | rep3 |
Sample_I | BCa_6h.10nM.E2_rep3 | BCa | 6h.10nM.E2 | BCa_6h.10nM.E2 | rep3 |
Sample_J | BCa_FBS_rep4 | BCa | FBS | BCa_FBS | rep4 |
Sample_K | BCa_6h.DMSO_rep4 | BCa | 6h.DMSO | BCa_6h.DMSO | rep4 |
Sample_L | BCa_6h.10nM.E2_rep4 | BCa | 6h.10nM.E2 | BCa_6h.10nM.E2 | rep4 |
# log2(LFQ) values (not imputed)
data("unimputed.counts", package = "DEprot")
head(unimputed.counts[,1:6])
Sample_A | Sample_B | Sample_C | Sample_D | Sample_E | Sample_F | |
---|---|---|---|---|---|---|
protein.1 | 17.8663 | 18.4581 | 18.7380 | 18.1438 | 18.7811 | 19.3043 |
protein.2 | 22.5238 | 22.9057 | 23.0183 | 22.3577 | 23.0094 | 23.1491 |
protein.3 | 16.1388 | 15.3720 | 15.2374 | 17.0161 | 15.5452 | 15.7899 |
protein.4 | 21.0189 | 21.0513 | 21.0460 | 21.2229 | 21.3352 | 21.0178 |
protein.5 | 19.1038 | 19.4950 | 19.5529 | 18.5697 | 19.3690 | 19.5156 |
protein.6 | 15.8377 | 15.4202 | 15.4765 | 15.9765 | 15.3439 | 15.2335 |
2.2 Build
DEprot
object
Now we will combine the the counts and metadata to create a
DEprot
object (hereafter defined as dpo).
Notice that another import information is whether the data are log
transformed, and if yes, which is the log base. Recommended
transformation is the log2(score + 1).
If data are pre-normalized and/or pre-imputed, it can be indicated
with the corresponding parameters.
If this is not the case, leave
the parameters as NA
(not NULL
!).
Ultimately, if the metadata table does not contain a
column.id
column corresponding to the row names of the
counts_table
, it is possible to indicate the name of
another column that should be assumed as to be the
column.id
.
dpo <- load.counts(counts = unimputed.counts,
metadata = sample.config,
log.base = 2,
imputation = NA,
normalization.method = NA,
column.id = "column.id")
dpo
> DEprot object:
> Samples: 12
> Proteins: 13239
> Counts available: raw
> Log transformation: log2
> Metadata columns: column.id, sample.id, cell, condition, combined.id, replicate
This object is an S4-vector of class DEprot
. The
S4-vectors are containers of slots that can be accessed using the symbol
@
(e.g., object@slot.id
.
The structure of
an object of class DEprot
(and
DEprot.analyses
) is the following:
Slot | Description |
---|---|
raw.counts | table containing the raw counts, if not available it
will be NULL |
norm.counts | table containing the normalized counts, if not
available it will be NULL |
imputed.counts | table containing the imputed counts, if not available
it will be NULL |
log.base | a number indicating the base of the log used to
transform the table, if not available it will be NA |
log.transformed | logical value indicating whether the data are log-transformed or not |
imputed | logical value indicating whether the data are imputed or not |
imputation | discussed further in the Imputation paragraph |
normalized | logical value indicating whether the data are normalized or not |
normalization.method | a string indicating the type of normalization applied,
if not available it will be NA |
boxplot.raw | box+violin plot of the distribution of the LFQ intensities per sample obtained from the raw counts |
boxplot.norm | box+violin plot of the distribution of the LFQ intensities per sample obtained from the normalized counts |
boxplot.imputed | box+violin plot of the distribution of the LFQ intensities per sample obtained from the imputed counts |
analyses.result.list | discussed further in the Differential Expression analyses paragraph |
contrasts | discussed further in the Differential Expression analyses paragraph |
differential.analyses.params | discussed further in the Differential Expression analyses paragraph |
2.3 Rename sample columns
As in our example, sometimes the columns of the counts are not the
actual ID of the samples, but rather an identifier. However, it is
possible to rename the counts column names indicating any column of the
metadata table (having unique values). The original identifiers are
stored in a new column (old.column.id
) of the metadata.
Notice that the renaming will applied to all counts table available.
dpo <- rename.samples(DEprot.object = dpo,
metadata.column = "sample.id")
get.metadata(dpo)
> column.id sample.id cell condition combined.id
> 1 BCa_FBS_rep1 BCa_FBS_rep1 BCa FBS BCa_FBS
> 2 BCa_6h.DMSO_rep1 BCa_6h.DMSO_rep1 BCa 6h.DMSO BCa_6h.DMSO
> 3 BCa_6h.10nM.E2_rep1 BCa_6h.10nM.E2_rep1 BCa 6h.10nM.E2 BCa_6h.10nM.E2
> 4 BCa_FBS_rep2 BCa_FBS_rep2 BCa FBS BCa_FBS
> 5 BCa_6h.DMSO_rep2 BCa_6h.DMSO_rep2 BCa 6h.DMSO BCa_6h.DMSO
> 6 BCa_6h.10nM.E2_rep2 BCa_6h.10nM.E2_rep2 BCa 6h.10nM.E2 BCa_6h.10nM.E2
> 7 BCa_FBS_rep3 BCa_FBS_rep3 BCa FBS BCa_FBS
> 8 BCa_6h.DMSO_rep3 BCa_6h.DMSO_rep3 BCa 6h.DMSO BCa_6h.DMSO
> 9 BCa_6h.10nM.E2_rep3 BCa_6h.10nM.E2_rep3 BCa 6h.10nM.E2 BCa_6h.10nM.E2
> 10 BCa_FBS_rep4 BCa_FBS_rep4 BCa FBS BCa_FBS
> 11 BCa_6h.DMSO_rep4 BCa_6h.DMSO_rep4 BCa 6h.DMSO BCa_6h.DMSO
> 12 BCa_6h.10nM.E2_rep4 BCa_6h.10nM.E2_rep4 BCa 6h.10nM.E2 BCa_6h.10nM.E2
> replicate old.column.id
> 1 rep1 Sample_A
> 2 rep1 Sample_B
> 3 rep1 Sample_C
> 4 rep2 Sample_D
> 5 rep2 Sample_E
> 6 rep2 Sample_F
> 7 rep3 Sample_G
> 8 rep3 Sample_H
> 9 rep3 Sample_I
> 10 rep4 Sample_J
> 11 rep4 Sample_K
> 12 rep4 Sample_L
head(dpo@raw.counts[,1:6])
> BCa_FBS_rep1 BCa_6h.DMSO_rep1 BCa_6h.10nM.E2_rep1 BCa_FBS_rep2
> protein.1 17.8663 18.4581 18.7380 18.1438
> protein.2 22.5238 22.9057 23.0183 22.3577
> protein.3 16.1388 15.3720 15.2374 17.0161
> protein.4 21.0189 21.0513 21.0460 21.2229
> protein.5 19.1038 19.4950 19.5529 18.5697
> protein.6 15.8377 15.4202 15.4765 15.9765
> BCa_6h.DMSO_rep2 BCa_6h.10nM.E2_rep2
> protein.1 18.7811 19.3043
> protein.2 23.0094 23.1491
> protein.3 15.5452 15.7899
> protein.4 21.3352 21.0178
> protein.5 19.3690 19.5156
> protein.6 15.3439 15.2335
3 Data normalization
When a DEprot
object is loaded, automatically a
box/violin plot showing the distribution of the LFQ values per samples
is generated.
This representation is useful to estimate whether the
data are normalized or not.
Boxplots display the quantiles of the LFQ intensities, while red and blue dahsed lines correspond to maximum and minimum LFQ value for each sample.
In this package we apply the Modified Balanced Quantile Normalization
(MBQN) from the MBQN
package and developed by E.Brombacher et
al. (Proteomics, 2020). The modification balances the
median (or mean) intensity of features (rows) which are rank invariant
(RI) or nearly rank invariant (NRI) across samples (columns) before
quantile normalization. This prevents an over-correction of the
intensity profiles of RI and NRI features by classical quantile
normalization and therefore supports the reduction of systematics in
downstream analyses.
dpo <- normalize.counts(DEprot.object = dpo,
NRI.RI.ratio.threshold = 0.5,
balancing.function = "median")
dpo
> DEprot object:
> Samples: 12
> Proteins: 13239
> Counts available: raw, normalized
> Log transformation: log2
> Metadata columns: column.id, sample.id, cell, condition, combined.id, replicate, old.column.id
dpo@normalization.method
> param value
> 1 package MBQN
> 2 method Quantile normalization
> 3 balanced TRUE
> 4 function median
> 5 NRI/RI ratio threshold 0.5
head(dpo@raw.counts[,1:6])
> BCa_FBS_rep1 BCa_6h.DMSO_rep1 BCa_6h.10nM.E2_rep1 BCa_FBS_rep2
> protein.1 17.8663 18.4581 18.7380 18.1438
> protein.2 22.5238 22.9057 23.0183 22.3577
> protein.3 16.1388 15.3720 15.2374 17.0161
> protein.4 21.0189 21.0513 21.0460 21.2229
> protein.5 19.1038 19.4950 19.5529 18.5697
> protein.6 15.8377 15.4202 15.4765 15.9765
> BCa_6h.DMSO_rep2 BCa_6h.10nM.E2_rep2
> protein.1 18.7811 19.3043
> protein.2 23.0094 23.1491
> protein.3 15.5452 15.7899
> protein.4 21.3352 21.0178
> protein.5 19.3690 19.5156
> protein.6 15.3439 15.2335
Also in this case a box/violin plot with the corresponding normalized LFQ values per each sample is generated and stored in a vector slot.
3.1 Batch effect correction
Proteomics is highly susceptible to batch effects. Hereafter we use
the package HarmonizR
package, developed by H.Voß &
S.Schlumbohm (Nat.Commun., 2022). This tool besides
handling multiple experiments, allows for the batch correction of data
derived by the combination of both DIA (Data-Independent Acquisition)
and DDA (Data-Dependent Acquisition).
HarmonizR
is not a required dependency, therefore if this fucntion is used andHarmonizR
is not already installed, a warning will indicate the requred installation. The package can be installed by:BiocManager::install("HarmonizR")
, or alternativelydevtools::install_github("https://github.com/SimonSchlumbohm/HarmonizR/", subdir = "HarmonizR")
.
To use the harmonize.batches
is sufficient to provide a
DEprot
object containing a combined table of multiple
experiments and indicate the identifier of a column in the metadata
table which corresponds to the batch annotation. The result is a DEprot
object with the same structure described in the previous paragraph.
Here, for simplicity, we will add manually a column to the metadata
table with “dummy” batches.
## Adding batch column to the metadata table
dpo@metadata$batch = rep(c("batch_1","batch_2"), each = 6)
get.metadata(dpo)
> column.id sample.id cell condition combined.id
> 1 BCa_FBS_rep1 BCa_FBS_rep1 BCa FBS BCa_FBS
> 2 BCa_6h.DMSO_rep1 BCa_6h.DMSO_rep1 BCa 6h.DMSO BCa_6h.DMSO
> 3 BCa_6h.10nM.E2_rep1 BCa_6h.10nM.E2_rep1 BCa 6h.10nM.E2 BCa_6h.10nM.E2
> 4 BCa_FBS_rep2 BCa_FBS_rep2 BCa FBS BCa_FBS
> 5 BCa_6h.DMSO_rep2 BCa_6h.DMSO_rep2 BCa 6h.DMSO BCa_6h.DMSO
> 6 BCa_6h.10nM.E2_rep2 BCa_6h.10nM.E2_rep2 BCa 6h.10nM.E2 BCa_6h.10nM.E2
> 7 BCa_FBS_rep3 BCa_FBS_rep3 BCa FBS BCa_FBS
> 8 BCa_6h.DMSO_rep3 BCa_6h.DMSO_rep3 BCa 6h.DMSO BCa_6h.DMSO
> 9 BCa_6h.10nM.E2_rep3 BCa_6h.10nM.E2_rep3 BCa 6h.10nM.E2 BCa_6h.10nM.E2
> 10 BCa_FBS_rep4 BCa_FBS_rep4 BCa FBS BCa_FBS
> 11 BCa_6h.DMSO_rep4 BCa_6h.DMSO_rep4 BCa 6h.DMSO BCa_6h.DMSO
> 12 BCa_6h.10nM.E2_rep4 BCa_6h.10nM.E2_rep4 BCa 6h.10nM.E2 BCa_6h.10nM.E2
> replicate old.column.id batch
> 1 rep1 Sample_A batch_1
> 2 rep1 Sample_B batch_1
> 3 rep1 Sample_C batch_1
> 4 rep2 Sample_D batch_1
> 5 rep2 Sample_E batch_1
> 6 rep2 Sample_F batch_1
> 7 rep3 Sample_G batch_2
> 8 rep3 Sample_H batch_2
> 9 rep3 Sample_I batch_2
> 10 rep4 Sample_J batch_2
> 11 rep4 Sample_K batch_2
> 12 rep4 Sample_L batch_2
## batch correction
dpo <- harmonize.batches(DEprot.object = dpo,
batch.column = "batch",
cores = 1)
4 Data imputation
Often many NA
/NaN
values are present in the
LFQ tables due to the technical limitations of the protein detection by
Mass Spectrometry (MS) experiments.
Here we provide the possibility to use 4 different methods for the imputation the data:
missForest
: developed by DJ.Stekhoven & P.Buehlmann (Bioinformatics, 2012), this tool will impute the NaN and assign and estimated value. It also yields an out-of-bag (OOB) imputation error estimate (general, or per each sample). Moreover, it can be run parallel to save computation time (both examples reported here after).- k-Nearest Neighbors (kNN) algorithm
- Local Least Squares (LLS) imputation, using the pearson correlation coefficient
- Singular Value Decomposition (SVD)
Of note, there are cases in which the missing data are not due to the
random but rather consequent to real biological effects (i.e.,
knock-down, knock-out, degron systems and PROTACs, etc.). This may lead
to the issue that a protein completely missing in all the replicates of
an individual condition will be imputed with relatively high numbers.
This would ultimately lead to the incorrect definition of the
differential status of this protein. To overcome this possibility, it is
possible to use the function randomize.missing.values
to
assign - only to completely missing proteins in a certain condition -
random values from the bottom distribution of the full dataset. It will
be sufficient to indicate the ID of a column belonging to the metadata
which indicates the groups/conditions and the percentage of distribution
from which select random numbers (“tail.percentage”).
dpo <- randomize.missing.values(DEprot.object = dpo,
group.column = "combined.id",
tail.percentage = 5)
Hereafeter, an example in how to use the missForest
method to impute the data:
## Without parallelization
dpo <- impute.counts(DEprot.object = dpo,
method = "missForest",
use.normalized.data = TRUE,
missForest.max.iterations = 100,
missForest.variable.wise.OOBerror = T)
## With parallelization
dpo <- impute.counts(DEprot.object = dpo,
method = "missForest",
use.normalized.data = TRUE,
missForest.max.iterations = 100,
missForest.variable.wise.OOBerror = TRUE,
missForest.cores = 10,
missForest.parallel.mode = "variables")
dpo
head(dpo@imputation$OOBerror)
data.frame(dpo@imputation[-3])
> DEprot object:
> Samples: 12
> Proteins: 13239
> Counts available: raw, normalized, imputed
> Log transformation: log2
> Metadata columns: column.id, sample.id, cell, condition, combined.id, replicate, old.column.id
> BCa_FBS_rep1 BCa_6h.DMSO_rep1 BCa_6h.10nM.E2_rep1 BCa_FBS_rep2
> 0 0 0 0
> BCa_6h.DMSO_rep2 BCa_6h.10nM.E2_rep2
> 0 0
method | max.iterations | parallelization.mode | cores | processing.time |
---|---|---|---|---|
missForest | 100 | variables | 12 | 3.02 mins |
Also in this case a box/violin plot with the corresponding imputed LFQ values per each sample is generated and stored in a vector slot.
5 Sample similarities
5.1 Principal Component Analyses (PCA)
PCA can be performed in order to perform a dimensional reduction and
determine which factor explains the variability of the samples.
DEprot
includes function dedicated to this aim and
specifically build to work with DEprot
objects.
Notice that, even if the data are not log-transformed,
perform.PCA
will do it before performing the analyses.
5.1.1 Compute PCs
## Perform the analyses (DEprot.PCA object)
PCA <- perform.PCA(DEprot.object = dpo,
which.data = "imputed") # possible: raw, normalized, imputed
The DEprot.PCA
object contains the following slots:
Slot | Description |
---|---|
PCA.metadata | metadata of the samples used in the PCA (subset of the
original DEprot@metadata ) |
sample.subset | vector containing the list of samples analyzed |
data.used | vector indicating the type of counts used (imputed, normalized, raw) |
prcomp | object of class prcomp corresponding to
the full PCA output |
PCs | data.frame combining the PC scores and the
metadata table, useful for replotting |
importance | statistical summary table for the PCA analyses per each PC |
cumulative.PC.plot | ggplot object corresponding to out put of
plot.PC.cumulative for this object |
5.1.2 Visualize PCAs
## Plot cumulative variance of all PCs
#### equivalent to `PCA@cumulative.PC.plot`
plot.PC.cumulative(DEprot.PCA.object = PCA,
bar.color = "steelblue",
line.color = "navyblue")
## Plot PC scatters
PC_1.2 <-
plot.PC.scatter(DEprot.PCA.object = PCA,
PC.x = 1,
PC.y = 2,
color.column = "condition",
shape.column = "replicate",
label.column = NULL,
plot.zero.lines = F) +
geom_hline(yintercept = 0, color = "gray", linetype = "dashed") +
theme(legend.position = "none")
PC_2.3 <-
plot.PC.scatter(DEprot.PCA.object = PCA,
PC.x = 2,
PC.y = 3,
color.column = "condition",
shape.column = "replicate",
label.column = NULL,
plot.zero.lines = TRUE)
patchwork::wrap_plots(PC_1.2, PC_2.3, nrow = 1)
Alternatively, the combination of PC1.2 and PC2.3 can be
automatically generated using the function
plot.PC.scatter.123
:
plot.PC.scatter.123(DEprot.PCA.object = PCA,
color.column = "condition",
shape.column = "replicate",
label.column = "replicate",
dot.colors = c("6h.10nM.E2" = "indianred",
"6h.DMSO" = "steelblue",
"FBS" = "forestgreen"),
plot.zero.line.y.12 = TRUE,
plot.zero.line.x.12 = FALSE,
plot.zero.line.y.23 = TRUE,
plot.zero.line.x.23 = TRUE)
5.1.3 Analyze PCs on a sample subset
These analyses can also be performed for a subset of samples by
indicated the sample names of interest.
In the example below we will
use only the sample in which the estrogen receptor is active (E2 and FBS
conditions).
## Perform the analyses (DEprot.PCA object)
PCA.fbs.e2 <-
perform.PCA(DEprot.object = dpo,
sample.subset = dpo@metadata$column.id[grepl("E2|FBS",
dpo@metadata$column.id)],
which.data = "imputed")
## Plot cumulative variance of all PCs
plot.PC.cumulative(DEprot.PCA.object = PCA.fbs.e2,
bar.color = "indianred",
line.color = "firebrick4",
title = "**Only ERa active**")
## Plot PC scatters
PC.fbs.e2_1.2 <-
plot.PC.scatter(DEprot.PCA.object = PCA.fbs.e2,
PC.x = 1,
PC.y = 2,
color.column = "condition",
shape.column = "replicate",
label.column = NULL,
plot.zero.lines = F) +
geom_hline(yintercept = 0, color = "gray", linetype = "dashed") +
theme(legend.position = "none")
PC.fbs.e2_2.3 <-
plot.PC.scatter(DEprot.PCA.object = PCA.fbs.e2,
PC.x = 2,
PC.y = 3,
color.column = "condition",
shape.column = "replicate",
label.column = NULL,
plot.zero.lines = T)
patchwork::wrap_plots(PC.fbs.e2_1.2, PC.fbs.e2_2.3, nrow = 1)
5.2 Correlations
Another method to define the sample clustering/groups, is the overall
correlation between the samples.
Hierarchical clustering is
performed using the 1 - correlation
values, since the
hierarchical clustering algorithm is based on dissimilarities while the
correlations are an index of similarity.
corr.all.samples <-
plot.correlation.heatmap(DEprot.object = dpo,
which.data = "imputed",
palette = viridis::mako(n = 10, direction = -1, begin = 0.25),
correlation.scale.limits = c(0.9,1),
correlation.method = "pearson",
plot.subtitle = "All samples",
display.values = TRUE)
corr.all.samples
Also in this case the sample correlation can be computed for a subset of samples as shown before for the PCAs.
corr.ERa.active <-
plot.correlation.heatmap(DEprot.object = dpo,
which.data = "imputed",
sample.subset = dpo@metadata$column.id[grepl("E2|FBS",
dpo@metadata$column.id)],
palette = viridis::magma(n = 10, direction = -1, begin = 0.25),
correlation.scale.limits = c(0.9,1),
correlation.method = "pearson",
plot.subtitle = "Only ERa active",
clustering.method = "complete",
display.values = TRUE)
corr.ERa.active
The DEprot.correlation
correlation object contains the
following slots:
Slot | Description |
---|---|
heatmap | ggplot object corresponding to the
correlation heatmap |
corr.metadata | metadata of the samples used in the correlation (subset
of the original DEprot@metadata ) |
sample.subset | vector containing the list of samples analyzed |
data.used | vector indicating the type of counts used (imputed, normalized, raw) |
corr.matrix | the correlation matrix on which the heatmap is base on |
distance | object of class dist corresponding to the
output of as.dist(1 - correlation.matrix) |
cluster | hclust object generated by
hclust(d = as.dist(1 - correlation.matrix), method = clustering.method) |
6 Differential Expression (DE) analyses
Differential expression analyses between two conditions can be
performed employing two different methods using the functions
diff.analyses
and diff.analyses.limma
. With
the first (diff.analyses
) the conditions will be compared
two-by-two (individual t-/Wilcox tests), while with the second
(diff.analyses.limma
) the limma
package is
used to fit the data on a linear model.
In both cases, it is sufficient to provide a list of 3-elements
vectors. The latter, should indicate any column of the metadata table
(grouping factor) and two values (groups) to compare within this column.
The first variable will be numerator and the second the denominator of
the fold change:
c("group.column", "condition.A", "condition.B")
, FoldChange
= group A/B.
When a replicate column is available, it is possible to run the
analyses in paired mode. When using diff.analyses
,
it is sufficient to indicate the column from the metadata table that
corresponds to the replicate identifiers (in our example
replicate.column = "replicate"
) and set
paired.test = TRUE
. For each contrast it will be verified
that replicate identifiers are not repeated within a group and, that
replicate number and identifiers coincide between the two groups. By the
default analyses ignore the replicates.
If
diff.analyses.limma
is used instead, the correlation
between replicates will be estimated and incorporated in the fitting
model. In this case indicate the replicate column identifier (in our
example replicate.column = "replicate"
) and set
include.rep.model = TRUE
.
In the following example we will compare 6h.10nM.E2 vs 6h.DMSO, and 6h.10nM.E2 vs FBS. Both paired and unpaired examples are shown, but further analyses are based on the paired results only (multi t-test mode).
LIMMA mode
## Unpaired test
dpo_analyses <- diff.analyses.limma(DEprot.object = dpo,
contrast.list = list(c("condition", "6h.10nM.E2", "6h.DMSO"),
c("condition", "6h.10nM.E2", "FBS")),
linear.FC.th = 2,
padj.th = 0.05,
padj.method = "BH",
fitting.method = "ls",
which.data = "imputed")
## Paired test
dpo_analyses <- diff.analyses.limma(DEprot.object = dpo,
contrast.list = list(c("condition", "6h.10nM.E2", "6h.DMSO"),
c("condition", "6h.10nM.E2", "FBS")),
replicate.column = "replicate",
include.rep.model = TRUE,
linear.FC.th = 2,
padj.th = 0.05,
padj.method = "BH",
fitting.method = "ls",
which.data = "imputed")
Multiple t-test mode (next paragraphs use these results, in paired mode)
## Unpaired test
dpo_analyses <- diff.analyses(DEprot.object = dpo,
contrast.list = list(c("condition", "6h.10nM.E2", "6h.DMSO"),
c("condition", "6h.10nM.E2", "FBS")),
linear.FC.th = 2,
padj.th = 0.05,
padj.method = "bonferroni",
stat.test = "t.test",
which.data = "imputed")
## Paired test
dpo_analyses <- diff.analyses(DEprot.object = dpo,
contrast.list = list(c("condition", "6h.10nM.E2", "6h.DMSO"),
c("condition", "6h.10nM.E2", "FBS")),
replicate.column = "replicate",
paired.test = TRUE,
linear.FC.th = 2,
padj.th = 0.05,
padj.method = "bonferroni",
stat.test = "t.test",
which.data = "imputed")
dpo_analyses
> DEprot.analyses object:
> Counts used: imputed
> Fold Change threshold: 2 (linear)
> FC unresponsive range: [0.9090909,1.1] (linear)
> padj threshold: 0.05 (linear)
> padj method: bonferroni
>
>
> Differential results summary:
> contrast.id group.factor group1 group2 paired.test
> 1 condition: 6h.10nM.E2 vs 6h.DMSO condition 6h.10nM.E2 6h.DMSO TRUE
> 2 condition: 6h.10nM.E2 vs 6h.DMSO condition 6h.10nM.E2 6h.DMSO TRUE
> 3 condition: 6h.10nM.E2 vs 6h.DMSO condition 6h.10nM.E2 6h.DMSO TRUE
> 4 condition: 6h.10nM.E2 vs 6h.DMSO condition 6h.10nM.E2 6h.DMSO TRUE
> 5 condition: 6h.10nM.E2 vs FBS condition 6h.10nM.E2 FBS TRUE
> 6 condition: 6h.10nM.E2 vs FBS condition 6h.10nM.E2 FBS TRUE
> 7 condition: 6h.10nM.E2 vs FBS condition 6h.10nM.E2 FBS TRUE
> 8 condition: 6h.10nM.E2 vs FBS condition 6h.10nM.E2 FBS TRUE
> diff.status n median.FoldChange
> 1 6h.DMSO 0 NA
> 2 6h.10nM.E2 0 NA
> 3 unresponsive 8557 0.004011163
> 4 null 4682 0.150392374
> 5 FBS 0 NA
> 6 6h.10nM.E2 0 NA
> 7 unresponsive 2368 -0.010256032
> 8 null 10871 -0.187923718
The summary can be collected by using the generic function
summary
:
6.1 DE results
The output will be a DEprot.analyses
object. This class
is similar to the base DEprot
one, however 3 slots are now
available:
- contrasts: corresponds to the list used to define the contrasts, but includes also the IDs of the counts matrix belonging to each subgroup and whether the statistical test was performed in paired mode.
- differential.analyses.params: a list containing the core parameters used for the differential expression analyses.
- analyses.result.list: a list with an element for each contrast including all the results of the differential analyses (see below for details).
The analyses.result.list
, for each contrast, stores a
list with the following elements:
Element | Description |
---|---|
results | a data.frame containing the results of the analyses; includes average expression of each group, basemean, foldchange, pvalue and p.adj, differential.status |
n.diff | a summary table showing the number of proteins in each differential expression status (up/down/unresponsive, null) |
PCA.data | output of perform.PCA for the subset of
samples analyzed in a specific contrast |
PCA.plots | combination of 3 plots: scatter PC1-vs-PC2, scatter PC2-vs-PC3, and cumulative bar plot |
correlations | combination of Pearson and Spearman correlation
heatmaps (obtained by plot.correlation.heatmap ) for the
subset of samples analyzed in a specific contrast |
volcano | volcano plot showing the log2(FoldChange) x
-log10(p.adjusted) of differential expression results; it can be
regenerated using plot.volcano |
MA.plot | MA-plot showing the log2(basemean) x log2(FoldChange)
of differential expression results; it can be regenerated using
plot.MA |
If the function diff.analyses.limma
was used, an
additional slot called limma.fit
will be available and will
contain the results of limma::contrasts.fit
.
6.1.1 DE table
The table with the results of the differential analyses can be
retrieved directly from the list in the DEprot.analyses
object
(dpo_analyses@analyses.result.list$contrast.id$results
) or
using the get.results
function.
## Direct access
results = dpo_analyses@analyses.result.list$condition_6h.10nM.E2.vs.6h.DMSO$results
## Function
results = get.results(dpo_analyses, contrast = 1)
head(results)
prot.id | basemean.log2 | log2.mean.6h.10nM.E2 | log2.mean.6h.DMSO | log2.Fold_6h.10nM.E2.vs.6h.DMSO | p.value | padj | diff.status |
---|---|---|---|---|---|---|---|
protein.1 | 18.85442 | 18.91504 | 18.79380 | 0.1212396 | 0.6342736 | 1 | unresponsive |
protein.2 | 23.07816 | 23.08739 | 23.06894 | 0.0184512 | 0.8673988 | 1 | unresponsive |
protein.3 | 15.68530 | 15.65794 | 15.71266 | -0.0547179 | 0.7188334 | 1 | unresponsive |
protein.4 | 21.14521 | 21.09930 | 21.19112 | -0.0918296 | 0.2081734 | 1 | unresponsive |
protein.5 | 19.56671 | 19.62718 | 19.50624 | 0.1209374 | 0.1169272 | 1 | unresponsive |
protein.6 | 15.31226 | 15.32551 | 15.29901 | 0.0264997 | 0.6433742 | 1 | unresponsive |
6.1.2 PCA and correlation within the comparison
The DEprot.analyses
object includes PCA and correlation
analyses of the samples involved in the contrast.
6.1.3 Visualize DE analyses
6.1.3.1 MA and volcano
Differential expressed proteins can be visualized as either a volcano
plot or an MA-plot.
Both these plots are available in the
dpo_analyses@analyses.result.list$contrast.id
list, but can
also be generated using the functions plot.volcano
and
plot.MA
.
Of note, if
use.uncorrected.pvalue = TRUE
, the normal p-value will be used instead of the p.adjusted. In this case the FoldChange and p-value thresholds are collected from theDEprot.analyses
object and reapplied to compute the new differential status of the proteins.
volcano = plot.volcano(dpo_analyses, contrast = 1, use.uncorrected.pvalue = TRUE)
MAplot = plot.MA(dpo_analyses, contrast = 1, use.uncorrected.pvalue = TRUE)
patchwork::wrap_plots(volcano, MAplot)
6.1.3.2 Heatmaps
To plot results heatmaps there are two functions available:
heatmap.counts
: allows for the plotting of raw, normalized or imputed counts.heatmap.contrasts
: allows for the plotting of log2(FoldChange) values.
6.1.3.2.1 Counts
With the heatmap.counts
function it is possible to
generate an heatmap for the raw/normalized/imputed counts. It is
possible to select only specific proteins and/or subset samples (from
column.id
column in metadata table). If the object provided
is a DEprot.analyses
object, it is also possible to
indicate to plot only differential proteins (or top.n differential
proteins) from a specific contrast.
The protein ranking definition
is based on the differential score, computed as:
abs(log2(fold change)) * -log10(Padj)
.
The resulting object is of class DEprot.counts.heatmap
,
and contains the ggplot-heatmap, the row cluster and the column
cluster.
## Plotting from a DEprot object
imputed_counts_heatmap <-
heatmap.counts(DEprot.object = dpo,
which.data = "imputed",
sample.subset = dpo@metadata$column.id[grep("6h", dpo@metadata$column.id)],
show.protein.names = TRUE,
protein.subset = c("protein.2295", "protein.304", "protein.657",
"protein.2819", "protein.2168", "protein.10594"),
title = "Imputed counts | protein and sample selection")
## Plotting from a DEprot.analyses object
## top 15 differential proteins from contrast 1
imputed_counts_heatmap_diffProteins <-
heatmap.counts(DEprot.object = dpo_analyses,
which.data = "imputed",
contrast = 1,
top.n = 15,
palette = viridis::mako(n = 100, direction = -1),
cell.border.color = "white",
show.protein.names = TRUE,
sample.subset = dpo@metadata$column.id[grep("6h", dpo@metadata$column.id)],
use.uncorrected.pvalue = TRUE,
title = "condition: **6h.10nM.E2** *vs* **6h.DMSO** (top 15) | Imputed counts")
## Combine heatmaps
patchwork::wrap_plots(imputed_counts_heatmap@heatmap,
imputed_counts_heatmap_diffProteins@heatmap)
Instead of “pure” counts, it is possibile to compute Z-scores of the
values and plot a divergent heatmap.
In this case instead of a
palette, it will be required to indicate the color for the positive,
negative and 0 values.
We will apply this to the
imputed_counts_heatmap_diffProteins
heatmap from the
previous example.
## Z-score by row
imputed_counts_heatmap_diffProteins_rowScaled <-
heatmap.counts(DEprot.object = dpo_analyses,
which.data = "imputed",
contrast = 1,
top.n = 15,
high.color = "purple4",
low.color = "darkorange",
mid.color = "white",
cell.border.color = "white",
show.protein.names = TRUE,
sample.subset = dpo@metadata$column.id[grep("6h", dpo@metadata$column.id)],
use.uncorrected.pvalue = TRUE,
scale = "rows",
title = "condition: **6h.10nM.E2** *vs* **6h.DMSO** (top 15)<br>Imputed counts Z-score")
## Z-score by column
imputed_counts_heatmap_diffProteins_columnScaled <-
heatmap.counts(DEprot.object = dpo_analyses,
which.data = "imputed",
contrast = 1,
top.n = 15,
high.color = "firebrick",
low.color = "steelblue",
mid.color = "white",
cell.border.color = "white",
show.protein.names = TRUE,
sample.subset = dpo@metadata$column.id[grep("6h", dpo@metadata$column.id)],
use.uncorrected.pvalue = TRUE,
scale = "columns",
title = "condition: **6h.10nM.E2** *vs* **6h.DMSO** (top 15)<br>Imputed counts Z-score")
## Combine heatmaps
patchwork::wrap_plots(imputed_counts_heatmap_diffProteins_rowScaled@heatmap,
imputed_counts_heatmap_diffProteins_columnScaled@heatmap)
Ultimately, With this function is also possible to group data based on the values of a metadata’s column. In this way it is possible to average the values by the values of the defined column (e.g., the replicates of a condition).
In the next example we will average the replicates of each condition
using the combined.id
column form the metadata of the
dpo_analyses
object.
imputed_counts_heatmap_diffProteins_rowScaled_grouped.by.condition <-
heatmap.counts(DEprot.object = dpo_analyses,
group.by.metadata.column = "combined.id",
which.data = "imputed",
contrast = 1,
high.color = "firebrick",
low.color = "steelblue",
mid.color = "white",
cell.border.color = "white",
show.protein.names = TRUE,
use.uncorrected.pvalue = TRUE,
scale = "rows",
title = "condition: **6h.10nM.E2** *vs* **6h.DMSO** (all)<br>Imputed counts Z-score")
imputed_counts_heatmap_diffProteins_rowScaled_grouped.by.condition@heatmap
6.1.3.2.2 Fold changes
Instead of counts, it is possible to plot the fold changes of one or
multiple contrasts using the function heatmap.contrasts
.
Only differential proteins are shown.
Also in this case it is
possible to show only the top N differential proteins. However
the top N of each contrast will be shown. This means than
N proteins might be shown.
FC_heatmap <-
heatmap.contrasts(DEprot.analyses.object = dpo_analyses,
contrasts = c(1:2),
top.n = 20,
high.color = "#35978F",
low.color = "#BF812D",
mid.color = "white",
show.protein.names = TRUE,
use.uncorrected.pvalue = TRUE)
FC_heatmap@heatmap
6.1.3.3 Upset-plot
To identify which differential proteins are common among multiple
comparisons, it is possible to generate an upset plot using the function
plot.upset
. With this function all the intersections
between all the contrasts including in the provided object can be
visualized by co-occurrence. It is also possible to subset only specific
contrasts indicating a numeric vector in the flag
sort.intersections
.
upset.plot <- plot.upset(DEprot.analyses.object = dpo_analyses,
contrast.subset = c(1,2),
title = "**My upset plot**",
sort.intersections = "cardinality",
sort.sets = "descending",
intersection.bar.color = "navy",
setsize.bar.color = "black",
show.counts = T,
height.ratio = 0.5,
width.ratio = 0.4,
use.uncorrected.pvalue = TRUE)
upset.plot # or upset.plot@upset
Besides the plot, also a TRUE
/FALSE
table
is returned. The latter shows for each condition if a protein has been
found to be differential. Only proteins that are differential in at
least one condition are shown.
prot.id | condition: 6h.10nM.E2 vs 6h.DMSO | 6h.10nM.E2 | condition: 6h.10nM.E2 vs 6h.DMSO | 6h.DMSO | condition: 6h.10nM.E2 vs FBS | FBS | condition: 6h.10nM.E2 vs FBS | 6h.10nM.E2 |
---|---|---|---|---|
protein.17 | FALSE | FALSE | TRUE | FALSE |
protein.25 | FALSE | FALSE | FALSE | TRUE |
protein.44 | FALSE | FALSE | FALSE | TRUE |
protein.48 | FALSE | FALSE | TRUE | FALSE |
protein.53 | FALSE | FALSE | FALSE | TRUE |
7 Other tools
7.1 Count detected proteins
In order to know how many proteins have been detected in each sample,
it is possible to use the function protein.summary
. A
stacked bar plot will be generated per each sample. The input object
must belong to either DEprot
or
DEprot.analyses
classes.
NOTE: this plot is meaningfult only on raw and/or unimputed normalized data. Indeed, estimation of missing values is not possibile on the imputed tables.
protein.counts <-
protein.summary(DEprot.object = dpo_analyses,
n.lables = "counts",
show.frequency = F,
colors = c("gray", "steelblue4"),
title = "**# Proteins identified in each sample**")
protein.counts
Another possibility is to study the protein co-detection by sample groups. A group can be defined using any column of the metadata table. In the output stacked bar plot, each color will show how many proteins in how many samples within a group were detected.
protein.counts.byCondition <-
protein.summary(DEprot.object = dpo_analyses,
group.column = "condition",
n.lables = "percentage",
show.frequency = T,
x.label.angle = 0,
title = "**# Proteins identified per _Condition_**")
protein.counts.byCondition
7.2 Filter proteins
For certain analyses it might be necessary to filter-out/keep only
certain specific protein (e.g., keep only nuclear proteins).
With
the function filter.proteins
is possible to indicate a
vector containing the protein IDs (it must correspond to the row.names
of the counts table) and chose whether to "keep"
or
"remove"
these proteins.
This function works for both DEprot
and
DEprot.analyses
classes. The output will look exactly as
the original, however counts tables (raw, normalized, imputed), count
boxplots, PCAs, correlations, differential tables, differential protein
counts, violin and MA-plot will be recomputed.
NOTE: we discourage the filtering before normalization/bacth correction and imputation, as well as before limma-base differential analyses. In all these methods are based on the assumption that the proteins ‘available’ are ‘all’ the proteins available, which is not the case.
7.2.1 Filtering examples
Hereafter are shown two examples for keeping nuclear proteins or
removing cytoplasmic proteins.
Notice that this is only an example and, since the protein IDs are randomized, it would not applicable for the example dataset availble with
DEprot
. FurthermoreDEprot
does not include the packagesAnnotationDbi
andorg.Hs.eg.db
, which can be installed throug theBioconductor
portal.
KEEP NUCLEAR PROTEINS
nucleus <- AnnotationDbi::select(org.Hs.eg.db::org.Hs.eg.db,
keytype = "GOALL",
keys = "GO:0005634", #nucleus
columns = c("SYMBOL", "UNIPROT"))
dpo_analyses_nuclear <- filter.proteins(DEprot.object = dpo_analyses,
proteins = nucleus$SYMBOL,
mode = "keep")
REMOVE CYTOPLASMIC PROTEINS
cytoplasm <- AnnotationDbi::select(org.Hs.eg.db::org.Hs.eg.db,
keytype = "GOALL",
keys = "GO:0005737", #cytoplasm
columns = c("SYMBOL", "UNIPROT"))
dpo_analyses_nuclear <- filter.proteins(DEprot.object = dpo_analyses,
proteins = cytoplasm$SYMBOL,
mode = "remove")
7.3 GSEA and OverRepresentation Analyses (ORA)
7.3.1 Define GenSet
Enrichment analyses can be performed on a full contrast of a
DEprot.analyses
object (GSEA) or on the subset of proteins
that belong to a specific category of diff.status
in the
results table of the given contrast (ORA).
For this analyses a geneSet table will be required as “reference”.
The latter must be a data.frame with two columns: “gs_name” (geneSet
name) and “gene_symbol” (corresponding to the protein id). In
DEprot
we provide a ready-to-use list of protein complexes
obtained by the CORUM database.
Notice that the “gene_symbol” values must match the “prot.id” in the result tables. To this goal, the
geneset.enrichment
function contains the parametergsub.pattern.prot.id
that will be used to remove a pattern from the “prot.id”s.
In the following example we will show how to create a genset table from the CORUM data and use it for GSEA and ORA analyses.
7.3.2 Define geneSet
## GeneSet Enrichment Analyses
data("corum_v5.0", package = "DEprot")
corum_geneSet =
corum_v5.0 %>%
dplyr::filter(organism == "Human") %>%
dplyr::rename(gs_name = complex.name,
gene_symbol = protein.members) %>%
dplyr::select(gs_name, gene_symbol)
corum_geneSet
gs_name | gene_symbol |
---|---|
BCL6-HDAC4 complex | BCL6 |
BCL6-HDAC4 complex | HDAC4 |
BCL6-HDAC5 complex | BCL6 |
BCL6-HDAC5 complex | HDAC5 |
BCL6-HDAC7 complex | BCL6 |
BCL6-HDAC7 complex | HDAC7 |
Multisubunit ACTR coactivator complex | CREBBP |
Multisubunit ACTR coactivator complex | EP300 |
Multisubunit ACTR coactivator complex | KAT2B |
Multisubunit ACTR coactivator complex | NCOA3 |
7.3.3 Perform enrichment analyses
### GSEA
GSEA.results =
geneset.enrichment(DEprot.analyses.object = dpo_analyses,
contrast = 1,
TERM2GENE = corum_geneSet,
enrichment.type = "GSEA",
gsub.pattern.prot.id = "_HUMAN|;.*",
pvalueCutoff = 0.05,
pAdjustMethod = "BH",
dotplot.n = 10)
### OverRepresentation Analyses (ORA)
ORA.results =
geneset.enrichment(DEprot.analyses.object = dpo_analyses,
contrast = 1,
TERM2GENE = corum_geneSet,
enrichment.type = "GSEA",
gsub.pattern.prot.id = "_HUMAN|;.*",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05,
pAdjustMethod = "BH",
diff.status.category = "6h.10nM.E2",
dotplot.n = 10)
The output is an object of class
DEprot.enrichResult
:
- enrichment.discovery: the direct output from
clusterProfiler::GSEA/enricher
(GSEA and ORA, respectively); - protein.network: a string plot showing protein networks
(
clusterProfiler::cnetplot
); - pathway.network: a list with clusters and string plot
showing pathway/set networks
(
aPEAR::enrichmentNetwork
); - NES.plot (GSEA only): a bar plot showing the NES scores for each significantly enriched geneSet;
- dotplot_gene.ratio: a dotplot showing the geneRatios for each significantly enriched geneSet;
- dotplot_fold.enrichment (ORA only): a dotplot showing the foldEnrichment for each significantly enriched geneSet,
- parameters: a list containing the parameters used to run the analyses.
8 Package information
8.1 Documentation
With the package a detailed PDF manual with details for each function and respective parameters is available.
The R-package have been published on GitHub and a git-pages website is available as well. At both these sites it is possible to find the installation procedure, required dependencies, and the links for changeLog, manual and vignette.
8.2 Package history and releases
A list of all releases and respective description of changes applied could be found here.
8.3 Contact
For any suggestion, bug fixing, commentary please contact fill an Issues/Pull
requests form on the DEprot
github page.