QuickStart.Rmd
The following guide presents the effect of the MLExpResso package on the analysis of changes in expression and methylation of the human genome.
Scientists believe that the result of increased methylation is decreased gene expression. Recent studies suggest that the relationship between methylation and expression is more complex than we previously thought.
MLExpResso
is an R package for integrative analyses and visualization of gene expression and DNA methylation data.
Key functions of this package are:
Identification of genes with affected expression,
Identification of DMR - differentially methylated regions,
Identification of regions with changes in expression and methylation,
Visualization of identified regions.
For both, methylation and expression data, we conduct some statistical tests and present graphically received results. The joint modeling and visualization of genes expression and methylation improve interpretability of identified signals.
The methodology is supplemented with example applications to The Cancer Genome Atlas data.
In this vignette we will work with the data sets containing information about gene expression and methylation for patients with breast cancer. We will analyze differences in methylation and expression for patients with different subtypes of BRCA cancer. To run the examples below you should install MLExpRessoData
package (https://github.com/geneticsMiNIng/MLGenSigdata). Data sets in this R package are based on the Bioconductor package RTCGA
(https://bioconductor.org/packages/release/bioc/html/RTCGA.html).
library(MLExpResso)
library(MLExpRessoData)
BRCA_exp
Package MLExpRessoData
contains BRCA_exp
dataset. This set contains information about gene expression: read counts per-gene, computed for genes for 736 patients with breast cancer. Rows of this data set correspond to samples taken from patients. First column SUBTYPE
corresponds to a subtype of BRCA cancer, next columns correspond to genes.
BRCA_exp[1:5, 1:5]
## SUBTYPE AANAT AARSD1 AATF AATK
## TCGA-A1-A0SB-01A-11R-A144-07 Normal 9 2354 2870 317
## TCGA-A1-A0SD-01A-11R-A115-07 LumA 2 1846 5656 312
## TCGA-A1-A0SE-01A-11R-A084-07 LumA 11 3391 9522 736
## TCGA-A1-A0SF-01A-11R-A144-07 LumA 0 2169 4625 169
## TCGA-A1-A0SG-01A-11R-A144-07 LumA 1 2273 3473 92
Before we go to the testing, we need to define condition values for each sample. We would like to test for differences between LumA
subtype and other
subtypes of breast cancer, so we create a vector, which each element corresponds to a sample. Our division into this two groups relies on numbers of occurences of each subtype. The LumA
subtype is the most common, in case of breast cancer.
condition_exp <- ifelse(BRCA_exp$SUBTYPE == "LumA", "LumA", "other")
head(condition_exp, 8)
## [1] "other" "LumA" "LumA" "LumA" "LumA" "LumA" "other" "LumA"
Now we can visualize mean expression of each gene in a division of groups Luma and other. To do this we use the plot_means()
function.
This function requires parameters such as:
data
- data set with information from methylation or expression. In the example below, we use the BRCA_exp
data without the SUBTYPE
column.
condition
- a factor of levels corresponding to order of samples in data. In our example, we use the condition_exp
vector defined earlier.
names
- number of genes to be labeled. On a plot are marked genes with the biggest difference between the means.
plot_means(
data = BRCA_exp[,!(colnames(BRCA_exp) == "SUBTYPE")],
condition = condition_exp,
names = 5
)
In the MLExpResso
package we carry out the tests for identification of genes with affected expression. To do this we use the calculate_test()
function.
This function requires parameters such as:
data
- object of the class appropriate for the given test. In the example below, we use the BRCA_exp
data without the SUBTYPE
column.
condition
- the factor of levels corresponding to order of samples in data. In our example, we use the condition_exp
vector defined earlier.
test
- character defining test. Possible values of parameter test
are described in the documentation of this function.
Tests are based on the methods implemented in Bioconductor packages Deseq
, Deseq2
and edgeR
.
res_exp <- calculate_test(
data = BRCA_exp[,!(colnames(BRCA_exp) == "SUBTYPE")],
condition = condition_exp,
test = "lrt"
)
head(res_exp)
## id log2.fold pval mean_LumA mean_other mean
## 1 AURKB 2.339920 3.191000e-32 539.0426 2323.8868 1485.01
## 2 CBX2 2.895062 2.834335e-26 632.5106 4296.6038 2574.48
## 3 KPNA2 1.447288 8.551812e-24 11547.36 26427.38 19433.77
## 4 PRR11 3.822148 2.286874e-22 396.383 3479.981 2030.69
## 5 BIRC5 1.988998 1.953941e-21 1957.085 6658.358 4448.76
## 6 GSG2 1.405039 3.527773e-21 278.2128 629.3396 464.31
As a result we get a data frame with columns describing the log2 of the fold change, p-value and the mean for each gene.
We can visualize the result of calculate_test()
function by plot_volcano()
function. It draws a plot with p-values and fold logarithm from methylation or expression data.
This function requires parameters such as:
data
- data frame containing result of chosen test from calculate_test()
function. In the example below, we use the res_exp
data.frame calculated earlier.
line
- p-value on which we draw a horizontal line.
names
- p-value below which we want to draw genes names.
fold_line
- value on which we want to draw a vertical line on both sides of zero.
More parameters were described in documentation of a plot_volcano()
function.
plot_volcano(res_exp, line = 0.05, names = 0.000000001, fold_line = 2)
BRCA_met
data setIn this section, we will work with the methylation level data from TCGA database. Package MLExpRessoData
contains BRCA_met
dataset. This data set contains information about methylation of CpG probes for patients with breast cancer. Rows of this data set correspond to patients, more precisely, to samples taken from patients. First column SUBTYPE
corresponds to a subtype of BRCA cancer, next columns correspond to CpG probes. Values inside the table indicate the percentage methylation level of CpG probe for specified sample.
head(BRCA_met)[1:5, 1:4]
## SUBTYPE cg00021527 cg00031162 cg00032227
## TCGA-A1-A0SD-01A-11D-A112-05 LumA 0.03781858 0.7910348 0.006391233
## TCGA-A2-A04N-01A-11D-A112-05 LumA 0.01437552 0.7359370 0.008752293
## TCGA-A2-A04P-01A-31D-A032-05 Basal 0.01360124 0.6967802 0.009442039
## TCGA-A2-A04Q-01A-21D-A032-05 Basal 0.01525656 0.5341244 0.014674247
## TCGA-A2-A04T-01A-21D-A032-05 Basal 0.01167384 0.7378100 0.012251559
In this analysis we would like to find genes with different methylation. At first we need to use function aggregate_probes()
, which generates new data frame with CpG probes aggregated to genes. To this aggregation we use, by default, the Illumina Human Methylation data set from the TxDb.Hsapiens.UCSC.hg18.knownGene
Bioconductor package.
Function aggregate_probes()
requires a parameter data
- data frame containing methylation values for CpG probes.
BRCA_met_gen <- aggregate_probes(data = BRCA_met)
head(BRCA_met_gen)[1:5, 1:4]
## AANAT AARSD1 AATF AATK
## TCGA-A1-A0SD-01A-11D-A112-05 0.7148533 0.8625816 0.24294092 0.7835302
## TCGA-A2-A04N-01A-11D-A112-05 0.5850106 0.8355825 0.21367129 0.8466190
## TCGA-A2-A04P-01A-31D-A032-05 0.4495537 0.8786166 0.03277413 0.3417919
## TCGA-A2-A04Q-01A-21D-A032-05 0.7120650 0.8819490 0.03460160 0.7264985
## TCGA-A2-A04T-01A-21D-A032-05 0.6010397 0.7739978 0.02501599 0.6276399
In our example we will test for differential methylation between groups with LumA
breast cancer subtype and other subtypes of that cancer. Again we will use condition vector, which consist of two values corresponds to a subtype of breast cancer: LumA
and other
.
condition_met <- ifelse(BRCA_met$SUBTYPE == "LumA", "LumA", "other")
head(condition_met, 8)
## [1] "LumA" "LumA" "other" "other" "other" "other" "LumA" "other"
Before we go to the testing, we can visualize a chosen gene with marked CpG probes. Function plot_methylation_path
requires parameters such as:
data
- data frame containing values from methylation: columns corespond to CpG probes, rows to samples. In the example below, we use the BRCA_met
.
condition
- a vector of levels corresponding to order of samples in data.
gene
- name of chosen gene.
show.gene
- logical. If TRUE arrows corresopnding to gene will be drawn.
observ
- logical. If TRUE dots corresponding to CpG probes will be drawn.
plot_methylation_path(BRCA_met, condition = condition_met, gene = 'CACNA1G', show.gene = TRUE, observ = TRUE, islands = F)
## 'select()' returned 1:1 mapping between keys and columns
## 'select()' returned 1:many mapping between keys and columns
In the MLExpResso
package we carry out the tests for identification of differentially methylated regions. To do this we use the calculate_test()
function.
This function requires parameters such as:
data
- object of the class appropriate for the given test. In the example below, we use the BRCA_met_gen
methylation data aggregated to probes.
condition
- a factor of levels corresponding to order of samples in data. In our example, we use the condition_met
vector defined earlier.
test
- character defining test. Possible values of parameter test
are described in the documentation of this function.
Methylation tests are based on the methods implemented in packages limma
and MethyAnalysis
.
res_met <- calculate_test(
data = BRCA_met_gen,
condition = condition_met,
test = "ttest"
)
head(res_met)
## id log2.fold pval mean_LumA mean_other mean
## 1 ICAM2 -0.15151320 3.754116e-17 0.2547275 0.4062407 0.3330801
## 2 RILP -0.05073691 2.575168e-13 0.3079069 0.3586438 0.3341447
## 3 PIPOX 0.11505558 5.360053e-12 0.4242804 0.3092248 0.3647812
## 4 TNFSF12 -0.13412855 5.867083e-12 0.1791401 0.3132686 0.2485025
## 5 CD7 0.09822690 1.641919e-11 0.8635077 0.7652808 0.8127112
## 6 KSR1 0.19973400 2.054467e-11 0.658270 0.458536 0.5549808
As a result we get a data frame with columns describing the log2 of the fold change, p-value and the mean for each gene.
We can also create a comparison table with results of calculate_test()
function for methylation and expression data.
Function calculate_comparison_table()
requires parameters such as:
data1
, data2
- objects of the class appropriate for the given tests.
condition1
, condition2
- factors of levels coresponding to order of samples in data1
and data2
respectively.
test1
, test2
- characters defining tests coresponding to order of samples in data1
and data2
respectively. Possible values of parameter test
are described in calculate_test()
function documentation.
comparison <- calculate_comparison_table(
data1 = BRCA_exp[,!(colnames(BRCA_exp)=="SUBTYPE")],
data2 = BRCA_met_gen,
condition1 = condition_exp,
condition2 = condition_met,
test1 = "nbinom2",
test2 = "ttest"
)
head(comparison)
## id nbinom2.log2.fold nbinom2.pval ttest.log2.fold ttest.pval
## 59 AURKB 2.360714 1.704243e-37 0.0017389592 2.077252e-01
## 102 CBX2 2.905397 5.402147e-31 0.0584687549 1.214043e-06
## 327 KPNA2 1.466181 3.396674e-26 0.0012105971 7.505750e-01
## 277 GSG2 1.426569 3.325659e-25 -0.0018566938 2.411495e-01
## 66 BIRC5 2.004989 9.482155e-24 -0.0005444811 5.330216e-01
## 334 KRT16 4.333332 4.102956e-19 0.0486814033 1.606151e-05
## geom.mean.rank no.probes
## 59 1.881527e-19 2
## 102 8.098418e-19 2
## 327 1.596702e-13 1
## 277 2.831926e-13 2
## 66 2.248153e-12 1
## 334 2.567093e-12 2
As a result, we get a data frame with columns describing the log2 of the fold change, p-value and the mean for each gene for two tests. With this two test results, we compute the ranking of the most significant changed genes in terms of both methylation and expression. The created column contains the geometric mean of p-values for expression and methylation.
We can visualize results of calculate_comparison_table()
by plot_pvalues()
function. This function requires parameters such as:
data
- data.frame - result of calculate_comparison_table()
function.
names
- number of genes to be labeled. Gens are selected based on the ranking of the most significant changed genes in terms of both methylation and expression - geom.mean.rank column.
plot_pvalues(comparison, names = 10)
The great advantage of MLExpResso
package is the ability to perform a variety of visualizations for expression and methylation. All plots in our package are based on the ggplot2
package. We use also the scales
and ggrepel
packages for mathematical axes and repel overlapping text labels.
test_exp <- comparison[ , c("id", "nbinom2.log2.fold", "nbinom2.pval")]
test_met <- comparison[ , c("id", "ttest.log2.fold", "ttest.pval")]
For both, methylation and expression data, we can visualise the volcano plots for results of chosen tests and simple statistics for chosen gene.
Function plot_volcanoes()
requiers parameters such that:
data.m
, data.e
- data sets with information from methylation and expression respectively. In the example below, we use the BRCA_met
and BRCA_exp
data without the SUBTYPE
columns.
condition.m
, condition.e
- factors of levels coresponding to the order of samples in data.m
and data.e
respectively. In our example, we use the condition_met
and condition_exp
vectors defined earlier.
gene
- character defining the gene name. In the example, we visualize results for CACNA1G
gene
test.m
, test.e
- results from calculate_test()
function for methylation and expression data.
values
- if TRUE
p-values and log fold for the chosen gene will be added to the plot.
plot_volcanoes(
data.m = BRCA_met[,!(colnames(BRCA_met) == "SUBTYPE")],
data.e = BRCA_exp[,!(colnames(BRCA_exp) == "SUBTYPE")],
condition.m = condition_met,
condition.e = condition_exp,
gene = "CACNA1G",
test.m = test_met,
test.e = test_exp,
values=TRUE
)
Other function plot_gene()
allow us to visualize the methylation path
- placement of probes near the gene with a marked percentage of methylation for each probe in division into groups.
Function plot_gene()
requiers parameters such that:
data.m
, data.e
- data sets with informations from methylation and expression respectively. Note that plot_gene()
methylation requires data frame with CpG probes, not genes. In the example below, we use the BRCA_met
and BRCA_exp
.
condition.m
, condition.e
- factors of levels coresponding to order of samples in data.m
and data.e
respectively. In our example, we use the condition_met
and condition_exp
vectors defined earlier.
gene
- character defining the gene name. In the example, we visualize results for CACNA1G
gene.
… additional parameters. Below we use parameters show.gene
- If TRUE
line corresponding to the gene will be drawn, observ
- If TRUE
dots corresponding to CpG probes will be drawn and islands
-If TRUE
line corresponding to islands should be drawn.
plot_gene(
data.m = BRCA_met,
data.e = BRCA_exp,
condition.m = condition_met,
condition.e = condition_exp,
gene = "CACNA1G",
show.gene = TRUE,
observ = TRUE,
islands = TRUE
)
Using this function we also get boxplots containing values of expression in division from condition_exp
vector for chosen gene.