Znalezienie regionów o zróżniocowanym stopniu metylacji (ang. Differentially methylated regions).
Współpraca z prof. Płoskim
(Zakład Genetyki Medycznej WUM).
W badaniu opublikowanym w 2013 roku dokonano porównania stopnia metylacji u kobiet z osteoporozą i kobiet z chorobą zwyrodnieniową stawów. Odkryto około 250 regionów o zróżnicowanym stopniu metylacji. Były to głównie regiony kodujące czynniki transkrypcyjne, które biorą udział w różnicowaniu komórek i tworzenie szkieletu, co sugeruje, że stopień metylacji może istotny przy tworzeniu tych zaburzeń.
Dwie próbki pochodzące od jednej pacjentki
EK - komórka nie poddana chorobie - ponad 6mln rekordów,
EU - komórka poddana chorobie - 11 mln rekordów.
CHROM | POS | REF | ALT | DP | AF | meth | unmeth |
---|---|---|---|---|---|---|---|
chrM | 106 | G | C | 9 | 1 | 0 | 9 |
chr1 | 2198933 | C | G | 23 | 0.3478 | 15 | 8 |
chrY | 9269317 | C | G | 1 | 1 | 0 | 1 |
chr | start | end | coverage.EK | met.EK | coverage.EU | met.EU |
---|---|---|---|---|---|---|
chr1 | 15513 | 15513 | 23 | 1.00 | 22 | 1.00 |
chrY | 59033091 | 59033091 | 99 | 0.67 | 166 | 0.77 |
chr20 | 3139367 | 3139367 | 28 | 0.75 | 26 | 0.76 |
Wczytywanie danych: (muszą być specjalniej postaci! )
file.list <- list("Sample_40092EK.tab",
"Sample_40092EU.tab")
myobj <- methRead(file.list, sample.id = list("EK", "EU"),
assembly = "hg18", treatment = c(1, 0),
context = "CpG")
Łączenie próbek:
meth <- unite(myobj)
Podział danych na regiony. Każda grupa \((chr, tiles)\) spełnia warunki: \[chromosom = chr, start >= tiles * 1000, end <= (tiles + 1)*1000\]
Dla każdej grupy wykonuję test Wilcoxona dla par obserwacji:
Przykładowe wyniki:
How can I get the chromosomal location of a list of genes? -> bioMart
library(biomaRt)
ensembl <- useMart("ensembl",
dataset = "hsapiens_gene_ensembl")
# head(listAttributes(ensembl),50)
attributes <- c('chromosome_name', 'start_position', 'end_position',
'gene_biotype', 'hgnc_id', 'hgnc_symbol')
# head(listFilters(ensembl),20)
genes <- getBM(attributes = attributes, mart = ensembl,
filter = c('chromosome_name'), value = c(1:22,"X", "Y", "M"))
genes_hgnc <- dplyr::filter(genes, hgnc_symbol != "")
chromosome_name | start_position | end_position | gene_biotype | hgnc_id | hgnc_symbol |
---|---|---|---|---|---|
5 | 85762559 | 85763138 | processed_pseudogene | HGNC:36410 | RPS2P25 |
2 | 222503883 | 222504843 | processed_pseudogene | HGNC:38555 | GAPDHP49 |
8 | 81807772 | 81808726 | processed_pseudogene | HGNC:48765 | HNRNPA1P36 |
9 | 96928310 | 96940253 | protein_coding | HGNC:23449 | NUTM2G |
13 | 22696023 | 22696574 | processed_pseudogene | HGNC:3994 | FTH1P7 |
8 | 122769274 | 122770150 | processed_pseudogene | HGNC:21261 | CDK5P1 |
how to find interval overlaps in r -> Finding overlap in ranges with R
overlaps_find_1 <- findOverlaps(
as(analysis1[[1]], "GRanges"), # myobj[[1]]
as(genes_hgnc, "GRanges"),
type = "within")
cbind(
genes_hgnc[subjectHits(overlaps_find_1), ],
as.data.frame(
as(analysis1[[1]], "GRanges")
)[queryHits(overlaps_find_1), c(2,6:8)] %>%
dplyr::rename(base = start) %>%
mutate(group = "EK")
) -> EK
An Introduction to the GenomicRanges Package
seqnames | start | end | gene_biotype | hgnc_id | hgnc_symbol | base | coverage | numCs | numTs | group |
---|---|---|---|---|---|---|---|---|---|---|
chr1 | 825138 | 859446 | processed_transcript | HGNC:49377 | LINC01128 | 825448 | 36 | 20 | 16 | EK |
chr1 | 825138 | 859446 | processed_transcript | HGNC:49377 | LINC01128 | 825449 | 13 | 0 | 13 | EK |
chr1 | 825138 | 859446 | processed_transcript | HGNC:49377 | LINC01128 | 825626 | 35 | 31 | 4 | EK |
chr1 | 825138 | 859446 | processed_transcript | HGNC:49377 | LINC01128 | 825627 | 13 | 9 | 4 | EK |
chr1 | 825138 | 859446 | processed_transcript | HGNC:49377 | LINC01128 | 825684 | 36 | 28 | 8 | EK |
chr1 | 825138 | 859446 | processed_transcript | HGNC:49377 | LINC01128 | 825685 | 13 | 11 | 2 | EK |
EK_EU %>% # dplyr::bind_rows(EK, EU)
filter(hgnc_symbol %in% genes_names) %>% # > 100 odczytow w EK i EU
group_by(hgnc_symbol) %>%
do(mod = glm(cbind(numCs, numTs) ~ group+as.factor(base), data =.,
family = "binomial")) -> models
models %>% do(data.frame(
var = names(coef(.$mod)),
coef(summary(.$mod)))) %>%
filter(var == "groupEU") -> res_do
# + troche brudnego przerabiania
Estimate | pval | hgnc_symbol | chromosome_name | start_position | end_position | gene_biotype | hgnc_id | pval.adj | EK_reads_in_gene | EU_reads_in_gene |
---|---|---|---|---|---|---|---|---|---|---|
0.8717109 | 0 | CYSTM1 | 5 | 140174642 | 140282052 | protein_coding | HGNC:30239 | 0 | 752 | 952 |
1.7999432 | 0 | PIK3R5 | 17 | 8878911 | 8965712 | protein_coding | HGNC:30035 | 0 | 253 | 281 |
1.2661603 | 0 | LINC01121 | 2 | 45164870 | 45323295 | processed_transcript | HGNC:49266 | 0 | 513 | 637 |
2.0360239 | 0 | ZNF385C | 17 | 42025576 | 42098479 | protein_coding | HGNC:33722 | 0 | 260 | 293 |
-0.4512390 | 0 | PRDM16 | 1 | 3069168 | 3438621 | protein_coding | HGNC:14000 | 0 | 2766 | 3162 |
2.1870519 | 0 | CYP2F1 | 19 | 41114432 | 41128366 | protein_coding | HGNC:2632 | 0 | 118 | 143 |