From sequencing data to differential methylation analysis
Bioinformatics Unit, Core Facility, Faculty of Medicine and Health Sciences, Linköping University
Clinical Genomics Linköping, Linköping University
A bioinformatics pipeline typically depends on the availability of several resources, including adequate storage, computer units, network connectivity, and appropriate software execution environment.
Automation helps manage bioinformatics resources and workflows and streamlines day-to-day bioinformatics operations.
Containers are a standard unit of software that enables the packaging of software and its dependencies to be run on different computers and operating systems with virtually no configuration changes.
Version control of the pipeline should include semantic versioning of the deployed instance of a pipeline as a whole.
The most critical requirement for implementing a bioinformatics pipeline is a proper, systematic clinical validation in the context of the entire next-generation sequencing (NGS) assay.
Curated from Gong T el al (2022)
Optional steps:
- Downsampling reads (seqtk)
- Collect methylation stats
Important information
1. same pipeline can be used for Twist custom methylation panel
2. genome reference file should match the panel’s genome build
3. replace the Methylome V1 probe and target BED file.
A. MethylKit
%%{init: {"flowchart": {"htmlLabels": false}} }%% flowchart TB id1["Reading methylation calls from Bismark alignments"] id2["filtering on read coverage"] id1 --> id2 id2 --> id3[merging samples] id3 --> id4[[calculate DiffMeth]] id4 --> id5(Annotation)
B. EdgeR
%%{init: {"flowchart": {"htmlLabels": false}} }%% flowchart TB id1["Reading coverage files from Bismark results"] id2["counts, samples, chromosome location"] id3["filtering chromosomes, finding nearest TSS, filter by coverage"] id1 --> id2 id2 --> id3 id3 --> id6[[Calculate M-value]] id6 --> id7[[Calculate diff meth]]
adopted from Gosdschan, Bioconductor Europe Meeting, 2018 (Akalin et al. 2012)
methylation statistics per base
summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 20.00 82.79 63.17 94.74 100.00
percentiles:
0% 10% 20% 30% 40% 50% 60% 70%
0.00000 0.00000 0.00000 48.38710 70.00000 82.78556 90.00000 93.33333
80% 90% 95% 99% 99.5% 99.9% 100%
96.42857 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000
read coverage statistics per base
summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.00 16.00 26.00 34.45 39.00 630.00
percentiles:
0% 10% 20% 30% 40% 50% 60% 70%
10.000 11.000 14.000 17.000 20.000 26.000 30.000 36.000
80% 90% 95% 99% 99.5% 99.9% 100%
42.000 60.000 78.750 195.800 328.300 441.945 630.000
adopted from Gosdschan, Bioconductor Europe Meeting, 2018 (Akalin et al. 2012)
get the bases covered in all samples: merge all samples to one object for base-pair locations that are covered in all samples:
adopted from Gosdschan, Bioconductor Europe Meeting, 2018 (Akalin et al. 2012)
Testing for differential methylation using either Fisher’s exact test or Chisq test for logistic regression model (depending on the sample size per set) with p-value adjustment using SLIM 1 method (Wang, Tuominen, and Tsai 2010) :
Optional correction for overdispersion if more variability present in the data than assumed by binomial distribution:
Covariates can be included in the analysis to separate the influence of the covariates from the treatment effect via the logistic regression model.
adopted from Gosdschan, Bioconductor Europe Meeting, 2018 (Akalin et al. 2012)
Use genomation package to annotate differentially methylated regions/bases based on gene annotation:
first read the gene BED file:
then get all differentially methylated bases:
now annotate differentially methylated CpGs with promoter/exon/intron using annotation data
finally visualize the annotation:
adopted from Gosdschan, Bioconductor Europe Meeting, 2018 (Akalin et al. 2012)
The EdgeR function readBismark2DGE reads all the files and collates the counts for all the sample into one data object:
> Sample <- row.names(targets)
> files <- paste0(Sample,".bismark.cov.gz")
> yall <- readBismark2DGE(files, sample.names=Sample)
Reading P6_1.bismark.cov.gz
Reading P6_4.bismark.cov.gz
Reading P7_2.bismark.cov.gz
Reading P7_5.bismark.cov.gz
Reading P8_3.bismark.cov.gz
Reading P8_6.bismark.cov.gz
Hashing ...
Collating counts ...
> dim(yall)
[1] 3538117 12
> head(yall$counts)
P6_1-Me P6_1-Un P6_4-Me P6_4-Un P7_2-Me P7_2-Un P7_5-Me P7_5-Un
chr15-3051454 14 2 18 0 6 0 15 1
chr15-3051455 1 0 1 0 0 0 0 0
chr15-3051486 16 0 17 1 6 0 16 0
chr15-3051555 13 3 18 0 5 1 16 0
chr15-3051556 1 0 1 0 0 0 0 0
chr15-3051559 14 2 15 3 5 1 15 1
P8_3-Me P8_3-Un P8_6-Me P8_6-Un
chr15-3051454 8 0 3 0
chr15-3051455 0 0 1 0
chr15-3051486 8 0 3 0
chr15-3051555 8 0 3 0
chr15-3051556 0 0 1 0
chr15-3051559 8 0 3 0
“DGEList”
.data.frame
.yall$samples
Filtering unassembled chromosomes
Sort into genomic order
Gene Annotation
Filtering to remove low counts
> TSS <- nearestTSS(yall$genes$Chr, yall$genes$Locus, species="Mm")
> yall$genes$EntrezID <- TSS$gene_id
> yall$genes$Symbol <- TSS$symbol
> yall$genes$Strand <- TSS$strand
> yall$genes$Distance <- TSS$distance
> yall$genes$Width <- TSS$width
> head(yall$genes)
Chr Locus EntrezID Symbol Strand Distance Width
chr1-3020689 chr1 3020689 497097 Xkr4 - -650809 457017
chr1-3020690 chr1 3020690 497097 Xkr4 - -650808 457017
chr1-3020708 chr1 3020708 497097 Xkr4 - -650790 457017
chr1-3020724 chr1 3020724 497097 Xkr4 - -650774 457017
chr1-3020725 chr1 3020725 497097 Xkr4 - -650773 457017
chr1-3020814 chr1 3020814 497097 Xkr4 - -650684 457017
> Coverage <- yall$counts[, Methylation=="Me"] + yall$counts[, Methylation=="Un"]
> head(Coverage)
P6_1-Me P6_4-Me P7_2-Me P7_5-Me P8_3-Me P8_6-Me
chr1-3020689 16 21 4 13 6 2
chr1-3020690 37 47 18 46 15 22
chr1-3020708 0 2 0 1 0 0
chr1-3020724 16 21 4 14 6 2
chr1-3020725 37 47 18 46 15 22
chr1-3020814 0 0 0 1 0 0
> TotalLibSize <- y$samples$lib.size[Methylation=="Me"] +
+ y$samples$lib.size[Methylation=="Un"]
> y$samples$lib.size <- rep(TotalLibSize, each=2)
> y$samples
group lib.size norm.factors
P6_1-Me P6 27754786 1
P6_1-Un P6 27754786 1
P6_4-Me P7 41007143 1
P6_4-Un P7 41007143 1
P7_2-Me P8 21424225 1
P7_2-Un P8 21424225 1
P7_5-Me P6 26197143 1
P7_5-Un P6 26197143 1
P8_3-Me P7 18782105 1
P8_3-Un P7 18782105 1
P8_6-Me P8 15539660 1
P8_6-Un P8 15539660 1
Exploring differences between samples
In microarray methylation studies, a common measure of methylation level is the M-value, which is defined as \[
M = log_{2} \frac{(Me + α)}{(Un + α)}
\] where Me and Un are the methylated and unmethylated intensities and
α is some suitable offset to avoid taking logarithms of zero.
M contains the empirical logit methylation level for each CpG site in each sample.
Design Matrix
In EdgeR, this can be done by fitting linear models under a specified design matrix and testing for corresponding coefficients or contrasts.
Dispersion estimation
Similar to the RNA-seq data, the variability between biological replicates has also been observed in bisulfite sequencing data. This variability can be captured by the NB dispersion parameter under the generalized linear model (GLM) framework in EdgeR.
differentially methylated CpG loci Test is performed using likelihood ratio tests (LRT) in EdgeR.
> topTags(lrt)
Coefficient: 1*P6 -0.5*P7 -0.5*P8
Chr Locus EntrezID Symbol Strand Distance Width logFC
chr16-76326604 chr16 76326604 268903 Nrip1 - -46445 85647 -8.87
chr13-45709467 chr13 45709467 66355 Gmpr + -202023 38943 7.60
chr10-40387375 chr10 40387375 78334 Cdk19 + -38067 134511 -8.13
chr11-100144651 chr11 100144651 16669 Krt19 - 952 2889 -8.19
chr17-46572098 chr17 46572098 20807 Srf - 15936 9324 9.10
chr13-45709489 chr13 45709489 66355 Gmpr + -202045 38943 7.47
chr3-54724012 chr3 54724012 69639 Exosc8 - -11352 6686 -8.40
chr13-45709480 chr13 45709480 66355 Gmpr + -202036 38943 7.70
chr8-120068504 chr8 120068504 102193 Zdhhc7 - -32968 20378 7.42
chr2-69631013 chr2 69631013 72569 Bbs5 + 16242 20315 -7.30
logCPM LR PValue FDR
chr16-76326604 1.52 321 8.35e-72 4.27e-66
chr13-45709467 1.98 319 2.62e-71 6.70e-66
chr10-40387375 1.77 312 6.46e-70 1.10e-64
chr11-100144651 1.58 306 1.40e-68 1.79e-63
chr17-46572098 1.61 302 1.44e-67 1.48e-62
chr13-45709489 1.97 294 5.46e-66 4.66e-61
chr3-54724012 1.31 292 1.56e-65 1.14e-60
chr13-45709480 1.97 292 2.01e-65 1.28e-60
chr8-120068504 1.73 285 7.12e-64 4.05e-59
chr2-69631013 2.03 277 3.66e-62 1.88e-57
email me: jyotirmoy.das@liu.se
jyotirmoy@z6g4:r-nf$ nextflow run main.nf -profile docker \
--formatted_designfile ./data/Sample_sheet.csv \
--compare_str data/comparefile.txt \
--cov_dir data/ \
--bam_dir /mnt/SD3/CFFMHS-KW-/pipeline_output/output/bismark/deduplicated/sorted_bam/ \
-resume
N E X T F L O W ~ version 22.04.5
Launching `main.nf` [wise_koch] DSL2 - revision: 405acae70a
DNAm-NF PIPELINE
===================================
reads : ./data/Sample_sheet.csv
comparefile : data/comparefile.txt
coverage_dir : data/
bam_dir : /mnt/SD3/CFFMHS-KW-/pipeline_output/output/bismark/deduplicated/sorted_bam/
reads_count1 : false
reads_count2 : false
outdir : /mnt/SD2/Jyotirmoys/JD/Scripts/MyScripts/JDCo/DNAm/r-nf/results
executor > local (1)
[a4/26a463] process > EdgeR (comparefile.txt) [100%] 1 of 1, cached: 1 ✔
[16/f9be4c] process > MethylKit (comparefile.txt) [100%] 1 of 1 ✔
Completed at: 02-Feb-2024 18:17:32
Duration : 8h 32m 2s
CPU hours : 184.1 (7.3% cached)
Succeeded : 1
Cached : 1
jyotirmoy@z6g4:r-nf$
jyotirmoy@z6g4:test-nf$ nextflow run main.nf \
-profile singularity \
--illumina \
-resume
N E X T F L O W ~ version 22.04.5
Launching `main.nf` [maniac_swirles] DSL2 - revision: cbb43dc325
DNAm-NF PIPELINE
===================================
reads : /mnt/SD2/Jyotirmoys/JD/Scripts/MyScripts/JDCo/DNAm/test-nf/data/*{1,2}.fastq.gz
outdir : ./2024-01-29_results
WARN: Access to undefined parameter `nanopore` -- Initialise it to a default value eg. `params.nanopore = some_value`
[- ] process > illuminaAnalysis:illuminaQC:fastqc -
[- ] process > illuminaAnalysis:illuminaQC:multiqc -
[9e/741818] process > illuminaAnalysis:illuminaQC:fastqc (fastqqc on KN1_S9) [100%] 1 of 1, cached: 1
[- ] process > illuminaAnalysis:illuminaQC:multiqc -
[9e/741818] process > illuminaAnalysis:illuminaQC:fastqc (fastqqc on KN1_S9) [100%] 1 of 1, cached: 1 ✔
[97/31b39e] process > illuminaAnalysis:illuminaQC:multiqc [100%] 1 of 1, cached: 1 ✔
[9e/741818] process > illuminaAnalysis:illuminaQC:fastqc (fastqqc on KN1_S9) [100%] 1 of 1, cached: 1 ✔
[97/31b39e] process > illuminaAnalysis:illuminaQC:multiqc [100%] 1 of 1, cached: 1 ✔
[84/b16f89] process > illuminaAnalysis:illuminaMethylation:TRIMGALORE (trimgalore on KN1_S9) [100%] 1 of 1, cached: 1 ✔
[d4/044f58] process > illuminaAnalysis:illuminaMethylation:BISMARK_GENOMEPREPARATION (bismark_gp on null) [100%] 1 of 1, cached: 1 ✔
[4f/ee231d] process > illuminaAnalysis:illuminaMethylation:BISMARK_ALIGNMENT (bismark_alignment on null) [100%] 1 of 1, cached: 1 ✔
[65/01972a] process > illuminaAnalysis:illuminaMethylation:BISMARK_METHYLATIONEXTRACTOR (bismark_methextract on null) [100%] 1 of 1, cached: 1 ✔
[- ] process > illuminaAnalysis:illuminaMethylation:BISMARK_REPORT -
Completed at: 29-Jan-2024 12:33:30
Duration : 42m 26s
CPU hours : 161.1 (100% cached)
Succeeded : 0
Cached : 6
jyotirmoy@z6g4:test-nf$
Copyright © Jyotirmoy Das