Skip to content

TwistMethNext integrates various tools and custom scripts to provide a comprehensive analysis workflow for Twist NGS Methylation data.

Step Tool/Software Description
Generate Reference Genome Bismark Generate reference genome index files for bismark.
Raw Data QC FastQC Performs quality control checks on raw sequence data.
Adapter Trimming Trim Galore Trims adapters and low-quality bases from the reads.
Align Reads Bismark (bowtie2) Aligns bisulfite-treated reads to a reference genome.
Deduplicate removal Bismark Removes PCR duplicates from the aligned reads.
Sort and Indexing Samtools Sorts the aligned and deduplicated BAM files.
Extract Methylation Calls Bismark Extracts methylation calls from the aligned reads.
Summary Report Bismark Generates a summary report of the Bismark alignment and
methylation extraction.
Alignment QC Qualimap Generates quality control metrics for the aligned reads.
QC Reporting MultiQC Aggregates quality control reports from various steps
into a single report.
Differential Methylation EdgeR,
MethylKit
Performs differential methylation analysis using EdgeR
and MethylKit package.
Post Processing ggplot2 Generates summary statistics and visualizations of the
differential methylation results.
GO Analysis Gene Ontology Generates GOChord diagram from gene ontology analysis results.

Feature Details

  1. READ_PROCESSING: Checks the quality of raw sequencing data and trims low-quality bases and adapters to improve downstream analysis.
  2. BISMARK_ANALYSIS: Aligns bisulfite-converted reads to a reference genome, identifies and removes PCR duplicates, sorts and indexes the aligned reads, performs quality control on alignments, and extracts methylation information from the aligned reads.
  3. QC_REPORTING: Compiles quality control metrics from various steps into a comprehensive report for easy interpretation.
  4. DIFFERENTIAL_METHYLATION:
  5. EDGER_ANALYSIS:
    • Takes coverage files, a design file, and comparison information as input.
    • Performs differential methylation analysis using the EdgeR Bioconductor package.
    • Outputs CSV files with differential methylation results for each group comparison. METHYLKIT_ANALYSIS:
    • Takes coverage files, a design file, and comparison information as input.
    • Performs differential methylation analysis using the EdgeR Bioconductor package.
    • Outputs CSV files with differential methylation results for each group comparison.
  6. POST_PROCESSING:
  7. Reads the EdgeR results and generates:
    • Summary statistics (total DMRs, hyper/hypomethylated regions, significant DMRs)
    • Volcano plot (visualizing fold change vs. significance)
    • MA plot (visualizing intensity vs. fold change)
  8. Functional Analsysis

  9. Reads the EdgeR/MethylKit results and generates -

  10. Top n corresponding genes from the EdgeR/MethylKit results are picked up to generate the gene ontology results.
  11. generates a CSV file with the GO classification results (only Biological Processes).
  12. generates a Chord diagram for top 10 results from the GO analysis.

Read processing

Read processing subworkflow includes -

  • FASTQC - for Quality check of samples
  • TRIM Galore - adapter trimming

FastQC

FASTQC is a widely used tool for assessing the quality of raw and processed sequencing data. It provides a comprehensive quality check, including metrics like per-base quality scores, GC content, and adapter contamination.

fastqc $args --threads $task.cpus $reads

General Options

  • <file{R1,R2}.fastq>: Input FASTQ files (gzip-compressed files, e.g., file1.fastq.gz, are also supported).
  • -o <output_directory>: Specify the directory where reports will be saved. Defaults to the current directory if omitted.
  • -t <number_of_threads>: Specify the number of threads for parallel processing.
FASTQC Results
  1. HTML Report: Visual summary of the quality metrics.
  2. ZIP File: Contains the raw data used to generate the report.

Key metrics in the HTML report:

  • Per Base Sequence Quality:
  • Boxplots showing quality scores across all positions in reads.
  • Green indicates high-quality bases (>Q30).
  • Per Sequence Quality Scores:
  • Overall quality of reads in the file.
  • Per Base GC Content:
  • GC content distribution across the length of reads.
  • Adapter Content:
  • Detects overrepresented adapter sequences.
  • Overrepresented Sequences:
  • Identifies frequently occurring sequences (e.g., adapters or contaminants).

Trim Galore

Trim Galore is a versatile tool for trimming sequencing reads and removing adapter sequences. It’s particularly useful for preparing raw sequencing data for downstream applications like alignment or differential expression/methylation analysis. Trim Galore combines the functionalities of Cutadapt and FastQC for quality control and trimming.

trim_galore --paired --cores $task.cpus $args $reads

General Options

  • -q <quality>: Trim low-quality bases from the ends of reads. Default is 20.
  • --length <min_length>: Discard reads shorter than the specified length after trimming.
  • --adapter <sequence>: Specify a custom adapter sequence. By default, Trim Galore auto-detects adapters.
  • --gzip: Compress the output files into .gz format.
  • --fastqc: Run FastQC before and after trimming.
  • --cores <number>: Use multiple cores for faster processing.

Bismark Analysis

Reference Genome Preparation

Bismark needs to prepare the bisulfite index for the genome.

  • In the current pipeline, user can provide the genome.fasta and the pipeline can index it.
  • Optinally, user can provide the index files directly, and the pipeline will use it without indexing the genome again.

General Options

  • --verbose: Prints detailed output during the indexing process.
  • --parallel <threads>: Uses multiple threads to speed up genome preparation.
  • --bowtie2: Specifies that Bowtie2 will be used for alignment (default option in most versions).
  • --path_to_bowtie <path>: Specifies the path to the Bowtie installation if not in your PATH.

Example with Options:

bismark_genome_preparation --bowtie2 --parallel 4 <genome.fasta>
Reference Genome Preparation

After successful completion, Bismark generates a bisulfite-converted genome in two orientations (C->T and G->A) along with the Bowtie/Bowtie2 indices.

Output directory structure:

your_path_to_reference/
    Bisulfite_Genome/
        CT_conversion/
        GA_conversion/
        genome.1.bt2
        genome.2.bt2
        ...

Bismark Alignment

This step aligns bisulfite-treated sequencing reads to a reference genome.

bismark --genome <path_to_reference_genome> -1 <reads_R1.fq> -2 <reads_R2.fq> -o <output_directory>

General Options * --genome: Path to the reference genome directory preprocessed with bismark_genome_preparation. * -1 and -2: Specify paired-end reads. Use -U for single-end reads. * -o: Output directory for alignment files.

Bismark Alignment
  • Produces .bam alignment files.
  • Produces .report.txt and
  • unmapped_reads.fq.gz file.

Bismark Deduplication

This step removes duplicate reads to avoid overestimating methylation levels.

deduplicate_bismark ${paired_end} $args --bam $bam

General options: * args options use all arguments from bismark deduplicate command. * --paired: Use this for paired-end data. Remove for single-end reads. * --bam: Specifies the input BAM file, generated from bismark alignment

Bismark Deduplicate Removal
  • Generates a deduplicated .bam file.
  • Produces deduplicated_report.txt file.

Bismark Methylation Extractor

Extract methylation data from deduplicated BAM files.

bismark_methylation_extractor \
    --bedGraph --gzip \
    -o <output_directory> <deduplicated.bam>

General options: * --bedGraph: Generates bedGraph file * --gzip: Compresses the output files.

Bismark Methylation Extractor
  • Generates .bismark.cov.gz files and methylation call data in CpG, CHG, and CHH contexts.
  • Generates bedGraph.gz file.
  • Also generates splitting_report.txt file.

Bismark Report

Generate a summary report of alignment and methylation statistics.

Command:

bismark2report
Bismark Report

Produces an HTML file summarizing: * Alignment efficiency. * Duplicate rates. * Methylation levels (CpG, CHG, CHH contexts).

Alignment Quality Mapping

The main module for assessing alignment quality is qualimap bamqc.

qualimap bamqc \
    -bam <input.bam> \
    -outdir <output_directory> \
    -outformat <html> \
    --java-mem-size

General Options: * -bam <input.bam>: Path to the aligned BAM file (e.g., deduplicated BAM file). * -outdir <output_directory>: Directory for output reports. * -outformat <pdf|html>: Choose the output format for the report.

Alignment Quality Check

The output includes:

  1. General Alignment Statistics:
  2. Total number of reads.
  3. Percentage of mapped reads.
  4. Percentage of properly paired reads (for paired-end data).
  5. Coverage Statistics:
  6. Mean coverage depth.
  7. Percentage of the genome covered at varying depths (e.g., 1x, 5x, 10x).
  8. Insert Size Distribution (for paired-end reads):
  9. Provides insights into library preparation and sequencing quality.
  10. GC Content Distribution:
  11. Checks for bias in GC content distribution compared to expected values.
  12. Read Quality Metrics:
  13. Distribution of mapping quality scores.

QC Reporting

MultiQC is used for the QC reporting combining all results from the FastQC, Trim galore, Bismark Alignment, Bismark Deduplication, Bismark summary report, and Qualimap results.

Output:

  • Generates an interactive HTML report (multiqc_report.html) and a data file (multiqc_data.json).
  • Output includes summary statistics, plots, and tool-specific metrics.

Differential Methylation Analysis

To calculate the differential methylation from the input samples, two different methods can be used -

EdgeR

edgeR is a Bioconductor package primarily used for RNA-seq differential expression analysis but can also handle differential methylation analysis when paired with bisulfite sequencing data. This requires pre-processed methylation data, such as counts of methylated (M) and unmethylated (U) reads at each cytosine position or region of interest.

Rscript $baseDir/bin/edgeR_analysis.R \
        --coverage_files '${coverage_files}' \
        --design "${design_file}" \
        --compare "${compare_str}" \
        --output . \
        --threshold ${coverage_threshold}

General options

  • --coverage_files: selected from the bismark_methylation_extractor files.
  • --design: selected from the Sample_sheet.csv
  • --compare: selected from the Sample_sheet.csv .
EdgeR Results
  • Generates EdgeR_group_<compare_str>.csv.

MethylKit

MethylKit is an R package designed for analyzing bisulfite sequencing data, particularly for differential methylation analysis. It supports genome-wide methylation data and is ideal for CpG, CHH, and CHG methylation studies.

Rscript $baseDir/bin/run_methylkit.R \
        --coverage_files '${coverage_files}' \
        --design "${design_file}" \
        --compare "${compare_str}" \
        --output . \
        --threshold ${coverage_threshold}

General options

  • --coverage_files: selected from the bismark_methylation_extractor files.
  • --design: selected from the Sample_sheet.csv
  • --compare: selected from the Sample_sheet.csv .
MethylKit Results
  • Generates Methylkit_group_<compare_str>.csv

Post-processing

Generates A) Volcano Plot, B) MA Plot and C) Summary Statistics from the Diffrential Methylation results.

Figures

Volcano Plot Volcano Plot

MA plot MA plot

Gene Ontology Analysis

The pipeline has also a module to perform the Gene Ontology analysis from the top n corresponding genes from the differential methylation results (EdgeR/MethylKit) using the clusterProfiler package.

The results generates a full table with all Biological Processes and a Chord diagram with top 10 functions identified in the analysis.

GOchord Diagram

GOchord Diagram

© Jyotirmoy Das 2025
3D1F C87F 8D52 BFD4 5A1D
3C86 63F2 2E14 6A0A 6B98
Made with MkDocs