Skip to content

BaeLab/ABE-conversion-analysis

Repository files navigation

ABE Conversion Analysis

This program analyzes Adenine Base Editor (ABE) conversion efficiency from high-throughput sequencing data. It processes FASTQ files to identify and quantify A-to-G conversions at specific genomic target sites.

Overview

The ABE Conversion Analysis pipeline:

  • Reads FASTQ sequencing files and reference sequences
  • Automatically detects sense/antisense orientation
  • Aligns reads to reference sequences
  • Identifies A-to-G conversions within CRISPR spacer regions
  • Quantifies editing efficiency at each adenine position
  • Outputs results to Excel format with detailed conversion statistics

Requirements

Python Dependencies

  • Python 3.7+
  • BioPython (1.79+)
  • openpyxl
  • multiprocessing (built-in)

Install dependencies:

pip install biopython openpyxl

Input Files

1. Input Configuration File

Format: ABE_off_target_input_V106W.txt

The input file contains target information in a 3-line repeating format:

[Region name]
[Wild-type sequence]
[CRISPR spacer sequence]

Example:

chr1_55040035.dna
cttgcgttccgaggaggacggcctggccgaagcacccgagcacggaaccacagccaccttccaccgctgcgccaaggtgcgggtgtagggatgggaggccggggcgaacccgcagccgggacggtgcggtgctgtttcctctcgggcctcagtttccccccatgtaagagaggaagtggagtgcaggtcgccgagggc
cccgcaccttggcgcagcgg

2. FASTQ Files

  • Place FASTQ files in any directory of your choice (the bundled example data lives in fastqjoin/demo_fastq/)
  • Supported formats: *.fastq or *.fastqjoin
  • The program will automatically detect and process all FASTQ files in the directory
  • The .fastqjoin file was generated by merging the paired-end R1 and R2 reads with the fastq-join program from ea-utils
  • The bundled fastqjoin/demo_fastq/ contains a single demo dataset, and only a downsampled subset of its reads is included (kept reads are on-target for the regions in ABE_off_target_input_V106W.txt, capped per region) so the program can be run quickly out of the box; it is not the full sequencing dataset

Quick Start

The repository ships with example data, so it runs out of the box:

python ABE_conversion_analysis.py

This uses the default input file (ABE_off_target_input_V106W.txt) and the default FASTQ directory (fastqjoin/demo_fastq/), writing results to output/ABE_analysis_result.xlsx (the output/ folder is created automatically).

Usage

Command-line Options

python ABE_conversion_analysis.py [--input INPUT] [--fastq-dir DIR] [--output OUTPUT] [--processes N]
Argument Short Default Description
--input -i ABE_off_target_input_V106W.txt Input configuration file
--fastq-dir -f fastqjoin/demo_fastq/ Directory containing FASTQ files
--output -o output/ABE_analysis_result.xlsx Output Excel file (parent folder auto-created)
--processes -p 8 Max parallel processes
--indicator-length 10 Length (bp) of each flanking indicator used to crop reads
--wildtype-range 20 Half-width (bp) of the indel-free window required to call a non-indel outcome

Examples

# Run with the bundled example data (uses all defaults)
python ABE_conversion_analysis.py

# Specify all paths explicitly
python ABE_conversion_analysis.py \
    --input ABE_off_target_input_V106W.txt \
    --fastq-dir fastqjoin/my_sample/ \
    --output results.xlsx

# Limit CPU usage
python ABE_conversion_analysis.py --processes 4

The program will:

  • Load target regions from the input file
  • Detect all FASTQ files (.fastq and .fastqjoin)
  • Process samples in parallel using multiprocessing
  • Generate an Excel file with detailed conversion statistics

Excel File Format

The output Excel file contains the following columns:

Column Description
Sample FASTQ filename
Region Genomic region/target name
Position in Spacer (PAM-proximal) Position of adenine in spacer (numbered from PAM-proximal end)
Total Sequence Total number of sequences analyzed
Indel Number of sequences with insertions/deletions
A Number of sequences maintaining adenine (no conversion)
T Number of A-to-T conversions
G Number of A-to-G conversions (desired ABE editing)
C Number of A-to-C conversions
N Number of sequences with ambiguous base calls
is Sense Boolean indicating if sequence is sense (True) or antisense (False)

Algorithm Details

1. Orientation Detection

The program automatically determines if the target sequence is in sense or antisense orientation by searching for the spacer sequence in both the original and reverse complement of the wild-type sequence.

2. Sequence Processing

  • Extracts sequences between flanking indicators
  • Filters sequences appearing ≤1 time (likely sequencing errors)
  • Performs global pairwise alignment with BioPython

3. Conversion Tracking

  • Identifies all adenine positions within the spacer
  • Uses PAM-proximal numbering (position 1 = closest to PAM)
  • Calibrates positions to account for insertions
  • Detects wildtype, indel, and base conversion outcomes

4. Parallel Processing

  • Utilizes multiprocessing to analyze multiple samples simultaneously
  • Default: Uses up to 8 CPU cores
  • Significantly reduces processing time for large datasets

Tested Environment

The program was developed and tested on Linux running under WSL2 (Windows Subsystem for Linux 2) on Windows:

Component Specification
Operating system Ubuntu 22.04.1 LTS (WSL2)
Kernel 6.18.33.1-microsoft-standard-WSL2
CPU 12th Gen Intel Core i5-12400 (8 logical processors allocated to WSL2)
RAM 23.5 GB allocated to WSL2
Python 3.14.3

The code uses only the standard library plus BioPython and openpyxl, so it is not WSL2-specific and should run on any platform with a compatible Python; the WSL2 configuration above is simply the environment in which it was validated.

Performance

  • Processing speed depends on:
    • Number of FASTQ files
    • Number of target regions
    • Read depth per sample
    • CPU cores available

On the tested environment above, the bundled demo (1 sample × 28 regions, ~45,000 on-target reads) completes in roughly 5-7 seconds. Full sequencing datasets with multiple samples and millions of reads scale up accordingly (on the order of minutes).

Troubleshooting

Common Issues

  1. "SPACER not found" error

    • Check that spacer sequences exactly match regions in wild-type sequences
    • Verify input file format (3 lines per region)
  2. "Zero length sequence" warning

    • Usually indicates poor sequencing quality
    • Affected sequences are automatically skipped
  3. Low conversion efficiency

    • Verify correct target sequences
    • Check sequencing quality and depth
    • Ensure proper PCR amplification of target regions

Code Structure

The Python script follows standard conventions:

  • Configuration Parameters: All settings at the top of the file
  • Utility Functions: Helper functions for sequence manipulation
  • Sequence Analysis Functions: Functions for finding A positions and cropping sequences
  • Alignment Functions: BioPython alignment and calibration functions
  • Main Processing Function: Core logic for processing each task
  • Main Execution: Entry point with user-friendly progress output

Citation

License

This program is released under the MIT License. See the LICENSE file for details.

Contact

For questions or issues, please contact:

bbakgosu@snu.ac.kr

About

ABE conversion quantify script for NGS targeted sequencing

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages