This program analyzes Adenine Base Editor (ABE) conversion efficiency from high-throughput sequencing data. It processes FASTQ files to identify and quantify A-to-G conversions at specific genomic target sites.
The ABE Conversion Analysis pipeline:
- Reads FASTQ sequencing files and reference sequences
- Automatically detects sense/antisense orientation
- Aligns reads to reference sequences
- Identifies A-to-G conversions within CRISPR spacer regions
- Quantifies editing efficiency at each adenine position
- Outputs results to Excel format with detailed conversion statistics
- Python 3.7+
- BioPython (1.79+)
- openpyxl
- multiprocessing (built-in)
Install dependencies:
pip install biopython openpyxlFormat: ABE_off_target_input_V106W.txt
The input file contains target information in a 3-line repeating format:
[Region name]
[Wild-type sequence]
[CRISPR spacer sequence]
Example:
chr1_55040035.dna
cttgcgttccgaggaggacggcctggccgaagcacccgagcacggaaccacagccaccttccaccgctgcgccaaggtgcgggtgtagggatgggaggccggggcgaacccgcagccgggacggtgcggtgctgtttcctctcgggcctcagtttccccccatgtaagagaggaagtggagtgcaggtcgccgagggc
cccgcaccttggcgcagcgg
- Place FASTQ files in any directory of your choice (the bundled example data lives in
fastqjoin/demo_fastq/) - Supported formats:
*.fastqor*.fastqjoin - The program will automatically detect and process all FASTQ files in the directory
- The
.fastqjoinfile was generated by merging the paired-end R1 and R2 reads with thefastq-joinprogram from ea-utils - The bundled
fastqjoin/demo_fastq/contains a single demo dataset, and only a downsampled subset of its reads is included (kept reads are on-target for the regions inABE_off_target_input_V106W.txt, capped per region) so the program can be run quickly out of the box; it is not the full sequencing dataset
The repository ships with example data, so it runs out of the box:
python ABE_conversion_analysis.pyThis uses the default input file (ABE_off_target_input_V106W.txt) and the default
FASTQ directory (fastqjoin/demo_fastq/), writing results to
output/ABE_analysis_result.xlsx (the output/ folder is created automatically).
python ABE_conversion_analysis.py [--input INPUT] [--fastq-dir DIR] [--output OUTPUT] [--processes N]
| Argument | Short | Default | Description |
|---|---|---|---|
--input |
-i |
ABE_off_target_input_V106W.txt |
Input configuration file |
--fastq-dir |
-f |
fastqjoin/demo_fastq/ |
Directory containing FASTQ files |
--output |
-o |
output/ABE_analysis_result.xlsx |
Output Excel file (parent folder auto-created) |
--processes |
-p |
8 |
Max parallel processes |
--indicator-length |
10 |
Length (bp) of each flanking indicator used to crop reads | |
--wildtype-range |
20 |
Half-width (bp) of the indel-free window required to call a non-indel outcome |
# Run with the bundled example data (uses all defaults)
python ABE_conversion_analysis.py
# Specify all paths explicitly
python ABE_conversion_analysis.py \
--input ABE_off_target_input_V106W.txt \
--fastq-dir fastqjoin/my_sample/ \
--output results.xlsx
# Limit CPU usage
python ABE_conversion_analysis.py --processes 4The program will:
- Load target regions from the input file
- Detect all FASTQ files (
.fastqand.fastqjoin) - Process samples in parallel using multiprocessing
- Generate an Excel file with detailed conversion statistics
The output Excel file contains the following columns:
| Column | Description |
|---|---|
| Sample | FASTQ filename |
| Region | Genomic region/target name |
| Position in Spacer (PAM-proximal) | Position of adenine in spacer (numbered from PAM-proximal end) |
| Total Sequence | Total number of sequences analyzed |
| Indel | Number of sequences with insertions/deletions |
| A | Number of sequences maintaining adenine (no conversion) |
| T | Number of A-to-T conversions |
| G | Number of A-to-G conversions (desired ABE editing) |
| C | Number of A-to-C conversions |
| N | Number of sequences with ambiguous base calls |
| is Sense | Boolean indicating if sequence is sense (True) or antisense (False) |
The program automatically determines if the target sequence is in sense or antisense orientation by searching for the spacer sequence in both the original and reverse complement of the wild-type sequence.
- Extracts sequences between flanking indicators
- Filters sequences appearing ≤1 time (likely sequencing errors)
- Performs global pairwise alignment with BioPython
- Identifies all adenine positions within the spacer
- Uses PAM-proximal numbering (position 1 = closest to PAM)
- Calibrates positions to account for insertions
- Detects wildtype, indel, and base conversion outcomes
- Utilizes multiprocessing to analyze multiple samples simultaneously
- Default: Uses up to 8 CPU cores
- Significantly reduces processing time for large datasets
The program was developed and tested on Linux running under WSL2 (Windows Subsystem for Linux 2) on Windows:
| Component | Specification |
|---|---|
| Operating system | Ubuntu 22.04.1 LTS (WSL2) |
| Kernel | 6.18.33.1-microsoft-standard-WSL2 |
| CPU | 12th Gen Intel Core i5-12400 (8 logical processors allocated to WSL2) |
| RAM | 23.5 GB allocated to WSL2 |
| Python | 3.14.3 |
The code uses only the standard library plus BioPython and openpyxl, so it is not WSL2-specific and should run on any platform with a compatible Python; the WSL2 configuration above is simply the environment in which it was validated.
- Processing speed depends on:
- Number of FASTQ files
- Number of target regions
- Read depth per sample
- CPU cores available
On the tested environment above, the bundled demo (1 sample × 28 regions, ~45,000 on-target reads) completes in roughly 5-7 seconds. Full sequencing datasets with multiple samples and millions of reads scale up accordingly (on the order of minutes).
-
"SPACER not found" error
- Check that spacer sequences exactly match regions in wild-type sequences
- Verify input file format (3 lines per region)
-
"Zero length sequence" warning
- Usually indicates poor sequencing quality
- Affected sequences are automatically skipped
-
Low conversion efficiency
- Verify correct target sequences
- Check sequencing quality and depth
- Ensure proper PCR amplification of target regions
The Python script follows standard conventions:
- Configuration Parameters: All settings at the top of the file
- Utility Functions: Helper functions for sequence manipulation
- Sequence Analysis Functions: Functions for finding A positions and cropping sequences
- Alignment Functions: BioPython alignment and calibration functions
- Main Processing Function: Core logic for processing each task
- Main Execution: Entry point with user-friendly progress output
This program is released under the MIT License. See the LICENSE file for details.
For questions or issues, please contact: