ABE Conversion Analysis

This program analyzes Adenine Base Editor (ABE) conversion efficiency from high-throughput sequencing data. It processes FASTQ files to identify and quantify A-to-G conversions at specific genomic target sites.

Overview

The ABE Conversion Analysis pipeline:

Reads FASTQ sequencing files and reference sequences
Automatically detects sense/antisense orientation
Aligns reads to reference sequences
Identifies A-to-G conversions within CRISPR spacer regions
Quantifies editing efficiency at each adenine position
Outputs results to Excel format with detailed conversion statistics

Requirements

Python Dependencies

Python 3.7+
BioPython (1.79+)
openpyxl
multiprocessing (built-in)

Install dependencies:

pip install biopython openpyxl

Input Files

1. Input Configuration File

Format: ABE_off_target_input_V106W.txt

The input file contains target information in a 3-line repeating format:

[Region name]
[Wild-type sequence]
[CRISPR spacer sequence]

Example:

chr1_55040035.dna
cttgcgttccgaggaggacggcctggccgaagcacccgagcacggaaccacagccaccttccaccgctgcgccaaggtgcgggtgtagggatgggaggccggggcgaacccgcagccgggacggtgcggtgctgtttcctctcgggcctcagtttccccccatgtaagagaggaagtggagtgcaggtcgccgagggc
cccgcaccttggcgcagcgg

2. FASTQ Files

Place FASTQ files in any directory of your choice (the bundled example data lives in fastqjoin/demo_fastq/)
Supported formats: *.fastq or *.fastqjoin
The program will automatically detect and process all FASTQ files in the directory
The .fastqjoin file was generated by merging the paired-end R1 and R2 reads with the fastq-join program from ea-utils
The bundled fastqjoin/demo_fastq/ contains a single demo dataset, and only a downsampled subset of its reads is included (kept reads are on-target for the regions in ABE_off_target_input_V106W.txt, capped per region) so the program can be run quickly out of the box; it is not the full sequencing dataset

Quick Start

The repository ships with example data, so it runs out of the box:

python ABE_conversion_analysis.py

This uses the default input file (ABE_off_target_input_V106W.txt) and the default FASTQ directory (fastqjoin/demo_fastq/), writing results to output/ABE_analysis_result.xlsx (the output/ folder is created automatically).

Usage

Command-line Options

python ABE_conversion_analysis.py [--input INPUT] [--fastq-dir DIR] [--output OUTPUT] [--processes N]

Argument	Short	Default	Description
`--input`	`-i`	`ABE_off_target_input_V106W.txt`	Input configuration file
`--fastq-dir`	`-f`	`fastqjoin/demo_fastq/`	Directory containing FASTQ files
`--output`	`-o`	`output/ABE_analysis_result.xlsx`	Output Excel file (parent folder auto-created)
`--processes`	`-p`	`8`	Max parallel processes
`--indicator-length`		`10`	Length (bp) of each flanking indicator used to crop reads
`--wildtype-range`		`20`	Half-width (bp) of the indel-free window required to call a non-indel outcome

Examples

# Run with the bundled example data (uses all defaults)
python ABE_conversion_analysis.py

# Specify all paths explicitly
python ABE_conversion_analysis.py \
    --input ABE_off_target_input_V106W.txt \
    --fastq-dir fastqjoin/my_sample/ \
    --output results.xlsx

# Limit CPU usage
python ABE_conversion_analysis.py --processes 4

The program will:

Load target regions from the input file
Detect all FASTQ files (.fastq and .fastqjoin)
Process samples in parallel using multiprocessing
Generate an Excel file with detailed conversion statistics

Excel File Format

The output Excel file contains the following columns:

Column	Description
Sample	FASTQ filename
Region	Genomic region/target name
Position in Spacer (PAM-proximal)	Position of adenine in spacer (numbered from PAM-proximal end)
Total Sequence	Total number of sequences analyzed
Indel	Number of sequences with insertions/deletions
A	Number of sequences maintaining adenine (no conversion)
T	Number of A-to-T conversions
G	Number of A-to-G conversions (desired ABE editing)
C	Number of A-to-C conversions
N	Number of sequences with ambiguous base calls
is Sense	Boolean indicating if sequence is sense (True) or antisense (False)

Algorithm Details

1. Orientation Detection

The program automatically determines if the target sequence is in sense or antisense orientation by searching for the spacer sequence in both the original and reverse complement of the wild-type sequence.

2. Sequence Processing

Extracts sequences between flanking indicators
Filters sequences appearing ≤1 time (likely sequencing errors)
Performs global pairwise alignment with BioPython

3. Conversion Tracking

Identifies all adenine positions within the spacer
Uses PAM-proximal numbering (position 1 = closest to PAM)
Calibrates positions to account for insertions
Detects wildtype, indel, and base conversion outcomes

4. Parallel Processing

Utilizes multiprocessing to analyze multiple samples simultaneously
Default: Uses up to 8 CPU cores
Significantly reduces processing time for large datasets

Tested Environment

The program was developed and tested on Linux running under WSL2 (Windows Subsystem for Linux 2) on Windows:

Component	Specification
Operating system	Ubuntu 22.04.1 LTS (WSL2)
Kernel	6.18.33.1-microsoft-standard-WSL2
CPU	12th Gen Intel Core i5-12400 (8 logical processors allocated to WSL2)
RAM	23.5 GB allocated to WSL2
Python	3.14.3

The code uses only the standard library plus BioPython and openpyxl, so it is not WSL2-specific and should run on any platform with a compatible Python; the WSL2 configuration above is simply the environment in which it was validated.

Performance

Processing speed depends on:
- Number of FASTQ files
- Number of target regions
- Read depth per sample
- CPU cores available

On the tested environment above, the bundled demo (1 sample × 28 regions, ~45,000 on-target reads) completes in roughly 5-7 seconds. Full sequencing datasets with multiple samples and millions of reads scale up accordingly (on the order of minutes).

Troubleshooting

Common Issues

"SPACER not found" error
- Check that spacer sequences exactly match regions in wild-type sequences
- Verify input file format (3 lines per region)
"Zero length sequence" warning
- Usually indicates poor sequencing quality
- Affected sequences are automatically skipped
Low conversion efficiency
- Verify correct target sequences
- Check sequencing quality and depth
- Ensure proper PCR amplification of target regions

Code Structure

The Python script follows standard conventions:

Configuration Parameters: All settings at the top of the file
Utility Functions: Helper functions for sequence manipulation
Sequence Analysis Functions: Functions for finding A positions and cropping sequences
Alignment Functions: BioPython alignment and calibration functions
Main Processing Function: Core logic for processing each task
Main Execution: Entry point with user-friendly progress output

Citation

License

This program is released under the MIT License. See the LICENSE file for details.

Contact

For questions or issues, please contact:

bbakgosu@snu.ac.kr

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ABE Conversion Analysis

Overview

Requirements

Python Dependencies

Input Files

1. Input Configuration File

2. FASTQ Files

Quick Start

Usage

Command-line Options

Examples

Excel File Format

Algorithm Details

1. Orientation Detection

2. Sequence Processing

3. Conversion Tracking

4. Parallel Processing

Tested Environment

Performance

Troubleshooting

Common Issues

Code Structure

Citation

License

Contact

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
fastqjoin/demo_fastq		fastqjoin/demo_fastq
output		output
ABE_conversion_analysis.py		ABE_conversion_analysis.py
ABE_off_target_input_V106W.txt		ABE_off_target_input_V106W.txt
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

ABE Conversion Analysis

Overview

Requirements

Python Dependencies

Input Files

1. Input Configuration File

2. FASTQ Files

Quick Start

Usage

Command-line Options

Examples

Excel File Format

Algorithm Details

1. Orientation Detection

2. Sequence Processing

3. Conversion Tracking

4. Parallel Processing

Tested Environment

Performance

Troubleshooting

Common Issues

Code Structure

Citation

License

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages