fix(gtdb-parser): respect --use_ncbi_taxonomy in novel-species filter by thanhleviet · Pull Request #13 · gambit-suite/gambitdb

thanhleviet · 2026-05-19T14:01:33Z

Summary

The novel-species filter at GtdbSpreadsheetParser.py:166 was reading the raw gtdb_taxonomy column to drop GTDB placeholder species (sp###), but --use_ncbi_taxonomy and --ncbi_whitelist only rewrite the derived species column. As a result, genomes whose NCBI name is a real species but whose GTDB name is a placeholder were silently dropped — most visibly losing 797 of 837 Shigella flexneri rows and yielding zero S. flexneri in the final Escherichia/Shigella database.

This PR points the filter at the species column instead. Default builds are unaffected because both columns hold the same value when no rename flag is set. Builds with --use_ncbi_taxonomy or --ncbi_whitelist now correctly preserve NCBI-named species. Added na=False so blank cells in the species column don't raise during the regex check.

Changes

gambitdb/GtdbSpreadsheetParser.py:166 — switch column from gtdb_taxonomy to species; add na=False.
Explanatory comment expanded at lines 159-164 to capture the rationale.

Test plan

Existing GtdbSpreadsheetParser unit tests pass (2/2 green); 10 unrelated pre-existing failures in the wider suite are present on unmodified main and not caused by this change.
Verified on a hand-crafted 145-row subset:
- With --use_ncbi_taxonomy: Shigella flexneri now appears with ngenomes=6 (was 0 before fix).
- Without --use_ncbi_taxonomy: genuine sp\d+$ placeholders still correctly dropped (regression check).
- No orphan species rows with empty ngenomes in species_taxa.csv for the affected species.
Default behaviour unchanged for builds that do not set --use_ncbi_taxonomy or --ncbi_whitelist — the same rows are dropped because species is derived from gtdb_taxonomy in that path.

Out of scope

A secondary issue exists where species whose accessions are removed by the minimum-genomes filter still appear in species_taxa.csv with empty ngenomes (lines 202-235). This is independent of the column-mismatch bug fixed here and can fire under any flag combination. To be addressed in a follow-up.

The novel-species filter read the raw gtdb_taxonomy column while --use_ncbi_taxonomy rewrites only the species column. This caused genomes with NCBI-named species but GTDB placeholder taxonomy (e.g. NCBI Shigella flexneri / GTDB s__ECMA0423 sp047199055) to be dropped, ultimately yielding zero S. flexneri in the Escherichia/Shigella build. Switch the filter to read the species column so it sees post-rewrite names. Add na=False to tolerate rows missing a species call, which the species column (unlike gtdb_taxonomy) does not guarantee to populate.

Michal-Babins · 2026-05-19T14:07:51Z

Makes sense. LGTM

thanhleviet requested a review from Michal-Babins May 19, 2026 14:04

Michal-Babins approved these changes May 19, 2026

View reviewed changes

Comment thread gambitdb/GtdbSpreadsheetParser.py

Michal-Babins May 19, 2026

Copy link
Copy Markdown

Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. LGTM

thanhleviet merged commit d027140 into main May 19, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gtdb-parser): respect --use_ncbi_taxonomy in novel-species filter#13

fix(gtdb-parser): respect --use_ncbi_taxonomy in novel-species filter#13
thanhleviet merged 1 commit into
mainfrom
fix/novel-species-filter-respects-ncbi-taxonomy

thanhleviet commented May 19, 2026

Uh oh!

Michal-Babins May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

thanhleviet commented May 19, 2026

Summary

Changes

Test plan

Out of scope

Uh oh!

Michal-Babins May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants