Skip to content

fix(gtdb-parser): respect --use_ncbi_taxonomy in novel-species filter#13

Merged
thanhleviet merged 1 commit into
mainfrom
fix/novel-species-filter-respects-ncbi-taxonomy
May 19, 2026
Merged

fix(gtdb-parser): respect --use_ncbi_taxonomy in novel-species filter#13
thanhleviet merged 1 commit into
mainfrom
fix/novel-species-filter-respects-ncbi-taxonomy

Conversation

@thanhleviet

Copy link
Copy Markdown
Contributor

Summary

The novel-species filter at GtdbSpreadsheetParser.py:166 was reading the raw gtdb_taxonomy column to drop GTDB placeholder species (sp###), but --use_ncbi_taxonomy and --ncbi_whitelist only rewrite the derived species column. As a result, genomes whose NCBI name is a real species but whose GTDB name is a placeholder were silently dropped — most visibly losing 797 of 837 Shigella flexneri rows and yielding zero S. flexneri in the final Escherichia/Shigella database.

This PR points the filter at the species column instead. Default builds are unaffected because both columns hold the same value when no rename flag is set. Builds with --use_ncbi_taxonomy or --ncbi_whitelist now correctly preserve NCBI-named species. Added na=False so blank cells in the species column don't raise during the regex check.

Changes

  • gambitdb/GtdbSpreadsheetParser.py:166 — switch column from gtdb_taxonomy to species; add na=False.
  • Explanatory comment expanded at lines 159-164 to capture the rationale.

Test plan

  • Existing GtdbSpreadsheetParser unit tests pass (2/2 green); 10 unrelated pre-existing failures in the wider suite are present on unmodified main and not caused by this change.
  • Verified on a hand-crafted 145-row subset:
    • With --use_ncbi_taxonomy: Shigella flexneri now appears with ngenomes=6 (was 0 before fix).
    • Without --use_ncbi_taxonomy: genuine sp\d+$ placeholders still correctly dropped (regression check).
    • No orphan species rows with empty ngenomes in species_taxa.csv for the affected species.
  • Default behaviour unchanged for builds that do not set --use_ncbi_taxonomy or --ncbi_whitelist — the same rows are dropped because species is derived from gtdb_taxonomy in that path.

Out of scope

A secondary issue exists where species whose accessions are removed by the minimum-genomes filter still appear in species_taxa.csv with empty ngenomes (lines 202-235). This is independent of the column-mismatch bug fixed here and can fire under any flag combination. To be addressed in a follow-up.

The novel-species filter read the raw gtdb_taxonomy column while
--use_ncbi_taxonomy rewrites only the species column. This caused
genomes with NCBI-named species but GTDB placeholder taxonomy (e.g.
NCBI Shigella flexneri / GTDB s__ECMA0423 sp047199055) to be dropped,
ultimately yielding zero S. flexneri in the Escherichia/Shigella build.

Switch the filter to read the species column so it sees post-rewrite
names. Add na=False to tolerate rows missing a species call, which the
species column (unlike gtdb_taxonomy) does not guarantee to populate.
@thanhleviet thanhleviet requested a review from Michal-Babins May 19, 2026 14:04

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. LGTM

@thanhleviet thanhleviet merged commit d027140 into main May 19, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants