Skip to content

SantanderAI/linear-adapter-trainer

LinearAdapterTrainer

Train linear embedding adapters with triplet loss to align retrieval embeddings with your queries.

License: Apache 2.0 Python 3.12+ CI CodeQL Code style: ruff Typed


Open source by Santander AI Lab. An AI / machine learning Python library for retrieval-augmented generation (RAG): it trains linear embedding adapters with triplet loss to align your retrieval embeddings with real user queries. Part of Santander AI Open Source — see also santander.com.

LinearAdapterTrainer fine-tunes retrieval without retraining your embedding model. It learns a small linear transform that is applied to query embeddings at search time, nudging them closer to relevant chunks and away from irrelevant ones. Your vector index stays exactly as it is — you only adapt the query side.

It is built around two composable modules:

  1. Dataset generator — point it at a knowledge base and it produces (query, positive, negative) triplets, with configurable negative mining (semantically opposite, hard, or random) and a leakage-free train/val split.
  2. Linear adapter trainer — trains a PyTorch linear adapter on those triplets with triplet loss, then reports retrieval gains with precision@k, recall@k, MRR and nDCG.

Why a linear adapter?

Approach Cost Re-index corpus? Reversible
Fine-tune the embedding model High (GPU, data) Yes No
Linear query adapter (this repo) Low (CPU-friendly) No Yes
Re-ranking model Medium (latency) No Yes

The adapter is a single matrix (initialized at identity), so training is fast, stable, and easy to audit. At inference you simply do adapted_query = adapter(query_embedding) before your usual nearest-neighbor search.

How it works

                  Module 1: dataset                Module 2: adapter
 ┌───────────────┐   ┌──────────────────┐      ┌────────────────────────┐
 │ Knowledge base│──▶│ query generation │──┐   │  anchor = adapter(q)    │
 │   (chunks)    │   │ negative mining  │  ├──▶│  triplet loss:          │
 └───────────────┘   │ train/val split  │  │   │   pull anchor→positive  │
         │           └──────────────────┘  │   │   push anchor↮negative  │
         ▼                                  │   └────────────┬───────────┘
   embed chunks ────────────────────────────              evaluate
                                                  precision@k · recall@k · MRR · nDCG

The triplet objective is:

L = max(0, d(adapter(query), positive) − d(adapter(query), negative) + margin)

where d is cosine (default) or Euclidean distance.

Installation

# with uv (recommended)
uv sync --dev

# for the example notebook (scraping + plotting)
uv sync --group examples

# or with pip, choosing the backends you need
pip install "linear-adapter-trainer[sentence-transformers]"   # local models
pip install "linear-adapter-trainer[openai]"                  # OpenAI API
pip install "linear-adapter-trainer[all]"

The core install is dependency-light (numpy, torch, tqdm). A dependency-free HashingEmbedder and TemplateQueryGenerator let you run the whole pipeline offline (great for CI and demos).

Quickstart (Python)

from linear_adapter_trainer import (
    AdapterTrainer, DatasetConfig, DatasetGenerator, KnowledgeBase,
    TemplateQueryGenerator, TrainingConfig,
)
from linear_adapter_trainer.embeddings import SentenceTransformerEmbedder

kb = KnowledgeBase.from_jsonl("examples/data/sample_kb.jsonl")
embedder = SentenceTransformerEmbedder("sentence-transformers/all-MiniLM-L6-v2")

# Module 1 — generate triplets
dataset = DatasetGenerator(
    knowledge_base=kb,
    embedder=embedder,
    query_generator=TemplateQueryGenerator(seed=0),  # or LLMQueryGenerator(...)
    config=DatasetConfig(queries_per_chunk=4, strategy="mixed", val_fraction=0.2),
).generate()

# Module 2 — train the adapter
result = AdapterTrainer(kb, embedder, TrainingConfig(epochs=30)).fit(dataset)

print(result.improvement)        # delta per metric vs the base embeddings
result.adapter.save("adapter.pt")

At query time:

import numpy as np
from linear_adapter_trainer import LinearAdapter

adapter = LinearAdapter.load("adapter.pt")
query_vec = embedder.embed(["how do plants make energy?"])
adapted = adapter.transform(query_vec)   # use this for nearest-neighbor search

Quickstart (CLI)

Everything is driven by one TOML file (see examples/config.toml):

uv run linear-adapter generate examples/config.toml   # build the dataset
uv run linear-adapter train    examples/config.toml   # train + report metrics
uv run linear-adapter evaluate examples/config.toml   # base vs adapted
uv run linear-adapter run      examples/config.toml   # generate -> train

Example output (with a Sentence-Transformers backend on a paraphrased query set):

metric              base     adapted     delta
------------------------------------------------
precision@1       0.5200      0.7100   +0.1900
mrr               0.6310      0.8050   +0.1740
ndcg@10           0.6890      0.8420   +0.1530

The bundled examples/config.toml is offline (hashing embedder + template queries) so it runs anywhere. In that setup the baseline is already optimal — the queries reuse chunk tokens — so the adapter correctly reports a ~0 delta. Model selection always includes the identity baseline, so the trained adapter can never score worse than your base embeddings. Switch to a semantic backend (and llm query generation) to see real gains.

Example: improving retrieval on real data

The notebook examples/santander_retrieval_demo.ipynb walks through the full workflow on a corpus scraped from the Santander website:

  1. Scrape & chunk the site into a knowledge base (live, with a cached snapshot fallback so it always runs).
  2. Generate a QA dataset of triplets with an LLM (or offline templates).
  3. Train the linear adapter.
  4. Measure the base-vs-adapted retrieval gain with tables and plots.
uv sync --group examples
export OPENAI_API_KEY=sk-...     # for natural, LLM-generated queries
uv run jupyter lab examples/santander_retrieval_demo.ipynb

The notebook prefers a Sentence-Transformers model + LLM queries (real gains) and automatically falls back to the offline HashingEmbedder + TemplateQueryGenerator when no model or API key is available.

Negative mining strategies

Set strategy in [dataset] to control how negatives are sampled:

  • semantic_opposite — least similar chunks (far in latent space).
  • hard — most similar but incorrect chunks (the strongest training signal).
  • random — uniformly sampled easy negatives.
  • mixed — a weighted blend (configure [dataset.mix]).

Project layout

linear_adapter_trainer/
├── knowledge_base/   # ingestion + chunking
├── embeddings/       # pluggable backends (hashing, sentence-transformers, openai)
├── dataset/          # Module 1: query generation, negatives, split
├── adapter/          # Module 2: model, triplet loss, trainer
├── evaluation/       # precision@k, recall@k, MRR, nDCG, comparison
├── config.py         # TOML config + factories
└── cli.py            # command-line interface

See DOCUMENTATION.md for the full API reference and design notes.

Development

uv sync --dev
uv run ruff check .
uv run pytest
uv run python examples/quickstart.py

Contributing

Contributions are welcome — please read CONTRIBUTING.md. Notable changes are tracked in CHANGELOG.md.

Security

Please report security issues responsibly — see .github/SECURITY.md (contact security-opensource@gruposantander.com or open a GitHub Security Advisory). Do not file public issues for vulnerabilities.

License

Apache License 2.0. See LICENSE and NOTICE. Copyright (c) 2026 Santander Group. Originally authored by Pedro Martin Minguez (Santander AI Lab).

Citation

If you use this software, please cite it — see CITATION.cff.

@software{linear_adapter_trainer_2026,
  title  = {LinearAdapterTrainer: linear embedding adapters with triplet loss for retrieval},
  author = {{Santander AI Lab}},
  year   = {2026},
  url    = {https://github.com/SantanderAI/linear-adapter-trainer},
  license = {Apache-2.0}
}

Packages

 
 
 

Contributors

Languages