Releases: smart-models/Smart-Embedder
Releases · smart-models/Smart-Embedder
v1.2.0
Smart Embedder 1.2.0
Token truncation visibility + per-backend token limits.
✨ Added
- Truncation warnings in API responses. /embed and /rerank now return a warnings array. When input exceeds the model token limit,
the response reports model, max_tokens, original_tokens, truncated_tokens, and truncation_side instead of silently truncating. - Per-backend token length limits. Token limits split per backend (BGE-M3, Qwen dense, Qwen rerank) instead of one shared payload
limit. - version field in /config response ("version": "1.2.0").
🔧 Changed
- Default token limits now set to each model's documented maximum (e.g. QWEN_RERANK_MAX_LENGTH=32768, BGE-M3 8192). Launcher
VRAM-tuned values still apply when env unset. - Removed legacy shared payload-limit settings.
🐛 Fixed
- Qwen rerank decode drift: re-truncate to correct token boundary.
- Suppressed noisy tokenizer length warnings in logs.
✅ Tests
- Runtime suite now 17 checks (18 with --token); added coverage for truncation-warning response shape.
v1.1.0
Smart Embedder 1.1.0
This release separates CPU and GPU setups into dedicated, explicit artifacts and rebrands all deployment components to
smart-embedder.
✨ Highlights
Split CPU/GPU setup
- New requirements-cpu.txt (CPU-only PyTorch wheel, torch==2.7.0+cpu) alongside requirements-gpu.txt (CUDA torch==2.7.0+cu126).
Same pins otherwise, so model compatibility is unchanged. - New Dockerfile.cpu built on python:3.11-slim instead of the nvidia/cuda base — no CUDA libraries pulled in CPU mode (CPU image
~2.31 GB, verified build). - start_server.bat / start_server.sh now install the correct requirements file automatically based on the selected device (cpu /
gpu). - docker-compose.cpu.yml builds the slim CPU image with a distinct image tag and container name.
Rebrand to smart-embedder
- Compose project name: smart-embedder
- Image tags: smart-embedder:gpu / smart-embedder:cpu
- Container names: smart-embedder-gpu / smart-embedder-cpu (now distinct, so GPU and CPU can run side by side)
- Default Hugging Face cache volume: smart-embedder-hf-cache
- Updated OCI image labels
Versioning & API
- Application version bumped to 1.1.0 (FastAPI version, Prometheus server_info)
- Device-aware Swagger title: shows Smart Embedder GPU or Smart Embedder CPU depending on detected hardware
- Default file names removed. Bare commands no longer work — pass explicit files:
- GPU: docker compose -f docker-compose.gpu.yml up
- CPU: docker compose -f docker-compose.gpu.yml -f docker-compose.cpu.yml up
- Local install: pip install -r requirements-gpu.txt (or requirements-cpu.txt)
- Or just use start_server.bat / start_server.sh, which handle this automatically.
- HF cache volume renamed to smart-embedder-hf-cache. To avoid re-downloading models (~3 GB), migrate the old volume:
docker volume create smart-embedder-hf-cache
docker run --rm -v bge-m3-embedder-reranker-hf-cache:/from -v smart-embedder-hf-cache:/to alpine sh -c "cp -a /from/. /to/"
v1.0.0
🚀 Smart Embedder v1.0.0
First stable release of Smart Embedder — a lightweight, self-hosted embedding server for hybrid search pipelines, running
entirely on local hardware with no cloud dependency.
Core Features
- Hybrid search stack — dense vectors (BGE-M3 or Qwen3), sparse lexical matching (BM25/SPLADE), and ColBERT late-interaction
reranking in a single FastAPI server - Selectable embedding backend — switch between BGE-M3 and Qwen3 dense embeddings at startup via environment variable
- Interactive reranker selection — choose between BGE-M3 and Qwen3 reranker at startup
- GPU VRAM auto-tuning — batch sizes and model parameters automatically calibrated to available VRAM
- Shared GPU executor — serialized inference across all endpoints to prevent CUDA OOM under concurrent load
- CPU fallback — full functionality on CPU when no NVIDIA GPU is available
Infrastructure
- Docker + Docker Compose support (GPU and CPU profiles)
- Prometheus metrics endpoint
- Optional Bearer token authentication
- Graceful shutdown with request queue drainage
- Configurable
PORTbinding via environment variable
Models Supported
| Role | Models |
|---|---|
| Dense embeddings | BAAI/bge-m3, Qwen3-Embedding |
| Sparse embeddings | BAAI/bge-m3 |
| Reranker | BAAI/bge-reranker-v2-m3, Qwen3-Reranker |