Skip to content

Releases: smart-models/Smart-Embedder

v1.2.0

17 Jun 20:06

Choose a tag to compare

Smart Embedder 1.2.0

Token truncation visibility + per-backend token limits.

✨ Added

  • Truncation warnings in API responses. /embed and /rerank now return a warnings array. When input exceeds the model token limit,
    the response reports model, max_tokens, original_tokens, truncated_tokens, and truncation_side instead of silently truncating.
  • Per-backend token length limits. Token limits split per backend (BGE-M3, Qwen dense, Qwen rerank) instead of one shared payload
    limit.
  • version field in /config response ("version": "1.2.0").

🔧 Changed

  • Default token limits now set to each model's documented maximum (e.g. QWEN_RERANK_MAX_LENGTH=32768, BGE-M3 8192). Launcher
    VRAM-tuned values still apply when env unset.
  • Removed legacy shared payload-limit settings.

🐛 Fixed

  • Qwen rerank decode drift: re-truncate to correct token boundary.
  • Suppressed noisy tokenizer length warnings in logs.

✅ Tests

  • Runtime suite now 17 checks (18 with --token); added coverage for truncation-warning response shape.

v1.1.0

08 Jun 12:00

Choose a tag to compare

Smart Embedder 1.1.0

This release separates CPU and GPU setups into dedicated, explicit artifacts and rebrands all deployment components to
smart-embedder.

✨ Highlights

Split CPU/GPU setup

  • New requirements-cpu.txt (CPU-only PyTorch wheel, torch==2.7.0+cpu) alongside requirements-gpu.txt (CUDA torch==2.7.0+cu126).
    Same pins otherwise, so model compatibility is unchanged.
  • New Dockerfile.cpu built on python:3.11-slim instead of the nvidia/cuda base — no CUDA libraries pulled in CPU mode (CPU image
    ~2.31 GB, verified build).
  • start_server.bat / start_server.sh now install the correct requirements file automatically based on the selected device (cpu /
    gpu).
  • docker-compose.cpu.yml builds the slim CPU image with a distinct image tag and container name.

Rebrand to smart-embedder

  • Compose project name: smart-embedder
  • Image tags: smart-embedder:gpu / smart-embedder:cpu
  • Container names: smart-embedder-gpu / smart-embedder-cpu (now distinct, so GPU and CPU can run side by side)
  • Default Hugging Face cache volume: smart-embedder-hf-cache
  • Updated OCI image labels

Versioning & API

  • Application version bumped to 1.1.0 (FastAPI version, Prometheus server_info)
  • Device-aware Swagger title: shows Smart Embedder GPU or Smart Embedder CPU depending on detected hardware

⚠️ Breaking changes

  • Default file names removed. Bare commands no longer work — pass explicit files:
    • GPU: docker compose -f docker-compose.gpu.yml up
    • CPU: docker compose -f docker-compose.gpu.yml -f docker-compose.cpu.yml up
    • Local install: pip install -r requirements-gpu.txt (or requirements-cpu.txt)
    • Or just use start_server.bat / start_server.sh, which handle this automatically.
  • HF cache volume renamed to smart-embedder-hf-cache. To avoid re-downloading models (~3 GB), migrate the old volume:
    docker volume create smart-embedder-hf-cache
    docker run --rm -v bge-m3-embedder-reranker-hf-cache:/from -v smart-embedder-hf-cache:/to alpine sh -c "cp -a /from/. /to/"

v1.0.0

29 May 15:17

Choose a tag to compare

🚀 Smart Embedder v1.0.0

First stable release of Smart Embedder — a lightweight, self-hosted embedding server for hybrid search pipelines, running
entirely on local hardware with no cloud dependency.

Core Features

  • Hybrid search stack — dense vectors (BGE-M3 or Qwen3), sparse lexical matching (BM25/SPLADE), and ColBERT late-interaction
    reranking in a single FastAPI server
  • Selectable embedding backend — switch between BGE-M3 and Qwen3 dense embeddings at startup via environment variable
  • Interactive reranker selection — choose between BGE-M3 and Qwen3 reranker at startup
  • GPU VRAM auto-tuning — batch sizes and model parameters automatically calibrated to available VRAM
  • Shared GPU executor — serialized inference across all endpoints to prevent CUDA OOM under concurrent load
  • CPU fallback — full functionality on CPU when no NVIDIA GPU is available

Infrastructure

  • Docker + Docker Compose support (GPU and CPU profiles)
  • Prometheus metrics endpoint
  • Optional Bearer token authentication
  • Graceful shutdown with request queue drainage
  • Configurable PORT binding via environment variable

Models Supported

Role Models
Dense embeddings BAAI/bge-m3, Qwen3-Embedding
Sparse embeddings BAAI/bge-m3
Reranker BAAI/bge-reranker-v2-m3, Qwen3-Reranker