Automatic Construction of Clinical Scoring Systems with LLM Agents (ICML 2026)

_{Click image to view full-resolution.}

Authors: Silas Ruhrberg Estévez · Christopher Chiu · Mihaela van der Schaar

Paper • BibTeX • Getting Started • Results

Modern clinical practice relies on evidence-based guidelines implemented as compact scoring sys- tems composed of a small number of interpretable decision rules. While machine-learning models achieve strong performance, many fail to trans- late into routine clinical use due to misalignment with workflow constraints such as memorabil- ity, auditability, and bedside execution. We ar- gue that this gap arises not from insufficient pre- dictive power, but from optimizing over model classes that are incompatible with guideline de- ployment. Deployable guidelines often take the form of unit-weighted clinical checklists, formed by thresholding the sum of binary rules, but learn- ing such scores requires searching an exponen- tially large discrete space of possible rule sets. We introduce AgentScore, which performs se- mantically guided optimization in this space by us- ing LLMs to propose candidate rules and a deter- ministic, data-grounded verification-and-selection loop to enforce statistical validity and deploya- bility constraints. Across eight clinical predic- tion tasks, AgentScore outperforms existing score-generation methods and achieves AUROC comparable to more flexible interpretable models despite operating under stronger structural con- straints. On two additional externally validated tasks, AgentScore achieves higher discrimina- tion than established guideline-based scores.

Getting Started

Install (uv)

cd agentscore-official
uv sync --dev

Quickstart (Python API)

import pandas as pd
from agentscore_official import AgentScoreConfig, run_agentscore

# user-provided data
X_train = pd.read_csv("X_train.csv")
y_train = pd.read_csv("y_train.csv").iloc[:, 0]
X_test = pd.read_csv("X_test.csv")
y_test = pd.read_csv("y_test.csv").iloc[:, 0]

# optional custom chat function
# expected signature: chat_fn(user_prompt, model_name, temperature, top_p) -> str
# keyword arguments are supported

def my_chat_fn(user_prompt, model_name="gpt-4o", temperature=1.0, top_p=0.95):
    # call your own LLM backend here and return text response
    raise NotImplementedError

cfg = AgentScoreConfig(
    model_name="gpt-4o",
    feature_iterations=100,
    feature_max_rules=6,
    score_refine_steps=10,
    temperature=1.0,
    task_description="Binary risk prediction for 30-day adverse event.",
)

result = run_agentscore(
    X_train,
    y_train,
    X_test,
    y_test,
    config=cfg,
    chat_fn=my_chat_fn,  # remove to use built-in Azure/OpenAI env flow
)

print(result["test_metrics"])
print(result["score_spec"])

Synthetic Clinical Demo Dataset

from agentscore_official import make_synthetic_stroke_dataset

data = make_synthetic_stroke_dataset(n_samples=1200, random_state=42)
X_train, y_train = data["X_train"], data["y_train"]
X_test, y_test = data["X_test"], data["y_test"]

run_agentscore(...) prints the final score card and returns a dictionary with:

score_spec
retained_rules
train_metrics
test_metrics
scores
debug_summary
llm_usage_summary

CLI (CSV input)

agentscore-run \
  data.x_train_path=./X_train.csv \
  data.y_train_path=./y_train.csv \
  data.x_test_path=./X_test.csv \
  data.y_test_path=./y_test.csv \
  model.model_name=gpt-5 \
  model.task_description="Binary risk prediction for 30-day adverse event."

You can override any setting from src/agentscore_official/conf/config.yaml via Hydra CLI arguments.

Notebook

See:

notebooks/agentscore_quickstart.ipynb

LLM Setup (Built-in Backend)

If you do not provide chat_fn, AgentScore uses the built-in client in llm_client.py.

GPT-4o (Azure OpenAI)

export GPT4O_ENDPOINT="https://<your-resource>.openai.azure.com/"
export GPT4O_KEY="<your-api-key>"
export GPT4O_DEPLOYMENT="<your-gpt-4o-deployment-name>"

Use in config:

cfg = AgentScoreConfig(model_name="gpt-4o", ...)

GPT-5 (Azure OpenAI)

export GPT5_ENDPOINT="https://<your-resource>.openai.azure.com/"
export GPT5_KEY="<your-api-key>"
export GPT5_DEPLOYMENT="<your-gpt-5-deployment-name>"

Use in config:

cfg = AgentScoreConfig(model_name="gpt-5", ...)

Notes:

model_name="gpt5" is also accepted.
GPT-5 path forces temperature behavior from the client (as in original code).

DeepSeek (OpenAI-compatible endpoint)

export DSV3_2_ENDPOINT="<your-deepseek-openai-compatible-endpoint>"
export DSV3_2_KEY="<your-api-key>"
export DSV3_2_DEPLOYMENT="<optional-deployment-or-model>"
export DSV3_2_MODEL="<optional-model-name>"

Use in config with any deepseek-like model string, for example:

cfg = AgentScoreConfig(model_name="deepseek-chat", ...)

Optional DeepSeek tuning env vars:

DSV3_2_MAX_TOKENS (default 800)
DSV3_2_REQUEST_TIMEOUT (seconds)

Paper

Title: Automatic Construction of Clinical Scoring Systems with LLM Agents
Venue: International Conference on Machine Learning (ICML 2026)
OpenReview: https://openreview.net/forum?id=fFUI2PGaNG
ICML poster page: https://icml.cc/virtual/2026/poster/62589

Citation

@inproceedings{
ruhrbergestevez2026agentscore,
title={Automatic Construction of Clinical Scoring Systems with LLM Agents},
author={Silas Ruhrberg Estévez and Christopher Chiu and Mihaela van der Schaar},
booktitle={Forty-third International Conference on Machine Learning},
year={2026},
url={https://openreview.net/forum?id=fFUI2PGaNG}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
notebooks		notebooks
src/agentscore_official		src/agentscore_official
tests		tests
.gitignore		.gitignore
LICENCE.md		LICENCE.md
README.md		README.md
agentscore.png		agentscore.png
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatic Construction of Clinical Scoring Systems with LLM Agents (ICML 2026)

Getting Started

Install (uv)

Quickstart (Python API)

Synthetic Clinical Demo Dataset

CLI (CSV input)

Notebook

LLM Setup (Built-in Backend)

GPT-4o (Azure OpenAI)

GPT-5 (Azure OpenAI)

DeepSeek (OpenAI-compatible endpoint)

Paper

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Automatic Construction of Clinical Scoring Systems with LLM Agents (ICML 2026)

Getting Started

Install (uv)

Quickstart (Python API)

Synthetic Clinical Demo Dataset

CLI (CSV input)

Notebook

LLM Setup (Built-in Backend)

GPT-4o (Azure OpenAI)

GPT-5 (Azure OpenAI)

DeepSeek (OpenAI-compatible endpoint)

Paper

Citation

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages