Click image to view full-resolution.
Authors: Silas Ruhrberg Estévez · Christopher Chiu · Mihaela van der Schaar
Paper • BibTeX • Getting Started • Results
Modern clinical practice relies on evidence-based guidelines implemented as compact scoring sys- tems composed of a small number of interpretable decision rules. While machine-learning models achieve strong performance, many fail to trans- late into routine clinical use due to misalignment with workflow constraints such as memorabil- ity, auditability, and bedside execution. We ar- gue that this gap arises not from insufficient pre- dictive power, but from optimizing over model classes that are incompatible with guideline de- ployment. Deployable guidelines often take the form of unit-weighted clinical checklists, formed by thresholding the sum of binary rules, but learn- ing such scores requires searching an exponen- tially large discrete space of possible rule sets. We introduce AgentScore, which performs se- mantically guided optimization in this space by us- ing LLMs to propose candidate rules and a deter- ministic, data-grounded verification-and-selection loop to enforce statistical validity and deploya- bility constraints. Across eight clinical predic- tion tasks, AgentScore outperforms existing score-generation methods and achieves AUROC comparable to more flexible interpretable models despite operating under stronger structural con- straints. On two additional externally validated tasks, AgentScore achieves higher discrimina- tion than established guideline-based scores.
cd agentscore-official
uv sync --devimport pandas as pd
from agentscore_official import AgentScoreConfig, run_agentscore
# user-provided data
X_train = pd.read_csv("X_train.csv")
y_train = pd.read_csv("y_train.csv").iloc[:, 0]
X_test = pd.read_csv("X_test.csv")
y_test = pd.read_csv("y_test.csv").iloc[:, 0]
# optional custom chat function
# expected signature: chat_fn(user_prompt, model_name, temperature, top_p) -> str
# keyword arguments are supported
def my_chat_fn(user_prompt, model_name="gpt-4o", temperature=1.0, top_p=0.95):
# call your own LLM backend here and return text response
raise NotImplementedError
cfg = AgentScoreConfig(
model_name="gpt-4o",
feature_iterations=100,
feature_max_rules=6,
score_refine_steps=10,
temperature=1.0,
task_description="Binary risk prediction for 30-day adverse event.",
)
result = run_agentscore(
X_train,
y_train,
X_test,
y_test,
config=cfg,
chat_fn=my_chat_fn, # remove to use built-in Azure/OpenAI env flow
)
print(result["test_metrics"])
print(result["score_spec"])from agentscore_official import make_synthetic_stroke_dataset
data = make_synthetic_stroke_dataset(n_samples=1200, random_state=42)
X_train, y_train = data["X_train"], data["y_train"]
X_test, y_test = data["X_test"], data["y_test"]run_agentscore(...) prints the final score card and returns a dictionary with:
score_specretained_rulestrain_metricstest_metricsscoresdebug_summaryllm_usage_summary
agentscore-run \
data.x_train_path=./X_train.csv \
data.y_train_path=./y_train.csv \
data.x_test_path=./X_test.csv \
data.y_test_path=./y_test.csv \
model.model_name=gpt-5 \
model.task_description="Binary risk prediction for 30-day adverse event."You can override any setting from src/agentscore_official/conf/config.yaml via Hydra CLI arguments.
See:
notebooks/agentscore_quickstart.ipynb
If you do not provide chat_fn, AgentScore uses the built-in client in llm_client.py.
export GPT4O_ENDPOINT="https://<your-resource>.openai.azure.com/"
export GPT4O_KEY="<your-api-key>"
export GPT4O_DEPLOYMENT="<your-gpt-4o-deployment-name>"Use in config:
cfg = AgentScoreConfig(model_name="gpt-4o", ...)export GPT5_ENDPOINT="https://<your-resource>.openai.azure.com/"
export GPT5_KEY="<your-api-key>"
export GPT5_DEPLOYMENT="<your-gpt-5-deployment-name>"Use in config:
cfg = AgentScoreConfig(model_name="gpt-5", ...)Notes:
model_name="gpt5"is also accepted.- GPT-5 path forces temperature behavior from the client (as in original code).
export DSV3_2_ENDPOINT="<your-deepseek-openai-compatible-endpoint>"
export DSV3_2_KEY="<your-api-key>"
export DSV3_2_DEPLOYMENT="<optional-deployment-or-model>"
export DSV3_2_MODEL="<optional-model-name>"Use in config with any deepseek-like model string, for example:
cfg = AgentScoreConfig(model_name="deepseek-chat", ...)Optional DeepSeek tuning env vars:
DSV3_2_MAX_TOKENS(default800)DSV3_2_REQUEST_TIMEOUT(seconds)
- Title: Automatic Construction of Clinical Scoring Systems with LLM Agents
- Venue: International Conference on Machine Learning (ICML 2026)
- OpenReview: https://openreview.net/forum?id=fFUI2PGaNG
- ICML poster page: https://icml.cc/virtual/2026/poster/62589
@inproceedings{
ruhrbergestevez2026agentscore,
title={Automatic Construction of Clinical Scoring Systems with LLM Agents},
author={Silas Ruhrberg Estévez and Christopher Chiu and Mihaela van der Schaar},
booktitle={Forty-third International Conference on Machine Learning},
year={2026},
url={https://openreview.net/forum?id=fFUI2PGaNG}
}