Skip to content

vanderschaarlab/agentscore-official

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Automatic Construction of Clinical Scoring Systems with LLM Agents (ICML 2026)

AgentScore overview diagram
Click image to view full-resolution.

Authors: Silas Ruhrberg Estévez · Christopher Chiu · Mihaela van der Schaar

PaperBibTeXGetting StartedResults


Modern clinical practice relies on evidence-based guidelines implemented as compact scoring sys- tems composed of a small number of interpretable decision rules. While machine-learning models achieve strong performance, many fail to trans- late into routine clinical use due to misalignment with workflow constraints such as memorabil- ity, auditability, and bedside execution. We ar- gue that this gap arises not from insufficient pre- dictive power, but from optimizing over model classes that are incompatible with guideline de- ployment. Deployable guidelines often take the form of unit-weighted clinical checklists, formed by thresholding the sum of binary rules, but learn- ing such scores requires searching an exponen- tially large discrete space of possible rule sets. We introduce AgentScore, which performs se- mantically guided optimization in this space by us- ing LLMs to propose candidate rules and a deter- ministic, data-grounded verification-and-selection loop to enforce statistical validity and deploya- bility constraints. Across eight clinical predic- tion tasks, AgentScore outperforms existing score-generation methods and achieves AUROC comparable to more flexible interpretable models despite operating under stronger structural con- straints. On two additional externally validated tasks, AgentScore achieves higher discrimina- tion than established guideline-based scores.

Getting Started

Install (uv)

cd agentscore-official
uv sync --dev

Quickstart (Python API)

import pandas as pd
from agentscore_official import AgentScoreConfig, run_agentscore

# user-provided data
X_train = pd.read_csv("X_train.csv")
y_train = pd.read_csv("y_train.csv").iloc[:, 0]
X_test = pd.read_csv("X_test.csv")
y_test = pd.read_csv("y_test.csv").iloc[:, 0]

# optional custom chat function
# expected signature: chat_fn(user_prompt, model_name, temperature, top_p) -> str
# keyword arguments are supported

def my_chat_fn(user_prompt, model_name="gpt-4o", temperature=1.0, top_p=0.95):
    # call your own LLM backend here and return text response
    raise NotImplementedError

cfg = AgentScoreConfig(
    model_name="gpt-4o",
    feature_iterations=100,
    feature_max_rules=6,
    score_refine_steps=10,
    temperature=1.0,
    task_description="Binary risk prediction for 30-day adverse event.",
)

result = run_agentscore(
    X_train,
    y_train,
    X_test,
    y_test,
    config=cfg,
    chat_fn=my_chat_fn,  # remove to use built-in Azure/OpenAI env flow
)

print(result["test_metrics"])
print(result["score_spec"])

Synthetic Clinical Demo Dataset

from agentscore_official import make_synthetic_stroke_dataset

data = make_synthetic_stroke_dataset(n_samples=1200, random_state=42)
X_train, y_train = data["X_train"], data["y_train"]
X_test, y_test = data["X_test"], data["y_test"]

run_agentscore(...) prints the final score card and returns a dictionary with:

  • score_spec
  • retained_rules
  • train_metrics
  • test_metrics
  • scores
  • debug_summary
  • llm_usage_summary

CLI (CSV input)

agentscore-run \
  data.x_train_path=./X_train.csv \
  data.y_train_path=./y_train.csv \
  data.x_test_path=./X_test.csv \
  data.y_test_path=./y_test.csv \
  model.model_name=gpt-5 \
  model.task_description="Binary risk prediction for 30-day adverse event."

You can override any setting from src/agentscore_official/conf/config.yaml via Hydra CLI arguments.

Notebook

See:

  • notebooks/agentscore_quickstart.ipynb

LLM Setup (Built-in Backend)

If you do not provide chat_fn, AgentScore uses the built-in client in llm_client.py.

GPT-4o (Azure OpenAI)

export GPT4O_ENDPOINT="https://<your-resource>.openai.azure.com/"
export GPT4O_KEY="<your-api-key>"
export GPT4O_DEPLOYMENT="<your-gpt-4o-deployment-name>"

Use in config:

cfg = AgentScoreConfig(model_name="gpt-4o", ...)

GPT-5 (Azure OpenAI)

export GPT5_ENDPOINT="https://<your-resource>.openai.azure.com/"
export GPT5_KEY="<your-api-key>"
export GPT5_DEPLOYMENT="<your-gpt-5-deployment-name>"

Use in config:

cfg = AgentScoreConfig(model_name="gpt-5", ...)

Notes:

  • model_name="gpt5" is also accepted.
  • GPT-5 path forces temperature behavior from the client (as in original code).

DeepSeek (OpenAI-compatible endpoint)

export DSV3_2_ENDPOINT="<your-deepseek-openai-compatible-endpoint>"
export DSV3_2_KEY="<your-api-key>"
export DSV3_2_DEPLOYMENT="<optional-deployment-or-model>"
export DSV3_2_MODEL="<optional-model-name>"

Use in config with any deepseek-like model string, for example:

cfg = AgentScoreConfig(model_name="deepseek-chat", ...)

Optional DeepSeek tuning env vars:

  • DSV3_2_MAX_TOKENS (default 800)
  • DSV3_2_REQUEST_TIMEOUT (seconds)

Paper

Citation

@inproceedings{
ruhrbergestevez2026agentscore,
title={Automatic Construction of Clinical Scoring Systems with LLM Agents},
author={Silas Ruhrberg Estévez and Christopher Chiu and Mihaela van der Schaar},
booktitle={Forty-third International Conference on Machine Learning},
year={2026},
url={https://openreview.net/forum?id=fFUI2PGaNG}
}

About

Official implementation of AgentScore, accepted at ICML 2026.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 95.4%
  • Jupyter Notebook 4.6%