Skip to content
43 changes: 31 additions & 12 deletions runpodctl/reference/runpodctl-serverless.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -63,12 +63,23 @@ Create a new Serverless endpoint from a template or from a Hub repo:

```bash
# Create from a template
runpodctl serverless create --name "my-endpoint" --template-id "tpl_abc123"
runpodctl serverless create --template-id "tpl_abc123" --gpu-id "NVIDIA GeForce RTX 4090"

# Create from a template with a model reference
runpodctl serverless create --template-id "tpl_abc123" --gpu-id "NVIDIA GeForce RTX 4090" \
--model-reference https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct:main

# Create a CPU endpoint
runpodctl serverless create --template-id "tpl_abc123" --compute-type CPU

# Create from a Hub repo
runpodctl hub search vllm # Find the hub ID
runpodctl serverless create --hub-id cm8h09d9n000008jvh2rqdsmb --name "my-vllm"

# Create from a Hub repo and attach a model reference
runpodctl serverless create --hub-id cm8h09d9n000008jvh2rqdsmb --gpu-id "NVIDIA GeForce RTX 4090" \
--model-reference https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct:main

# Create from a Hub repo with custom environment variables
runpodctl serverless create --hub-id cm8h09d9n000008jvh2rqdsmb --name "my-vllm" \
--env MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \
Expand All @@ -88,7 +99,7 @@ Each Serverless template can only be bound to one endpoint at a time. To create
#### Create flags

<ResponseField name="--name" type="string">
Name for the endpoint.
Name for the endpoint. Must be at least 3 characters. If omitted, a name is auto-generated in the format `endpoint-XXXXXXXX`.
</ResponseField>

<ResponseField name="--template-id" type="string">
Expand All @@ -100,15 +111,19 @@ Hub listing ID to deploy from (alternative to `--template-id`). Use [`runpodctl
</ResponseField>

<ResponseField name="--gpu-id" type="string">
GPU type for workers. Use [`runpodctl gpu list`](/runpodctl/reference/runpodctl-gpu) to see available GPUs.
GPU type for workers. Accepts either a GPU type ID (e.g., `NVIDIA A40`, `NVIDIA GeForce RTX 4090`) or a GPU pool ID (e.g., `ADA_24`, `AMPERE_48`). Use [`runpodctl gpu list`](/runpodctl/reference/runpodctl-gpu) to see available GPUs.
</ResponseField>

<ResponseField name="--gpu-count" type="int" default="1">
Number of GPUs per worker.
</ResponseField>

<ResponseField name="--compute-type" type="string" default="GPU">
Compute type (`GPU` or `CPU`).
Compute type (`GPU` or `CPU`). For CPU endpoints, use `--instance-id` to specify the CPU instance type.
</ResponseField>

<ResponseField name="--instance-id" type="string" default="cpu3g-4-16">
CPU instance ID when using `--compute-type CPU`. If omitted, defaults to `cpu3g-4-16`. Only valid with `--compute-type CPU`.
</ResponseField>

<ResponseField name="--workers-min" type="int" default="0">
Expand All @@ -124,27 +139,27 @@ Comma-separated list of preferred datacenter IDs. Use [`runpodctl datacenter lis
</ResponseField>

<ResponseField name="--network-volume-id" type="string">
Network volume ID to attach. Use [`runpodctl network-volume list`](/runpodctl/reference/runpodctl-network-volume) to see available network volumes.
Network volume ID to attach for single-region deployments. Use [`runpodctl network-volume list`](/runpodctl/reference/runpodctl-network-volume) to see available network volumes. Mutually exclusive with `--network-volume-ids`.
</ResponseField>

<ResponseField name="--network-volume-ids" type="string">
Comma-separated list of network volume IDs to attach. Use this when attaching multiple network volumes to an endpoint.
Comma-separated list of network volume IDs for multi-region deployments. Mutually exclusive with `--network-volume-id`.
</ResponseField>

<ResponseField name="--min-cuda-version" type="string">
Minimum CUDA version required for workers (e.g., `12.4`). Workers will only be scheduled on machines that meet this CUDA version requirement.
</ResponseField>

<ResponseField name="--scaler-type" type="string" default="QUEUE_DELAY">
Autoscaler type (`QUEUE_DELAY` or `REQUEST_COUNT`). `QUEUE_DELAY` scales based on queue wait time; `REQUEST_COUNT` scales based on concurrent requests.
<ResponseField name="--scale-by" type="string">
Autoscaling strategy: `delay` (scales based on queue wait time in seconds) or `requests` (scales based on pending request count).
</ResponseField>

<ResponseField name="--scaler-value" type="int">
Scaler threshold value. For `QUEUE_DELAY`, this is the target delay in seconds. For `REQUEST_COUNT`, this is the number of concurrent requests per worker before scaling.
<ResponseField name="--scale-threshold" type="int">
Trigger point for the autoscaler. For `delay`, this is the target queue wait time in seconds. For `requests`, this is the pending request count that triggers scaling.
</ResponseField>

<ResponseField name="--idle-timeout" type="int">
Idle timeout in seconds. Workers shut down after being idle for this duration. Valid range: 5-3600 seconds.
Idle timeout in seconds. Workers shut down after being idle for this duration. Valid range: 1-3600 seconds.
</ResponseField>

<ResponseField name="--flash-boot" type="bool">
Expand All @@ -156,7 +171,11 @@ Execution timeout in seconds. Jobs that exceed this duration are terminated. The
</ResponseField>

<ResponseField name="--env" type="string">
Environment variable in `KEY=VALUE` format. Use multiple `--env` flags to set multiple variables. When deploying from `--hub-id`, these values override the Hub release defaults.
Environment variable in `KEY=VALUE` format. Use multiple `--env` flags to set multiple variables. These values only apply when deploying from `--hub-id`, where they override the Hub release defaults. With `--template-id`, environment variables come from the template, so `--env` is ignored and the CLI prints a note to that effect.
</ResponseField>

<ResponseField name="--model-reference" type="string">
Model reference URL to attach to the endpoint. Use multiple `--model-reference` flags to attach multiple models. Works with both `--template-id` and `--hub-id`, and requires GPU compute type.
</ResponseField>

### Update an endpoint
Expand Down
Loading