Models

Models are fine-tuned versions of base models, automatically created when a training run completes successfully.

The Model Object

{
  "id": "96e9f0d9-81c1-41cb-b24e-5c85ea09e769",
  "name": "Qwen3.5-2B-ft-b3e2b918",
  "provider": "hosted",
  "provider_model_id": "tt-model-b3e2b918",
  "base_model": "Qwen/Qwen3.5-2B",
  "fine_tune_job_id": "b3e2b918-...",
  "created_at": "2026-03-06T10:57:50.000Z"
}

Field	Description
`name`	Auto-generated from base model + job ID
`provider`	Public provider label; currently `hosted`
`provider_model_id`	Hosted model identifier
`base_model`	Original model that was fine-tuned

The recommended way to work with models is the tt CLI — each endpoint below shows the tt command first, followed by the equivalent REST call.

List Models

GET /api/v1/models

CLI

tt models list

Equivalent REST call

curl https://tunedtensor.com/api/v1/models \
  -H "Authorization: Bearer <api-key>" \

Supports page and per_page query parameters.

Get a Model

GET /api/v1/models/:id

CLI

tt models get <model-id>

Equivalent REST call

curl https://tunedtensor.com/api/v1/models/:id \
  -H "Authorization: Bearer <api-key>" \

Download a Model Artifact

GET /api/v1/models/:id/download

CLI

tt models download <model-id> --output model.tar.gz

Equivalent REST call

curl https://tunedtensor.com/api/v1/models/:id/download \
  -H "Authorization: Bearer <api-key>" \

Returns a short-lived signed download URL for models that have a Tuned Tensor-hosted artifact. Interactive terminals show download progress, transfer rate, and ETA. Hosted models can still be referenced by provider_model_id, but may not expose downloadable weights.

Serve a Model Locally

The recommended local path is tt models serve. It can serve a model ID, a downloaded .tar.gz artifact, or an extracted model directory through an OpenAI-compatible chat completions API. It also applies the compiled behaviour spec prompt from tunedtensor.json by default, so local inference matches the behaviour spec prompt used during training.

# One-time local runtime setup
tt models setup-runtime

# Serve by model ID. The CLI downloads and caches the artifact automatically.
tt models serve <model-id> --spec tunedtensor.json

# Or serve a downloaded archive / extracted directory
tt models download <model-id> --output model.tar.gz
tt models serve model.tar.gz --spec tunedtensor.json
tt models serve ./models/my-model --spec tunedtensor.json

# Managed mode starts on demand, idles down, serializes requests, and logs JSONL
tt models serve <model-id> --spec tunedtensor.json --managed \
  --idle-timeout 300 \
  --restart-after-requests 100 \
  --json-schema schema.json

The local reference server exposes:

POST http://127.0.0.1:8000/v1/chat/completions
GET  http://127.0.0.1:8000/health

curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      { "role": "user", "content": "Hello" }
    ],
    "max_tokens": 200
  }'

Use --device auto to choose CUDA, then Apple MPS, then CPU. You can also force a device explicitly:

tt models serve <model-id> --spec tunedtensor.json --device mps
tt models serve <model-id> --spec tunedtensor.json --device cuda
tt models serve <model-id> --spec tunedtensor.json --device cpu

The setup command creates an isolated Python runtime and installs torch, transformers, accelerate, and safetensors. If you already manage your own Python environment, pass --python <path> to tt models serve.

Add --managed when a local application should not keep the model loaded forever. Managed mode keeps the public endpoint stable, starts the model on first generation request, serializes generation calls, stops after --idle-timeout, restarts after --restart-after-requests or failed health checks, and logs request size, latency, schema validity, and the configured --gate-field result as JSON lines.

Downloaded Tuned Tensor artifacts are packaged as Hugging Face model directories inside model.tar.gz. Normal runs save merged model weights, so you can also extract the archive and run the model with other Hugging Face-compatible engines.

mkdir -p ./models/my-model
tar -xzf model.tar.gz -C ./models/my-model

Advanced: Apple Silicon with MLX-LM

On Apple Silicon, use MLX-LM and convert the Hugging Face directory to a quantized MLX model when you want an MLX-specific serving stack:

python3 -m venv .venv
source .venv/bin/activate
pip install mlx-lm

mlx_lm.convert \
  --hf-path ./models/my-model \
  --mlx-path ./models/my-model-mlx-4bit \
  --quantize \
  --q-bits 4 \
  --trust-remote-code

mlx_lm.server --model ./models/my-model-mlx-4bit --port 8080

The MLX server exposes an OpenAI-compatible chat endpoint at http://127.0.0.1:8080/v1/chat/completions.

Advanced: NVIDIA/Linux with vLLM

For production-like local serving on an NVIDIA GPU machine, you can also serve the extracted Hugging Face directory with vLLM:

pip install vllm

vllm serve ./models/my-model \
  --served-model-name my-model \
  --dtype float16 \
  --trust-remote-code

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "messages": [
      { "role": "user", "content": "Hello" }
    ],
    "max_tokens": 200
  }'

If a run was explicitly created with save_adapter_only: true, the artifact contains adapter weights instead of a merged standalone model. In that case, load the adapter together with the original base model rather than using the simple merged-model flow above.

Delete a Model

DELETE /api/v1/models/:id

CLI

tt models delete <model-id>

Equivalent REST call

curl -X DELETE https://tunedtensor.com/api/v1/models/:id \
  -H "Authorization: Bearer <api-key>" \

Deletes the model record from Tuned Tensor. The model may still exist on the provider and would need to be deleted there separately.