Datasets

JSONL files used for fine-tuning. Usually auto-generated when you start a run from a behaviour spec. Can also be uploaded manually.

Auto-Generated Datasets

When you start a run, the platform automatically:

Compiles your behaviour spec into JSONL chat format (system + user + assistant messages)
If augmentation is enabled, uses Claude to expand your examples into a larger, more diverse training set (typically 5–10 examples → 30–40 rows)
Uploads the compiled dataset to storage

Auto-generated datasets are named "Spec Name - Run #N".

The Dataset Object

{
  "id": "dc66546b-48b3-4490-8baf-9b50aa78130c",
  "name": "Customer Support Bot - Run #8",
  "description": "Auto-compiled from behaviour spec. 36 examples (augmented).",
  "format": "jsonl",
  "status": "validated",
  "row_count": 36,
  "file_size_bytes": 36922,
  "created_at": "2026-03-06T10:44:30.000Z"
}

The recommended way to work with datasets is the tt CLI — each endpoint below shows the tt command first, followed by the equivalent REST call.

Upload a Dataset

Large uploads use a signed S3 upload URL so file bytes do not pass through the app API.

CLI

tt datasets upload training.jsonl \
  --name "my-training-data" \
  --description "Custom training dataset"

Equivalent API flow

// 1. Request an upload URL from the app API.
const uploadUrl = await fetch("/api/v1/datasets/upload-url", {
  method: "POST",
  headers: {
    "Authorization": "Bearer <api-key>",
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    name: "my-training-data",
    description: "Custom training dataset",
    filename: file.name,
    size: file.size,
    contentType: file.type || "application/octet-stream"
  })
}).then((res) => res.json());

// 2. Upload the file directly to S3.
await fetch(uploadUrl.data.upload_url, {
  method: uploadUrl.data.method,
  headers: uploadUrl.data.headers,
  body: file
});

// 3. Finalize and validate the uploaded dataset.
await fetch("/api/v1/datasets/finalize", {
  method: "POST",
  headers: {
    "Authorization": "Bearer <api-key>",
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    path: uploadUrl.data.path,
    name: "my-training-data",
    description: "Custom training dataset"
  })
});

The file must be JSONL format and no larger than 100 MB. Standard text rows use input and output strings. Document OCR rows use an input object with prompt and assets fields plus a string output. Status will be validated if all lines parse correctly, or invalid with error details.

JSONL Format

Each standard row should be a JSON object with input and output strings:

{"input": "How do I reset my password?", "output": "Go to Settings > Security and choose Reset password."}
{"input": "Where can I see invoices?", "output": "Open Billing > Invoices from the dashboard."}

Document OCR JSONL

For document OCR or image-to-JSON datasets, pass --format document_ocr_jsonl so the CLI validates asset metadata before upload. Each row should include an input object with prompt and assets, plus a string output.

tt datasets upload ocr-training.jsonl \
  --name "OCR training data" \
  --format document_ocr_jsonl

{"input":{"prompt":"Extract invoice number and total as JSON.","assets":[{"type":"image","mime_type":"image/png","path":"./invoice-page-1.png"}]},"output":"{\"invoice_number\":\"INV-1007\",\"total\":\"$421.50\"}"}

Promote Labeling Jobs

The tt label workflow can turn unlabeled JSONL or CSV rows into a reviewed dataset. The teacher drafts outputs under a behaviour spec; accepted and edited rows can then be promoted into the same validated dataset format shown above.

tt label upload unlabeled.jsonl --spec <spec-id> --watch
tt label accept <job-id> --all
tt label promote <job-id> --name "reviewed-labels"

Labeling uploads are sanitized before teacher calls. Common PII is replaced with redaction placeholders. Secret-like rows, such as password assignments, bearer tokens, API keys, connection strings, and private keys, are blocked from teacher labeling and excluded from promoted datasets.

Promotion performs a final sanitization check over reviewed inputs and outputs before writing the dataset file, so reviewer edits cannot accidentally introduce blocked content into training data.

List Datasets

GET /api/v1/datasets

CLI

tt datasets list

Equivalent REST call

curl https://tunedtensor.com/api/v1/datasets \
  -H "Authorization: Bearer <api-key>" \

Get a Dataset

GET /api/v1/datasets/:id

CLI

tt datasets get <dataset-id>

Equivalent REST call

curl https://tunedtensor.com/api/v1/datasets/:id \
  -H "Authorization: Bearer <api-key>" \

Delete a Dataset

DELETE /api/v1/datasets/:id

CLI

tt datasets delete <dataset-id>

Equivalent REST call

curl -X DELETE https://tunedtensor.com/api/v1/datasets/:id \
  -H "Authorization: Bearer <api-key>" \

Deletes the dataset record and the underlying file from storage.