Use case story

Fine-tune Qwen 3.5 2B for email triage

We used Tuned Tensor and the tt CLI to train a small model that classifies email category, priority, and the next workflow action. The run used a larger cleaned dataset, a precise behaviour spec, and measured the tuned model against the base model on validation and held-out test examples.

Dataset

9,882 rows

Validation pass rate

41.7% -> 75.0%

Test avg score

0.623 -> 0.829

The Job

The model has one operational job: read an email and return a strict JSON decision with three fields: category, priority, and action.

That gives an email workflow a compact routing signal instead of asking a larger agent to repeatedly infer the same triage decision from raw message text.

Input:
Subject: Wire transfer approval needed today
Body: The vendor payment is due by 4pm and exceeds the normal approval threshold.

Expected output:
{
  "category": "finance",
  "priority": "urgent",
  "action": "flag"
}

The Approach

We started from the base Qwen 3.5 2B model, used a cleaned 9,882-row triage dataset, and wrote a behaviour spec with examples for every workflow action.

We also kept the hyperparameters conservative: two epochs, batch size 2, LoRA rank 16, LoRA alpha 32, no augmentation, and capped validation/test evals at 24 examples each for a fast iteration.

The Dataset

The uploaded dataset used flat Tuned Tensor JSONL rows with email text as the input and strict JSON as the target output. The 80/10/10 split produced 7,906 training rows, 988 validation rows, and 988 test rows.

{"input":"Subject: ...\nBody: ...","output":"{\"category\":\"meeting_request\",\"priority\":\"high\",\"action\":\"reply\"}"}

The Behaviour Spec

The spec constrains the model to lowercase enum values and tells it to pick the action by workflow consequence, not isolated keywords.

{
  "name": "Email Triage Qwen 2B v2",
  "base_model": "Qwen/Qwen3.5-2B",
  "system_prompt": "You are an email triage assistant. Classify the email and decide what operational action should happen next...",
  "guidelines": [
    "Always return valid JSON with exactly three keys: category, priority, action.",
    "Pick category by the primary intent of the email, not by isolated keywords.",
    "Pick action by workflow consequence: reply, forward, archive, flag, mark_read, or ignore."
  ],
  "constraints": [
    "Do not output markdown, code fences, explanations, or prose.",
    "Do not include any keys other than category, priority, and action.",
    "Do not use values outside the allowed enums."
  ]
}

The Run

The run started from the base model and evaluated the base and tuned model with the same capped validation and test examples.

tt runs start "$SPEC_ID" \
  --dataset "$DATASET_ID" \
  --train-ratio 0.8 \
  --validation-ratio 0.1 \
  --test-ratio 0.1 \
  --epochs 2 \
  --batch-size 2 \
  --lora-rank 16 \
  --lora-alpha 32 \
  --max-eval-examples 24 \
  --max-test-eval-examples 24 \
  --no-augment \
  --json

The Result

Run IDeeb70f7e-c2a4-41dc-a042-cd5f15b47ea1
Model IDf2cf3d64-b88c-44d6-a3fe-f7f83298d9e4
Base modelQwen/Qwen3.5-2B
Dataset size9,882 rows
Train split7,906 rows
Validation split988 rows, capped at 24
Test split988 rows, capped at 24
Reserved TT cost$1.13
Validation
ModelAvg scorePass rateExact match
Baseline Qwen 3.5 2B0.55641.7%20.8%
Fine-tuned model0.81275.0%75.0%
Test
ModelAvg scorePass rateExact match
Baseline Qwen 3.5 2B0.62362.5%16.7%
Fine-tuned model0.82979.2%70.8%

The fine-tuned model improved validation avg score by 0.256 and pass rate by 33.3 percentage points. The held-out test split also improved, from 0.623 to 0.829 avg score and from 62.5% to 79.2% pass rate.

What We Learned

The fine-tune produced a large overall improvement, but it also surfaced a useful follow-up: the model became too eager to classify borderline spam or suspicious corporate messages as phishing, urgent, and flag.

The next highest-ROI improvement is a small corrective dataset that distinguishes credential-harvesting phishing from spam, fake promotions, and benign corporate messages with security language.

Run It Locally

After the run completes, download the final model artifact and extract the Hugging Face model directory.

tt models download f2cf3d64-b88c-44d6-a3fe-f7f83298d9e4 --output email-triage-model.tar.gz
mkdir -p ./models/email-triage
tar -xzf email-triage-model.tar.gz -C ./models/email-triage

Normal Tuned Tensor runs save merged model weights, so you do not need to load a separate LoRA adapter unless the run used save_adapter_only: true.

On Apple Silicon, use MLX-LM:

python3 -m venv .venv
source .venv/bin/activate
pip install mlx-lm

mlx_lm.convert \
  --hf-path ./models/email-triage \
  --mlx-path ./models/email-triage-mlx-4bit \
  --quantize \
  --q-bits 4 \
  --trust-remote-code

mlx_lm.server --model ./models/email-triage-mlx-4bit --port 8080

On NVIDIA/Linux, serve the extracted model with vLLM:

pip install vllm

vllm serve ./models/email-triage \
  --served-model-name email-triage \
  --dtype float16 \
  --trust-remote-code