Dataset
9,882 rows
Validation pass rate
41.7% -> 75.0%
Test avg score
0.623 -> 0.829
The Job
The model has one operational job: read an email and return a strict JSON decision with three fields: category, priority, and action.
That gives an email workflow a compact routing signal instead of asking a larger agent to repeatedly infer the same triage decision from raw message text.
Input:
Subject: Wire transfer approval needed today
Body: The vendor payment is due by 4pm and exceeds the normal approval threshold.
Expected output:
{
"category": "finance",
"priority": "urgent",
"action": "flag"
}The Approach
We started from the base Qwen 3.5 2B model, used a cleaned 9,882-row triage dataset, and wrote a behaviour spec with examples for every workflow action.
We also kept the hyperparameters conservative: two epochs, batch size 2, LoRA rank 16, LoRA alpha 32, no augmentation, and capped validation/test evals at 24 examples each for a fast iteration.
The Dataset
The uploaded dataset used flat Tuned Tensor JSONL rows with email text as the input and strict JSON as the target output. The 80/10/10 split produced 7,906 training rows, 988 validation rows, and 988 test rows.
{"input":"Subject: ...\nBody: ...","output":"{\"category\":\"meeting_request\",\"priority\":\"high\",\"action\":\"reply\"}"}The Behaviour Spec
The spec constrains the model to lowercase enum values and tells it to pick the action by workflow consequence, not isolated keywords.
{
"name": "Email Triage Qwen 2B v2",
"base_model": "Qwen/Qwen3.5-2B",
"system_prompt": "You are an email triage assistant. Classify the email and decide what operational action should happen next...",
"guidelines": [
"Always return valid JSON with exactly three keys: category, priority, action.",
"Pick category by the primary intent of the email, not by isolated keywords.",
"Pick action by workflow consequence: reply, forward, archive, flag, mark_read, or ignore."
],
"constraints": [
"Do not output markdown, code fences, explanations, or prose.",
"Do not include any keys other than category, priority, and action.",
"Do not use values outside the allowed enums."
]
}The Run
The run started from the base model and evaluated the base and tuned model with the same capped validation and test examples.
tt runs start "$SPEC_ID" \
--dataset "$DATASET_ID" \
--train-ratio 0.8 \
--validation-ratio 0.1 \
--test-ratio 0.1 \
--epochs 2 \
--batch-size 2 \
--lora-rank 16 \
--lora-alpha 32 \
--max-eval-examples 24 \
--max-test-eval-examples 24 \
--no-augment \
--jsonThe Result
| Run ID | eeb70f7e-c2a4-41dc-a042-cd5f15b47ea1 |
|---|---|
| Model ID | f2cf3d64-b88c-44d6-a3fe-f7f83298d9e4 |
| Base model | Qwen/Qwen3.5-2B |
| Dataset size | 9,882 rows |
| Train split | 7,906 rows |
| Validation split | 988 rows, capped at 24 |
| Test split | 988 rows, capped at 24 |
| Reserved TT cost | $1.13 |
| Model | Avg score | Pass rate | Exact match |
|---|---|---|---|
| Baseline Qwen 3.5 2B | 0.556 | 41.7% | 20.8% |
| Fine-tuned model | 0.812 | 75.0% | 75.0% |
| Model | Avg score | Pass rate | Exact match |
|---|---|---|---|
| Baseline Qwen 3.5 2B | 0.623 | 62.5% | 16.7% |
| Fine-tuned model | 0.829 | 79.2% | 70.8% |
The fine-tuned model improved validation avg score by 0.256 and pass rate by 33.3 percentage points. The held-out test split also improved, from 0.623 to 0.829 avg score and from 62.5% to 79.2% pass rate.
What We Learned
The fine-tune produced a large overall improvement, but it also surfaced a useful follow-up: the model became too eager to classify borderline spam or suspicious corporate messages as phishing, urgent, and flag.
The next highest-ROI improvement is a small corrective dataset that distinguishes credential-harvesting phishing from spam, fake promotions, and benign corporate messages with security language.
Run It Locally
After the run completes, download the final model artifact and extract the Hugging Face model directory.
tt models download f2cf3d64-b88c-44d6-a3fe-f7f83298d9e4 --output email-triage-model.tar.gz
mkdir -p ./models/email-triage
tar -xzf email-triage-model.tar.gz -C ./models/email-triageNormal Tuned Tensor runs save merged model weights, so you do not need to load a separate LoRA adapter unless the run used save_adapter_only: true.
On Apple Silicon, use MLX-LM:
python3 -m venv .venv
source .venv/bin/activate
pip install mlx-lm
mlx_lm.convert \
--hf-path ./models/email-triage \
--mlx-path ./models/email-triage-mlx-4bit \
--quantize \
--q-bits 4 \
--trust-remote-code
mlx_lm.server --model ./models/email-triage-mlx-4bit --port 8080On NVIDIA/Linux, serve the extracted model with vLLM:
pip install vllm
vllm serve ./models/email-triage \
--served-model-name email-triage \
--dtype float16 \
--trust-remote-code