Email Triage Use Case Story

Dataset

9,882 rows

Validation pass rate

41.7% -> 75.0%

Test avg score

0.623 -> 0.829

The Job

The model has one operational job: read an email and return a strict JSON decision with three fields: category, priority, and action.

That gives an email workflow a compact routing signal instead of asking a larger agent to repeatedly infer the same triage decision from raw message text.

Input:
Subject: Wire transfer approval needed today
Body: The vendor payment is due by 4pm and exceeds the normal approval threshold.

Expected output:
{
  "category": "finance",
  "priority": "urgent",
  "action": "flag"
}

The Approach

We started from the base Qwen 3.5 2B model, used a cleaned 9,882-row triage dataset, and wrote a behaviour spec with examples for every workflow action.

We also kept the hyperparameters conservative: two epochs, batch size 2, LoRA rank 16, LoRA alpha 32, no augmentation, and capped validation/test evals at 24 examples each for a fast iteration.

The Dataset

The uploaded dataset used flat Tuned Tensor JSONL rows with email text as the input and strict JSON as the target output. The 80/10/10 split produced 7,906 training rows, 988 validation rows, and 988 test rows.

{"input":"Subject: ...\nBody: ...","output":"{\"category\":\"meeting_request\",\"priority\":\"high\",\"action\":\"reply\"}"}

The Behaviour Spec

The spec constrains the model to lowercase enum values and tells it to pick the action by workflow consequence, not isolated keywords.

{
  "name": "Email Triage Qwen 2B v2",
  "base_model": "Qwen/Qwen3.5-2B",
  "system_prompt": "You are an email triage assistant. Classify the email and decide what operational action should happen next...",
  "guidelines": [
    "Always return valid JSON with exactly three keys: category, priority, action.",
    "Pick category by the primary intent of the email, not by isolated keywords.",
    "Pick action by workflow consequence: reply, forward, archive, flag, mark_read, or ignore."
  ],
  "constraints": [
    "Do not output markdown, code fences, explanations, or prose.",
    "Do not include any keys other than category, priority, and action.",
    "Do not use values outside the allowed enums."
  ]
}

The Run

The run started from the base model and evaluated the base and tuned model with the same capped validation and test examples.

tt runs start "$SPEC_ID" \
  --dataset "$DATASET_ID" \
  --train-ratio 0.8 \
  --validation-ratio 0.1 \
  --test-ratio 0.1 \
  --epochs 2 \
  --batch-size 2 \
  --lora-rank 16 \
  --lora-alpha 32 \
  --max-eval-examples 24 \
  --max-test-eval-examples 24 \
  --no-augment \
  --json

The Result

Run ID	eeb70f7e-c2a4-41dc-a042-cd5f15b47ea1
Model ID	f2cf3d64-b88c-44d6-a3fe-f7f83298d9e4
Base model	Qwen/Qwen3.5-2B
Dataset size	9,882 rows
Train split	7,906 rows
Validation split	988 rows, capped at 24
Test split	988 rows, capped at 24
Reserved TT cost	$1.13

Validation

Model	Avg score	Pass rate	Exact match
Baseline Qwen 3.5 2B	0.556	41.7%	20.8%
Fine-tuned model	0.812	75.0%	75.0%

Test

Model	Avg score	Pass rate	Exact match
Baseline Qwen 3.5 2B	0.623	62.5%	16.7%
Fine-tuned model	0.829	79.2%	70.8%

The fine-tuned model improved validation avg score by 0.256 and pass rate by 33.3 percentage points. The held-out test split also improved, from 0.623 to 0.829 avg score and from 62.5% to 79.2% pass rate.

What We Learned

The fine-tune produced a large overall improvement, but it also surfaced a useful follow-up: the model became too eager to classify borderline spam or suspicious corporate messages as phishing, urgent, and flag.

The next highest-ROI improvement is a small corrective dataset that distinguishes credential-harvesting phishing from spam, fake promotions, and benign corporate messages with security language.

Run It Locally

After the run completes, download the final model artifact and extract the Hugging Face model directory.

tt models download f2cf3d64-b88c-44d6-a3fe-f7f83298d9e4 --output email-triage-model.tar.gz
mkdir -p ./models/email-triage
tar -xzf email-triage-model.tar.gz -C ./models/email-triage

Normal Tuned Tensor runs save merged model weights, so you do not need to load a separate LoRA adapter unless the run used save_adapter_only: true.

On Apple Silicon, use MLX-LM:

python3 -m venv .venv
source .venv/bin/activate
pip install mlx-lm

mlx_lm.convert \
  --hf-path ./models/email-triage \
  --mlx-path ./models/email-triage-mlx-4bit \
  --quantize \
  --q-bits 4 \
  --trust-remote-code

mlx_lm.server --model ./models/email-triage-mlx-4bit --port 8080

On NVIDIA/Linux, serve the extracted model with vLLM:

pip install vllm

vllm serve ./models/email-triage \
  --served-model-name email-triage \
  --dtype float16 \
  --trust-remote-code

Fine-tune Qwen 3.5 2B for email triage