Product update
Tuned Tensor now supports fine-tuning Qwen/Qwen3-VL-2B-Instruct for OCR and document extraction workflows
Base model
Qwen3-VL 2B
Workflow
image to JSON
Outputs
structured JSON
Multimodal fine-tuning, same workflow
Tuned Tensor now supports fine-tuning Qwen/Qwen3-VL-2B-Instruct for OCR and document extraction workflows. This adds a small multimodal base model to the same behaviour-first training flow teams already use for text-to-JSON and classification tasks.
The important product boundary is simple: visual inputs in, structured JSON out. A training example can include a page image, screenshot, scan, receipt photo, or rendered PDF page, then ask the model to return a strict JSON object.
That keeps multimodal work close to the existing Tuned Tensor loop: define the behaviour, upload examples, compare the base model, fine-tune, evaluate the candidate, and deploy the resulting model artifact.
What this unlocks
The first supported use cases are OCR and document extraction tasks where a general model gets close, but your application needs repeatable formatting, field names, normalization rules, and template-specific behaviour.
Invoices
vendor, invoice number, line items, totals
Receipts
merchant, date, taxes, payment method
Forms
named fields, checkboxes, signatures, IDs
Screenshots
UI state, visible errors, selected values
These are not general image-captioning jobs. They are operational extraction workflows: the document changes, but the output contract stays stable enough to validate downstream.
How examples are represented
Multimodal examples follow the same chat shape as text examples, with image content attached to the user message. The target remains an assistant response, so you can keep training toward strict JSON, Markdown, CSV-like text, or a short classification label.
{
"messages": [
{
"role": "user",
"content": [
{ "type": "image" },
{ "type": "text", "text": "Extract invoice fields as strict JSON." }
]
},
{
"role": "assistant",
"content": "{\"vendor\":\"Acme Ltd\",\"total\":\"1240.50\"}"
}
],
"images": ["data:image/png;base64,..."]
}Tuned Tensor handles the model-side multimodal plumbing: image preparation, processor loading, training with visual inputs, artifact packaging, and candidate evaluation. The dataset author can stay focused on the behaviour contract.
What changes from text fine-tuning
The outer loop is the same, but multimodal fine-tuning adds a few practical pieces that text-only runs do not need.
Data
Rows include image payloads as well as messages
Processing
The Qwen3-VL processor travels with the model
Context
Images consume visual tokens before text output
Evaluation
Schema, fields, OCR text, and formatting matter
For document extraction, evaluation should usually be field-level: JSON validity, required keys, exact-match fields, date/currency normalization, table row accuracy, and whether the output is clean enough for the next system to consume without repair.
A better fit than generic OCR
Generic OCR is good when the task is simply “read this page.” Fine-tuning becomes useful when the same kind of document appears repeatedly and the business logic is hiding in the details: ambiguous labels, template-specific sections, noisy scans, local date formats, currency conventions, or fields that need to be inferred from layout.
A tuned Qwen3-VL workflow lets teams teach a small model those conventions directly. Instead of bolting a long prompt onto every request, you can train the behaviour into the model and evaluate it against the exact schema your application expects.
This is also where small multimodal models are most interesting: not as universal visual assistants, but as specialized extraction workers for a known document family.
Available now
Qwen/Qwen3-VL-2B-Instruct is now available as a multimodal base model in Tuned Tensor for OCR and document extraction workflows. You can use it when your task is best expressed as image plus instruction to structured JSON.
We are starting here deliberately. Multimodal input with structured text output fits Tuned Tensor’s core promise: small specialized models, behaviour specs, measurable regressions, and deployable artifacts. Broader vision tasks can come later; OCR and document extraction are the useful wedge today.