Product update

Tuned Tensor now supports fine-tuning Qwen/Qwen3-VL-2B-Instruct for OCR and document extraction workflows

Jun 24, 20264 min read

Base model

Qwen3-VL 2B

Workflow

image to JSON

Outputs

structured JSON

Multimodal fine-tuning, same workflow

Tuned Tensor now supports fine-tuning Qwen/Qwen3-VL-2B-Instruct for OCR and document extraction workflows. This adds a small multimodal base model to the same behaviour-first training flow teams already use for text-to-JSON and classification tasks.

The important product boundary is simple: visual inputs in, structured JSON out. A training example can include a page image, screenshot, scan, receipt photo, or rendered PDF page, then ask the model to return a strict JSON object.

That keeps multimodal work close to the existing Tuned Tensor loop: define the behaviour, upload examples, compare the base model, fine-tune, evaluate the candidate, and deploy the resulting model artifact.

What this unlocks

The first supported use cases are OCR and document extraction tasks where a general model gets close, but your application needs repeatable formatting, field names, normalization rules, and template-specific behaviour.

Invoices

vendor, invoice number, line items, totals

Receipts

merchant, date, taxes, payment method

Forms

named fields, checkboxes, signatures, IDs

Screenshots

UI state, visible errors, selected values

These are not general image-captioning jobs. They are operational extraction workflows: the document changes, but the output contract stays stable enough to validate downstream.

How examples are represented

Multimodal examples follow the same chat shape as text examples, with image content attached to the user message. The target remains an assistant response, so you can keep training toward strict JSON, Markdown, CSV-like text, or a short classification label.

{
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "image" },
        { "type": "text", "text": "Extract invoice fields as strict JSON." }
      ]
    },
    {
      "role": "assistant",
      "content": "{\"vendor\":\"Acme Ltd\",\"total\":\"1240.50\"}"
    }
  ],
  "images": ["data:image/png;base64,..."]
}

Tuned Tensor handles the model-side multimodal plumbing: image preparation, processor loading, training with visual inputs, artifact packaging, and candidate evaluation. The dataset author can stay focused on the behaviour contract.

What changes from text fine-tuning

The outer loop is the same, but multimodal fine-tuning adds a few practical pieces that text-only runs do not need.

Data

Rows include image payloads as well as messages

Processing

The Qwen3-VL processor travels with the model

Context

Images consume visual tokens before text output

Evaluation

Schema, fields, OCR text, and formatting matter

For document extraction, evaluation should usually be field-level: JSON validity, required keys, exact-match fields, date/currency normalization, table row accuracy, and whether the output is clean enough for the next system to consume without repair.

A better fit than generic OCR

Generic OCR is good when the task is simply “read this page.” Fine-tuning becomes useful when the same kind of document appears repeatedly and the business logic is hiding in the details: ambiguous labels, template-specific sections, noisy scans, local date formats, currency conventions, or fields that need to be inferred from layout.

A tuned Qwen3-VL workflow lets teams teach a small model those conventions directly. Instead of bolting a long prompt onto every request, you can train the behaviour into the model and evaluate it against the exact schema your application expects.

This is also where small multimodal models are most interesting: not as universal visual assistants, but as specialized extraction workers for a known document family.

Available now

Qwen/Qwen3-VL-2B-Instruct is now available as a multimodal base model in Tuned Tensor for OCR and document extraction workflows. You can use it when your task is best expressed as image plus instruction to structured JSON.

We are starting here deliberately. Multimodal input with structured text output fits Tuned Tensor’s core promise: small specialized models, behaviour specs, measurable regressions, and deployable artifacts. Broader vision tasks can come later; OCR and document extraction are the useful wedge today.

Try labeling