Field note

Fine-tune Qwen 3.5 2B for financial sentiment analysis

Jun 20, 20264 min read

Start with a narrow extraction task

We wanted a small financial model that did one thing cleanly: read a finance-related tweet or market social post and return a structured sentiment signal. Not a general market assistant. Not a trading system. Just a predictable extractor with a JSON contract.

The source data was public and inspectable: the MIT-licensed zeroshot/twitter-financial-news-sentiment dataset on Hugging Face. The labels map to 0=bearish, 1=bullish, and 2=neutral.

That shape is exactly where small fine-tunes are useful. The base model already understands finance language. The fine-tune teaches it to obey the task boundary, preserve the schema, and make the label choice consistently.

The output contract

The behaviour spec asks for strict JSON with exactly three fields:sentiment, label, and rationale. The sentiment string and numeric label must agree, and the rationale stays to one short sentence grounded in the post.

Extract the market sentiment signal from this finance-related social post.
Return only strict JSON with exactly these keys: sentiment, label, rationale.
sentiment must be one of bearish, bullish, neutral; label must be one of 0, 1, 2.

Post: $NVDA shares jump after analysts raise price targets on stronger AI chip demand.

{"sentiment":"bullish","label":1,"rationale":"The post expresses a bullish market signal."}

Prepare the dataset

We downloaded the dataset locally, removed duplicate rows, stripped control characters, and built a balanced training file. The goal was not to squeeze every row into the run. It was to keep the first public version small enough to inspect and large enough to move the model.

Raw rows

11,931

Deduped rows

11,924

Balanced train rows

5,100

The balanced training file used 1,700 bearish, 1,700 bullish, and 1,700 neutral examples. A separate balanced holdout set kept 60 rows per class aside.

Train and compare

We fine-tuned Qwen/Qwen3.5-2B for one epoch with Tuned Tensor. The run used 4,080 training rows after the normal train/validation/test split and evaluated the base model against the tuned model on capped validation and test sets.

Validation avg score

0.819 to 0.903

Validation pass rate

79.2% to 86.7%

Test avg score

0.834 to 0.875

Test pass rate

80.0% to 85.8%

The tuned model also preserved the format contract: valid JSON, strict JSON, and expected schema keys were all at 100% in the run diagnostics.

tt runs start <spec-id> \
  --dataset <dataset-id>

tt runs diagnose <run-id>

Run it locally before publishing

After the run completed, we downloaded the model artifact and served it through the Tuned Tensor local runtime. That gave us an OpenAI-compatible local endpoint for quick real-prompt tests before publishing anything.

tt models download <model-id> \
  --output models/qwen35-2b-financial-sentiment.tar.gz

tt models serve models/qwen35-2b-financial-sentiment.tar.gz \
  --spec tunedtensor.json \
  --host 127.0.0.1 \
  --port 8001 \
  --device auto \
  --temperature 0 \
  --max-tokens 96

We ran 20 hand-curated finance/social examples through the local model. All 20 returned valid JSON with the expected classification. That is not a production benchmark, but it is a useful smoke test: the downloaded model, tokenizer, runtime, and behaviour spec all worked together outside the training service.

Ship the artifacts

The final model is public on Hugging Face as tunedtensor/qwen3.5-2b-financial-sentiment. The repo includes the merged model weights, tokenizer/config files, training metrics, local smoke-test outputs, and the behaviour spec used for training.

We also added the spec to the tunedtensor/community-specs library so the workflow is easy to copy, inspect, and adapt.

What we learned

The useful unit is not just a model. It is the bundle: public data, validation notes, behaviour spec, run metrics, local tests, and a model card. When those pieces move together, the result is easier to audit than a notebook checkpoint with a few screenshots.

The remaining hard cases are the ones you would expect: mixed signals, contrarian framing, sarcasm, ticker-heavy slang, and old market news that should not be read as current advice. That is the next dataset loop.

Start a run View the spec