Alibaba's Qwen 3.5 9B Beats GPT-OSS-120B on 3 Benchmarks — Runs on a Laptop
Quick summary
Qwen 3.5's 9B model outscores OpenAI's 120B model on GPQA Diamond, MMLU-Pro, and multilingual tests. Apache 2.0, multimodal, runs offline on 16GB RAM. Full developer guide.
Read next
- Trump 145% China Tariff: GPU, iPhone, and Dev Hardware CostsTrump paused tariffs 90 days for most countries at 10% but raised China to 145% on April 9. What it means for GPU prices, TSMC, iPhone, and developer budgets.
- TSMC Q1 2026: $35.7B Record Revenue, AI Chip Demand Holds at 35%TSMC posted $35.7B in Q1 2026 revenue — up 35% YoY, a new record. N2 2nm chips entering volume production. AI accelerator CAGR revised to 54%. What it means for GPU pricing and developers.
Alibaba's Qwen team released the Qwen 3.5 Small Model Series on March 1, 2026: four models at 0.8B, 2B, 4B, and 9B parameters, all multimodal, all capable of running offline on consumer hardware. The 9B model beats OpenAI's GPT-OSS-120B on three major benchmarks while being 13 times smaller.
The scores: MMLU-Pro (82.5 vs 80.8), GPQA Diamond (81.7 vs 80.1), multilingual MMMLU (81.2 vs 78.2). The 9B also beats Gemini 2.5 Flash-Lite on MMMU-Pro visual reasoning (70.1 vs 59.7). These are not narrow tests — they cover graduate-level science reasoning, multilingual performance, and visual understanding, the three categories that matter most for real-world developer tasks.
All four models are Apache 2.0 licensed and available on Hugging Face. No restrictions, no fees, no usage caps.
The Four Models: What Runs Where
Qwen 3.5 0.8B — Fits in RAM on mid-range smartphones. Useful for classification, extraction, and intent detection where latency is critical and API costs aren't justified. Think on-device search query understanding or lightweight form parsing.
Qwen 3.5 2B — Runs on flagship phones (iPhone 16 Pro, Pixel 9 Pro) and entry-level laptops. Handles summarisation, light code generation, and document Q&A in offline or edge deployments. Faster than cloud API calls for simple tasks.
Qwen 3.5 4B — The practical sweet spot for laptop deployments. Any laptop with 8GB RAM. Handles multi-turn conversation, structured output, and moderate code generation with good speed. This is the model to use for developer tooling prototypes.
Qwen 3.5 9B — Requires 16GB RAM or a consumer GPU (RTX 3060 or equivalent). This is the benchmark winner. Use it for complex reasoning, multimodal document processing, and production edge hardware. On Apple Silicon (M2 Pro and above), inference is fast enough to be practical.
Why Early Fusion Changes Everything
The key architectural shift in Qwen 3.5 is early-fusion multimodal training. Most small multimodal models — including earlier Qwen versions — train the language model first, then attach a vision encoder. These two modules were never trained to communicate in the same representational space, so the vision encoder learns to approximate translation into the language model's space. Approximation introduces errors.
Qwen 3.5 trains text and image tokens jointly from the start. The model reasons about images and text in a unified representation. This is why it scores 70.1 on MMMU-Pro visual reasoning versus 59.7 for Gemini 2.5 Flash-Lite and 57.4 for GPT-5-Nano — both of which are late-fusion. The visual reasoning gap is not a marginal improvement. It's the difference between a model that actually understands image-text relationships and one that estimates them.
The models also use grouped query attention (GQA) and sliding window attention for middle layers, reducing memory footprint for long-context tasks. Standard transformer attention scales quadratically with sequence length — this becomes a real problem at 8K+ tokens on consumer hardware. The hybrid attention keeps memory usage manageable even when processing long documents or image sequences.
Full Benchmark Table
| Model | MMLU-Pro | GPQA Diamond | MMMLU (multilingual) | MMMU-Pro (vision) |
|---|---|---|---|---|
| Qwen 3.5 9B | 82.5 | 81.7 | 81.2 | 70.1 |
| GPT-OSS-120B | 80.8 | 80.1 | 78.2 | — |
| Gemini 2.5 Flash-Lite | — | — | — | 59.7 |
| GPT-5-Nano | — | — | — | 57.4 |
The multilingual result deserves attention. 81.2 vs 78.2 on MMMLU is a substantial gap. Alibaba's training data includes higher-quality Chinese, Arabic, and Southeast Asian content than most US lab datasets. For developers building products for non-English markets — which is most of the world — this is a material advantage over GPT-OSS-120B.
Developer Use Cases
Local LLM on a developer's machine. The 4B and 9B run with Ollama or LM Studio on a standard workstation. Zero API costs, zero network latency, no data leaving your machine. For code review, document parsing, and prototyping, the 9B running locally is competitive with paid API calls to mid-tier cloud models. Nvidia's Nemotron 3 Super is stronger on pure coding benchmarks, but Qwen 3.5 9B wins on vision and multilingual.
Edge and air-gapped deployment. Industrial IoT, on-premise healthcare, legal tech with data residency requirements — contexts where data can't go to a cloud API. The 4B and 9B are the first small models capable enough for production-quality responses in these environments without significant quality degradation.
Mobile AI. The 0.8B and 2B run on Android and iOS. Unlike Apple's Gemini-powered Siri (which only runs in Apple's controlled stack), Qwen 3.5 small models are available to any mobile developer, on any platform, offline. For apps targeting emerging markets where connectivity is unreliable, the 2B offline is more reliable than any cloud API.
Multimodal document pipelines. The 9B processes documents that mix text and images — invoices, engineering diagrams, medical scans, product photos — without the quality drop that comes from late-fusion models struggling to bridge visual and text representations. For any workflow that currently sends images to GPT-4V or Claude for parsing, Qwen 3.5 9B is worth benchmarking locally.
Fine-tuning on domain data. Small models are cheaper to fine-tune than large ones. A Qwen 3.5 9B fine-tuned on your domain data will outperform a generic GPT-OSS-120B on your specific task. Apache 2.0 means you can fine-tune, modify, and deploy commercially without restrictions or royalties.
How It Compares to the Other Small Models
vs Meta Llama 3.2 11B: Similar benchmark performance, Llama is slightly larger. Qwen 3.5 9B has better multilingual scores and native multimodal. Llama has broader community tooling and more third-party fine-tunes. Both are Apache 2.0. Pick Llama if ecosystem size matters, Qwen if multilingual or vision quality matters.
vs Microsoft Phi-4 14B: Phi-4 is larger and has stronger coding performance on some benchmarks. Qwen 3.5 9B wins on vision and multilingual. Phi-4 is MIT licensed. If your use case is primarily code generation and you have the hardware for 14B, Phi-4 is worth testing.
vs Google Gemma 3 9B: Direct size comparison. Qwen 3.5 9B outperforms Gemma 3 9B on multilingual and visual benchmarks. Gemma integrates better with Google's Vertex AI toolchain. Choose Gemma if you're already in the Google ecosystem; Qwen if raw benchmark performance matters more.
vs Mistral Small 22B: Mistral is larger but has similar performance on some reasoning tasks. Qwen 3.5 9B wins clearly on size efficiency and visual capability. Mistral Small is better if you have hardware that can run 22B and need the strongest pure-text performance.
The Cost Trend This Represents
A year ago, GPT-4 class reasoning required a $20/month API subscription or expensive cloud GPU instances. Today, equivalent reasoning runs locally on a $1,500 laptop. Qwen 3.5 9B beating GPT-OSS-120B is the continuation of a pattern: DeepSeek R1 challenged OpenAI o1, Xiaomi's Hunter Alpha targeted trillion-parameter performance, and now a 9B model from Alibaba outscores a 120B model from OpenAI on the same benchmarks.
For developers deciding which AI infrastructure to build on: cloud API dependency is increasingly optional for many tasks. The decision between local small model and cloud frontier model is now a product architecture choice — privacy vs capability, cost vs performance — not a forced choice between capable and unusable.
Key Takeaways
- Qwen 3.5 9B beats GPT-OSS-120B (13x larger) on GPQA Diamond (81.7 vs 80.1), MMLU-Pro (82.5 vs 80.8), and MMMLU multilingual (81.2 vs 78.2)
- Four sizes: 0.8B, 2B, 4B, 9B — all multimodal, all Apache 2.0, all on Hugging Face, no commercial restrictions
- Early-fusion vision: trains text and images jointly, giving a 10+ point MMMU-Pro advantage over late-fusion competitors
- Hardware: 4B on any 8GB RAM laptop; 9B on 16GB RAM or RTX 3060
- Best use cases: local developer LLM, edge/air-gapped deployment, mobile AI on Android, multimodal document parsing, domain fine-tuning
- The trend: small models are matching 12-month-old frontier model performance — capable AI no longer requires cloud API access
FAQ
Frequently Asked Questions
How does Qwen 3.5 9B compare to GPT-4o?
Qwen 3.5 9B is a small local model and does not match GPT-4o or Claude Sonnet for complex agentic tasks, long-form reasoning, or nuanced instruction following at frontier scale. Where it competes is on knowledge benchmarks: it outscores GPT-OSS-120B on GPQA Diamond (81.7 vs 80.1) and MMLU-Pro (82.5 vs 80.8), which is impressive for a 9B model. For practical developer tasks — code review, document Q&A, structured extraction, multimodal parsing — Qwen 3.5 9B running locally is a credible alternative to paid API calls on mid-tier cloud models.
Can Qwen 3.5 run on a MacBook or Windows laptop?
Yes. The 4B model runs on any laptop with 8GB RAM, including an M2 MacBook Air. The 9B model needs 16GB RAM, covering most MacBook Pros and mid-range Windows laptops. You can run it via Ollama (easiest), LM Studio (better UI), or llama.cpp directly. On Apple Silicon, the Metal GPU backend handles matrix operations efficiently, making inference significantly faster than on Intel/AMD CPUs.
What makes Qwen 3.5 multimodal different from other small vision models?
Qwen 3.5 uses early-fusion multimodal training — text and image tokens are trained together from the start, not added as a separate vision encoder after language training. This gives the model a unified representation of text and images rather than an approximation. In practice: 70.1 on MMMU-Pro visual reasoning versus 59.7 for Gemini 2.5 Flash-Lite and 57.4 for GPT-5-Nano, both late-fusion. For document understanding, diagram parsing, and mixed image-text tasks, early fusion is a meaningful improvement.
Is Qwen 3.5 free for commercial use?
Yes. All four models (0.8B, 2B, 4B, 9B) are Apache 2.0 licensed. You can use them commercially, fine-tune them, modify them, and deploy them in products without paying Alibaba or crediting them beyond the license attribution. Weights are on Hugging Face with no usage restrictions, seat limits, or API call caps — you run the model on your own hardware.
Should I use Qwen 3.5 or a cloud API like Claude or GPT-4o?
Use Qwen 3.5 locally when: data cannot leave your machine, you need zero network latency, you want to eliminate API costs, or you are deploying to edge or offline environments. Use cloud APIs (Claude, GPT-4o, Gemini) when: you need frontier reasoning for complex agentic tasks, you're building consumer products where API cost is manageable, or you lack the hardware to run 9B models well. For prototyping and many production use cases involving document processing, classification, and structured extraction, Qwen 3.5 9B locally is sufficient and cheaper.
Free Weekly Briefing
The AI & Dev Briefing
One honest email a week — what actually matters in AI and software engineering. No noise, no sponsored content. Read by developers across 30+ countries.
No spam. Unsubscribe anytime.
More on AI
All posts →Trump 145% China Tariff: GPU, iPhone, and Dev Hardware Costs
Trump paused tariffs 90 days for most countries at 10% but raised China to 145% on April 9. What it means for GPU prices, TSMC, iPhone, and developer budgets.
TSMC Q1 2026: $35.7B Record Revenue, AI Chip Demand Holds at 35%
TSMC posted $35.7B in Q1 2026 revenue — up 35% YoY, a new record. N2 2nm chips entering volume production. AI accelerator CAGR revised to 54%. What it means for GPU pricing and developers.
NVIDIA GTC 2026: What Developers and AI Engineers Need to Know Before March 16
Jensen Huang takes the stage on March 16 and has promised to "surprise the world" with a new chip. GTC 2026 covers physical AI, agentic AI, inference, and AI factories. Here is what matters for developers building on the AI stack — and what to watch for.
DeepSeek R2 Is Out: What Every Developer Needs to Know Right Now
DeepSeek R2 just dropped. It is multimodal, covers 100+ languages, and was trained on Nvidia Blackwell chips despite US export controls. Here is what changed from R1, what the benchmarks mean, and how to use it including running it locally.
Free Tool
Will AI replace your job?
4 questions. Get a personalised developer risk score based on your stack, role, and what you actually build day to day.
Check Your AI Risk Score →Written by
Software Engineer based in Delhi, India. Writes about AI models, semiconductor supply chains, and tech geopolitics — covering the intersection of infrastructure and global events. 952+ posts cited by ChatGPT, Perplexity, and Gemini. Read in 167 countries.
