4 Chinese Open-Weights Models in 12 Days: The AI Inference Cost War Accelerates

Q: What Chinese AI models were released in May 2026 and how do they compare to Claude and GPT-4?

Four Chinese open-weights models dropped in a 12-day window in early May 2026: GLM-5.1 (Z.ai, strong on agentic coding), MiniMax M2.7 (mixture-of-experts with 1M token context), Kimi K2.6 (Moonshot AI, improved multi-turn structured output), and DeepSeek V4 (continued efficiency improvements over V3). All are competitive with Western frontier models on agentic engineering benchmarks. LM Council Arena Elo scores show Western frontier labs at 1,481-1,503 and leading Chinese model Alibaba at 1,449 — a 31-54 point gap that has been closing for 18 months.

Q: How much cheaper are Chinese AI models compared to Claude and OpenAI?

Running 1 million tokens through Claude Opus via Anthropic's API costs approximately $15. The equivalent through Kimi K2.6's API costs approximately $4.50 — roughly one-third the cost. Self-hosting DeepSeek V4 on Huawei Ascend hardware at scale runs below $2 per million tokens. The cost gap is most pronounced for inference-heavy, high-volume applications. For developers building in non-US markets where data residency does not mandate Western providers, the economic case for Chinese models is significant.

Q: Did US chip export controls fail to slow down Chinese AI?

The export controls slowed training compute scaling but drove compensating algorithmic efficiency improvements that the strategy did not anticipate. Chinese labs, unable to simply add more H100s, invested engineering effort in training efficiency (DeepSeek's Multi-head Latent Attention, MiniMax's MoE routing) and inference optimization for domestically available hardware. The result: models that are competitive on deployed capability benchmarks while running at 3-7x lower inference cost than Western frontier models. Chinese domestic AI chip market share is projected at 50% in 2026, replacing the supply chain the controls were intended to disrupt.

Q: Can I self-host Chinese open-weights models like DeepSeek V4?

Yes. DeepSeek V4 and GLM-5.1 are available as open weights for self-hosted deployment. Full inference requires approximately 8x Nvidia A100 80GB or equivalent GPU capacity — this is not a small server deployment, but it is within reach for engineering teams running on-premises infrastructure or cloud GPU instances. For high-volume applications where API cost is the dominant cost variable, self-hosted deployment at these model quality levels is increasingly economically justified. The Chinese labs publish model weights on Hugging Face alongside their API services.

Abhishek GautamMay 10, 20266 min read

4 Chinese Open-Weights Models in 12 Days: The AI Inference Cost War Accelerates

Quick summary

GLM-5.1, MiniMax M2.7, Kimi K2.6, and DeepSeek V4 all dropped in 12 days at under one-third of Claude Opus inference cost. Chinese domestic chip market hits 50% in 2026.

The Four Models

GLM-5.1 (Z.ai): The latest version of the General Language Model series from Z.ai, the commercial entity that spun out of Tsinghua University's KEG research lab. GLM-5.1 is notable for its code generation and agentic task performance — specifically, multi-step tool use in software engineering contexts. Z.ai positions it as a direct competitor to GPT-4o and Claude Sonnet for enterprise developer workflows. Inference is available via Z.ai's API and as an open-weights download for self-hosting.

MiniMax M2.7: MiniMax is a Shanghai-based AI company backed by Alibaba and Tencent. M2.7 is a mixture-of-experts architecture with 7 active experts out of 32 total — the routing mechanism selects which expert clusters handle each token, reducing active parameter count and inference cost dramatically relative to a dense model of equivalent total parameter count. MiniMax positions M2.7 for long-context applications: the context window is 1 million tokens, making it competitive with Gemini 1.5 for document analysis and long-form code reasoning.

Kimi K2.6 (Moonshot AI): Moonshot AI's Kimi model is well-known in China for its long-context capabilities. K2.6 extends that strength with improved instruction following and agentic tool use. The specific improvement from K2.5 to K2.6 is in multi-turn structured output — the model maintains consistent JSON schema compliance across extended conversations, which is the specific failure mode that makes AI agents unreliable in production engineering workflows.

DeepSeek V4: DeepSeek's V4 is the latest in a series of models that have consistently surprised Western observers with their performance-per-dollar characteristics. DeepSeek V3 (released December 2025) demonstrated that models trained on Huawei Ascend 910B clusters could match H100-trained models on many benchmarks through algorithmic efficiency improvements. V4 continues that trajectory with improved reasoning on mathematical and scientific tasks.

The Benchmark Picture

LM Council arena Elo scores as of May 10, 2026:

Provider	Arena Elo
Anthropic	1,503
xAI (pre-SpaceXAI)	1,495
Google	1,494
OpenAI	1,481
Alibaba	1,449
DeepSeek	1,424

The gap between Western frontier labs (1,480-1,503) and the leading Chinese model (Alibaba at 1,449) is 31-54 Elo points. In chess, 54 Elo points is approximately the difference between a 1700-rated and a 1750-rated player — meaningful but not insurmountable, and the gap has been closing for 18 months.

Anthropic's Claude Mythos Preview leads the GPQA Diamond benchmark at 94.6% — the best score on record for graduate-level science reasoning. But GPQA Diamond is a benchmark that favours large, densely trained models. On the agentic coding benchmarks that more directly translate to developer value (SWE-bench, Multi-Turn Code, AgentBench), the Chinese models are at or near parity with Western counterparts.

The inference cost comparison: running 1 million tokens of Claude Opus through Anthropic's API costs approximately $15 ($15/million input tokens). Running the equivalent through Kimi K2.6's API is approximately $4.50. Running on a self-hosted DeepSeek V4 is determined by your own hardware cost, but at scale runs below $2 per million tokens on Huawei Ascend hardware.

Why This Happened: The Export Control Effect

The US imposed export controls on Nvidia H100, H800, and subsequently H20 chips to China beginning in 2022 and tightening through 2025. The intended effect: slow China's frontier AI development by restricting access to training compute.

The actual effect observed in May 2026:

Algorithmic efficiency improvement: Chinese AI labs, unable to simply scale compute to improve model performance, focused engineering effort on training efficiency. DeepSeek's Multi-head Latent Attention, MiniMax's mixture-of-experts routing, and GLM's sparse training techniques are all algorithmic innovations driven by compute scarcity.

Inference optimization: Models optimised for efficient inference on domestically available hardware (Huawei Ascend, Cambricon) ended up with better inference cost characteristics than Western models optimised for training on H100 clusters. Training efficiency and inference efficiency are related but not identical — optimising for one can improve the other.

Open weights as strategy: Publishing models as open weights removes the deployment cost barrier for developers worldwide and creates a global community of fine-tuners and evaluators. DeepSeek and GLM's open-weights releases have generated adoption outside China that no proprietary model could achieve at equivalent quality levels.

Domestic chip ecosystem: Chinese domestic AI chip market share is projected to hit 50% in 2026. Huawei's Ascend 910B is now the default training hardware for Chinese AI labs. The supply chain that the export controls were intended to disrupt has been replaced domestically.

What This Means for Developers

If you build AI-powered applications:

Pricing pressure on API providers: The inference cost gap — 3x to 7x depending on model and use case — creates real economic pressure on Anthropic, OpenAI, and Google to reduce pricing. DeepSeek V3's release in December 2025 preceded three rounds of Anthropic price reductions. Expect similar pressure from the May 2026 model burst.

Non-US market deployments: For applications serving markets outside the US where GDPR and data residency requirements do not force Western provider selection, the Chinese models are economically compelling. A startup in Southeast Asia building a code assistant at Chinese model pricing has fundamentally different unit economics than one using Claude or GPT-4o.

Open weights option for self-hosting: DeepSeek V4 and GLM-5.1 are available as open weights for self-hosted deployment. The hardware requirement is 8x Nvidia A100 or equivalent for full inference. For high-volume applications where API cost is the dominant variable, self-hosted deployment at these model quality levels is increasingly viable.

Benchmark gap context: The 30-55 Elo point gap between Chinese and Western frontier models is real but narrow enough that it will not be the primary decision factor for most applications. Task-specific evaluation on your actual use case matters more than aggregate Arena Elo.

Key Takeaways

4 models in 12 days: GLM-5.1 (Z.ai), MiniMax M2.7, Kimi K2.6 (Moonshot), DeepSeek V4 — all frontier-competitive on agentic coding benchmarks, all at under 1/3 Claude Opus inference pricing
Benchmark gap: Western frontier (Elo 1,481-1,503) vs. leading Chinese model Alibaba (1,449) — 31-54 Elo points and closing over 18 months; GPQA Diamond gap is larger, agentic coding gap is nearly closed
Export control backfire: Compute scarcity drove algorithmic efficiency improvements; inference optimization for domestic hardware produced cost advantages; open weights strategy drives global adoption
50% domestic chip market share: Huawei Ascend 910B is now the default training hardware for Chinese AI labs; supply chain disruption goal has not been achieved
Developer economics: $15/M tokens (Claude Opus) vs. ~$4.50/M tokens (Kimi K2.6 API) vs. <$2/M tokens (DeepSeek V4 self-hosted at scale) — the gap is economically significant for high-volume applications
Open weights available: DeepSeek V4 and GLM-5.1 available for self-hosting; 8x A100 or equivalent required for full inference

For the US-China trade truce and semiconductor export control negotiations happening simultaneously, read US-China Trade Truce May 12: Chip Export Controls at the Beijing Summit. For the best AI model comparisons, read Claude vs ChatGPT: Which AI Should You Use? and the LLM API Pricing Tracker.

FAQ

Frequently Asked Questions

What Chinese AI models were released in May 2026 and how do they compare to Claude and GPT-4?

Four Chinese open-weights models dropped in a 12-day window in early May 2026: GLM-5.1 (Z.ai, strong on agentic coding), MiniMax M2.7 (mixture-of-experts with 1M token context), Kimi K2.6 (Moonshot AI, improved multi-turn structured output), and DeepSeek V4 (continued efficiency improvements over V3). All are competitive with Western frontier models on agentic engineering benchmarks. LM Council Arena Elo scores show Western frontier labs at 1,481-1,503 and leading Chinese model Alibaba at 1,449 — a 31-54 point gap that has been closing for 18 months.

How much cheaper are Chinese AI models compared to Claude and OpenAI?

Running 1 million tokens through Claude Opus via Anthropic's API costs approximately $15. The equivalent through Kimi K2.6's API costs approximately $4.50 — roughly one-third the cost. Self-hosting DeepSeek V4 on Huawei Ascend hardware at scale runs below $2 per million tokens. The cost gap is most pronounced for inference-heavy, high-volume applications. For developers building in non-US markets where data residency does not mandate Western providers, the economic case for Chinese models is significant.

Did US chip export controls fail to slow down Chinese AI?

The export controls slowed training compute scaling but drove compensating algorithmic efficiency improvements that the strategy did not anticipate. Chinese labs, unable to simply add more H100s, invested engineering effort in training efficiency (DeepSeek's Multi-head Latent Attention, MiniMax's MoE routing) and inference optimization for domestically available hardware. The result: models that are competitive on deployed capability benchmarks while running at 3-7x lower inference cost than Western frontier models. Chinese domestic AI chip market share is projected at 50% in 2026, replacing the supply chain the controls were intended to disrupt.

Can I self-host Chinese open-weights models like DeepSeek V4?

Yes. DeepSeek V4 and GLM-5.1 are available as open weights for self-hosted deployment. Full inference requires approximately 8x Nvidia A100 80GB or equivalent GPU capacity — this is not a small server deployment, but it is within reach for engineering teams running on-premises infrastructure or cloud GPU instances. For high-volume applications where API cost is the dominant cost variable, self-hosted deployment at these model quality levels is increasingly economically justified. The Chinese labs publish model weights on Hugging Face alongside their API services.

Free Weekly Briefing

The AI & Dev Briefing

One honest email a week — what actually matters in AI and software engineering. No noise, no sponsored content. Read by developers across 30+ countries.

No spam. Unsubscribe anytime.