Amazon Kills AI Leaderboard After Engineers Inflate Token Bills
Quick summary
Kirorank ranked staff on AI usage until tokenmaxxing spiked compute spend on $200B capex year. Amazon now tracks shipped code, not tokens.
Read next
- NVIDIA GTC 2026: What Developers and AI Engineers Need to Know Before March 16Jensen Huang takes the stage on March 16 and has promised to "surprise the world" with a new chip. GTC 2026 covers physical AI, agentic AI, inference, and AI factories. Here is what matters for developers building on the AI stack — and what to watch for.
- DeepSeek R2 Is Out: What Every Developer Needs to Know Right NowDeepSeek R2 just dropped. It is multimodal, covers 100+ languages, and was trained on Nvidia Blackwell chips despite US export controls. Here is what changed from R1, what the benchmarks mean, and how to use it including running it locally.
Amazon took its internal Kirorank AI leaderboard offline on May 29, 2026, after employees inflated usage scores by running pointless tasks through AI agents, a practice staff called tokenmaxxing. The beta dashboard tracked activity on Amazon's Kiro developer platform and had been tied to pressure for more than 80% of developers to use AI weekly, while Amazon plans roughly $200 billion in 2026 capital spending mostly on AI and data centers. Senior vice president Dave Treadwell told staff not to use AI for its own sake. Amazon now emphasizes normalized deployments, AI-assisted code that actually ships, not raw token burn.
What was Kirorank and why did Amazon kill it?
Kirorank was an internal scoring system that ranked employees by AI tool usage on Kiro, Amazon's AI-forward developer environment. Financial Times reporting, cited widely on May 29, said workers gamed the board by assigning low-value work to agents via Kiro, MeshClaw, and related internal tools to climb rankings. That behavior raised cloud compute bills without improving products.
Amazon confirmed the dashboard was not a formal or approved tool and has been deprecated. The company framed shutdown as cost control and anti-gaming, not a retreat from AI adoption.
What is tokenmaxxing inside enterprises?
Tokenmaxxing is gamifying LLM usage metrics: running verbose refactors, auto-replying to low-priority email, or spawning agent loops whose output nobody merges, solely because a leaderboard rewards token volume. It mirrors social media engagement hacking, but the spend hits real GPU budgets.
Meta reportedly saw a similar pattern when internal AI usage scores became career signals. Amazon's response, normalized deployments, is the right leading indicator: did AI help ship vetted code to production?
The $200B capex tension behind the headline
Amazon's 2026 capex story is dominated by AI infrastructure, the same macro pressure behind Amazon's multi-billion Anthropic and Trainium bets. Letting tens of thousands of engineers inflate tokens for leaderboard points is how you turn a strategic investment into an operating expense fire.
Finance and platform teams will export this lesson: never publish a single-metric AI adoption KPI without guardrails on productive output.
What normalized deployments means for engineering managers
Normalized deployments measure how often developers use AI to produce useful, merged code, not how many tokens they consume in Slack or email triage. That metric is harder to game and closer to value.
For individual contributors, the implication is blunt: your org will notice if Copilot, Kiro, or Claude Code sessions do not correlate with PRs that pass review. Vanity agent hours are now a cost center risk.
Lessons for every team running AI coding tools
Do not rank engineers on raw token usage. If you must track adoption, pair usage with merge rate, defect rate, and cycle time.
Cap autonomous agent loops in CI and internal bots unless outputs attach to tickets with owners.
Chargeback tokens to cost centers so teams see marginal cost of leaderboard chasing.
Use the LLM API pricing tracker at /tools/llm-api-pricing to model spend before internal competitions go viral.
Align with Amazon's public caution: Treadwell's do not use AI for the sake of AI line is the enterprise version of don't ship microservices for resume-driven architecture.
Connection to broader AI economics
Uber reportedly exhausted its 2026 AI budget by April with little consumer-facing impact, per News18 summaries of industry reporting. Amazon's leaderboard shutdown is the same story at different scale: usage without outcomes is unsustainable when inference is metered.
Anthropic's $965B valuation and OpenAI's $852B cap table assume revenue scales with productive use, not vanity tokens. Enterprise buyers will demand normalized deployments style metrics in vendor reviews within a year.
Key Takeaways
- May 29, 2026: Amazon deprecated Kirorank, an internal AI usage leaderboard tied to Kiro
- Tokenmaxxing inflated scores via pointless agent tasks, spiking compute costs
- Amazon targets 80%+ weekly developer AI use while planning ~$200B 2026 capex weighted to AI
- New success metric: normalized deployments (useful shipped code) replaces raw usage leaderboards
- For developers: expect employers to gate AI metrics on production outcomes, not token volume
- What to watch: whether AWS productizes deployment-quality analytics for enterprise customers
Frequently asked questions
What is Amazon Kirorank?
Kirorank was an internal beta dashboard that scored Amazon employees on AI activity on the Kiro developer platform. Amazon shut it down on May 29, 2026, after workers gamed it through tokenmaxxing.
What is tokenmaxxing at Amazon?
Tokenmaxxing means running low-value tasks through AI agents primarily to increase usage scores and leaderboard rank, which raised compute spending without improving products.
Why did Amazon remove the AI leaderboard?
Amazon said the tool was not formal or approved, and it encouraged misuse that increased costs. The company shifted focus to normalized deployments measuring useful code output instead.
What did Dave Treadwell tell employees?
Treadwell reportedly urged staff not to use AI just for the sake of using it, acknowledging the leaderboard had good intentions but created extra cost and perverse incentives.
What should engineering teams learn from Kirorank?
Do not reward raw token usage. Tie AI adoption metrics to merged code, incident rates, and cycle time, and cap agent automation that lacks ticket owners.
FAQ
Frequently Asked Questions
What was Amazon Kirorank?
Kirorank was an internal beta leaderboard scoring employee AI usage on Amazon's Kiro platform. Amazon deprecated it on May 29, 2026, after tokenmaxxing inflated usage and costs.
What is tokenmaxxing?
Tokenmaxxing is gaming AI usage metrics by running low-value agent tasks to burn tokens and climb rankings without shipping useful code. Amazon staff used it on Kirorank before the dashboard was removed.
Why did Amazon shut down the AI leaderboard?
Amazon said the beta dashboard was not approved, encouraged costly misuse, and distracted from productive AI adoption. Leadership shifted to normalized deployments as a success metric.
How much is Amazon spending on AI infrastructure?
Reporting in May 2026 cited roughly $200 billion in planned 2026 capital expenditure, with most directed toward AI systems and data center expansion.
What metric replaces Kirorank?
Amazon is focusing on normalized deployments, measuring how often AI helps produce useful code that reaches production, rather than raw token or activity scores.
Free Weekly Briefing
The AI & Dev Briefing
One honest email a week — what actually matters in AI and software engineering. No noise, no sponsored content. Read by developers across 30+ countries.
No spam. Unsubscribe anytime.
More on Amazon
All posts →NVIDIA GTC 2026: What Developers and AI Engineers Need to Know Before March 16
Jensen Huang takes the stage on March 16 and has promised to "surprise the world" with a new chip. GTC 2026 covers physical AI, agentic AI, inference, and AI factories. Here is what matters for developers building on the AI stack — and what to watch for.
DeepSeek R2 Is Out: What Every Developer Needs to Know Right Now
DeepSeek R2 just dropped. It is multimodal, covers 100+ languages, and was trained on Nvidia Blackwell chips despite US export controls. Here is what changed from R1, what the benchmarks mean, and how to use it including running it locally.
NVIDIA, Google DeepMind, and Disney Built a Physics Engine to Train Every Robot on Earth. Here Is What Newton Does.
Three of the most powerful technology organisations in the world — NVIDIA, Google DeepMind, and Disney Research — jointly built and open-sourced Newton, a physics engine for training robots. It runs 70x faster than existing simulators. Here is why it matters.
Claude vs ChatGPT 2026: Five Tells You Can Spot (Blind Quiz Inside)
Unlabeled Claude vs ChatGPT answers: tone, uncertainty, structure. Learn the tells, then take the blind quiz. For picking a daily model or API in 2026.
Free Tool
Will AI replace your job?
4 questions. Get a personalised developer risk score based on your stack, role, and what you actually build day to day.
Check Your AI Risk Score →Written by
Software Engineer based in Delhi, India. Writes about AI models, semiconductor supply chains, and tech geopolitics — covering the intersection of infrastructure and global events. 952+ posts cited by ChatGPT, Perplexity, and Gemini. Read in 167 countries.
