
Why LLMs Instead of a Dedicated NER Model? 🤷#
spaCy and Stanza exist for exactly this. They’re fast, cheap, and purpose-built. The problem is Turkish morphology.
Turkish is agglutinative. A proper noun like Bozbey becomes Bozbey’in (possessive), Bozbey’e (dative), Ankara Adliyesi (organization) becomes a surface form that looks nothing like Ankara alone. Institutional names get long: “Bursa Cumhuriyet Başsavcılığı” contains “Bursa” but they are not the same entity — extracting both would be wrong. Suffix-heavy morphology historically beats up off-the-shelf models trained mostly on English.
So I tested LLM extraction through OpenRouter instead: one API, one prompt, both languages.
The question: which model gives the best named-entity F1 per dollar — especially on Turkish?
The Candidates 🏁#
14 models from the current OpenRouter catalog, prices from 2026-06-10:
| Model | Provider | $/M in | $/M out |
|---|---|---|---|
| Gemini 2.5 Flash | $0.30 | $2.50 | |
| DeepSeek V4 Flash | DeepSeek | $0.0983 | $0.1966 |
| DeepSeek V4 Pro | DeepSeek | $0.435 | $0.87 |
| Tencent Hy3 Preview | Tencent | $0.063 | $0.21 |
| MiniMax M3 | MiniMax | $0.30 | $1.20 |
| Xiaomi MiMo-V2.5 | Xiaomi | $0.14 | $0.28 |
| Xiaomi MiMo-V2.5-Pro | Xiaomi | $0.435 | $0.87 |
| Gemini 3.5 Flash | $1.50 | $9.00 | |
| Gemini 3.1 Flash Lite | $0.25 | $1.50 | |
| Gemma 4 31B | $0.12 | $0.36 | |
| GPT-4o-mini | OpenAI | $0.15 | $0.60 |
| GPT-5.4 mini | OpenAI | $0.75 | $4.50 |
| GPT-5.4 nano | OpenAI | $0.20 | $1.25 |
| Grok 4.3 | xAI | $1.25 | $2.50 |
The Dataset 📋#
8 real news articles (4 English from BBC and the Guardian, 4 Turkish from TRT and DW), hand-labeled over title + first 1,400 characters of body.
That gave me 96 gold entities: 20 persons, 36 organizations, 40 locations.
Labeling rules were strict and given to every model in the prompt:
- Strip Turkish case suffixes. Bozbey’in → Bozbey. Period.
- Demonyms map to the place. “Indonesian island” → Indonesia.
- No embedded entities. “Bursa Cumhuriyet Başsavcılığı” does not also yield a separate Bursa.
- Acronym pairs are one entity. HSK and “Hakimler ve Savcılar Kurulu” are the same thing; either surface form is a match.
The dataset is small — this is a first pass. Boundary judgment calls affect every model equally, so the comparison stays fair even if the absolute numbers have residual label noise.
What I Measured 📐#
- One request per article per model, identical prompt (extract entities, return
{"entities": [{"name", "type"}]}, canonical form, no duplicates, temperature 0) - Micro precision / recall / F1 over all 96 gold entities
- Same metrics split by language (tr / en separately)
- Per-article cost at listed OpenRouter prices
- Mean latency per request (wall-clock)
- Throughput: completion tokens per second
Decision rule, written before running: highest overall micro-F1 wins. Ties within 0.03 F1 are broken by Turkish F1, then by cost, then by throughput. A model that can’t return parseable JSON on any article scores 0 recall for that article — JSON reliability is part of the result.
The Results 📊#
| Model | P | R | F1 | tr F1 | en F1 | Cost/8 | ms/req | tok/s |
|---|---|---|---|---|---|---|---|---|
| Gemini 2.5 Flash 🏆 | 0.93 | 0.98 | 0.95 | 0.92 | 0.97 | $0.0077 | 2,434 | 135 |
| Gemini 3.5 Flash | 0.90 | 0.99 | 0.95 | 0.90 | 0.97 | $0.1830 | 12,456 | 198 |
| DeepSeek V4 Flash | 0.95 | 0.94 | 0.94 | 0.91 | 0.96 | $0.0030 | 21,081 | 77 |
| Grok 4.3 | 0.94 | 0.94 | 0.94 | 0.88 | 0.98 | $0.0235 | 5,041 | 175 |
| MiniMax M3 | 0.91 | 0.96 | 0.93 | 0.86 | 0.98 | $0.0171 | 30,125 | 53 |
| DeepSeek V4 Pro | 0.93 | 0.94 | 0.93 | 0.91 | 0.95 | $0.0201 | 41,021 | 63 |
| Gemini 3.1 Flash Lite | 0.89 | 0.96 | 0.92 | 0.85 | 0.97 | $0.0032 | 2,610 | 72 |
| Gemma 4 31B | 0.87 | 0.95 | 0.91 | 0.83 | 0.96 | $0.0015 | 14,313 | 24 |
| MiMo-V2.5-Pro | 0.88 | 0.93 | 0.90 | 0.86 | 0.93 | $0.0240 | 76,354 | 40 |
| MiMo-V2.5 | 0.87 | 0.88 | 0.87 | 0.76 | 0.95 | $0.0073 | 34,091 | 85 |
| GPT-4o-mini | 0.88 | 0.73 | 0.80 | 0.83 | 0.77 | $0.0012 | 2,538 | 55 |
| GPT-5.4 nano | 0.73 | 0.86 | 0.79 | 0.76 | 0.82 | $0.0029 | 2,651 | 79 |
| GPT-5.4 mini | 0.70 | 0.82 | 0.76 | 0.68 | 0.82 | $0.0089 | 2,237 | 75 |
| Tencent Hy3 Preview | 0.95 | 0.43 | 0.59 | 0.64 | 0.55 | $0.0094 | 100,047 | 55 |
One model didn’t make it across the finish line: Tencent Hy3 Preview hit the 180-second timeout on 4 of its 8 requests. When it did respond, precision was a sharp 0.95 — but a model that times out on 50% of articles is not a model you can ship.
What Stood Out 👀#
Turkish still separates the field#
English F1 runs 0.93–0.98 for most competitive models — it’s mostly noise at this point. Turkish F1 spans 0.68–0.91. That gap is the whole experiment.
The hardest article was an Ankara judiciary piece with 12 gold entities. No model cleanly recovered all of “Ankara Cumhuriyet Başsavcılığı”, “Ankara Adliyesi” (as an organization, not a location), and “WhatsApp” together. This is the canonical hard case: long institutional names with embedded place names, and a consumer brand thrown in.
The GPT-5.4 generation regressed badly 📉#
gpt-5.4-mini landed second-to-last at F1 0.76 with precision collapsing to 0.70. It answers in ~2.2 seconds — apparently without any visible reasoning — and systematically over-extracts. It treats “Adalet Bakanı Akın Gürlek” as a single person entity (title glued to name). It extracts “Bursa” separately from “Bursa Cumhuriyet Başsavcılığı” against the explicit rules. It tags the Turkish month name “Haziran” as a location. The nano variant actually beats the mini.
The older model beat its own successors 👻#
google/gemini-2.5-flash led the field at F1 0.9543 and Turkish F1 0.9250 — the top on both axes. Gemini 3.5 Flash and 3.1 Flash Lite, its direct successors, trail it on Turkish at 0.9048 and 0.85 respectively.
Same day. Same prompt. Same dataset. Google’s own next generation didn’t beat it. The AI industry’s newest is not always its best on specific extraction tasks.
Throughput vs latency tell different stories ⚡#
Grok 4.3 and Gemini 3.5 Flash stream at 175–198 tok/s and answer in under 13 seconds. DeepSeek V4 Flash manages only 77 tok/s and 21 seconds per request — reasoning overhead. Gemini 2.5 Flash sits comfortably in the middle at 135 tok/s and 2.4 seconds. For latency-sensitive applications this gap matters a lot; for batch workloads it’s mostly irrelevant.
The Verdict ✅#
Winner: google/gemini-2.5-flash
Here’s how the decision played out across all three dimensions:
Quality 🎯 — Gemini 2.5 Flash has the highest overall F1 (0.9543). Six candidates land within 0.03 — Gemini 3.5 Flash, DeepSeek V4 Flash, Grok 4.3, MiniMax M3, DeepSeek V4 Pro, and Gemini 3.1 Flash Lite. Turkish F1 breaks the tie: Gemini 2.5 Flash leads at 0.9250, ahead of the rest. No further tiebreaker needed.
Cost 💰 — $0.0077 per 8 articles ($0.00096/article). Not the cheapest in the field — DeepSeek V4 Flash is 2.6× cheaper at $0.0030 — but the winner earns its premium: it’s the best F1 and Turkish F1 in the entire field, and still a fraction of the cost of premium options like Gemini 3.5 Flash ($0.1830).
Throughput ⚡ — 135 tok/s, 2.4 seconds per request. Among the fastest in the competitive tier. Grok 4.3 beats it on tok/s (175) but trails it on quality. Gemini 2.5 Flash is the best overall balance of speed, quality, and cost in this field.
The one risk: it’s a previous-generation model and the most likely to be deprecated. If that happens, DeepSeek V4 Flash is the natural fallback — different provider, F1 0.9424, tr F1 0.9091, at $0.0030 per 8 articles.
Per-Language Routing 🗺️#
If you’re routing NER by detected language, the data points to different winners per language:
| Language | Main | Fallback 1 | Fallback 2 |
|---|---|---|---|
| 🇹🇷 Turkish | DeepSeek V4 Flash | Gemini 2.5 Flash | Grok 4.3 |
| 🇬🇧 English | Gemma 4 31B | DeepSeek V4 Flash | MiniMax M3 |
Why different from the overall winner?
- Turkish main is DeepSeek V4 Flash, not Gemini 2.5 Flash, because applying the same cost tiebreaker within the Turkish F1 tie window makes DeepSeek (at $0.0030) the cheapest valid option — it’s 2.6× cheaper with only 0.016 less Turkish F1.
- English main is Gemma 4 31B — it sits inside the English F1 tie window at 0.958, costs only $0.0015/8 articles (the cheapest on the board), and is open-weight with multi-provider availability, making it the most deprecation-resistant pick. MiniMax M3 has the top raw English F1 (0.983) but Gemma beats it on cost.
- DeepSeek V4 Flash appears in both fallback chains — it performs well across both languages, which means a single model can cover both in a pinch.
What I’d Do Differently 🔧#
The 8-article dataset is small by any serious NLP benchmark standard. It was sufficient to separate the field here — Turkish performance has enough variance that signal isn’t buried in noise — but the absolute F1 numbers carry residual label noise from boundary judgment calls.
A re-run is easy if the field changes: add new candidates, run the same script, compare.