14 LLMs Benchmarked on Turkish & English NER 🔬 • osmandagdeviren.com.tr

Why LLMs Instead of a Dedicated NER Model? 🤷#

spaCy and Stanza exist for exactly this. They’re fast, cheap, and purpose-built. The problem is Turkish morphology.

Turkish is agglutinative. A proper noun like Bozbey becomes Bozbey’in (possessive), Bozbey’e (dative), Ankara Adliyesi (organization) becomes a surface form that looks nothing like Ankara alone. Institutional names get long: “Bursa Cumhuriyet Başsavcılığı” contains “Bursa” but they are not the same entity — extracting both would be wrong. Suffix-heavy morphology historically beats up off-the-shelf models trained mostly on English.

So I tested LLM extraction through OpenRouter instead: one API, one prompt, both languages.

The question: which model gives the best named-entity F1 per dollar — especially on Turkish?

The Candidates 🏁#

14 models from the current OpenRouter catalog, prices from 2026-06-10:

Model	Provider	$/M in	$/M out
Gemini 2.5 Flash	Google	$0.30	$2.50
DeepSeek V4 Flash	DeepSeek	$0.0983	$0.1966
DeepSeek V4 Pro	DeepSeek	$0.435	$0.87
Tencent Hy3 Preview	Tencent	$0.063	$0.21
MiniMax M3	MiniMax	$0.30	$1.20
Xiaomi MiMo-V2.5	Xiaomi	$0.14	$0.28
Xiaomi MiMo-V2.5-Pro	Xiaomi	$0.435	$0.87
Gemini 3.5 Flash	Google	$1.50	$9.00
Gemini 3.1 Flash Lite	Google	$0.25	$1.50
Gemma 4 31B	Google	$0.12	$0.36
GPT-4o-mini	OpenAI	$0.15	$0.60
GPT-5.4 mini	OpenAI	$0.75	$4.50
GPT-5.4 nano	OpenAI	$0.20	$1.25
Grok 4.3	xAI	$1.25	$2.50

The Dataset 📋#

8 real news articles (4 English from BBC and the Guardian, 4 Turkish from TRT and DW), hand-labeled over title + first 1,400 characters of body.

That gave me 96 gold entities: 20 persons, 36 organizations, 40 locations.

Labeling rules were strict and given to every model in the prompt:

Strip Turkish case suffixes. Bozbey’in → Bozbey. Period.
Demonyms map to the place. “Indonesian island” → Indonesia.
No embedded entities. “Bursa Cumhuriyet Başsavcılığı” does not also yield a separate Bursa.
Acronym pairs are one entity. HSK and “Hakimler ve Savcılar Kurulu” are the same thing; either surface form is a match.

The dataset is small — this is a first pass. Boundary judgment calls affect every model equally, so the comparison stays fair even if the absolute numbers have residual label noise.

What I Measured 📐#

One request per article per model, identical prompt (extract entities, return {"entities": [{"name", "type"}]}, canonical form, no duplicates, temperature 0)
Micro precision / recall / F1 over all 96 gold entities
Same metrics split by language (tr / en separately)
Per-article cost at listed OpenRouter prices
Mean latency per request (wall-clock)
Throughput: completion tokens per second

Decision rule, written before running: highest overall micro-F1 wins. Ties within 0.03 F1 are broken by Turkish F1, then by cost, then by throughput. A model that can’t return parseable JSON on any article scores 0 recall for that article — JSON reliability is part of the result.

The Results 📊#

Model	P	R	F1	tr F1	en F1	Cost/8	ms/req	tok/s
Gemini 2.5 Flash 🏆	0.93	0.98	0.95	0.92	0.97	$0.0077	2,434	135
Gemini 3.5 Flash	0.90	0.99	0.95	0.90	0.97	$0.1830	12,456	198
DeepSeek V4 Flash	0.95	0.94	0.94	0.91	0.96	$0.0030	21,081	77
Grok 4.3	0.94	0.94	0.94	0.88	0.98	$0.0235	5,041	175
MiniMax M3	0.91	0.96	0.93	0.86	0.98	$0.0171	30,125	53
DeepSeek V4 Pro	0.93	0.94	0.93	0.91	0.95	$0.0201	41,021	63
Gemini 3.1 Flash Lite	0.89	0.96	0.92	0.85	0.97	$0.0032	2,610	72
Gemma 4 31B	0.87	0.95	0.91	0.83	0.96	$0.0015	14,313	24
MiMo-V2.5-Pro	0.88	0.93	0.90	0.86	0.93	$0.0240	76,354	40
MiMo-V2.5	0.87	0.88	0.87	0.76	0.95	$0.0073	34,091	85
GPT-4o-mini	0.88	0.73	0.80	0.83	0.77	$0.0012	2,538	55
GPT-5.4 nano	0.73	0.86	0.79	0.76	0.82	$0.0029	2,651	79
GPT-5.4 mini	0.70	0.82	0.76	0.68	0.82	$0.0089	2,237	75
Tencent Hy3 Preview	0.95	0.43	0.59	0.64	0.55	$0.0094	100,047	55

One model didn’t make it across the finish line: Tencent Hy3 Preview hit the 180-second timeout on 4 of its 8 requests. When it did respond, precision was a sharp 0.95 — but a model that times out on 50% of articles is not a model you can ship.

What Stood Out 👀#

Turkish still separates the field#

English F1 runs 0.93–0.98 for most competitive models — it’s mostly noise at this point. Turkish F1 spans 0.68–0.91. That gap is the whole experiment.

The hardest article was an Ankara judiciary piece with 12 gold entities. No model cleanly recovered all of “Ankara Cumhuriyet Başsavcılığı”, “Ankara Adliyesi” (as an organization, not a location), and “WhatsApp” together. This is the canonical hard case: long institutional names with embedded place names, and a consumer brand thrown in.

The GPT-5.4 generation regressed badly 📉#

gpt-5.4-mini landed second-to-last at F1 0.76 with precision collapsing to 0.70. It answers in ~2.2 seconds — apparently without any visible reasoning — and systematically over-extracts. It treats “Adalet Bakanı Akın Gürlek” as a single person entity (title glued to name). It extracts “Bursa” separately from “Bursa Cumhuriyet Başsavcılığı” against the explicit rules. It tags the Turkish month name “Haziran” as a location. The nano variant actually beats the mini.

The older model beat its own successors 👻#

google/gemini-2.5-flash led the field at F1 0.9543 and Turkish F1 0.9250 — the top on both axes. Gemini 3.5 Flash and 3.1 Flash Lite, its direct successors, trail it on Turkish at 0.9048 and 0.85 respectively.

Same day. Same prompt. Same dataset. Google’s own next generation didn’t beat it. The AI industry’s newest is not always its best on specific extraction tasks.

Throughput vs latency tell different stories ⚡#

Grok 4.3 and Gemini 3.5 Flash stream at 175–198 tok/s and answer in under 13 seconds. DeepSeek V4 Flash manages only 77 tok/s and 21 seconds per request — reasoning overhead. Gemini 2.5 Flash sits comfortably in the middle at 135 tok/s and 2.4 seconds. For latency-sensitive applications this gap matters a lot; for batch workloads it’s mostly irrelevant.

The Verdict ✅#

Winner: google/gemini-2.5-flash

Here’s how the decision played out across all three dimensions:

Quality 🎯 — Gemini 2.5 Flash has the highest overall F1 (0.9543). Six candidates land within 0.03 — Gemini 3.5 Flash, DeepSeek V4 Flash, Grok 4.3, MiniMax M3, DeepSeek V4 Pro, and Gemini 3.1 Flash Lite. Turkish F1 breaks the tie: Gemini 2.5 Flash leads at 0.9250, ahead of the rest. No further tiebreaker needed.

Cost 💰 — $0.0077 per 8 articles ($0.00096/article). Not the cheapest in the field — DeepSeek V4 Flash is 2.6× cheaper at $0.0030 — but the winner earns its premium: it’s the best F1 and Turkish F1 in the entire field, and still a fraction of the cost of premium options like Gemini 3.5 Flash ($0.1830).

Throughput ⚡ — 135 tok/s, 2.4 seconds per request. Among the fastest in the competitive tier. Grok 4.3 beats it on tok/s (175) but trails it on quality. Gemini 2.5 Flash is the best overall balance of speed, quality, and cost in this field.

The one risk: it’s a previous-generation model and the most likely to be deprecated. If that happens, DeepSeek V4 Flash is the natural fallback — different provider, F1 0.9424, tr F1 0.9091, at $0.0030 per 8 articles.

Per-Language Routing 🗺️#

If you’re routing NER by detected language, the data points to different winners per language:

Language	Main	Fallback 1	Fallback 2
🇹🇷 Turkish	DeepSeek V4 Flash	Gemini 2.5 Flash	Grok 4.3
🇬🇧 English	Gemma 4 31B	DeepSeek V4 Flash	MiniMax M3

Why different from the overall winner?

Turkish main is DeepSeek V4 Flash, not Gemini 2.5 Flash, because applying the same cost tiebreaker within the Turkish F1 tie window makes DeepSeek (at $0.0030) the cheapest valid option — it’s 2.6× cheaper with only 0.016 less Turkish F1.
English main is Gemma 4 31B — it sits inside the English F1 tie window at 0.958, costs only $0.0015/8 articles (the cheapest on the board), and is open-weight with multi-provider availability, making it the most deprecation-resistant pick. MiniMax M3 has the top raw English F1 (0.983) but Gemma beats it on cost.
DeepSeek V4 Flash appears in both fallback chains — it performs well across both languages, which means a single model can cover both in a pinch.

What I’d Do Differently 🔧#

The 8-article dataset is small by any serious NLP benchmark standard. It was sufficient to separate the field here — Turkish performance has enough variance that signal isn’t buried in noise — but the absolute F1 numbers carry residual label noise from boundary judgment calls.

A re-run is easy if the field changes: add new candidates, run the same script, compare.