Cheapest AI model for RAG retrieval
RAG sends a lot of context per call — typically 2,000 to 8,000 input tokens of retrieved chunks plus a short generated answer. Input pricing absolutely dominates. We benchmarked at 4,000 input + 400 output, which represents a typical mid-sized RAG turn.
Ranked cheapest first
| # | Model | Input $/M | Output $/M | Per 1M calls |
|---|---|---|---|---|
| #1 | GPT-5 Nano OpenAI |
$0.05 | $0.40 | $360 |
| #2 | GPT-4.1 Nano OpenAI |
$0.10 | $0.40 | $560 |
| #3 | Gemini 2.5 Flash-Lite |
$0.10 | $0.40 | $560 |
| #4 | Llama 3.1 8B Meta |
$0.18 | $0.18 | $792 |
| #5 | GPT-4o mini OpenAI |
$0.15 | $0.60 | $840 |
| #6 | GPT-5.4 Nano OpenAI |
$0.20 | $1.25 | $1,300 |
| #7 | DeepSeek V3 DeepSeek |
$0.27 | $1.10 | $1,520 |
| #8 | Gemini 3.1 Flash-Lite |
$0.25 | $1.50 | $1,600 |
| #9 | GPT-5 Mini OpenAI |
$0.25 | $2.00 | $1,800 |
| #10 | Gemini 2.5 Flash |
$0.30 | $2.50 | $2,200 |
| #11 | GPT-4.1 Mini OpenAI |
$0.40 | $1.60 | $2,240 |
| #12 | Llama 3.1 70B Meta |
$0.59 | $0.79 | $2,676 |
| #13 | DeepSeek V3.1 DeepSeek |
$0.60 | $1.70 | $3,080 |
| #14 | Gemini 3 Flash |
$0.50 | $3.00 | $3,200 |
| #15 | Qwen 2.5 Coder 32B Alibaba |
$0.80 | $0.80 | $3,520 |
| #16 | Llama 3.3 70B Meta |
$0.88 | $0.88 | $3,872 |
| #17 | Qwen 2.5 72B Alibaba |
$0.90 | $0.90 | $3,960 |
| #18 | GPT-5.4 Mini OpenAI |
$0.75 | $4.50 | $4,800 |
| #19 | Claude Haiku 4.5 Anthropic |
$1.00 | $5.00 | $6,000 |
| #20 | o3-mini OpenAI |
$1.10 | $4.40 | $6,160 |
| #21 | o4-mini OpenAI |
$1.10 | $4.40 | $6,160 |
| #22 | GLM-5.1 zhipu |
$1.40 | $4.40 | $7,360 |
| #23 | Qwen3 Coder 480B Alibaba |
$2.00 | $2.00 | $8,800 |
| #24 | GPT-5.1 OpenAI |
$1.25 | $10.00 | $9,000 |
| #25 | GPT-5 OpenAI |
$1.25 | $10.00 | $9,000 |
| #26 | Gemini 2.5 Pro |
$1.25 | $10.00 | $9,000 |
| #27 | Mistral Large Mistral |
$2.00 | $6.00 | $10,400 |
| #28 | GPT-4.1 OpenAI |
$2.00 | $8.00 | $11,200 |
| #29 | o3 OpenAI |
$2.00 | $8.00 | $11,200 |
| #30 | GPT-5.3 OpenAI |
$1.75 | $14.00 | $12,600 |
| #31 | GPT-5.2 OpenAI |
$1.75 | $14.00 | $12,600 |
| #32 | Gemini 3.1 Pro |
$2.00 | $12.00 | $12,800 |
| #33 | GPT-4o OpenAI |
$2.50 | $10.00 | $14,000 |
| #34 | DeepSeek R1 DeepSeek |
$3.00 | $7.00 | $14,800 |
| #35 | Llama 3.1 405B Meta |
$3.50 | $3.50 | $15,400 |
| #36 | GPT-5.4 OpenAI |
$2.50 | $15.00 | $16,000 |
| #37 | Claude Sonnet 4.6 Anthropic |
$3.00 | $15.00 | $18,000 |
| #38 | Claude Opus 4.8 Anthropic |
$5.00 | $25.00 | $30,000 |
| #39 | GPT-5.5 OpenAI |
$5.00 | $30.00 | $32,000 |
| #40 | GPT-4 Turbo OpenAI |
$10.00 | $30.00 | $52,000 |
| #41 | Claude Opus 4.8 (Fast Mode) Anthropic |
$10.00 | $50.00 | $60,000 |
| #42 | GPT-5 Pro OpenAI |
$15.00 | $120.00 | $108,000 |
| #43 | o3-pro OpenAI |
$20.00 | $80.00 | $112,000 |
| #44 | GPT-5.2 Pro OpenAI |
$21.00 | $168.00 | $151,200 |
| #45 | GPT-5.5 Pro OpenAI |
$30.00 | $180.00 | $192,000 |
| #46 | GPT-5.4 Pro OpenAI |
$30.00 | $180.00 | $192,000 |
Workload assumption: 4,000 input tokens + 400 output tokens per call, scaled to 1M calls. Pricing as of 2026-05-31.
How we computed this
The 4,000-input figure assumes five retrieved chunks of roughly 600 tokens each plus the question, system prompt, and answer-formatting instructions. The 400-output figure assumes a cited, paragraph-length answer. Two levers move this number in production: chunk count (re-rankers that cut retrieval from 10 chunks to 3 cut input cost 60% with little quality loss on most corpora) and prompt caching (OpenAI, Anthropic, and Google all discount repeated input by 75-90%, and the static system-prompt portion of a RAG call is exactly the kind of input that caches). A 10:1 input-to-output ratio means a model with cheap input and pricey output can still win this ranking.
The math, worked through
One call at this workload costs GPT-5 Nano $0: 4,000 input tokens at $0.05 per million is $0, plus 400 output tokens at $0.40 per million is $0. At 10,000 calls a day that is $108 a month. The third-place model, Gemini 2.5 Flash-Lite, runs 1.6x that. The most expensive model in the table, GPT-5.4 Pro, costs 533x the winner at the same workload: the spread between top and bottom of this ranking is not a rounding error, it is the difference between a tool budget and a headcount budget.
About the winner
The input-heavy shape favors models with aggressively priced input tiers, but RAG quality lives and dies on faithfulness to the retrieved context. Budget models paraphrase loosely and occasionally answer from parametric memory instead of the provided chunks, which is the one failure mode RAG exists to prevent. Test groundedness, not just price.
When not to pick the cheapest
Skip the cheapest tier if your RAG answers carry citations users will check, or if your chunks contain tables and numbers (weaker models scramble figures when summarizing). Long-context models also tempt teams to skip retrieval entirely and stuff whole documents into the prompt: at 50k+ input tokens per call that is almost always more expensive than fixing your retriever.
How to use this ranking
The winner is mathematically cheapest at the listed workload shape — that's not the same as "best for the use case." Cheaper models often have lower reasoning depth, smaller context windows, or worse instruction-following. Use this as the cost baseline, then test the top 2-3 candidates on your real prompts via the live counter.
Pricing snapshots come from each provider's published rate cards and are tracked in the full pricing changelog. Tokenizer accuracy per model is documented in the methodology.