⚡ Hermes Home Why Hermes Features Compare Install 🧠 ELI5 — What is Hermes? 🤖 Best Models 2026 🌎 Community Get started →
🏆 Overall 💻 Coding ✍️ Writing 🔍 Search 🧮 Reasoning ⚙️ How to configure
2026 Model Guide

Best AI models for Hermes

Top picks across coding, writing, search, and reasoning — so you know exactly what to plug in and why.

Data from SWE-bench Pro, GPQA Diamond, Chatbot Arena, and BenchLM. Updated April 2026. Source →

🏆

Overall best models

Great at everything. If you only pick one, pick from here. These handle coding, writing, research, and reasoning with minimal trade-offs.

🥇

Claude Opus 4.6

Anthropic · 1M context · $5 / $25 per 1M tokens
#1 on Chatbot Arena with 1,504 Elo — the highest score of any publicly available model. Leads instruction-following, long-form work, and agentic tasks. The thinking variant pushes even further. Slower than Sonnet but makes fewer mistakes on genuinely hard problems. Use it for your most important work.
Arena #1 (1504 Elo) Best all-rounder 1M context
1504 Arena Elo
🥈

Gemini 3.1 Pro

Google DeepMind · 1–2M context · $2 / $12 per 1M tokens
#4 on Chatbot Arena (1,492 Elo), #1 on GPQA Diamond reasoning (94.1%), and #1 on creative writing Arena scores. Exceptional value — at $2/$12 per 1M tokens it matches or beats models costing 4–10x more. The 2M context variant handles entire codebases and book-length documents. Currently in preview.
GPQA #1 (94.1%) Best value 2M context
1492 Arena Elo
🥉

GPT-5.4

OpenAI · 1.1M context · $2.50 / $15 per 1M tokens
OpenAI's flagship released March 2026. Leads SWE-bench Pro coding (57.7%) and Terminal-Bench DevOps (75.1%). Solid GPQA Diamond (91.67%). The 1.1M token context window is among the largest of any frontier model, and it costs less per token than Opus. Best pick when you need breadth across coding, writing, and reasoning at a reasonable price.
SWE-Pro #1 1.1M context Coding + reasoning
1484 Arena Elo
4

Claude Sonnet 4.6

Anthropic · 200K context · $3 / $15 per 1M tokens
The fast, smart everyday workhorse. 87.5% on GPQA Diamond, 79.6% on SWE-bench Verified, 1,462 Elo on Chatbot Arena. Nearly as capable as Opus at lower cost and higher speed. The right choice for high-volume tasks — daily summaries, coding, research — where you need quality without always paying Opus prices.
Great value Low latency Reliable
1462 Arena Elo
5

DeepSeek V3.2

DeepSeek · 128K context · $0.28 / $0.42 per 1M tokens
The remarkable open-source value pick. Released December 2025 with MIT license and "reasoning-first" architecture that integrates thinking directly into tool use. Claims GPT-5 level performance at a tiny fraction of the cost. Self-hostable. Arena score of 1,424 puts it solidly in reach of the closed frontier. The budget pick that doesn't embarrass itself against anything.
MIT license Ultra cheap Self-hostable
$0.28 per 1M input
💡 Starting out? Claude Sonnet 4.6 is the best first pick — strong, affordable, and the model this project was built and tested with. Get your API key at console.anthropic.com.
💻

Best for coding

Ranked on SWE-bench Pro (1,865 real GitHub issues, multi-language, standardised scaffold — the current clean benchmark) and SWE-bench Verified. These fix bugs and ship features, not just autocomplete.

🥇

Claude Opus 4.6

Anthropic · 1M context · $5 / $25 per 1M tokens
The engine powering Claude Code and Cursor — the two most-used AI coding tools. 80.8% on SWE-bench Verified and 65.4% on Terminal-Bench (DevOps/CLI tasks). Where Opus shines specifically is deep multi-file reasoning: architecture decisions, debugging subtle cross-module issues, reviewing large PRs. Its extended 1M context fits entire codebases.
Powers Claude Code Deep reasoning 1M context
80.8% SWE-bench Verified
🥈

GPT-5.4

OpenAI · 1.1M context · $2.50 / $15 per 1M tokens
Leads SWE-bench Pro at 57.7% — the benchmark least susceptible to contamination — and dominates Terminal-Bench 2.0 DevOps tasks at 75.1% (a 9.7-point lead over the next model). Best at CLI automation, shell scripting, and agentic code pipelines. GPT-5.4 is now the recommended replacement for GPT-5.3 Codex across most coding use cases.
SWE-Pro #1 (57.7%) DevOps/CLI Agentic pipelines
57.7% SWE-bench Pro
🥉

Claude Sonnet 4.6

Anthropic · 200K context · $3 / $15 per 1M tokens
79.6% on SWE-bench Verified — within 1.2 points of Opus at 40% lower cost and significantly faster. The right everyday coding model for iterative development, unit tests, code explanation, and high-volume agentic loops where paying Opus prices for every call doesn't make sense. Outperforms the now-deprecated Sonnet 4.5 on every benchmark.
Best value Fast iteration High volume
79.6% SWE-bench Verified
4

Gemini 3.1 Pro

Google DeepMind · 1–2M context · $2 / $12 per 1M tokens
80.6% on SWE-bench Verified and 54.2% on SWE-bench Pro at the most competitive price of any frontier model. The massive 2M context window means you can load an entire large codebase and reason across it in one pass — no chunking, no retrieval. The cheapest path to top-tier coding performance.
2M context Cheapest frontier Full-codebase
80.6% SWE-bench Verified
5

Qwen 3.6-Plus

Alibaba · 1M context · Emerging agentic pick · OpenRouter available
Leads Terminal-Bench at 61.6% — ahead of both GPT-5.4 and Gemini 3.1 Pro on CLI and DevOps automation. 88.2% on GPQA Diamond. The 1M token context fits large codebases cleanly. An emerging dark-horse for agentic coding pipelines with strong independent eval scores. Available now via Alibaba Cloud and OpenRouter.
Terminal-Bench #1 (61.6%) 1M context Agentic emerging
61.6% Terminal-Bench
✍️

Best for writing

Creative writing, copywriting, long-form docs, email drafts. Ranked on EQ-Bench Creative Writing Elo (sycophancy-resistant, community-verified) and Chatbot Arena creative writing scores.

🥇

Claude Sonnet 4.6

Anthropic · 200K context · $3 / $15 per 1M tokens
Leads EQ-Bench Creative Writing at 1,936 Elo — higher than Opus (1,932) on the benchmark specifically designed to resist sycophancy and measure genuine literary quality. Best voice consistency over long documents: tone, register, and style stay coherent across sessions. At 85% lower cost than Opus, it's the smart pick for high-volume writing — drafts, summaries, long-form content pipelines.
EQ-Bench CW #1 (1936) Best value Voice consistency
1936 EQ-Bench CW Elo
🥈

Claude Opus 4.6

Anthropic · 1M context · $5 / $25 per 1M tokens
Leads the Mazur creative writing benchmark (8.53) and instruction-following Arena (1,500 Elo — highest of any model tested). The thinking variant (8.56 Mazur) pushes further for complex literary work. #1 on Chatbot Arena overall at 1,504 Elo. Best for projects demanding deep voice and stylistic range — where spending more per token is justified by the work's importance.
Mazur #1 (8.53) IF Elo #1 Literary depth
8.53 Mazur score
🥉

Gemini 3.1 Pro

Google DeepMind · 1–2M context · $2 / $12 per 1M tokens
#1 on Chatbot Arena creative writing Elo (1,487) — human raters prefer it for fiction and blogs — and best for AI-tell avoidance across independent evals. The 2M context and 65K output limit are unmatched for long-form projects: entire chapters, full reports, long narrative arcs. Strong multilingual creative writing in 40+ languages. 12x cheaper on input than Opus.
Arena CW #1 (1487) 2M context AI-tell avoidance
1487 Arena CW Elo
4

GPT-5.4

OpenAI · 1.1M context · $2.50 / $15 per 1M tokens
Solid for structured and commercial writing: technical docs, reports, pitch decks, email campaigns. Excellent instruction-following (~92% IFEval). Ranked ~9th on Arena creative writing — noticeably behind Claude and Gemini for fiction and literary prose, but a strong default when your output needs precise formatting or structured argument flow over stylistic voice.
Structured docs 92% IFEval Technical writing
~9th Arena CW rank
5

Kimi K2.5

Moonshot AI · 128K context · $0.60 / $2.50 per 1M tokens
~1,700 EQ-Bench Creative Writing Elo — roughly 87% of Sonnet's literary quality at 80% lower cost. The budget pick for high-volume content: product descriptions, social copy, blog drafts, content pipelines where you need coherent writing at scale without paying frontier prices on every call. API is live and available now.
Budget CW pick ~1700 EQ-Bench CW Volume content
$0.60 per 1M input
🧮

Best for reasoning and analysis

Hard math, PhD-level science, complex multi-step logic, knowledge work, and second-brain tasks. Ranked on GPQA Diamond (PhD expert baseline: 65%), Humanity's Last Exam, and ARC-AGI-2.

🥇

Gemini 3.1 Pro

Google DeepMind · 1–2M context · 94.1% GPQA Diamond
Leads GPQA Diamond at 94.1% and ARC-AGI-2 visual reasoning at 77.1% — the highest score on both of the hardest published reasoning benchmarks. Near-perfect AIME 2025 math (98%+). Strong across physics, chemistry, biology, and multi-domain expert knowledge. The 2M context makes it uniquely capable for research requiring both depth and breadth in one pass.
GPQA #1 (94.1%) ARC-AGI-2 #1 (77.1%) 2M context
94.1% GPQA Diamond
🥈

GPT-5.4

OpenAI · 1.1M context · ~92% GPQA Diamond
~92% on GPQA Diamond, 41.6% on Humanity's Last Exam, and ~92% on IFEval strict compliance. The best-balanced reasoning model: strong on scientific knowledge, reliable at following precise analytical instructions, and capable across coding and writing tasks simultaneously. The GPT-5.4 Pro variant adds extended reasoning for the genuinely hard problems.
GPQA ~92% 92% IFEval Balanced
~92% GPQA Diamond
🥉

Claude Opus 4.6

Anthropic · 1M context · 89.2% GPQA Diamond
89.2% GPQA Diamond and #1 on Chatbot Arena overall (1,504 Elo). Best for long-context reasoning tasks requiring both analytical depth and 1M-token coherence — feeding in entire research corpora, reviewing large codebases, or synthesising book-length material. Claude Sonnet 4.6 leads GDPval-AA structured knowledge retrieval (1,633 Elo #1) if throughput and cost matter.
Arena #1 overall 1M context Long-context depth
89.2% GPQA Diamond
4

Gemini 3 Flash (Thinking)

Google DeepMind · 1M context · $0.50 / $3 per 1M tokens
89.8% on GPQA Diamond at just $0.50/$3 per 1M tokens — the best reasoning value by a large margin, outperforming models that cost 20x more. The thinking mode shows its chain-of-thought for auditing. 0.34s time-to-first-token. If you run many hard reasoning calls per day and can't justify frontier pricing, nothing else comes close at this price point.
89.8% GPQA Best value Auditable thinking
$0.50 per 1M input
5

Qwen 3.6-Plus

Alibaba · 1M context · 88.2% GPQA Diamond · Competitive pricing
88.2% on GPQA Diamond with a 1M token context window at a fraction of frontier pricing — a genuine dark-horse for knowledge work. Strong on structured knowledge retrieval and multi-step analytical tasks. The same model that leads Terminal-Bench for coding also performs well on reasoning benchmarks, making it unusually versatile. Available via Alibaba Cloud and OpenRouter.
GPQA 88.2% 1M context Value pick
88.2% GPQA Diamond
⚙️

How to set up a model in Hermes

Each model needs an API key and a provider setting. Here's the quick version for each major provider.

Anthropic (Claude models)

Get your key at console.anthropic.com, then in Hermes settings set provider: anthropic and ANTHROPIC_API_KEY in your environment. Model names: claude-opus-4-6, claude-sonnet-4-6, claude-sonnet-4-5.

OpenAI (GPT models)

Get your key at platform.openai.com, then set provider: openai and OPENAI_API_KEY. Model names: gpt-5-4, gpt-5-4-mini, gpt-5-4-nano.

Google (Gemini models)

Get your key at aistudio.google.com, then set provider: google and GOOGLE_API_KEY. Model names: gemini-3-1-pro-preview, gemini-3-pro, gemini-3-flash.

OpenRouter (all models via one key)

The easiest way to try multiple models without multiple accounts. Get a key at openrouter.ai, set provider: openrouter and OPENROUTER_API_KEY. Access Claude, GPT, Gemini, DeepSeek, Llama and more with one key.

Self-hosted (DeepSeek V3.2, Qwen 3.6-Plus, Gemma 4)

Run models locally with llama.cpp or Ollama. DeepSeek V3.2 (MIT) and Qwen 3.6-Plus are the top open-weight picks for coding. Gemma 4 26B MoE (Apache 2.0, 82.3% GPQA Diamond with only 3.8B active parameters) is the best edge/self-hosted reasoning option. Point Hermes at your local server: set provider: openai with a base_url like http://localhost:11434/v1. Your API key can be any string.

Pick by use case
Not sure which model to start with? Match your task to a pick.
🤖 Daily assistant
Claude Sonnet 4.6
Fast, affordable, strong across everything. The best starting point.
💻 Complex coding
Claude Opus 4.6
Powers Claude Code & Cursor. Deep multi-file reasoning.
✍️ Creative writing
Claude Sonnet 4.6
EQ-Bench CW #1 (1936). Best voice consistency, 85% cheaper than Opus.
🔍 Search & news
Gemini 3.1 Pro
Native Google Search grounding. 2M context for long research.
🧮 Hard reasoning
Gemini 3.1 Pro
GPQA #1 at 94.1%, ARC-AGI-2 #1 at 77.1%.
💰 Budget pick
Gemini 3 Flash (Thinking)
$0.50/1M, 89.8% GPQA Diamond. Best reasoning per dollar by far.
📄 Huge documents
Gemini 3.1 Pro
Up to 2M token context. Load entire codebases or books in one pass.
🗝️ Try everything
OpenRouter
One API key, every model. Switch between Claude, GPT, Gemini instantly.

Ready to get started?

Self-host Hermes in under five minutes and bring your own API key.