Best AI Models for Hermes

🏆

Overall best models

Great at everything. If you only pick one, pick from here. These handle coding, writing, research, and reasoning with minimal trade-offs.

🥇

Claude Opus 4.6

Anthropic · 1M context · $5 / $25 per 1M tokens

#1 on Chatbot Arena with 1,504 Elo — the highest score of any publicly available model. Leads instruction-following, long-form work, and agentic tasks. The thinking variant pushes even further. Slower than Sonnet but makes fewer mistakes on genuinely hard problems. Use it for your most important work.

Arena #1 (1504 Elo) Best all-rounder 1M context

1504 Arena Elo

🥈

Gemini 3.1 Pro

Google DeepMind · 1–2M context · $2 / $12 per 1M tokens

#4 on Chatbot Arena (1,492 Elo), #1 on GPQA Diamond reasoning (94.1%), and #1 on creative writing Arena scores. Exceptional value — at $2/$12 per 1M tokens it matches or beats models costing 4–10x more. The 2M context variant handles entire codebases and book-length documents. Currently in preview.

GPQA #1 (94.1%) Best value 2M context

1492 Arena Elo

🥉

GPT-5.4

OpenAI · 1.1M context · $2.50 / $15 per 1M tokens

OpenAI's flagship released March 2026. Leads SWE-bench Pro coding (57.7%) and Terminal-Bench DevOps (75.1%). Solid GPQA Diamond (91.67%). The 1.1M token context window is among the largest of any frontier model, and it costs less per token than Opus. Best pick when you need breadth across coding, writing, and reasoning at a reasonable price.

SWE-Pro #1 1.1M context Coding + reasoning

1484 Arena Elo

Claude Sonnet 4.6

Anthropic · 200K context · $3 / $15 per 1M tokens

The fast, smart everyday workhorse. 87.5% on GPQA Diamond, 79.6% on SWE-bench Verified, 1,462 Elo on Chatbot Arena. Nearly as capable as Opus at lower cost and higher speed. The right choice for high-volume tasks — daily summaries, coding, research — where you need quality without always paying Opus prices.

Great value Low latency Reliable

1462 Arena Elo

DeepSeek V3.2

DeepSeek · 128K context · $0.28 / $0.42 per 1M tokens

The remarkable open-source value pick. Released December 2025 with MIT license and "reasoning-first" architecture that integrates thinking directly into tool use. Claims GPT-5 level performance at a tiny fraction of the cost. Self-hostable. Arena score of 1,424 puts it solidly in reach of the closed frontier. The budget pick that doesn't embarrass itself against anything.

MIT license Ultra cheap Self-hostable

$0.28 per 1M input

💡 Starting out? Claude Sonnet 4.6 is the best first pick — strong, affordable, and the model this project was built and tested with. Get your API key at console.anthropic.com.

💻

Best for coding

Ranked on SWE-bench Pro (1,865 real GitHub issues, multi-language, standardised scaffold — the current clean benchmark) and SWE-bench Verified. These fix bugs and ship features, not just autocomplete.

🥇

Claude Opus 4.6

Anthropic · 1M context · $5 / $25 per 1M tokens

The engine powering Claude Code and Cursor — the two most-used AI coding tools. 80.8% on SWE-bench Verified and 65.4% on Terminal-Bench (DevOps/CLI tasks). Where Opus shines specifically is deep multi-file reasoning: architecture decisions, debugging subtle cross-module issues, reviewing large PRs. Its extended 1M context fits entire codebases.

Powers Claude Code Deep reasoning 1M context

80.8% SWE-bench Verified

🥈

GPT-5.4

OpenAI · 1.1M context · $2.50 / $15 per 1M tokens

Leads SWE-bench Pro at 57.7% — the benchmark least susceptible to contamination — and dominates Terminal-Bench 2.0 DevOps tasks at 75.1% (a 9.7-point lead over the next model). Best at CLI automation, shell scripting, and agentic code pipelines. GPT-5.4 is now the recommended replacement for GPT-5.3 Codex across most coding use cases.

SWE-Pro #1 (57.7%) DevOps/CLI Agentic pipelines

57.7% SWE-bench Pro

🥉

Claude Sonnet 4.6

Anthropic · 200K context · $3 / $15 per 1M tokens

79.6% on SWE-bench Verified — within 1.2 points of Opus at 40% lower cost and significantly faster. The right everyday coding model for iterative development, unit tests, code explanation, and high-volume agentic loops where paying Opus prices for every call doesn't make sense. Outperforms the now-deprecated Sonnet 4.5 on every benchmark.

Best value Fast iteration High volume

79.6% SWE-bench Verified

Gemini 3.1 Pro

Google DeepMind · 1–2M context · $2 / $12 per 1M tokens

80.6% on SWE-bench Verified and 54.2% on SWE-bench Pro at the most competitive price of any frontier model. The massive 2M context window means you can load an entire large codebase and reason across it in one pass — no chunking, no retrieval. The cheapest path to top-tier coding performance.

2M context Cheapest frontier Full-codebase

80.6% SWE-bench Verified

Qwen 3.6-Plus

Alibaba · 1M context · Emerging agentic pick · OpenRouter available

Leads Terminal-Bench at 61.6% — ahead of both GPT-5.4 and Gemini 3.1 Pro on CLI and DevOps automation. 88.2% on GPQA Diamond. The 1M token context fits large codebases cleanly. An emerging dark-horse for agentic coding pipelines with strong independent eval scores. Available now via Alibaba Cloud and OpenRouter.

Terminal-Bench #1 (61.6%) 1M context Agentic emerging

61.6% Terminal-Bench

✍️

Best for writing

Creative writing, copywriting, long-form docs, email drafts. Ranked on EQ-Bench Creative Writing Elo (sycophancy-resistant, community-verified) and Chatbot Arena creative writing scores.

🥇

Claude Sonnet 4.6

Anthropic · 200K context · $3 / $15 per 1M tokens

Leads EQ-Bench Creative Writing at 1,936 Elo — higher than Opus (1,932) on the benchmark specifically designed to resist sycophancy and measure genuine literary quality. Best voice consistency over long documents: tone, register, and style stay coherent across sessions. At 85% lower cost than Opus, it's the smart pick for high-volume writing — drafts, summaries, long-form content pipelines.

EQ-Bench CW #1 (1936) Best value Voice consistency

1936 EQ-Bench CW Elo

🥈

Claude Opus 4.6

Anthropic · 1M context · $5 / $25 per 1M tokens

Leads the Mazur creative writing benchmark (8.53) and instruction-following Arena (1,500 Elo — highest of any model tested). The thinking variant (8.56 Mazur) pushes further for complex literary work. #1 on Chatbot Arena overall at 1,504 Elo. Best for projects demanding deep voice and stylistic range — where spending more per token is justified by the work's importance.

Mazur #1 (8.53) IF Elo #1 Literary depth

8.53 Mazur score

🥉

Gemini 3.1 Pro

Google DeepMind · 1–2M context · $2 / $12 per 1M tokens

#1 on Chatbot Arena creative writing Elo (1,487) — human raters prefer it for fiction and blogs — and best for AI-tell avoidance across independent evals. The 2M context and 65K output limit are unmatched for long-form projects: entire chapters, full reports, long narrative arcs. Strong multilingual creative writing in 40+ languages. 12x cheaper on input than Opus.

Arena CW #1 (1487) 2M context AI-tell avoidance

1487 Arena CW Elo

GPT-5.4

OpenAI · 1.1M context · $2.50 / $15 per 1M tokens

Solid for structured and commercial writing: technical docs, reports, pitch decks, email campaigns. Excellent instruction-following (~92% IFEval). Ranked ~9th on Arena creative writing — noticeably behind Claude and Gemini for fiction and literary prose, but a strong default when your output needs precise formatting or structured argument flow over stylistic voice.

Structured docs 92% IFEval Technical writing

~9th Arena CW rank

Kimi K2.5

Moonshot AI · 128K context · $0.60 / $2.50 per 1M tokens

~1,700 EQ-Bench Creative Writing Elo — roughly 87% of Sonnet's literary quality at 80% lower cost. The budget pick for high-volume content: product descriptions, social copy, blog drafts, content pipelines where you need coherent writing at scale without paying frontier prices on every call. API is live and available now.

Budget CW pick ~1700 EQ-Bench CW Volume content

$0.60 per 1M input

🔍

Best for search and web tasks

Models with real-time web access, retrieval, and research capabilities. For Hermes cron jobs that check news, summarise feeds, monitor prices, or answer time-sensitive questions.

🥇

Gemini 3.1 Pro

Google DeepMind · 1–2M context · Native Google Search grounding

The strongest model for search-heavy tasks. Integrates directly with Google Search for real-time grounding — cites sources, verifies claims, and pulls current information without you building a retrieval layer. The 2M context window absorbs the results of dozens of parallel searches in one pass. Best for long-document research, competitive intelligence, and multi-source synthesis.

Google Search built-in 2M context Cited sources

2M context tokens

🥈

Grok 4

xAI · 256K context · $3 / $15 per 1M tokens · Live X data

The only frontier model with live access to the X/Twitter firehose — real-time social data, trending topics, public discourse. Also strong on standard web search. 88.13% on GPQA Diamond, 1,486 Elo on Chatbot Arena. The clear pick when your search task involves breaking news, real-time market sentiment, social monitoring, or any data that lives on X.

Live X/Twitter data Real-time news Strong reasoning

Live X feed access

🥉

Claude Sonnet 4.6 + web_search tool

Anthropic · 200K context · Hermes web_search tool

Hermes ships a web_search tool that any Claude model can use natively. Sonnet 4.6 is the best balance for high-frequency search automations — fast enough to run cron jobs every few minutes, smart enough to synthesise results well. The most practical pick for daily Hermes scheduled tasks: morning briefings, feed monitors, price watchers, news digests.

Hermes native Daily cron jobs Best value

Built-in web_search tool

GPT-5.4 (with browsing)

OpenAI · 1.1M context · $2.50 / $15 per 1M tokens

Excellent at combining web retrieval with deep reasoning — synthesises many search results into a coherent, well-structured answer. The 1.1M context window helps absorb large volumes of retrieved content. Strong for research tasks, fact-checking, and competitive intelligence where you need the model to reason carefully over what it finds, not just retrieve it.

1.1M context Deep synthesis Research

1.1M context tokens

Gemini 3 Flash (Thinking)

Google DeepMind · 1M context · $0.50 / $3 per 1M tokens

Ultra-low latency with native Google Search grounding and a 0.34s time-to-first-token. 89.8% on GPQA Diamond at just $0.50/$3 per 1M tokens — the best reasoning value on the market. Perfect for latency-sensitive search pipelines: live monitoring, real-time alerts, high-frequency lookups where you need a response in under a second at minimal cost.

0.34s latency Cheapest reasoning Search grounding

$0.50 per 1M input

🧮

Best for reasoning and analysis

Hard math, PhD-level science, complex multi-step logic, knowledge work, and second-brain tasks. Ranked on GPQA Diamond (PhD expert baseline: 65%), Humanity's Last Exam, and ARC-AGI-2.

🥇

Gemini 3.1 Pro

Google DeepMind · 1–2M context · 94.1% GPQA Diamond

Leads GPQA Diamond at 94.1% and ARC-AGI-2 visual reasoning at 77.1% — the highest score on both of the hardest published reasoning benchmarks. Near-perfect AIME 2025 math (98%+). Strong across physics, chemistry, biology, and multi-domain expert knowledge. The 2M context makes it uniquely capable for research requiring both depth and breadth in one pass.

GPQA #1 (94.1%) ARC-AGI-2 #1 (77.1%) 2M context

94.1% GPQA Diamond

🥈

GPT-5.4

OpenAI · 1.1M context · ~92% GPQA Diamond

~92% on GPQA Diamond, 41.6% on Humanity's Last Exam, and ~92% on IFEval strict compliance. The best-balanced reasoning model: strong on scientific knowledge, reliable at following precise analytical instructions, and capable across coding and writing tasks simultaneously. The GPT-5.4 Pro variant adds extended reasoning for the genuinely hard problems.

GPQA ~92% 92% IFEval Balanced

~92% GPQA Diamond

🥉

Claude Opus 4.6

Anthropic · 1M context · 89.2% GPQA Diamond

89.2% GPQA Diamond and #1 on Chatbot Arena overall (1,504 Elo). Best for long-context reasoning tasks requiring both analytical depth and 1M-token coherence — feeding in entire research corpora, reviewing large codebases, or synthesising book-length material. Claude Sonnet 4.6 leads GDPval-AA structured knowledge retrieval (1,633 Elo #1) if throughput and cost matter.

Arena #1 overall 1M context Long-context depth

89.2% GPQA Diamond

Gemini 3 Flash (Thinking)

Google DeepMind · 1M context · $0.50 / $3 per 1M tokens

89.8% on GPQA Diamond at just $0.50/$3 per 1M tokens — the best reasoning value by a large margin, outperforming models that cost 20x more. The thinking mode shows its chain-of-thought for auditing. 0.34s time-to-first-token. If you run many hard reasoning calls per day and can't justify frontier pricing, nothing else comes close at this price point.

89.8% GPQA Best value Auditable thinking

$0.50 per 1M input

Qwen 3.6-Plus

Alibaba · 1M context · 88.2% GPQA Diamond · Competitive pricing

88.2% on GPQA Diamond with a 1M token context window at a fraction of frontier pricing — a genuine dark-horse for knowledge work. Strong on structured knowledge retrieval and multi-step analytical tasks. The same model that leads Terminal-Bench for coding also performs well on reasoning benchmarks, making it unusually versatile. Available via Alibaba Cloud and OpenRouter.

GPQA 88.2% 1M context Value pick

88.2% GPQA Diamond

⚙️

How to set up a model in Hermes

Each model needs an API key and a provider setting. Here's the quick version for each major provider.

Anthropic (Claude models)

Get your key at console.anthropic.com, then in Hermes settings set provider: anthropic and ANTHROPIC_API_KEY in your environment. Model names: claude-opus-4-6, claude-sonnet-4-6, claude-sonnet-4-5.

OpenAI (GPT models)

Get your key at platform.openai.com, then set provider: openai and OPENAI_API_KEY. Model names: gpt-5-4, gpt-5-4-mini, gpt-5-4-nano.

Google (Gemini models)

Get your key at aistudio.google.com, then set provider: google and GOOGLE_API_KEY. Model names: gemini-3-1-pro-preview, gemini-3-pro, gemini-3-flash.

OpenRouter (all models via one key)

The easiest way to try multiple models without multiple accounts. Get a key at openrouter.ai, set provider: openrouter and OPENROUTER_API_KEY. Access Claude, GPT, Gemini, DeepSeek, Llama and more with one key.

Self-hosted (DeepSeek V3.2, Qwen 3.6-Plus, Gemma 4)

Run models locally with llama.cpp or Ollama. DeepSeek V3.2 (MIT) and Qwen 3.6-Plus are the top open-weight picks for coding. Gemma 4 26B MoE (Apache 2.0, 82.3% GPQA Diamond with only 3.8B active parameters) is the best edge/self-hosted reasoning option. Point Hermes at your local server: set provider: openai with a base_url like http://localhost:11434/v1. Your API key can be any string.

Quick reference

Pick by use case

Not sure which model to start with? Match your task to a pick.

🤖 Daily assistant

Claude Sonnet 4.6

Fast, affordable, strong across everything. The best starting point.

💻 Complex coding

Claude Opus 4.6

Powers Claude Code & Cursor. Deep multi-file reasoning.

✍️ Creative writing

Claude Sonnet 4.6

EQ-Bench CW #1 (1936). Best voice consistency, 85% cheaper than Opus.

🔍 Search & news

Gemini 3.1 Pro

Native Google Search grounding. 2M context for long research.

🧮 Hard reasoning

Gemini 3.1 Pro

GPQA #1 at 94.1%, ARC-AGI-2 #1 at 77.1%.

💰 Budget pick

Gemini 3 Flash (Thinking)

$0.50/1M, 89.8% GPQA Diamond. Best reasoning per dollar by far.

📄 Huge documents

Gemini 3.1 Pro

Up to 2M token context. Load entire codebases or books in one pass.

🗝️ Try everything

OpenRouter

One API key, every model. Switch between Claude, GPT, Gemini instantly.

Best AI models for Hermes

Overall best models

Claude Opus 4.6

Gemini 3.1 Pro

GPT-5.4

Claude Sonnet 4.6

DeepSeek V3.2

Best for coding

Claude Opus 4.6

GPT-5.4

Claude Sonnet 4.6

Gemini 3.1 Pro

Qwen 3.6-Plus

Best for writing

Claude Sonnet 4.6

Claude Opus 4.6

Gemini 3.1 Pro

GPT-5.4

Kimi K2.5

Best for search and web tasks

Gemini 3.1 Pro

Grok 4

Claude Sonnet 4.6 + web_search tool

GPT-5.4 (with browsing)

Gemini 3 Flash (Thinking)

Best for reasoning and analysis

Gemini 3.1 Pro

GPT-5.4

Claude Opus 4.6

Gemini 3 Flash (Thinking)

Qwen 3.6-Plus

How to set up a model in Hermes

Anthropic (Claude models)

OpenAI (GPT models)

Google (Gemini models)

OpenRouter (all models via one key)

Self-hosted (DeepSeek V3.2, Qwen 3.6-Plus, Gemma 4)

Ready to get started?