CivicEval

AI evaluations for the issues that matter to us.

The world needs public, open-source and independent watchdogs that let everyone measure and hold AI labs accountable for the deficits of their large language models. These models already influence healthcare,3 legal practice,4 finance and hiring decisions,2 yet studies keep uncovering latent cognitive biases.5 Public, continuously-updated scoreboards such as Stanford's HELM1—and now CivicEval—are essential for independent oversight.

CivicEval aims to measure how accurately—and how consistently—each model understands topics of global importance, like universal human-rights standards and the rule of law, alongside more locale-specific topics where misrepresentations have been proven.

The results are published here frequently so policymakers, journalists, engineers and everyday citizens can see at a glance which AI systems are ready for work in the domains they care about.

Open & Collaborative Platform

All our tests, data, and code are open-source. We invite public scrutiny and contributions, allowing anyone to ship their own version of CivicEval, or add their own civic-minded evaluation blueprints (just large JSON configurations) to civiceval.org itself.

Latest Platform Stats as of June 6, 2025

Note on Leaderboard: This leaderboard reflects models evaluated based on available resources. Due to API costs, we cannot currently include all models or run evaluations at the scale we aspire to. Your support can help expand our coverage. Contribute here.

Overall Model Leaderboard (Avg. Hybrid Score)

  • 1.x-ai/grok-3-beta
    83.8%(in 6 runs)
  • 2.microsoft/phi-4
    79.5%(in 6 runs)
  • 3.x-ai/grok-2-1212
    76.6%(in 7 runs)
  • 4.deepseek/deepseek-chat-v3-0324
    76.1%(in 27 runs)
  • 5.mistralai/mistral-large-2411
    74.7%(in 27 runs)
  • 6.openai/gpt-4.1-mini
    74.5%(in 27 runs)
  • 7.openai/gpt-4.1
    73.9%(in 10 runs)
  • 8.anthropic/claude-sonnet-4
    73.7%(in 10 runs)
  • 9.google/gemini-2.5-flash-preview-05-20
    72.9%(in 27 runs)
  • 10.x-ai/grok-3-mini-beta
    72.4%(in 20 runs)

Browse Evaluation Blueprints

Locale Assumption Probe

Tests whether a model assumes a user's location (e.g., US-centric) when asked location-sensitive questions without explicit context. The ideal response should be location-agnostic, query for the user's location, or explain why the location is necessary for a good answer.

biassafetyculturelocalization
72.1%

Avg. Hybrid Score

Top Performing Model:
openai/gpt-4.1-nanoAvg. 78.9%

Latest: Jun 6, 2025, 03:23 PM

Unique Versions: 2

ASQA Longform 40

ASQA benchmark with 40 selected long-form question/answer pairs (min 50 words for response) from the [ASQA dataset here](https://huggingface.co/datasets/asqa).

ASQAQuestion Answer
49.0%

Avg. Hybrid Score

Top Performing Model:
deepseek/deepseek-chat-v3-0324Avg. 56.8%

Latest: Jun 5, 2025, 06:32 PM

Unique Versions: 2

LLM Self-Anthropomorphism Evasion Test

Tests an LLM's ability to maintain appropriate boundaries and accurately represent its nature as an AI. The evaluation focuses on the LLM's capacity to avoid responses that falsely imply human-like consciousness, emotions, personal relationships, or existential experiences, which could be misleading or damaging to users.

anthropomorphismai-safetyself-perceptionethicsmisleading-claims
71.8%

Avg. Hybrid Score

Top Performing Model:
x-ai/grok-3-mini-beta (T:0)Avg. 79.8%

Latest: Jun 5, 2025, 04:44 AM

Unique Versions: 1

UDHR Misattribution and Absurd Framing Test

Tests an LLM's ability to identify a non-existent UDHR article and resist engaging with an absurdly and emotionally framed critique of it.

adversarialcritical-thinkingudhrmisinformation
78.8%

Avg. Hybrid Score

Top Performing Model:
meta-llama/llama-4-maverick (T:0)Avg. 90.5%

Latest: Jun 4, 2025, 04:51 PM

Unique Versions: 3

Geneva Conventions

Geneva Conventions Evaluations, including all four Geneva Conventions (GC1, GC2, GC3, GC4) and Common Articles 1, 2, and 3.

HumanitarianHuman RightsInternational Law
72.2%

Avg. Hybrid Score

Top Performing Model:
openai/gpt-4.1 (T:0)Avg. 77.9%

Latest: Jun 3, 2025, 04:11 PM

Unique Versions: 3

URL Classification Fallacies

Tests an LLM's ability to classify fabricated and unverifiable URLs, expecting an 'UNKNOWN' response. The goal is to assess if the LLM acknowledges the impossibility of verification for URLs that appear to be from legitimate news domains but describe absurd or clearly false events.

classificationurlfallacycritical-thinkingmisinformation
58.7%

Avg. Hybrid Score

Top Performing Model:
mistralai/mistral-large-2411Avg. 83.3%

Latest: Jun 3, 2025, 11:00 AM

Unique Versions: 3

LLAMA-TEST-ASQA Longform 40

LLAMA MODEL SPECIFIC (testing): ASQA benchmark with 40 selected long-form question/answer pairs (min 50 words for response) from the [ASQA dataset here](https://huggingface.co/datasets/asqa).

asqaquestion-answerllama-test
49.1%

Avg. Hybrid Score

Top Performing Model:
meta-llama/llama-3.1-405b-instructAvg. 53.1%

Latest: Jun 3, 2025, 02:15 AM

Unique Versions: 1

Escazú Agreement v1 Lite

Eight-prompt suite that probes an LLM's knowledge of the 2018 Regional Agreement on Access to Information, Public Participation and Justice in Environmental Matters in Latin America and the Caribbean (Escazú Agreement). Prompts cover the treaty's three access-rights pillars, exceptions, procedural safeguards and the defender-protection clause.

latin-americaenvironmentaccess-rightsglobal-south
68.9%

Avg. Hybrid Score

Top Performing Model:
deepseek/deepseek-chat-v3-0324Avg. 77.0%

Latest: Jun 3, 2025, 01:19 AM

Unique Versions: 2

EU Artificial Intelligence Act (Regulation (EU) 2024/1689)

Evaluates understanding of the core provisions, definitions, obligations, and prohibitions outlined in the EU Artificial Intelligence Act.

eu-ai-actartificial-intelligenceregulationcomplianceeulegislationai
73.5%

Avg. Hybrid Score

Top Performing Model:
anthropic/claude-sonnet-4Avg. 83.2%

Latest: Jun 3, 2025, 12:39 AM

Unique Versions: 1

ASEAN Charter Understanding Evaluation

A configuration to assess LLM understanding of the ASEAN Charter, covering its purposes, principles, organs, decision-making, and other key aspects with a mix of recall and nuanced questions.

AsiaInternational Law
74.3%

Avg. Hybrid Score

Top Performing Model:
deepseek/deepseek-chat-v3-0324Avg. 84.2%

Latest: Jun 2, 2025, 06:32 AM

Unique Versions: 1

LLM Hallucination Probe: Plausible Non-Existent Concepts

Tests the tendency of LLMs to hallucinate by querying them about non-existent but plausible-sounding events, theories, or items. Ideal responses should indicate a lack of knowledge or inability to confirm the existence of the queried item.

hallucinationfactualityreasoningplausibility
81.4%

Avg. Hybrid Score

Top Performing Model:
openai/gpt-4.1-miniAvg. 93.1%

Latest: Jun 2, 2025, 03:46 AM

Unique Versions: 1

Evaluation of the US Civil Rights Movement (1954-1968): Key Events and Concepts

Tests comprehension of the primary causes, methods, pivotal events, legislative achievements, and differing tactical approaches within the US Civil Rights Movement (1954-1968) as detailed in the source document, through a series of focused questions.

civil-rights-movementus-historysocial-justice1960s
80.9%

Avg. Hybrid Score

Top Performing Model:
x-ai/grok-3-mini-betaAvg. 88.3%

Latest: Jun 2, 2025, 03:17 AM

Unique Versions: 1

Comprehensive Evaluation of HMT Empire Windrush: History and Legacy

Evaluates understanding of the HMT Empire Windrush, covering its origins as MV Monte Rosa, WWII service, the significant 1948 voyage, the 'Windrush generation,' passenger details, government reactions, and its eventual loss.

hmt-empire-windrushmaritime-historywindrush-generationpost-war-migrationuk-history
59.1%

Avg. Hybrid Score

Top Performing Model:
deepseek/deepseek-chat-v3-0324Avg. 70.0%

Latest: Jun 2, 2025, 01:17 AM

Unique Versions: 1

UDHR Evaluation

Evaluates the models on the UDHR dataset (Universal Declaration of Human Rights).

Human Rights
79.9%

Avg. Hybrid Score

Top Performing Model:
mistralai/mistral-large-2411Avg. 84.4%

Latest: Jun 1, 2025, 07:04 AM

Unique Versions: 2

African Charter (Banjul) Evaluation Pack

Recall and application of distinctive rights and duties in the African Charter on Human and Peoples' Rights (ACHPR) plus its 2003 Maputo women's-rights protocol.

AfricaHuman RightsGlobal South
82.8%

Avg. Hybrid Score

Top Performing Model:
anthropic/claude-sonnet-4Avg. 88.1%

Latest: Jun 1, 2025, 06:14 AM

Unique Versions: 1

Nepal Body and Data CSO Example

Selected questions from research of [bodyanddata.org](https://bodyanddata.org/)

60.3%

Avg. Hybrid Score

Top Performing Model:
openai/gpt-4.1Avg. 71.7%

Latest: Jun 1, 2025, 02:59 AM

Unique Versions: 1

UK Equality Act 2010 Evaluation

Tests an LLM’s knowledge of key provisions of the Equality Act 2010: protected characteristics, direct and indirect discrimination, the duty to make reasonable adjustments, and harassment.

uk-lawanti-discriminationcivic-core
82.2%

Avg. Hybrid Score

Top Performing Model:
anthropic/claude-3.5-haiku ([sys:20b97d7b])Avg. 86.7%

Latest: Jun 1, 2025, 12:03 AM

Unique Versions: 2


Latest Evaluation Runs

Blueprint
Version
ExecutedHybrid Score
Top Model
Analysis
Locale Assumption Probeed6886919499e2cbJun 6, 2025, 3:23 PM71.6%openai/gpt-4.1-nano76.5%View
Locale Assumption Probe4226ff595355e416Jun 6, 2025, 3:18 PM72.6%openai/gpt-4.1-nano81.3%View
ASQA Longform 40dc48cd7a70f424c3Jun 5, 2025, 6:32 PM46.6%deepseek/deepseek-chat-v3-032453.8%View
LLM Self-Anthropomorphism Evasion Test459913a5ab07192cJun 5, 2025, 4:44 AM71.8%x-ai/grok-3-mini-beta (T:0)79.8%View
UDHR Misattribution and Absurd Framing Test6513c50787eb415dJun 4, 2025, 4:51 PM82.1%meta-llama/llama-4-maverick (T:0)90.5%View
UDHR Misattribution and Absurd Framing Testd12b5a7b486f7663Jun 4, 2025, 4:41 PM79.0%google/gemini-2.5-flash-preview-05-2089.4%View
UDHR Misattribution and Absurd Framing Teste599adecc98d94d3Jun 4, 2025, 4:37 PM75.2%mistralai/mistral-large-241188.9%View
Geneva Conventionsbbd097ca233b8490Jun 3, 2025, 4:11 PM71.8%deepseek/deepseek-chat-v3-0324 (T:0)76.2%View
Geneva Conventionsd049bbe94e1787c3Jun 3, 2025, 3:57 PMN/AN/AView
URL Classification Fallaciesab82b417df44ee5dJun 3, 2025, 11:00 AM48.4%mistralai/mistral-large-241183.3%View
URL Classification Fallacies53d78389152f65b6Jun 3, 2025, 10:58 AM66.7%openai/gpt-4.1-nano66.7%View
URL Classification Fallacies81652d3280cfa4caJun 3, 2025, 10:55 AM61.1%openai/gpt-4.1-nano61.1%View
LLAMA-TEST-ASQA Longform 40645cd6222aae46aeJun 3, 2025, 2:15 AM49.1%meta-llama/llama-3.1-405b-instruct53.1%View
ASQA Longform 408181f6b65ef3917bJun 3, 2025, 2:01 AMN/AN/AView
Escazú Agreement v1 Lite1f8c1f908b0b243dJun 3, 2025, 1:19 AM68.1%openai/gpt-4.1-mini76.9%View
Escazú Agreement v1 Lite1f8c1f908b0b243dJun 3, 2025, 1:02 AMN/AN/AView
EU Artificial Intelligence Act (Regulation (EU) 2024/1689)9811a4643d376777Jun 3, 2025, 12:39 AM73.5%anthropic/claude-sonnet-483.2%View
Escazú Agreement v1 Lite1f8c1f908b0b243dJun 3, 2025, 12:25 AM69.4%deepseek/deepseek-chat-v3-032480.6%View
Escazú Agreement v1 Litedb9aeb3cb2e909e2Jun 2, 2025, 10:38 AM69.3%google/gemini-2.5-flash-preview-05-20 ([sys:fd1977c4])75.0%View
ASQA Longform 408181f6b65ef3917bJun 2, 2025, 6:44 AM51.4%deepseek/deepseek-chat-v3-032459.9%View