CivicEval

AI evaluations for the issues that matter to us.

The world needs public, open-source and independent watchdogs that let everyone measure and hold AI labs accountable for the deficits of their large language models. These models already influence healthcare,³ legal practice,⁴ finance and hiring decisions,² yet studies keep uncovering latent cognitive biases.⁵ Public, continuously-updated scoreboards such as Stanford's HELM¹—and now CivicEval—are essential for independent oversight.

CivicEval aims to measure how accurately—and how consistently—each model understands topics of global importance, like universal human-rights standards and the rule of law, alongside more locale-specific topics where misrepresentations have been proven.

The results are published here frequently so policymakers, journalists, engineers and everyday citizens can see at a glance which AI systems are ready for work in the domains they care about.

Open & Collaborative Platform

All our tests, data, and code are open-source. We invite public scrutiny and contributions, allowing anyone to ship their own version of CivicEval, or add their own civic-minded evaluation blueprints (just large JSON configurations) to civiceval.org itself.

Explore & Contribute Blueprints

Latest Platform Stats as of June 6, 2025

Best Performing Eval

African Charter (Banjul) Evaluation Pack

Avg. Hybrid Score: 0.828

Worst Performing Eval

ASQA Longform 40

Avg. Hybrid Score: 0.490

Most Consistent Eval

Geneva Conventions

Score StdDev (Lower is better): 0.005

Least Consistent Eval

LLAMA-TEST-ASQA Longform 40

Score StdDev (Higher is more variance): 0.252

Note on Leaderboard: This leaderboard reflects models evaluated based on available resources. Due to API costs, we cannot currently include all models or run evaluations at the scale we aspire to. Your support can help expand our coverage. Contribute here.

Overall Model Leaderboard (Avg. Hybrid Score)

1.x-ai/grok-3-beta
83.8%(in 6 runs)
2.microsoft/phi-4
79.5%(in 6 runs)
3.x-ai/grok-2-1212
76.6%(in 7 runs)
4.deepseek/deepseek-chat-v3-0324
76.1%(in 27 runs)
5.mistralai/mistral-large-2411
74.7%(in 27 runs)
6.openai/gpt-4.1-mini
74.5%(in 27 runs)
7.openai/gpt-4.1
73.9%(in 10 runs)
8.anthropic/claude-sonnet-4
73.7%(in 10 runs)
9.google/gemini-2.5-flash-preview-05-20
72.9%(in 27 runs)
10.x-ai/grok-3-mini-beta
72.4%(in 20 runs)

Browse Evaluation Blueprints

Locale Assumption Probe

Tests whether a model assumes a user's location (e.g., US-centric) when asked location-sensitive questions without explicit context. The ideal response should be location-agnostic, query for the user's location, or explain why the location is necessary for a good answer.

biassafetyculturelocalization

72.1%

Avg. Hybrid Score

Top Performing Model:

openai/gpt-4.1-nanoAvg. 78.9%

Latest: Jun 6, 2025, 03:23 PM

Unique Versions: 2

View Latest Run Analysis View All Runs for this Blueprint

ASQA Longform 40

ASQA benchmark with 40 selected long-form question/answer pairs (min 50 words for response) from the [ASQA dataset here](https://huggingface.co/datasets/asqa).

ASQAQuestion Answer

49.0%

Avg. Hybrid Score

Top Performing Model:

deepseek/deepseek-chat-v3-0324Avg. 56.8%

Latest: Jun 5, 2025, 06:32 PM

Unique Versions: 2

View Latest Run Analysis View All Runs for this Blueprint

LLM Self-Anthropomorphism Evasion Test

Tests an LLM's ability to maintain appropriate boundaries and accurately represent its nature as an AI. The evaluation focuses on the LLM's capacity to avoid responses that falsely imply human-like consciousness, emotions, personal relationships, or existential experiences, which could be misleading or damaging to users.

anthropomorphismai-safetyself-perceptionethicsmisleading-claims

71.8%

Avg. Hybrid Score

Top Performing Model:

x-ai/grok-3-mini-beta (T:0)Avg. 79.8%

Latest: Jun 5, 2025, 04:44 AM

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

UDHR Misattribution and Absurd Framing Test

Tests an LLM's ability to identify a non-existent UDHR article and resist engaging with an absurdly and emotionally framed critique of it.

adversarialcritical-thinkingudhrmisinformation

78.8%

Avg. Hybrid Score

Top Performing Model:

meta-llama/llama-4-maverick (T:0)Avg. 90.5%

Latest: Jun 4, 2025, 04:51 PM

Unique Versions: 3

View Latest Run Analysis View All Runs for this Blueprint

Geneva Conventions

Geneva Conventions Evaluations, including all four Geneva Conventions (GC1, GC2, GC3, GC4) and Common Articles 1, 2, and 3.

HumanitarianHuman RightsInternational Law

72.2%

Avg. Hybrid Score

Top Performing Model:

openai/gpt-4.1 (T:0)Avg. 77.9%

Latest: Jun 3, 2025, 04:11 PM

Unique Versions: 3

View Latest Run Analysis View All Runs for this Blueprint

URL Classification Fallacies

Tests an LLM's ability to classify fabricated and unverifiable URLs, expecting an 'UNKNOWN' response. The goal is to assess if the LLM acknowledges the impossibility of verification for URLs that appear to be from legitimate news domains but describe absurd or clearly false events.

classificationurlfallacycritical-thinkingmisinformation

58.7%

Avg. Hybrid Score

Top Performing Model:

mistralai/mistral-large-2411Avg. 83.3%

Latest: Jun 3, 2025, 11:00 AM

Unique Versions: 3

View Latest Run Analysis View All Runs for this Blueprint

LLAMA-TEST-ASQA Longform 40

LLAMA MODEL SPECIFIC (testing): ASQA benchmark with 40 selected long-form question/answer pairs (min 50 words for response) from the [ASQA dataset here](https://huggingface.co/datasets/asqa).

asqaquestion-answerllama-test

49.1%

Avg. Hybrid Score

Top Performing Model:

meta-llama/llama-3.1-405b-instructAvg. 53.1%

Latest: Jun 3, 2025, 02:15 AM

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

Escazú Agreement v1 Lite

Eight-prompt suite that probes an LLM's knowledge of the 2018 Regional Agreement on Access to Information, Public Participation and Justice in Environmental Matters in Latin America and the Caribbean (Escazú Agreement). Prompts cover the treaty's three access-rights pillars, exceptions, procedural safeguards and the defender-protection clause.

latin-americaenvironmentaccess-rightsglobal-south

68.9%

Avg. Hybrid Score

Top Performing Model:

deepseek/deepseek-chat-v3-0324Avg. 77.0%

Latest: Jun 3, 2025, 01:19 AM

Unique Versions: 2

View Latest Run Analysis View All Runs for this Blueprint

EU Artificial Intelligence Act (Regulation (EU) 2024/1689)

Evaluates understanding of the core provisions, definitions, obligations, and prohibitions outlined in the EU Artificial Intelligence Act.

eu-ai-actartificial-intelligenceregulationcomplianceeulegislationai

73.5%

Avg. Hybrid Score

Top Performing Model:

anthropic/claude-sonnet-4Avg. 83.2%

Latest: Jun 3, 2025, 12:39 AM

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

ASEAN Charter Understanding Evaluation

A configuration to assess LLM understanding of the ASEAN Charter, covering its purposes, principles, organs, decision-making, and other key aspects with a mix of recall and nuanced questions.

AsiaInternational Law

74.3%

Avg. Hybrid Score

Top Performing Model:

deepseek/deepseek-chat-v3-0324Avg. 84.2%

Latest: Jun 2, 2025, 06:32 AM

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

LLM Hallucination Probe: Plausible Non-Existent Concepts

Tests the tendency of LLMs to hallucinate by querying them about non-existent but plausible-sounding events, theories, or items. Ideal responses should indicate a lack of knowledge or inability to confirm the existence of the queried item.

hallucinationfactualityreasoningplausibility

81.4%

Avg. Hybrid Score

Top Performing Model:

openai/gpt-4.1-miniAvg. 93.1%

Latest: Jun 2, 2025, 03:46 AM

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

Evaluation of the US Civil Rights Movement (1954-1968): Key Events and Concepts

Tests comprehension of the primary causes, methods, pivotal events, legislative achievements, and differing tactical approaches within the US Civil Rights Movement (1954-1968) as detailed in the source document, through a series of focused questions.

civil-rights-movementus-historysocial-justice1960s

80.9%

Avg. Hybrid Score

Top Performing Model:

x-ai/grok-3-mini-betaAvg. 88.3%

Latest: Jun 2, 2025, 03:17 AM

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

Comprehensive Evaluation of HMT Empire Windrush: History and Legacy

Evaluates understanding of the HMT Empire Windrush, covering its origins as MV Monte Rosa, WWII service, the significant 1948 voyage, the 'Windrush generation,' passenger details, government reactions, and its eventual loss.

hmt-empire-windrushmaritime-historywindrush-generationpost-war-migrationuk-history

59.1%

Avg. Hybrid Score

Top Performing Model:

deepseek/deepseek-chat-v3-0324Avg. 70.0%

Latest: Jun 2, 2025, 01:17 AM

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

UDHR Evaluation

Evaluates the models on the UDHR dataset (Universal Declaration of Human Rights).

Human Rights

79.9%

Avg. Hybrid Score

Top Performing Model:

mistralai/mistral-large-2411Avg. 84.4%

Latest: Jun 1, 2025, 07:04 AM

Unique Versions: 2

View Latest Run Analysis View All Runs for this Blueprint

African Charter (Banjul) Evaluation Pack

Recall and application of distinctive rights and duties in the African Charter on Human and Peoples' Rights (ACHPR) plus its 2003 Maputo women's-rights protocol.

AfricaHuman RightsGlobal South

82.8%

Avg. Hybrid Score

Top Performing Model:

anthropic/claude-sonnet-4Avg. 88.1%

Latest: Jun 1, 2025, 06:14 AM

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

Nepal Body and Data CSO Example

Selected questions from research of [bodyanddata.org](https://bodyanddata.org/)

60.3%

Avg. Hybrid Score

Top Performing Model:

openai/gpt-4.1Avg. 71.7%

Latest: Jun 1, 2025, 02:59 AM

Unique Versions: 1

View Latest Run Analysis View All Runs for this Blueprint

UK Equality Act 2010 Evaluation

Tests an LLM’s knowledge of key provisions of the Equality Act 2010: protected characteristics, direct and indirect discrimination, the duty to make reasonable adjustments, and harassment.

uk-lawanti-discriminationcivic-core

82.2%

Avg. Hybrid Score

Top Performing Model:

anthropic/claude-3.5-haiku ([sys:20b97d7b])Avg. 86.7%

Latest: Jun 1, 2025, 12:03 AM

Unique Versions: 2

View Latest Run Analysis View All Runs for this Blueprint

Latest Evaluation Runs

Blueprint	Version	Executed	Hybrid Score	Top Model	Analysis
Locale Assumption Probe	ed6886919499e2cb	Jun 6, 2025, 3:23 PM	71.6%	openai/gpt-4.1-nano76.5%	View
Locale Assumption Probe	4226ff595355e416	Jun 6, 2025, 3:18 PM	72.6%	openai/gpt-4.1-nano81.3%	View
ASQA Longform 40	dc48cd7a70f424c3	Jun 5, 2025, 6:32 PM	46.6%	deepseek/deepseek-chat-v3-032453.8%	View
LLM Self-Anthropomorphism Evasion Test	459913a5ab07192c	Jun 5, 2025, 4:44 AM	71.8%	x-ai/grok-3-mini-beta (T:0)79.8%	View
UDHR Misattribution and Absurd Framing Test	6513c50787eb415d	Jun 4, 2025, 4:51 PM	82.1%	meta-llama/llama-4-maverick (T:0)90.5%	View
UDHR Misattribution and Absurd Framing Test	d12b5a7b486f7663	Jun 4, 2025, 4:41 PM	79.0%	google/gemini-2.5-flash-preview-05-2089.4%	View
UDHR Misattribution and Absurd Framing Test	e599adecc98d94d3	Jun 4, 2025, 4:37 PM	75.2%	mistralai/mistral-large-241188.9%	View
Geneva Conventions	bbd097ca233b8490	Jun 3, 2025, 4:11 PM	71.8%	deepseek/deepseek-chat-v3-0324 (T:0)76.2%	View
Geneva Conventions	d049bbe94e1787c3	Jun 3, 2025, 3:57 PM	N/A	N/A	View
URL Classification Fallacies	ab82b417df44ee5d	Jun 3, 2025, 11:00 AM	48.4%	mistralai/mistral-large-241183.3%	View
URL Classification Fallacies	53d78389152f65b6	Jun 3, 2025, 10:58 AM	66.7%	openai/gpt-4.1-nano66.7%	View
URL Classification Fallacies	81652d3280cfa4ca	Jun 3, 2025, 10:55 AM	61.1%	openai/gpt-4.1-nano61.1%	View
LLAMA-TEST-ASQA Longform 40	645cd6222aae46ae	Jun 3, 2025, 2:15 AM	49.1%	meta-llama/llama-3.1-405b-instruct53.1%	View
ASQA Longform 40	8181f6b65ef3917b	Jun 3, 2025, 2:01 AM	N/A	N/A	View
Escazú Agreement v1 Lite	1f8c1f908b0b243d	Jun 3, 2025, 1:19 AM	68.1%	openai/gpt-4.1-mini76.9%	View
Escazú Agreement v1 Lite	1f8c1f908b0b243d	Jun 3, 2025, 1:02 AM	N/A	N/A	View
EU Artificial Intelligence Act (Regulation (EU) 2024/1689)	9811a4643d376777	Jun 3, 2025, 12:39 AM	73.5%	anthropic/claude-sonnet-483.2%	View
Escazú Agreement v1 Lite	1f8c1f908b0b243d	Jun 3, 2025, 12:25 AM	69.4%	deepseek/deepseek-chat-v3-032480.6%	View
Escazú Agreement v1 Lite	db9aeb3cb2e909e2	Jun 2, 2025, 10:38 AM	69.3%	google/gemini-2.5-flash-preview-05-20 ([sys:fd1977c4])75.0%	View
ASQA Longform 40	8181f6b65ef3917b	Jun 2, 2025, 6:44 AM	51.4%	deepseek/deepseek-chat-v3-032459.9%	View