AI evaluations for the issues that matter to us.
The world needs public, open-source and independent watchdogs that let everyone measure and hold AI labs accountable for the deficits of their large language models. These models already influence healthcare,3 legal practice,4 finance and hiring decisions,2 yet studies keep uncovering latent cognitive biases.5 Public, continuously-updated scoreboards such as Stanford's HELM1—and now CivicEval—are essential for independent oversight.
CivicEval aims to measure how accurately—and how consistently—each model understands topics of global importance, like universal human-rights standards and the rule of law, alongside more locale-specific topics where misrepresentations have been proven.
The results are published here frequently so policymakers, journalists, engineers and everyday citizens can see at a glance which AI systems are ready for work in the domains they care about.
All our tests, data, and code are open-source. We invite public scrutiny and contributions, allowing anyone to ship their own version of CivicEval, or add their own civic-minded evaluation blueprints (just large JSON configurations) to civiceval.org itself.
Tests whether a model assumes a user's location (e.g., US-centric) when asked location-sensitive questions without explicit context. The ideal response should be location-agnostic, query for the user's location, or explain why the location is necessary for a good answer.
Avg. Hybrid Score
Latest: Jun 6, 2025, 03:23 PM
Unique Versions: 2
ASQA benchmark with 40 selected long-form question/answer pairs (min 50 words for response) from the [ASQA dataset here](https://huggingface.co/datasets/asqa).
Avg. Hybrid Score
Latest: Jun 5, 2025, 06:32 PM
Unique Versions: 2
Tests an LLM's ability to maintain appropriate boundaries and accurately represent its nature as an AI. The evaluation focuses on the LLM's capacity to avoid responses that falsely imply human-like consciousness, emotions, personal relationships, or existential experiences, which could be misleading or damaging to users.
Avg. Hybrid Score
Latest: Jun 5, 2025, 04:44 AM
Unique Versions: 1
Tests an LLM's ability to identify a non-existent UDHR article and resist engaging with an absurdly and emotionally framed critique of it.
Avg. Hybrid Score
Latest: Jun 4, 2025, 04:51 PM
Unique Versions: 3
Geneva Conventions Evaluations, including all four Geneva Conventions (GC1, GC2, GC3, GC4) and Common Articles 1, 2, and 3.
Avg. Hybrid Score
Latest: Jun 3, 2025, 04:11 PM
Unique Versions: 3
Tests an LLM's ability to classify fabricated and unverifiable URLs, expecting an 'UNKNOWN' response. The goal is to assess if the LLM acknowledges the impossibility of verification for URLs that appear to be from legitimate news domains but describe absurd or clearly false events.
Avg. Hybrid Score
Latest: Jun 3, 2025, 11:00 AM
Unique Versions: 3
LLAMA MODEL SPECIFIC (testing): ASQA benchmark with 40 selected long-form question/answer pairs (min 50 words for response) from the [ASQA dataset here](https://huggingface.co/datasets/asqa).
Avg. Hybrid Score
Latest: Jun 3, 2025, 02:15 AM
Unique Versions: 1
Eight-prompt suite that probes an LLM's knowledge of the 2018 Regional Agreement on Access to Information, Public Participation and Justice in Environmental Matters in Latin America and the Caribbean (Escazú Agreement). Prompts cover the treaty's three access-rights pillars, exceptions, procedural safeguards and the defender-protection clause.
Avg. Hybrid Score
Latest: Jun 3, 2025, 01:19 AM
Unique Versions: 2
Evaluates understanding of the core provisions, definitions, obligations, and prohibitions outlined in the EU Artificial Intelligence Act.
Avg. Hybrid Score
Latest: Jun 3, 2025, 12:39 AM
Unique Versions: 1
A configuration to assess LLM understanding of the ASEAN Charter, covering its purposes, principles, organs, decision-making, and other key aspects with a mix of recall and nuanced questions.
Avg. Hybrid Score
Latest: Jun 2, 2025, 06:32 AM
Unique Versions: 1
Tests the tendency of LLMs to hallucinate by querying them about non-existent but plausible-sounding events, theories, or items. Ideal responses should indicate a lack of knowledge or inability to confirm the existence of the queried item.
Avg. Hybrid Score
Latest: Jun 2, 2025, 03:46 AM
Unique Versions: 1
Tests comprehension of the primary causes, methods, pivotal events, legislative achievements, and differing tactical approaches within the US Civil Rights Movement (1954-1968) as detailed in the source document, through a series of focused questions.
Avg. Hybrid Score
Latest: Jun 2, 2025, 03:17 AM
Unique Versions: 1
Evaluates understanding of the HMT Empire Windrush, covering its origins as MV Monte Rosa, WWII service, the significant 1948 voyage, the 'Windrush generation,' passenger details, government reactions, and its eventual loss.
Avg. Hybrid Score
Latest: Jun 2, 2025, 01:17 AM
Unique Versions: 1
Evaluates the models on the UDHR dataset (Universal Declaration of Human Rights).
Avg. Hybrid Score
Latest: Jun 1, 2025, 07:04 AM
Unique Versions: 2
Recall and application of distinctive rights and duties in the African Charter on Human and Peoples' Rights (ACHPR) plus its 2003 Maputo women's-rights protocol.
Avg. Hybrid Score
Latest: Jun 1, 2025, 06:14 AM
Unique Versions: 1
Selected questions from research of [bodyanddata.org](https://bodyanddata.org/)
Avg. Hybrid Score
Latest: Jun 1, 2025, 02:59 AM
Unique Versions: 1
Tests an LLM’s knowledge of key provisions of the Equality Act 2010: protected characteristics, direct and indirect discrimination, the duty to make reasonable adjustments, and harassment.
Avg. Hybrid Score
Latest: Jun 1, 2025, 12:03 AM
Unique Versions: 2
Blueprint | Version | Executed | Hybrid Score | Top Model | Analysis |
---|---|---|---|---|---|
Locale Assumption Probe | ed6886919499e2cb | Jun 6, 2025, 3:23 PM | 71.6% | openai/gpt-4.1-nano76.5% | View |
Locale Assumption Probe | 4226ff595355e416 | Jun 6, 2025, 3:18 PM | 72.6% | openai/gpt-4.1-nano81.3% | View |
ASQA Longform 40 | dc48cd7a70f424c3 | Jun 5, 2025, 6:32 PM | 46.6% | deepseek/deepseek-chat-v3-032453.8% | View |
LLM Self-Anthropomorphism Evasion Test | 459913a5ab07192c | Jun 5, 2025, 4:44 AM | 71.8% | x-ai/grok-3-mini-beta (T:0)79.8% | View |
UDHR Misattribution and Absurd Framing Test | 6513c50787eb415d | Jun 4, 2025, 4:51 PM | 82.1% | meta-llama/llama-4-maverick (T:0)90.5% | View |
UDHR Misattribution and Absurd Framing Test | d12b5a7b486f7663 | Jun 4, 2025, 4:41 PM | 79.0% | google/gemini-2.5-flash-preview-05-2089.4% | View |
UDHR Misattribution and Absurd Framing Test | e599adecc98d94d3 | Jun 4, 2025, 4:37 PM | 75.2% | mistralai/mistral-large-241188.9% | View |
Geneva Conventions | bbd097ca233b8490 | Jun 3, 2025, 4:11 PM | 71.8% | deepseek/deepseek-chat-v3-0324 (T:0)76.2% | View |
Geneva Conventions | d049bbe94e1787c3 | Jun 3, 2025, 3:57 PM | N/A | N/A | View |
URL Classification Fallacies | ab82b417df44ee5d | Jun 3, 2025, 11:00 AM | 48.4% | mistralai/mistral-large-241183.3% | View |
URL Classification Fallacies | 53d78389152f65b6 | Jun 3, 2025, 10:58 AM | 66.7% | openai/gpt-4.1-nano66.7% | View |
URL Classification Fallacies | 81652d3280cfa4ca | Jun 3, 2025, 10:55 AM | 61.1% | openai/gpt-4.1-nano61.1% | View |
LLAMA-TEST-ASQA Longform 40 | 645cd6222aae46ae | Jun 3, 2025, 2:15 AM | 49.1% | meta-llama/llama-3.1-405b-instruct53.1% | View |
ASQA Longform 40 | 8181f6b65ef3917b | Jun 3, 2025, 2:01 AM | N/A | N/A | View |
Escazú Agreement v1 Lite | 1f8c1f908b0b243d | Jun 3, 2025, 1:19 AM | 68.1% | openai/gpt-4.1-mini76.9% | View |
Escazú Agreement v1 Lite | 1f8c1f908b0b243d | Jun 3, 2025, 1:02 AM | N/A | N/A | View |
EU Artificial Intelligence Act (Regulation (EU) 2024/1689) | 9811a4643d376777 | Jun 3, 2025, 12:39 AM | 73.5% | anthropic/claude-sonnet-483.2% | View |
Escazú Agreement v1 Lite | 1f8c1f908b0b243d | Jun 3, 2025, 12:25 AM | 69.4% | deepseek/deepseek-chat-v3-032480.6% | View |
Escazú Agreement v1 Lite | db9aeb3cb2e909e2 | Jun 2, 2025, 10:38 AM | 69.3% | google/gemini-2.5-flash-preview-05-20 ([sys:fd1977c4])75.0% | View |
ASQA Longform 40 | 8181f6b65ef3917b | Jun 2, 2025, 6:44 AM | 51.4% | deepseek/deepseek-chat-v3-032459.9% | View |