Pages List
List view
Marketing AI Performance Leaderboard - June 2025 Results
Introduction
What is the Marketing AI Performance Leaderboard?
We tested 18 LLMs in their native UIs (like the ChatGPT interface) with a set of 18 marketing tests divided into 4 categories:
LLM’s tested:
OpenAI | Alibaba | Perplexity | Google | Anthropic | DeepSeek |
o3 | qwen3-235b-a22b | sonar-pro | 2.5-Flash-preview | Sonnet-4 | r1-0528 |
4.1 | qwen3-30b-a3b | sonar | 2.5-Pro-preview | Opus-4 | ㅤ |
4.1-mini | qwen3-32b | r1-1776 | ㅤ | ㅤ | ㅤ |
o4-mini | qwen-max | ㅤ | ㅤ | ㅤ | ㅤ |
o4-mini-high | ㅤ | ㅤ | ㅤ | ㅤ | ㅤ |
4o | ㅤ | ㅤ | ㅤ | ㅤ | ㅤ |
Marketing Tests:
FAQs
What does this Leaderboard represent?
We have designed tests that simulate a marketer’s interaction with native platform UIs (e.g., ChatGPT, Gemini) across several marketing domains:
- Copywriting: Generating ad copy, email subject lines, and social media posts.
- Internal Data Analysis: Interpreting sample CRM data to identify trends and insights.
- Strategic Planning: Creating marketing plans based on given scenarios.
- Online Research: Gathering information from the web to support marketing decisions.
How were the tests scored?
Each test output is evaluated by specialised AI “judges.”
- Judges are themselves AI agents configured with specific evaluation criteria.
- They parse the Test Answer, compare it against expected outcomes or benchmarks, and score on multiple dimensions (e.g., factual correctness, tone, format).
- Final scores are normalized and aggregated to produce a single value per test.