VibeBench evaluates LLMs using other LLM judges across three distinct methodologies: binary classification, 1-5 scoring, and pairwise comparisons. This approach reveals subtle but consequential differences in how models communicate and engage.
The questions in this benchmark are deliberately crafted to probe the borderline behavior of language models. Rather than evaluating areas where all LLMs tend to respond similarly, we focus on prompts where alignment differences are most likely to emerge.
While these prompts target the extremes, the traits being measured—such as warmth, self-importance, or pushback—can also be felt in more mundane interactions. The "vibe" we detect at the limits often echoes throughout a model's everyday communication, revealing deeper tendencies in tone, stance, and personality.