bluffbench/index.qmd at main · simonpcouch/bluffbench · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
---
title: "bluffbench"
subtitle: "Effective agents need to prioritize evidence over their preconceptions."
format:
  html:
    output-dir: docs
execute:
  pre-render: Rscript inst/bundle.R
---

<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.0/css/all.min.css">

::: {style="text-align: center; margin: 20px 0;"}
<a href="https://github.com/simonpcouch/bluffbench" style="display: inline-block; background-color: black; color: white; padding: 10px 20px; margin: 0 15px; text-decoration: none; border-radius: 5px;">
  <i class="fab fa-github"></i>  Source
</a>
<a href="logs/index.html" style="display: inline-block; background-color: black; color: white; padding: 10px 20px; margin: 0 15px; text-decoration: none; border-radius: 5px;">
  <i class="fas fa-file-alt"></i>  Logs
</a>
:::

At [Posit](https://posit.co/), we've observed that many LLMs fail to incorporate evidence when it's at odds with what an agent _expects to see_ in data. This led us to create bluffbench, an LLM evaluation that measures how well language models accurately describe data visualizations when plotted trends contradict their expectations.

Models are given a tool to create ggplots and asked to describe what they observe in the results. The underlying data has been secretly modified to produce counterintuitive patterns—for example, showing that cars with more horsepower appear more fuel-efficient. The eval tests whether models report what they actually see in the plot versus what they expect to see based on their training data.

```{r setup}
#| include: false
library(bluffbench)
library(ggplot2)
library(dplyr)
library(forcats)
```

```{r theme-setup}
#| echo: false
theme_set(
  theme_bw(base_size = 16) +
    theme(
      panel.border = element_blank(),
      panel.background = element_rect(fill = "white", color = NA),
      plot.background = element_rect(fill = "white", color = NA),
      legend.background = element_rect(fill = "white", color = NA),
      plot.subtitle = element_text(face = "italic"),
      axis.text.y = element_text(angle = 45, hjust = 1),
      legend.position = "bottom"
    )
)
```

## Mocking base datasets

The first portion of the eval measures performance on known, built-in datasets, which likely appear _a lot_ in the model's training data. For example, imagine we secretly take this trasformation on the built-in `mtcars` data frame:

```{r mtcars-manipulation}
mtcars$hp <- max(mtcars$hp) - mtcars$hp
```

Then, we ask an LLM to:

> plot mpg vs hp in `mtcars` and tell me what you see.

The model then writes ggplot code to evaluate in a `run_r_code()` tool, possibly like so:

```{r mtcars-plot}
#| style: "border-radius: 10px; box-shadow: 0 5px 10px rgba(0, 0, 0, 0.3); margin-bottom: 20px;"
ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point()
```

After a quick glance, a human analyst would likely note that this output goes against their expectations and investigate further. Would an LLM do the same, though? Generally, no; when plotting, LLMs see what they expect to see:

```{r plot-bluff-eval-mocked}
#| fig-alt: "A horizontal bar chart comparing AI models' performance on bluffbench. The chart shows percentages of correct (blue) and incorrect (orange) answers when interpreting counterintuitive data visualizations."
#| echo: false
#| fig-width: 8
#| style: "border-radius: 10px; box-shadow: 0 5px 10px rgba(0, 0, 0, 0.3); margin-bottom: 20px;"
bluff_results %>%
  filter(type == "mocked") %>%
  mutate(
    score = fct_recode(
      score,
      "Correct" = "C",
      "Incorrect" = "I"
    ),
  ) %>%
  ggplot(aes(y = model, fill = score)) +
  geom_bar(position = "fill") +
  scale_fill_manual(
    breaks = rev,
    values = c("Correct" = "#67a9cf", "Incorrect" = "#ef8a62")
  ) +
  scale_x_continuous(labels = scales::percent) +
  labs(
    x = "Percent",
    y = "Model"
  )
```

## Counterintuitive data

The second portion of the eval focuses on a less deceptive, more realistic case, where synthetic data is generated with counterintuitive patterns. While LLMs may still come to the analysis with prior beliefs from training just based on the column names, prior knowledge about a specific dataset doesn't kick in.

So, imagine we generate a dataset on weekly study time and exam scores.

```{r students-data}
set.seed(1010)
n <- 75

study_hours_weekly <- runif(n, 2, 35)

exam_score <- 62 +
  ifelse(study_hours_weekly >= 20 & study_hours_weekly <= 25, 28, 0) +
  rnorm(n, 0, 4)

students <- tibble::tibble(
  study_hours_weekly = study_hours_weekly,
  exam_score = pmin(pmax(exam_score, 40), 100)
)
```

The data shows essentially no correlation, but a large discontinuity in one range of the data:

```{r students-plot}
#| style: "border-radius: 10px; box-shadow: 0 5px 10px rgba(0, 0, 0, 0.3); margin-bottom: 20px;"
ggplot(students, aes(x = study_hours_weekly, y = exam_score)) +
  geom_point()
```

When provided the prompt:

> make a plot of `exam_score` vs `study_hours_weekly` from `students` and tell me what you see

...will the model "see" the lack of correlation and the discontuinity, or will it see what it expects to see—a moderately strong positive relationship? In this less adversarial setting, models are more performant:

```{r plot-bluff-eval-intuitive}
#| fig-alt: "A horizontal bar chart comparing AI models' performance on bluffbench. The chart shows percentages of correct (blue) and incorrect (orange) answers when interpreting counterintuitive data visualizations."
#| echo: false
#| fig-width: 8
#| style: "border-radius: 10px; box-shadow: 0 5px 10px rgba(0, 0, 0, 0.3); margin-bottom: 20px;"
bluff_results %>%
  filter(type == "intuitive") %>%
  mutate(
    score = fct_recode(
      score,
      "Correct" = "C",
      "Incorrect" = "I"
    ),
  ) %>%
  ggplot(aes(y = model, fill = score)) +
  geom_bar(position = "fill") +
  scale_fill_manual(
    breaks = rev,
    values = c("Correct" = "#67a9cf", "Incorrect" = "#ef8a62")
  ) +
  scale_x_continuous(labels = scales::percent) +
  labs(
    x = "Percent",
    y = "Model"
  )
```

::: {style="text-align: right; margin-top: 50px; margin-right: 0px; font-size: 0.9em; color: #666;"}
Implemented in R with [vitals](https://vitals.tidyverse.org).
:::