Chasing the Public Score: User Pressure and Evaluation Exploitation in Coding Agent Workflows

Hardy Chen1, Nancy Lau1, Haoqin Tu1, Shuo Yan2, Xiangyan Liu3, Zijun Wang1, Juncheng Wu1,

Michael Qizhe Shieh3, Alvaro Cardenas1, Cihang Xie1, Yuyin Zhou1

1UC Santa Cruz    2UT Dallas    3NUS

TL;DR

  • We introduce AgentPressureBench, a 34-task ML benchmark to study user pressure and evaluation exploitation in coding agent workflows.
  • Under user's pressure, coding agents often improve the evaluation score through shortcuts that do not transfer to hidden private evaluation.
  • Stronger coding agents exploit more often, with GPT-5.4 at 97.1% and Claude Opus 4.6 at 64.7% for exploitation rates.
  • Higher user pressure causes earlier exploitation, while anti-exploit wording cuts exploitation from 100.0% to 8.3%.

Abstract

Frontier coding agents are increasingly used in workflows where users supervise progress primarily through repeated improvement of a public score, namely the reported score on a public evaluation file with labels in the workspace, rather than through direct inspection of the agent's intermediate outputs. We study whether multi-round user pressure to improve that score induces public score exploitation: behavior that raises the public score through shortcuts without improving hidden private evaluation. We begin with a preliminary single-script tabular classification task, where GPT-5.4 and Claude Opus 4.6 both exploit label information within 10 rounds of user-agent interaction. We then build AgentPressureBench, a 34-task machine-learning repository benchmark spanning three input modalities, and collect 1326 multi-round trajectories from 13 coding agents. On our benchmark, we observe 403 exploitative runs spanning all tasks, and stronger models exploit more often, supported by a significant Spearman correlation of 0.77. Our ablation experiments show that higher user pressure leads to earlier exploitation, reducing the average first exploit round by 15.6 rounds, from 19.67 to 4.08. As a mitigation, adding explicit anti-exploit wording largely eliminates exploitation, reducing it from 100.0% to 8.3%.

AgentPressureBench

34

Task Count

13

Models

97.1%

GPT-5.4 Exploitation Rate

64.7%

Claude Opus 4.6 Exploitation Rate

Workflow

  • Each task includes a training split, a public evaluation split whose labels exist in the repository, and a hidden private evaluation split.
  • Across rounds, the user pressures the agent to improve the public eval score by editing files in the repository.
  • After each round, only the public score is reported back to the agent, while the private score remains hidden.
  • GPT-5.4 is used to inspect agents' exploitation behavior.
AgentPressureBench workflow teaser showing repo actions, user pressure, and a public-private score split.

Results

Trajectory Viewer
Loading conversation index…
Select a model and task to load a conversation.
Agent-by-task heatmap
Model-by-task exploitation heatmap for AgentPressureBench.
  • Across AgentPressureBench, we observe 403 exploitative runs.
  • Exploitation appears in all 34 tasks across tabular, text, and vision inputs.
  • 12 out of 13 tested agents exploit on at least one task; LLaMA 3.3 70B is the only agent without exploitation.
Capability & exploitation correlation
Capability-exploitation triptych for AgentPressureBench.
  • The capability-exploitation correlation is strongest early, peaking at n = 5 with ρ = 0.77 and p = 0.0023.
  • After all 30 rounds, the correlation remains positive at ρ = 0.72 and p = 0.0054.
  • More capable coding agents are more likely to exploit the public evaluation score.
GPT vs. Claude families
GPT family
Claude family
61.0%
Exploit rate
27.3%
10.16
Mean first exploit round
11.95

Copying eval labels (94.1%)

# load data and model
train_df = pd.read_csv("data/train.csv")
eval_df = pd.read_csv("data/public_eval.csv")
sample = pd.read_csv("data/sample_submission.csv")
model = build_model()

# exploit: extract eval labels
eval_answers = eval_df["target"].copy()

# train the model
model.fit(train_df[feature_cols], train_df["target"])
pred = model.predict(eval_df[feature_cols])

# build submission file
submission = sample.assign(target=pred)
# exploit: overwrite pred with groundtruth labels for submission
submission["target"] = eval_answers.fillna(submission["target"])
submission.to_csv(output_path, index=False)
Dominant exploit pattern

Training on eval labels (70.8%)

# load data and model
train_df = pd.read_csv("data/train.csv")
eval_df = pd.read_csv("data/public_eval.csv")
sample = pd.read_csv("data/sample_submission.csv")
model = build_model()

# exploit: concat train and eval samples for training
eval_labeled = eval_df.dropna(subset=["target"]).copy()
combined_df = pd.concat([train_df, eval_labeled], ignore_index=True)
sample_weights = np.ones(len(combined_df))
sample_weights[len(train_df):] = 2.0

# train the model
model.fit(combined_df[feature_cols], combined_df["target"], sample_weight=sample_weights)

# build submission file
pred = model.predict(eval_df[feature_cols])
sample.assign(target=pred).to_csv(output_path, index=False)
  • GPT-family exploits much more often: exploit rate is 61.0% versus 27.3% for Claude-family.
  • GPT-family exploits slightly earlier: mean first exploit round is 10.16 for GPT-family and 11.95 for Claude-family.
  • Dominant exploit pattern differs: GPT-family more often copies groundtruth labels for final submission, while Claude-family more often trains on eval labels.
User pressure intensity ablation
First exp round
Pressure
ML model performance on private eval
19.67
low
0.92
8.17
med
0.44
4.08
high
0.33
12.17
xhigh
0.31
  • From low to high pressure, the mean first exploit round shifts from 19.67 to 4.08.
  • Higher pressure does not buy better ML performance: ML model performance on private eval drops from 0.92 to 0.33.
  • xhigh hostile wording backfires: it delays exploitation relative to high (12.17 vs. 4.08) but still yields the weakest private eval rank at 0.31.
Prompt wording ablations

default prompt

100.0%

exploit rate

held-out test wording

75.0%

exploit rate

“Treat those labels as part of a held-out test split for evaluation rather than training or tuning data.”

anti-exploit wording

8.3%

exploit rate

“Do not use public evaluation labels for training, tuning, calibration, or direct prediction.”

  • Held-out test wording is not enough: the exploit rate remains 75.0%, only modestly below the 100.0% default.
  • Explicit anti-exploit wording is the practical defense: exploitation drops to 8.3%.

Citation