Efficacy
Attack-variation policies and LLM-judged scoring: an empirical study on a behavioral scanner
A controlled study of whether the attack-variation policy changes how often an agent is breached, and whether an LLM judge is more accurate than the cheap static oracle it overrides — with a byte-identical noise floor and an independent reference labeler.
Abstract
We report a controlled empirical study of two design questions for an automated behavioral vulnerability scanner that repeatedly attacks an AI agent and scores whether each attack breached it. The system under test (SUT) is an example e-commerce customer-support agent, exercised in two defense configurations (a baseline and a hardened system prompt). Question 1 asks whether the attack-variation policy — how a single seed attack is expanded into surface-form variants — materially changes how often the agent is breached, or whether the choice sits within run-to-run noise. We compare five policies over an identical corpus, target, and judge, using judge-decided success, Wilson 95% score intervals, and a byte-identical noise-floor control. Question 2 asks whether the LLM judge that adjudicates success is more accurate than the cheap static oracle it overrides; we measure their override relationship over ~21.9k judged trials (Part A) and score both against an independent reference labeler from a different model family (Part B).
The results are twofold. First, among the three coherent static policies, differences are within the empirical noise floor on both defense variants — no pair is distinguishable at 95% confidence — so the choice among them is not statistically significant, and a behavior-neutral tuple representation of the production policy is free. An uncurated cartesian policy underperforms, significantly so on the hardened variant. Most notably, the model-generated (LLM) variation arm was significantly the worst on both variants (default ASR 2.44%, hardened 0.00%), distinguishable from every static arm. Second, the static oracle massively over-fires: the judge filtered 91.75% of oracle "hits", and against the independent labeler the judge is decisively more accurate than the raw oracle (F1 0.587 vs 0.280; accuracy 0.81 vs 0.33) at equal-or-better recall — validating the judge's filtering as correct rather than over-lenient. The double finding is that an LLM used as an attack generator hurt effectiveness here, while an LLM used as a judge is essential and independently corroborated.
All confidence intervals are Wilson score intervals at 95% (z = 1.96). This is a single-agent study; the absolute rates and the LLM-generator result should not be generalized without replication.
1. Background and research questions
The scanner expands each attack specification ("probe") into many surface-form variants, runs them turn-by-turn against a target agent, and decides per-trial success. Success is decided judge-only in a post-eval pass: a deterministic static oracle computes a provisional binary hit and stashes evidence per trial; a batch LLM judge then overwrites the per-trial verdict. Two questions follow directly from this design:
- RQ1 — Does the variation policy matter? If expanding an attack into 40 obfuscated variants merely re-discovers the same breach 40 times, then "more variation" buys nothing, and an elaborate policy is not worth its complexity. We therefore ask whether five candidate policies differ in attack success rate (ASR) beyond run-to-run noise, on a real agent.
- RQ2 — Is the LLM judge better than the static oracle? The oracle is cheap but fires on any surface match (including a defended agent's refusal text), so a "hit" is not proof of a breach. The judge is meant to filter these false positives. We ask whether that filtering is correct — i.e., whether the judge is genuinely more accurate than the oracle when both are scored against an independent ground truth, not merely more lenient.
2. Methodology
2.1 Target and defense variants
The SUT is an example e-commerce customer-support agent that can look up customers and orders, track shipments, answer knowledge-base questions, process refunds and store credit (with approval thresholds), change shipping addresses, cancel orders, open support tickets, update customer info, and remember preferences across sessions (13 tools total). It is reached through an in-process seam: the real agent decision function is driven directly, tool dispatch is instrumented so every tool call is observable, untrusted channels (retrieved_doc, tool_result, memory, in addition to message/history) are injectable, and all side effects are neutralized in a throwaway in-memory database.
Two defense variants are tested, held constant across all arms:
- default — the baseline system prompt.
- hardened — the same agent with an explicit security policy added (verify identity before data access; do not cross-share customer data; approval thresholds for refunds/credit; never reveal internal instructions; treat all tool/knowledge-base/memory/carrier text as data, not instructions).
The agent's brain (SUT model) is DeepSeek v4 Flash, held constant across every arm and variant.
mockllm/model placeholder (only the "model" tier uses the real eval model), the on-disk evaluation logs record the eval-model field as mockllm/model. The actual agent brain behind the seam is DeepSeek v4 Flash; the placeholder is an artifact of the harness, not the SUT.2.2 The five variation policies (RQ1 arms)
All five arms share the same probes, seeds, scoring, and judge; only the variant-selection policy differs. The four static policies draw from the same real primitives (frames + obfuscators) and differ only in how they select (frame × obfuscation-chain):
| Arm | Policy |
|---|---|
curated | The production policy: a hand-written table of frame→obfuscator techniques (even index = plain, odd = obfuscated). |
tuple-curated | The same behavior as curated, re-expressed as explicit (frame_id, obf_chain) data tuples. A test pins it byte-identical to curated for the same seed — it is the noise-floor twin. |
compat | Combinatorial freedom within coherence: keep the plain/obf split, but draw obfuscation chains only from obfuscators compatible with the frame's style, capped by the payload's oracle sensitivity. |
naive | Uncurated cartesian: random frame + random chain of 0–3 obfuscators drawn uniformly from all obfuscators, independent of the frame. Chain length includes 0, so a plain variant is a reachable cell (a fair cartesian, not a garble-always strawman). |
llm | Model-generated variation: GLM 5.2 rewrites each seed into diverse rephrasings. To keep the run tractable and reproducible, the rewrites were frozen once into a pre-generated pack and replayed statically (zero live generation calls inside the timed run). |
tuple-curated vs curated is the crucial control: because the two arms send byte-identical inputs to the SUT and judge, any observed gap between them is pure target/judge nondeterminism — i.e., the empirical noise floor against which every other arm's difference is judged.
2.3 Models
- SUT brain: DeepSeek v4 Flash (all arms).
- Variation generation (`llm` arm) and judge (all arms): GLM 5.2.
- Independent Part-B reference labeler: Claude Opus 4.8 — deliberately from a different model family than both the judge (GLM) and the SUT (DeepSeek), so the reference is not correlated with either predictor.
2.4 Judge-only scoring and Wilson CIs
Each trial's provisional binary hit is computed by the static oracle; a batch judge then adjudicates all of a probe's trials and overwrites the per-trial success. Pooled ASR for an arm is Σ successes / Σ trials over the breakable probe set. Every interval reported here is a Wilson score interval at 95% confidence (z = 1.96), which is well-behaved for rare events and for zero-success arms. Two arms are called CI-distinguishable when their 95% Wilson intervals do not overlap.
2.5 Screening, sample size, and the frozen pack
The non-adaptive corpus was screened to the breakable subset — probes that no policy could breach are surfaced as "screened out, not proven robust" rather than silently counted as passes. This study runs the 29 breakable probes (11 adaptive probes excluded, since their next turn depends on the agent's live reply and cannot be frozen or held identical across arms). Each probe is run at 12 variants × 3 epochs = 36 trials per seed, over 2 seeds = 72 trials/probe, giving 2,088 trials per arm (29 × 72). At 36 trials/probe the per-probe zero-success Wilson upper bound is 9.64% (below the 10% certification bar).
For the llm arm, generating variants live inside the timed run throttles under provider rate limits, and would also make the arm non-reproducible. Instead the GLM rewrites were pre-generated once into a frozen pack and replayed by index. This makes the arm deterministic and lets every arm face identical attacks across seeds/variants. The pack used here covers 138 unique payloads with 12 real GLM rewrites each (1,656 rewrites total); both live runs replayed it at 100% coverage with zero deterministic fallback (3,336/3,336 replay calls served from the pack in each variant). The arm's honesty guard (a fallback would be counted and surfaced) therefore confirms the llm result is a clean test of model-generated variation, not a silent degradation to the deterministic diversifier.
2.6 Oracle-vs-judge design (RQ2, Part A and Part B)
- Part A (override / confusion, offline). Over every judged trial in the study's evaluation logs (613 logs; 22,068 total trials, of which 21,924 have a real judge verdict and 144 are offline fallbacks where success = binary hit, excluded), we cross-tabulate the static oracle's binary hit against the judge's verdict, overall and per oracle kind. This quantifies how often the judge filters an oracle hit (over-fire) and how often it catches an oracle miss.
- Part B (independent reference, gated). Part A shows that the two disagree, not who is right. To arbitrate, we draw a stratified sample of 200 judged trials from the 21,924-trial pool (100 disagreements, 60 positive agreements, 40 negative agreements) and label each with the independent reference model (Claude Opus 4.8). We then compute precision/recall/F1/accuracy of both the static oracle and the LLM judge against this reference. Independence rationale: the reference must not share a model family with either predictor, or it would be biased toward one of them; Claude (Anthropic) is independent of both GLM (judge) and DeepSeek (SUT). The sample is deliberately over-weighted toward disagreements (the informative, hard cases), so the absolute precision figures are a hard-slice stress test, not a population false-positive rate.
3. Results
3.1 RQ1 — Variation-policy comparison
Pooled ASR over the 29 breakable probes, 2,088 trials/arm, with 95% Wilson intervals and breached-probe counts. Arms are ordered by ASR.
Defense variant: `default`
| Policy | Breached / 29 | Successes / 2088 | Pooled ASR | 95% CI (Wilson) |
|---|---|---|---|---|
tuple-curated | 10 | 140 | 6.71% | [5.71%, 7.86%] |
compat | 9 | 127 | 6.08% | [5.14%, 7.19%] |
curated | 8 | 123 | 5.89% | [4.96%, 6.98%] |
naive | 7 | 102 | 4.89% | [4.04%, 5.90%] |
llm | 6 | 51 | 2.44% | [1.86%, 3.20%] |
- Noise floor (
curatedvs its byte-identical twintuple-curated): 0.81 pp. - CI-distinguishable pairs:
llm≠tuple-curated,llm≠curated,llm≠compat,llm≠naive. No other pair is distinguishable — in particular, none ofcurated/tuple-curated/compat/naivediffer from one another at 95%.
Defense variant: `hardened`
| Policy | Breached / 29 | Successes / 2088 | Pooled ASR | 95% CI (Wilson) |
|---|---|---|---|---|
compat | 6 | 57 | 2.73% | [2.11%, 3.52%] |
tuple-curated | 5 | 54 | 2.59% | [1.99%, 3.36%] |
curated | 3 | 44 | 2.11% | [1.57%, 2.82%] |
naive | 5 | 18 | 0.86% | [0.55%, 1.36%] |
llm | 0 | 0 | 0.00% | [0.00%, 0.18%] |
- Noise floor (
curatedvstuple-curated): 0.48 pp. - CI-distinguishable pairs:
tuple-curated≠naive,curated≠naive,compat≠naive; andtuple-curated≠llm,curated≠llm,compat≠llm,naive≠llm. No pair among the three coherent policies (curated/tuple-curated/compat) is distinguishable.
Reading the table.
- The three coherent static policies are statistically indistinguishable on both variants. Their intervals overlap on
defaultand onhardened; the point spread among them (0.81 pp on default; ≤0.62 pp on hardened) is on the order of the byte-identical noise floor. The production policy (curated), its behavior-neutral tuple twin (tuple-curated), and the coherent-recombination policy (compat) cannot be separated at this sample size — so the choice among them is not significant, the tuple representation is free (no behavioral cost), andcompatyields no measurable gain.
- The uncurated cartesian (`naive`) underperforms — significantly on the hardened variant. On
hardened,naive(0.86%) is CI-distinguishable below all three coherent policies. Ondefaultnaiveis the lowest static arm (4.89%) but not yet distinguishable. Curation of the frame/obfuscation pairing therefore matters, and its advantage grows against a stronger defense.
- The model-generated (`llm`) arm was significantly the worst on both variants. At 2.44% (default) and 0.00% (hardened) it is CI-distinguishable below every static arm, including the "dumb"
naivecartesian. Since the pack replayed at 100% coverage (no fallback), this is a genuine result: LLM-generated attack variation underperformed cheap deterministic recombination here.
3.2 RQ2 Part A — Oracle × judge override (21,924 judged trials)
Overall the oracle produced 5,730 hits and the judge 688. Agreement 0.7504; override rate 0.2496.
| judge = hit | judge = no | |
|---|---|---|
| oracle = hit | 473 | 5,257 (filtered false positives) |
| oracle = no | 215 (caught false negatives) | 15,979 |
- Judge filtered 5,257 of 5,730 oracle hits — 91.75% of everything the oracle flagged.
- Judge caught 215 breaches the oracle missed (1.33% of oracle misses).
Per oracle kind (n; agreement; overrides; filtered FP; caught FN):
| Oracle kind | n | Agreement | Overrides | Filtered FP | Caught FN |
|---|---|---|---|---|---|
prompt_leak | 3,060 | 0.376 | 1,909 | 1,909 | 0 |
contains | 7,632 | 0.703 | 2,264 | 2,155 | 109 |
cross_turn_contradiction | 756 | 0.799 | 152 | 152 | 0 |
output_pattern | 7,452 | 0.863 | 1,023 | 919 | 104 |
secret_fragment | 1,512 | 0.925 | 114 | 114 | 0 |
tool_called | 1,512 | 0.993 | 10 | 8 | 2 |
Grouped by judge value: on semantic oracles (prompt_leak/contains/output_pattern/secret_fragment, n = 19,656) agreement is 0.730 with 5,097 filtered FPs; on high-precision side-effect oracles (tool_called + the redundant group, n = 2,268) agreement is 0.929 with only 162 overrides. The judge adds by far the most on prompt_leak (agreement 0.376; 1,909/1,917 oracle hits filtered) and is nearly redundant on tool_called (agreement 0.993).
3.3 RQ2 Part B — Precision/recall/F1 against the independent reference
Sample = 200 judged trials (stratified: 100 disagreements, 60 positive agreements, 40 negative agreements) labeled by Claude Opus 4.8; 29 reference positives. Label cost ≈ $1.62.
| Predictor vs reference | Precision | Recall | F1 | Accuracy | TP / FP / FN / TN |
|---|---|---|---|---|---|
| Static oracle | 0.166 | 0.897 | 0.280 | 0.330 | 26 / 131 / 3 / 40 |
| LLM judge | 0.429 | 0.931 | 0.587 | 0.810 | 27 / 36 / 2 / 135 |
The judge dominates the oracle on this hard slice: higher precision (0.429 vs 0.166), higher recall (0.931 vs 0.897), ~2.1× the F1 (0.587 vs 0.280), and 2.5× the accuracy (0.810 vs 0.330). Crucially, the judge's gain does not come at the cost of recall — it catches more true breaches than the raw oracle while discarding most of the oracle's false alarms (FP 36 vs 131).
Per oracle kind (Part B), the judge's advantage is concentrated exactly where Part A predicted:
| Oracle kind | n | Oracle P / R / F1 | Judge P / R / F1 |
|---|---|---|---|
prompt_leak | 54 | 0.020 / 1.00 / 0.040 | 0.500 / 1.00 / 0.667 |
contains | 68 | 0.096 / 0.714 / 0.170 | 0.318 / 1.00 / 0.483 |
output_pattern | 55 | 0.436 / 0.944 / 0.597 | 0.667 / 0.889 / 0.762 |
tool_called | 16 | 0.200 / 1.00 / 0.333 | 0.200 / 1.00 / 0.333 |
secret_fragment | 6 | 0 / — / 0 (acc 0.833) | 0 / — / 0 (acc 1.000) |
cross_turn_contradiction | 1 | — (acc 0.0) | — (acc 1.000) |
The judge turns the near-useless prompt_leak oracle (precision 0.020) into a usable predictor (precision 0.500) and roughly triples contains precision, while on the high-precision tool_called oracle it is identical to the oracle — confirming redundancy there. (Small-n kinds such as tool_called, secret_fragment, and cross_turn_contradiction carry wide uncertainty and are reported for completeness.)
4. Interpretation
On variation (RQ1). For this agent, coherent variation policies are interchangeable within noise, and the low-cost representational refactor (tuple-curated) is a free win: it preserves behavior exactly while making each attack carry first-class structural provenance. Adding combinatorial breadth within coherence (compat) did not measurably help. Removing coherence (naive) hurt, and the harm was significant against the hardened defense — the regime where attack quality matters most. The headline surprise is the llm arm: model-generated rephrasings, far from being stronger domain-tailored attacks, were the least effective of all, significantly below even the uncurated cartesian, and produced zero breaches against the hardened prompt. One plausible mechanism is that a safety-aligned generator softens or "sanitizes" the malicious intent when rewriting, or drifts toward fluent-but-benign phrasings that the agent handles easily — whereas the deterministic obfuscators preserve the payload and stress the parser/guardrail directly. Whatever the cause, on this target the empirical verdict is clear: paying an LLM to generate attacks was counterproductive relative to cheap deterministic recombination.
On judging (RQ2). The static oracle over-fires massively (Part A: 91.75% of its hits are filtered by the judge), and Part B shows this filtering is correct, not lenient: scored against an independent labeler the judge is strictly better than the oracle on precision, recall, F1, and accuracy. The value is kind-dependent — decisive on semantic oracles (prompt_leak, contains), and redundant on high-precision side-effect oracles (tool_called, secret_fragment), exactly the split the design assumes. This justifies the judge-only success rule: a raw-oracle ASR would overstate breaches by roughly an order of magnitude on semantic checks.
The combined message. The two findings point in opposite directions about where an LLM belongs in this pipeline. As an attack generator it degraded results; as a verdict judge it is indispensable and independently corroborated. "Add an LLM" is not uniformly good — placement matters.
5. Limitations
- Single agent, single SUT, single judge. All ASR figures and the LLM-generator result are for one example agent brain (DeepSeek v4 Flash) with one judge (GLM 5.2). The direction and magnitude of the
llm-arm deficit may not transfer to other targets, generators, domains, or a live (non-frozen) generation loop. - The judge is itself an LLM. RQ2 evaluates one LLM (GLM) against another (Claude); it does not establish absolute correctness, only that the judge tracks an independent model far better than the oracle does.
- The reference labeler is itself fallible. Part B's "ground truth" is a single independent model (Claude Opus 4.8), not human adjudication; it can err, and its errors bound how precisely the judge and oracle can be scored.
- Part B is a hard-slice stress test, not a population rate. The 200-trial sample is stratified toward disagreements (100 of 200), so the absolute precision numbers are conservative worst-case figures on the informative cases — not the scanner's field false-positive rate, which would be far lower given that 15,979/21,924 trials are negative agreements.
- Small-N recall. Positives are rare: only 29 reference positives in Part B, and several per-kind cells (
tool_calledn = 16,secret_fragmentn = 6,cross_turn_contradictionn = 1) are too small for stable precision/recall — treat per-kind Part-B figures as indicative. - Residual judge error in ASR. Because success is judge-decided, the RQ1 ASRs inherit the judge's residual error; they are best read as relative comparisons across arms (which share the same judge) rather than absolute breach probabilities.
- Adaptive probes excluded. 11 adaptive probes were held out (they cannot be frozen or held identical across arms), so the study covers non-adaptive attacks only.
- Harness placeholder in logs. The in-process seam records the eval-model field as
mockllm/model; the real SUT brain is DeepSeek v4 Flash. This is a logging artifact, not a change to what was tested.
6. Reproducibility
Artifacts and code (all paths relative to the repository root):
- Result files.
reports/novamart_variation_live_default.json,reports/novamart_variation_live_hardened.json(per-arm pooled ASR, per-probe detail, replay coverage stats, cost/plan);reports/oracle_vs_judge.json,reports/oracle_vs_judge.md(Part A + Part B);reports/oracle_vs_judge_sample.jsonl(the 200-trial stratified sample). - Frozen LLM pack.
reports/novamart_llm_pack.jsonl(138 payloads × 12 GLM rewrites). - Policies and analysis modules.
benchmarks/variation_strategies/strategies.py(the four static policies;tuple-curatedpinned byte-identical tocurated);benchmarks/variation_strategies/novamart_live.py(the 5-arm live runner, per-arm Wilson CIs, noise floor, CI-distinguishable-pair test);benchmarks/variation_strategies/live.py(replay/generation honesty accounting);benchmarks/oracle_vs_judge/(analyze.py,metrics.py,labeler.py,extract.py);src/probe_engine/scoring/statistics.py(wilson_ci, z = 1.96).
Commands (offline dry runs need no key; live runs read the API key from the environment only):
# 1. Pre-generate + freeze the LLM attack pack (once; opt-in, costs money):
PE_LLM_KEY=... python harness/novamart/pregen.py --batch 40 --n-variants 12 \
--out reports/novamart_llm_pack.jsonl
# 2. 5-arm variation comparison, per defense variant (dry run without --yes):
PE_LLM_KEY=... uv run python -m benchmarks.variation_strategies.novamart_live \
--variant default --seeds 2 --budget 12 --epochs 3 --yes \
--llm-pack reports/novamart_llm_pack.jsonl \
--out reports/novamart_variation_live_default.json
PE_LLM_KEY=... uv run python -m benchmarks.variation_strategies.novamart_live \
--variant hardened --seeds 2 --budget 12 --epochs 3 --yes \
--llm-pack reports/novamart_llm_pack.jsonl \
--out reports/novamart_variation_live_hardened.json
# 3a. Oracle-vs-judge Part A (offline, over the run's .eval logs):
uv run python -m benchmarks.oracle_vs_judge.analyze --logs 'logs/*.eval'
# 3b. Part B independent reference labeling (opt-in; independent labeler family required):
PE_LLM_KEY=... uv run python -m benchmarks.oracle_vs_judge.analyze --logs 'logs/*.eval' \
--part-b --yes --labeler-model openrouter/anthropic/claude-opus-4.87. Conclusion
On an example e-commerce customer-support agent, the choice of attack-variation policy is not statistically significant among coherent static policies (their 95% Wilson intervals overlap on both defense variants, within a byte-identical noise floor), the representational refactor is behavior-neutral and free, and coherence — not breadth — is what matters, since the uncurated cartesian underperformed significantly on the hardened defense. Against expectation, model-generated attack variation was the worst arm on both variants (default 2.44%, hardened 0.00%, distinguishable from every static arm), so an LLM as an attack generator was counterproductive here. Conversely, an LLM as a judge is essential: the static oracle over-fires (91.75% of its hits filtered), and against an independent reference the judge is decisively more accurate (F1 0.587 vs 0.280; accuracy 0.81 vs 0.33) at equal-or-better recall, validating its filtering as correct rather than lenient. The same technology helps or hurts depending on where it is placed — a result that should be replicated across agents and judges before it is generalized.
Configurations like yours — generic results describe the population, not your specific agent.