Visual regression
Visual regression catches when something looks wrong, not just when something is wrong. A button moved 3 pixels off, a color swap on the primary CTA, the search bar disappearing on Firefox — these don’t fail a “find the button and click it” check, but they’re real bugs.
This page is about how Marriska does that check, why we chose that approach, and what it costs you.
The two ends of the spectrum
Section titled “The two ends of the spectrum”Visual testing has historically pulled in two directions, and both hurt:
Pure pixel-diff (Percy, Chromatic-style classic)
Section titled “Pure pixel-diff (Percy, Chromatic-style classic)”Compare every pixel to a baseline. If any meaningful number changed, fail the test.
Why this loses: browsers update their font renderers, antialiasing algorithms shift between Chromium versions, sub-pixel rounding disagreements between OSes — none of which are bugs, but all of which trip a strict pixel-diff. You end up either bumping baselines weekly or living with red builds you’ve trained yourself to ignore.
Pure LLM-vision compare (the “ask the model” approach)
Section titled “Pure LLM-vision compare (the “ask the model” approach)”Send both screenshots to a vision LLM and ask “did anything important change?”
Why this is expensive: vision tokens cost real money, and the overwhelming majority of step screenshots haven’t changed at all between runs. Burning a vision-model call on every step of every test gets pricey at scale, and the latency adds up.
What we do instead
Section titled “What we do instead”Three-stage cascade. Each stage answers if it can; otherwise it punts to the next:
1. Byte-identical short-circuit
Section titled “1. Byte-identical short-circuit”If the new screenshot is exactly the same bytes as the baseline, pass without doing anything else. Common when nothing on the page actually changed.
2. Pixel-diff fast path
Section titled “2. Pixel-diff fast path”Compute the percentage of pixels that differ. If it’s under the
threshold (default 0.5%, configurable via VISUAL_PIXEL_THRESHOLD),
pass. This catches the antialiasing-and-rendering-noise case without
spending an LLM call.
3. Vision-LLM judgement
Section titled “3. Vision-LLM judgement”If the diff exceeds the threshold, send both screenshots to the visual-comparison model with a tight system prompt. The prompt asks specifically for meaningful regressions — layout shifts, missing elements, color/style changes, broken components — and explicitly tells the model to ignore dynamic content (timestamps, badges, counters) and rendering noise.
The LLM responds with PASS: or FAIL: followed by a one-sentence
explanation. That sentence becomes the analysis attached to the step
result, so when something fails you see why, not just that it
did.
If the LLM call itself fails (network, no API key configured), the step fails with the pixel diff percentage as the explanation — better than silently passing.
What counts as a “step” for visual regression
Section titled “What counts as a “step” for visual regression”Any step that uses the screenshot (or takeScreenshot) action with
a baseline saved on the test runs through the visual pipeline. Other
step types capture a screenshot too, but those are for the report
view — they don’t get compared.
To opt a step into visual regression:
- Add a
screenshot:step in your test. - Run the test once with
visual_mode=update_baseline— this saves the captured screenshot as the baseline. - Subsequent runs in the default
visual_mode=comparemode go through the cascade above.
Baselines are keyed by step ID + browser, not by step number.
Reordering steps or inserting new ones doesn’t invalidate baselines —
the IDs survive drag-and-drop and edits. See
Step actions reference for the
screenshot shape.
What you pay (and how to pay less)
Section titled “What you pay (and how to pay less)”Visual compares count against your monthly visual compares quota (20 on Free, 100 on Starter, 500 on Pro, 2,000 on Team — see Plan tier limits). The byte and pixel stages don’t count; only the LLM call does.
If the diff threshold is too tight for your test, raise the
visual_pixel_threshold (it’s a fraction between 0.0 and 1.0). A
slightly higher threshold means more passes from the pixel stage
without an LLM call.
What we don’t do (yet)
Section titled “What we don’t do (yet)”- Per-element baselines independently of the step they live on — baselines are step-scoped. If you want N elements compared independently, write N screenshot steps.
- Auto-baseline-update on intentional changes: today, intentional
changes need a manual
update_baselinerun. - Cross-browser baseline sharing: each browser stores its own baseline, because Chromium / Firefox / WebKit render real differences (especially in form controls and SVG).
Related
Section titled “Related”- Step actions reference — the
screenshotaction shape - BYOK concept — connecting your own OpenAI key so visual compares don’t count against your Marriska quota
- Plan tier limits — per-month visual compare caps