Visual regression

Visual regression catches when something looks wrong, not just when something is wrong. A button moved 3 pixels off, a color swap on the primary CTA, the search bar disappearing on Firefox — these don’t fail a “find the button and click it” check, but they’re real bugs.

This page is about how Marriska does that check, why we chose that approach, and what it costs you.

The two ends of the spectrum

Visual testing has historically pulled in two directions, and both hurt:

Pure pixel-diff (Percy, Chromatic-style classic)

Compare every pixel to a baseline. If any meaningful number changed, fail the test.

Why this loses: browsers update their font renderers, antialiasing algorithms shift between Chromium versions, sub-pixel rounding disagreements between OSes — none of which are bugs, but all of which trip a strict pixel-diff. You end up either bumping baselines weekly or living with red builds you’ve trained yourself to ignore.

Pure LLM-vision compare (the “ask the model” approach)

Send both screenshots to a vision LLM and ask “did anything important change?”

Why this is expensive: vision tokens cost real money, and the overwhelming majority of step screenshots haven’t changed at all between runs. Burning a vision-model call on every step of every test gets pricey at scale, and the latency adds up.

What we do instead

Three-stage cascade. Each stage answers if it can; otherwise it punts to the next:

1. Byte-identical short-circuit

If the new screenshot is exactly the same bytes as the baseline, pass without doing anything else. Common when nothing on the page actually changed.

2. Pixel-diff fast path

Compute the percentage of pixels that differ. If it’s under the threshold (default 0.5%, configurable via VISUAL_PIXEL_THRESHOLD), pass. This catches the antialiasing-and-rendering-noise case without spending an LLM call.

3. Vision-LLM judgement

If the diff exceeds the threshold, send both screenshots to the visual-comparison model with a tight system prompt. The prompt asks specifically for meaningful regressions — layout shifts, missing elements, color/style changes, broken components — and explicitly tells the model to ignore dynamic content (timestamps, badges, counters) and rendering noise.

The LLM responds with PASS: or FAIL: followed by a one-sentence explanation. That sentence becomes the analysis attached to the step result, so when something fails you see why, not just that it did.

If the LLM call itself fails (network, no API key configured), the step fails with the pixel diff percentage as the explanation — better than silently passing.

What counts as a “step” for visual regression

Any step that uses the screenshot (or takeScreenshot) action with a baseline saved on the test runs through the visual pipeline. Other step types capture a screenshot too, but those are for the report view — they don’t get compared.

To opt a step into visual regression:

Add a screenshot: step in your test.
Run the test once with visual_mode=update_baseline — this saves the captured screenshot as the baseline.
Subsequent runs in the default visual_mode=compare mode go through the cascade above.

Baselines are keyed by step ID + browser, not by step number. Reordering steps or inserting new ones doesn’t invalidate baselines — the IDs survive drag-and-drop and edits. See Step actions reference for the screenshot shape.

What you pay (and how to pay less)

Visual compares count against your monthly visual compares quota (20 on Free, 100 on Starter, 500 on Pro, 2,000 on Team — see Plan tier limits). The byte and pixel stages don’t count; only the LLM call does.

If the diff threshold is too tight for your test, raise the visual_pixel_threshold (it’s a fraction between 0.0 and 1.0). A slightly higher threshold means more passes from the pixel stage without an LLM call.

What we don’t do (yet)

Per-element baselines independently of the step they live on — baselines are step-scoped. If you want N elements compared independently, write N screenshot steps.
Auto-baseline-update on intentional changes: today, intentional changes need a manual update_baseline run.
Cross-browser baseline sharing: each browser stores its own baseline, because Chromium / Firefox / WebKit render real differences (especially in form controls and SVG).

Step actions reference — the screenshot action shape
BYOK concept — connecting your own OpenAI key so visual compares don’t count against your Marriska quota
Plan tier limits — per-month visual compare caps