The AI Runtime Field Lab

Field Briefs

Retail Intermediate Eval Featured Draft, pending editorial review

Count the Milk: A Reality-Gap Eval Harness for Vision Inventory Systems

Build an evaluation harness that measures how a vision-based inventory counter degrades from controlled to messy real-world conditions, and gates deployment on consistency under variance, not demo accuracy.

The failure behind this brief

What happened

Starbucks terminated its AI inventory counting program across all North American stores nine months after a chain-wide launch. The image-recognition system, built by NomadGo, frequently miscounted and mislabeled stock, confusing milk types or failing to detect items entirely, and stores reverted to manual counting. An internal memo reviewed by Reuters read: Effective immediately, Automated Counting will be discontinued.

Root cause read

A vision model that performed under controlled conditions was deployed at scale into thousands of variable retail environments, where lighting, placement, lookalike packaging, and human workflow did not match the training distribution. Instead of catching shortages, the system added a new layer of inaccuracy to the supply chain it was meant to fix.

Engineering lesson

Eval-production skew at chain scale: the gap between accuracy in a demo environment and consistency under real-world variance was never measured before full deployment. A staged rollout with a shadow-mode comparison against manual ground truth would have surfaced the gap at ten stores instead of all of them.

  • A chain-wide AI inventory system was terminated nine months after launch because it frequently miscounted and mislabeled items, including confusing milk types and failing to detect a syrup bottle in the company's own launch video.
  • Reporting described the system introducing a new layer of inaccuracy into the supply chain it was designed to fix, compounding the stockouts it was meant to solve.
  • The replacement was a return to manual counting plus more frequent restocking, meaning the AI deployment delivered negative value at full scale.

Sources Source 1

Editorial note: Reuters reporting and the company statement are the primary sources; attach the Reuters link in the editorial verification pass before publish.

Why this matters

A vision system that counts perfectly under demo conditions and unreliably under store conditions is worse than manual counting, because its errors arrive with confidence and propagate into ordering decisions. The missing artifact is not a better model, it is an eval that measures the degradation before deployment does.

Persona

Team deploying vision-based counting, shelf-audit, or inventory systems into physical environments

Current manual workflow

The model is validated on clean benchmark images, piloted briefly under favorable conditions, and rolled out broadly; degradation under real conditions is discovered through downstream stockouts months later.

The AI workflow to build

The harness evaluates any counting model against a paired image set: a controlled baseline and matched messy variants covering lighting shifts, partial occlusion, lookalike packaging shelved adjacently, cluttered placement, and varied camera angles. It reports accuracy per condition, the degradation curve from clean to messy, per-class confusion showing which lookalikes get merged, and repeated-run consistency per scene, then issues a deploy or hold verdict against a stated consistency threshold.

Inputs

  • a fixture set of 30 shelf scenes in matched clean and messy variants with ground-truth counts, including at least three lookalike product pairs
  • a counting model to evaluate (any off-the-shelf vision model or API)
  • a consistency threshold configuration

Outputs

  • a per-condition accuracy report with the degradation curve
  • a lookalike confusion matrix
  • repeated-run consistency per scene
  • a deploy or hold verdict with the failing conditions named

Definition of done

Against a fixture where the model is known to degrade on lookalikes and occlusion, the harness quantifies the clean-to-messy gap, names the two failing conditions, reports consistency across five repeated runs per scene, and issues a hold verdict; against a deliberately easy fixture it issues a deploy verdict, demonstrating that the gate discriminates rather than always blocking.

Example input

A counting model that scores 96 percent on the clean set, evaluated against the messy set containing oat milk and whole milk cartons shelved adjacent under dim lighting.

Example output

Report: clean 96 percent, messy 71 percent, degradation concentrated in the lookalike condition (oat and whole milk merged in 8 of 10 scenes) and occlusion, consistency 60 percent across repeated runs on affected scenes, verdict hold with the two failing conditions named.

Data plan

synthetic data

Boundaries and non-goals

  • training a better counting model
  • real store deployment
  • camera or hardware integration

Evaluation ideas

  • gap quantification accuracy on a model with seeded degradation
  • verdict discrimination between the easy and hard fixtures
  • report clarity scored against a rubric: can a non-ML operations lead act on it

Run Level target

R3 Reliable Plain translation: handles real cases.

Scope envelope

Buildable by one solo builder in 20 to 30 focused hours, on public, synthetic, or sanitized data, with a demo path that requires no production access.

Suggested tools

  • any vision model or API as the system under test
  • any image augmentation library for generating messy variants

Suggested options, never requirements; briefs are tool-agnostic.

Product thesis questions

  • What consistency threshold should gate a physical-world vision deployment?
  • Would a shadow-mode comparison against manual counts have caught this at pilot scale instead of chain scale?

Open briefs in this failure family

Legal Open

Cite or Strike: A Citation Verifier for Legal Drafting

Build a verifier that checks every citation in an AI-drafted legal document against a real source and flags or strikes anything it cannot ground.

Intermediate Verifier R3 Draft

Reliability focus citation grounding, fabrication detection, export gating

View Brief
B2B SaaS Open

Triage First: A Support Ticket Routing Agent

Build an agent that classifies inbound support tickets, drafts a suggested reply, escalates uncertain cases, and explains why it routed each one.

Starter Agentic Workflow R2 Draft

Reliability focus classification, calibrated escalation, explainability

View Brief

Seen this failure in your own stack? Submit your version.

Submit your version