Count the Milk: A Reality-Gap Eval Harness for Vision Inventory Systems
Build an evaluation harness that measures how a vision-based inventory counter degrades from controlled to messy real-world conditions, and gates deployment on consistency under variance, not demo accuracy.
The failure behind this brief
Starbucks terminated its AI inventory counting program across all North American stores nine months after a chain-wide launch. The image-recognition system, built by NomadGo, frequently miscounted and mislabeled stock, confusing milk types or failing to detect items entirely, and stores reverted to manual counting. An internal memo reviewed by Reuters read: Effective immediately, Automated Counting will be discontinued.
A vision model that performed under controlled conditions was deployed at scale into thousands of variable retail environments, where lighting, placement, lookalike packaging, and human workflow did not match the training distribution. Instead of catching shortages, the system added a new layer of inaccuracy to the supply chain it was meant to fix.
Eval-production skew at chain scale: the gap between accuracy in a demo environment and consistency under real-world variance was never measured before full deployment. A staged rollout with a shadow-mode comparison against manual ground truth would have surfaced the gap at ten stores instead of all of them.
- A chain-wide AI inventory system was terminated nine months after launch because it frequently miscounted and mislabeled items, including confusing milk types and failing to detect a syrup bottle in the company's own launch video.
- Reporting described the system introducing a new layer of inaccuracy into the supply chain it was designed to fix, compounding the stockouts it was meant to solve.
- The replacement was a return to manual counting plus more frequent restocking, meaning the AI deployment delivered negative value at full scale.
Sources Source 1
Editorial note: Reuters reporting and the company statement are the primary sources; attach the Reuters link in the editorial verification pass before publish.
Why this matters
A vision system that counts perfectly under demo conditions and unreliably under store conditions is worse than manual counting, because its errors arrive with confidence and propagate into ordering decisions. The missing artifact is not a better model, it is an eval that measures the degradation before deployment does.
Persona
Team deploying vision-based counting, shelf-audit, or inventory systems into physical environments
Current manual workflow
The model is validated on clean benchmark images, piloted briefly under favorable conditions, and rolled out broadly; degradation under real conditions is discovered through downstream stockouts months later.
The AI workflow to build
The harness evaluates any counting model against a paired image set: a controlled baseline and matched messy variants covering lighting shifts, partial occlusion, lookalike packaging shelved adjacently, cluttered placement, and varied camera angles. It reports accuracy per condition, the degradation curve from clean to messy, per-class confusion showing which lookalikes get merged, and repeated-run consistency per scene, then issues a deploy or hold verdict against a stated consistency threshold.
Inputs
- a fixture set of 30 shelf scenes in matched clean and messy variants with ground-truth counts, including at least three lookalike product pairs
- a counting model to evaluate (any off-the-shelf vision model or API)
- a consistency threshold configuration
Outputs
- a per-condition accuracy report with the degradation curve
- a lookalike confusion matrix
- repeated-run consistency per scene
- a deploy or hold verdict with the failing conditions named
Definition of done
Against a fixture where the model is known to degrade on lookalikes and occlusion, the harness quantifies the clean-to-messy gap, names the two failing conditions, reports consistency across five repeated runs per scene, and issues a hold verdict; against a deliberately easy fixture it issues a deploy verdict, demonstrating that the gate discriminates rather than always blocking.
A counting model that scores 96 percent on the clean set, evaluated against the messy set containing oat milk and whole milk cartons shelved adjacent under dim lighting.
Report: clean 96 percent, messy 71 percent, degradation concentrated in the lookalike condition (oat and whole milk merged in 8 of 10 scenes) and occlusion, consistency 60 percent across repeated runs on affected scenes, verdict hold with the two failing conditions named.
Data plan
synthetic data
Boundaries and non-goals
- training a better counting model
- real store deployment
- camera or hardware integration
Evaluation ideas
- gap quantification accuracy on a model with seeded degradation
- verdict discrimination between the easy and hard fixtures
- report clarity scored against a rubric: can a non-ML operations lead act on it
Run Level target
R3 Reliable Plain translation: handles real cases.
Scope envelope
Buildable by one solo builder in 20 to 30 focused hours, on public, synthetic, or sanitized data, with a demo path that requires no production access.
Suggested tools
- any vision model or API as the system under test
- any image augmentation library for generating messy variants
Suggested options, never requirements; briefs are tool-agnostic.
Product thesis questions
- What consistency threshold should gate a physical-world vision deployment?
- Would a shadow-mode comparison against manual counts have caught this at pilot scale instead of chain scale?
Open briefs in this failure family
Cite or Strike: A Citation Verifier for Legal Drafting
Build a verifier that checks every citation in an AI-drafted legal document against a real source and flags or strikes anything it cannot ground.
Reliability focus citation grounding, fabrication detection, export gating
View BriefTriage First: A Support Ticket Routing Agent
Build an agent that classifies inbound support tickets, drafts a suggested reply, escalates uncertain cases, and explains why it routed each one.
Reliability focus classification, calibrated escalation, explainability
View BriefSeen this failure in your own stack? Submit your version.
Submit your version