Test robot policies before field time.

Compare your policy against earlier checkpoints, another team, or a vendor runner on the same captured task pack.

Realistic humanoid robot moving a tote in a captured facility task

Policy rank

Winner
Vendor B1st
Team v42nd
Team v33rd
1

Capture site

A real task pack.

2

Compare policies

Your team, vendors, or checkpoints.

3

Pick next test

Pilot, tune, recapture, or hold.

Same task. Same robot. Clear comparison.

Compare your own policy versions or policies submitted by other teams under one captured site/task envelope before using robot time.

100 episodes500 episodesown or vendor policies

Failure

What broke?

OOD

What changed?

Site ops

Who gets field time?

Compare your own checkpoints

Run current, previous, and candidate policies against the same captured task envelope before spending scarce robot time.

Compare teams or vendors

Give site ops one evidence packet for policies submitted by internal teams, integrators, or vendors under the same task and threshold scope.

Decide the next test

Use ranking, failure clusters, OOD flags, and missing-proof labels to choose a pilot, tune, recapture, or hold path.

See the clips.

Generated first-person POV clips make policy failures easier to review across factory, warehouse, industrial, and home-task variants. They are review media, not real-world proof.

Evaluate
First-person humanoid robot POV lifting a blue tote from a warehouse shelf
First-person humanoid robot POV sorting small metal parts on a factory conveyor
First-person humanoid robot POV moving a carton at a loading dock
First-person humanoid robot POV pulling a frosted crate from a cold-storage shelf
First-person humanoid robot POV operating a guarded industrial machine station
First-person humanoid robot POV inspecting a small component at a QA bench
First-person humanoid robot POV sorting parts between bins in a packing cell
First-person humanoid robot POV restocking packaged goods in a backroom aisle
First-person humanoid robot POV loading dishes into a dishwasher
First-person humanoid robot POV folding a towel in a laundry room
First-person humanoid robot POV checking an industrial aisle route marker

Why now.

Recent world-model evaluation work makes the ranking workflow credible enough to use as a decision aid, while the proof boundary still matters.

Boundary: Blueprint uses policy-evaluation research as category evidence for ranking and diagnostic workflows. It does not turn a virtual score into a universal accuracy guarantee or a rank-fidelity result outside the measured evaluation scope.