Capture site
A real task pack.
Compare your policy against earlier checkpoints, another team, or a vendor runner on the same captured task pack.

Policy rank
WinnerA real task pack.
Your team, vendors, or checkpoints.
Pilot, tune, recapture, or hold.
Compare your own policy versions or policies submitted by other teams under one captured site/task envelope before using robot time.
What broke?
What changed?
Who gets field time?
Run current, previous, and candidate policies against the same captured task envelope before spending scarce robot time.
Give site ops one evidence packet for policies submitted by internal teams, integrators, or vendors under the same task and threshold scope.
Use ranking, failure clusters, OOD flags, and missing-proof labels to choose a pilot, tune, recapture, or hold path.
Generated first-person POV clips make policy failures easier to review across factory, warehouse, industrial, and home-task variants. They are review media, not real-world proof.
Evaluate










Recent world-model evaluation work makes the ranking workflow credible enough to use as a decision aid, while the proof boundary still matters.
0.929 closed-loop Pearson correlation
Reported across seven real-world VLA policies, with failure-mode reproduction for fine-grained diagnostic comparison.
RoboArena policy-evaluation proxy
Reports strong correlation between virtual OSCAR policy evaluation and real-world evaluation on RoboArena.
Boundary: Blueprint uses policy-evaluation research as category evidence for ranking and diagnostic workflows. It does not turn a virtual score into a universal accuracy guarantee or a rank-fidelity result outside the measured evaluation scope.