The problem
Real-world data is the bottleneck.
The Vista Standard
Built for the data your models actually need.
Body & hand pose
38 keypoints via MediaPipe
Every dataset will include full skeletal tracking — 38 keypoints per frame, confidence score included, exported as normalized coordinates compatible with LeRobot schema.
Object segmentation
SAM2-powered instance masks
Pixel-level segmentation masks will be generated on every manipulated object, designed to track occlusions and hand-object interactions across the full clip.
Action labeling
Timestamped + natural language
Every atomic action will be tagged with start/end timestamps and a structured natural language label, built to align with standard robotics action taxonomies.
LeRobot-ready
HuggingFace format native
Datasets will ship as LeRobot-compatible HDF5 archives — designed to load directly into your training loop with no conversion scripts and no format wrangling.
Pipeline
From raw footage to training-ready data.
Upload
Contributor submits raw footage via secure portal
Validation
Technical checks: resolution, fps, duration, format
Frame extraction
FFmpeg splits video into annotatable frames
Pose estimation
MediaPipe extracts 38 body & hand keypoints
Segmentation
SAM2 generates instance masks on all objects
LeRobot packaging
HDF5 export, indexed and ready to train
Early access
Get early access to Vista datasets.
We're onboarding our first research partners. Tell us about your use case.
Film your daily life. Get paid.
Join the Vista contributor network. Upload egocentric videos from your daily life and earn per validated clip.