Home / Use cases / Manipulation policy

Datasets for training robotic manipulation policies.

Real first-person human demonstrations, annotated end to end, packaged for the policy architectures your team already trains.

First-person egocentric view of a bimanual cooking task in a home kitchen, captured for manipulation policy training — Real egocentric footage of a home-kitchen manipulation task — the kind of first-person demonstration an EgoVista dataset is built from.

1 of 8 sections

The manipulation policy data bottleneck.

Training a robust manipulation policy is bottlenecked by data, more often than by architecture. The architecture side has matured fast: Diffusion Policy, Action Chunking with Transformers, VQ-BeT, and the family of vision language action models all converged on a working recipe within the last two years. The data side did not move at the same pace.

Public datasets exist, but they often share a few weaknesses: limited diversity in environment and contributor, missing annotation layers that the modern policies consume (hand pose, contact timing, action language), and unclear licensing or compliance posture for production use. Teams that want to ship a real product end up paying the cost of building their own data pipeline, which means renting a lab, recruiting demonstrators, building an annotation stack, and managing the compliance chain. That cost is significant, and it competes for engineering capacity with the policy work itself.

EgoVista absorbs that cost. The output is a dataset, ready to load, ready to train on, with the annotation layers, the format, and the audit trail your team needs.

2 of 8 sections

Compatible policy architectures.

The annotation stack and the export formats were chosen so the dataset feeds the policy classes that are currently used in production and in research. The mapping is direct in each case, no glue code required.

Diffusion Policy

Diffusion Policy (Chi et al.) needs dense observation and action sequences, ideally at a stable frame rate, and benefits from auxiliary signals like depth. EgoVista ships 30 FPS observations with depth as a first-class feature, plus hand pose if your conditioning uses it. The standard LeRobot training script for Diffusion Policy reads the dataset directly.

ACT (Action Chunking with Transformers)

ACT works best with long episodes split into action chunks and a clear language instruction per episode. EgoVista episodes are typically 30 to 180 seconds, with action labels in natural language and chunk boundaries documented in the metadata. The LeRobot v3.0 episodes format aligns naturally with ACT's chunking expectations.

VQ-BeT

VQ-BeT operates in a discretized action space and benefits from a well-defined behaviour vocabulary. Our action labels are produced in natural language but mapped to a controlled vocabulary that aligns with common manipulation primitives (reach, grasp, lift, place, push, pull, pour, rotate). Custom vocabularies can be requested at brief time, and we provide the mapping in the dataset card.

RT-1, RT-2 and OpenVLA

Vision language action models need consistent language instructions paired with visual observations and an action signal. EgoVista action labels are generated by a vision language model, with confidence scores attached. The RLDS export follows Open X-Embodiment conventions so the dataset can be mixed with the public OXE corpus during training, with the limitation that EgoVista produces the human signal rather than a robot-specific action, which the retargeting layer of the VLA policy then translates.

3 of 8 sections

What manipulation policy training needs.

The requirements that recur across teams training manipulation policies, with the way EgoVista addresses each:

Volume. A working policy on a narrow task typically wants 50 to 100 demonstrations. A more general policy or a longer horizon task wants 300 to 500. Volume is set per project and scaled after a sample batch validates the brief.
Diversity. Lighting, environment, contributor profile, object variations. Our European contributor network covers multiple countries, dominant hands, and home and professional setups. We document distribution metrics in the dataset card so you can audit coverage.
Annotation density. Hand pose at 30 FPS, contact transitions per frame, action labels at episode and sub-episode level.
Action language consistency. Action labels are produced by a single model version per dataset, with a controlled vocabulary mapping. Style drift is therefore bounded across the dataset.
Quality. Frame drops are tagged, anonymisation is verified per frame, schema validation is run before delivery, and the QA report quantifies what passed.

Each of these dimensions is measured and reported in the QA document delivered alongside the dataset.

4 of 8 sections

Example workflow: from brief to deployed policy.

A typical engagement runs through six stages, with the client controlling go or no-go decisions at each transition.

Typical workflow from brief to policy deployment.

Brief. Client describes the task, the target objects, the environment, the embodiment, the volume, and any constraints on contributor profile or compliance.
Sample dataset. EgoVista produces a 10 to 20 episode sample under the brief, with the agreed annotation stack and format.
Client validation. The client loads the sample with the existing training code, runs a smoke training and evaluates on their own benchmark or qualitative review. If the brief needs adjustment, we iterate.
Scale up. Once the brief is validated, the contributor network produces the full batch, typically 100 to 500 episodes, with ongoing QA review.
Training and deployment. The client trains the policy and measures policy success on their own evaluation. Any specific failure modes can be addressed by a targeted follow-up batch.
Iteration. The dataset is versioned, so a refresh with additional environments, additional contributors, or a stricter schema can be produced without restarting from zero.

5 of 8 sections

Environment and contributor diversity.

Manipulation policies generalise only as far as the training data spans. Our European contributor network is structured to cover a useful range of variables: country and home or professional environment, lighting conditions during the day, dominant hand, hand size and grip style, age range, manipulation technique on common objects. We do not collect demographic data beyond what is necessary for compliance with the contributor consent process, and we report distribution metrics in the dataset card so your team can audit coverage. Diversity targets are also negotiable at brief time, for projects that explicitly need a specific subset of variables to be balanced.

6 of 8 sections

What we deliver and what we don't.

Honesty on capabilities matters more than marketing. The boundary of what EgoVista delivers, stated explicitly:

We deliver annotations, not human-verified ground truth. Hand pose, depth, segmentation, contact timing and action labels are model predictions with measurable accuracy. The QA report quantifies how each layer performs on the delivered batch.
Action labels are produced by a vision language model. They are useful for training and for VLA conditioning, but they are not human-verified annotations. Teams that need human-verified action labels can request a manual review pass at additional cost and timeline.
Contributor diversity has geographic limits. The current network is European-first. North American or Asian distribution can be requested but requires longer timelines.
We do not produce simulation data. Every episode is a real recording by a real contributor on a real device. For teams that want sim plus real, we are happy to be the real side of the data mix.

7 of 8 sections

Manipulation policy training FAQ.

How many demonstrations do I need for a manipulation policy?

It depends on the policy class and the task. A pick-and-place policy with a Diffusion Policy backbone can converge with 50 to 100 demonstrations on a narrow task, while a more general behaviour or a longer horizon task usually wants 300 to 500. We always recommend starting with an evaluation sample of 10 to 20 demos so your team can measure quality against your own benchmark before committing to a larger production batch.

Can you target a specific task like 'pick and place dishes'?

Yes, that is the default mode. The engagement starts with a brief that describes the task, the objects, the environment, and any specific contributor profile. We then match contributors and produce a small sample to validate the brief. Once the brief is locked in, scaling to a production batch follows the same recipe. If the task requires a controlled setup we cannot reach with our contributor network, we can stage dedicated recording sessions.

Do you support bimanual manipulation?

Yes. The 2D and 3D hand pose annotation runs per hand, and the hand-object segmentation includes a dedicated class for objects manipulated by both hands. Action labels are produced in natural language and routinely describe coordinated two-hand actions. For bimanual policies, the dataset card includes per-clip metadata on hand dominance and on which hand the task instruction implicitly assigns to which subtask, when applicable.

How do you handle long-horizon tasks?

Long-horizon tasks are split into episodes that match natural task boundaries (start of activity to completion), with subtask annotations available as auxiliary action labels. The default frame rate of 30 FPS keeps the temporal resolution that diffusion and transformer-based policies need, and we can produce subsampled variants for VQ-BeT style architectures. For tasks longer than the LeRobot default chunk size, we ship multi-chunk episodes with explicit boundaries in the metadata.

Can the dataset include failure cases for robustness training?

Yes, on request. The default brief targets successful executions, but failure cases can be requested as a separate sub-corpus, with the failure mode tagged in the metadata. This is useful for training robustness behaviours such as recovery, retrying, or detecting that a task cannot be completed. We do not produce failures in a controlled way (we are not actors), but we capture them in the wild and tag them.

What is the typical episode length?

Most manipulation episodes fall in the 30 to 180 second range, which corresponds to one to three full task attempts. Shorter episodes are common for fine motor tasks like buttoning a shirt or threading a needle. Longer episodes are typical for assembly or kitchen workflows. The brief sets the target range, and the contributor selection follows. If your training pipeline needs a fixed maximum episode length, we can split or truncate at delivery time.

8 of 8 sections

Discuss your manipulation policy project.

We can run a brief call to scope the task, the volume, the format, and the timeline. From there, a sample dataset typically lands in two to three weeks, with the production batch scheduled after sample validation. For related material, see the imitation learning use case, the LeRobot format details, the RLDS format details, or the full product page.

Discuss your manipulation policy Request a sample