Skip to content
Early Access · First cohort

Egocentric video datasets for robotics manipulation training.

Native LeRobot and RLDS output. Multi-layer annotation pipeline. GDPR-compliant by design. Built in Europe for robotics teams scaling from demos to deployment.

Modular annotation stackEU-only pipelineQA-scored output
First-person egocentric view of cooking task — bimanual manipulation captured for imitation learning training

The problem

Why generic video data fails manipulation policies.

Imitation learning, behavioral cloning and VLA training are bottlenecked by the same three gaps: the egocentric viewpoint, the annotation stack a policy actually consumes, and the compliance posture you need to deploy in Europe. EgoVista is built to close all three.

Viewpoint mismatch with the robot

Third-person and surveillance footage do not match the egocentric viewpoint a manipulator actually sees. Policies trained on those signals collapse at deployment because the input distribution shifts.

No hand pose, no contact timing, no depth

Raw video is just RGB. To train a manipulation policy you also need hand keypoints, hand-object contact transitions, and per-frame depth: the layers that turn pixels into a learnable control signal.

GDPR exposure on every external API call

Most annotation services send raw frames containing identifiable faces to US-based models. That breaks GDPR and the upcoming EU AI Act, and exposes your training corpus to deletion requests you have no protocol to honor.

Real-world data

Egocentric data across diverse real-world environments.

Manipulation policies only generalize as far as their training data reaches. A model that has only seen one kitchen struggles in the next one. These frames come from real homes and workshops across Europe, with different lighting, layouts, and contributors. The variety is the point: it is what lets a policy hold up outside the room it was trained in.

Egocentric first-person view of cooking pasta carbonara in a home kitchen
Kitchen
Egocentric first-person view of folding t-shirts on a workbench in a home workshop
Workshop
Egocentric first-person view of watering plants with a watering can in a garden
Outdoor

How it ships

Two ways to ship.

Whether you start from scratch or already have footage, we deliver training-ready datasets in LeRobot or RLDS format.

Custom data collection

Custom collection, coming soon

Tell us what task your robot needs to learn, the environment, and the objects involved. We're building a contributor network to film targeted egocentric footage on demand. Custom collection projects open Q3 2026. Contact us to discuss your needs and get early access.

Best for: teams starting from scratch, specific environments needed, diverse contributors

Annotation as a service

We annotate your videos

Already have egocentric or manipulation video? Send it to us. Our pipeline adds 7 annotation layers and exports in LeRobot or RLDS format. Sample batches in days, production runs in weeks.

Best for: teams with existing footage, teleop recordings, in-house captures

Pricing tailored to your project scope

Send your videos →

Annotation service · how it works

From your footage to a training-ready dataset in 3 steps.

Step 01

Upload your videos

Send us a link to your files or email them to contact@egovista.app. We accept any format, any size.

Step 02

Choose your layers

Pick the annotation modules you need: pose, segmentation, depth, actions, contacts.

Step 03

Get your dataset

Receive annotated data in LeRobot, RLDS, or custom format. Sample batches in days, production runs in weeks.

Nine annotation layers, one ready-to-train dataset.

Hand pose annotation, depth estimation, EgoHOS segmentation, contact timing, action labels, packaged together in a single LeRobot or RLDS dataset so your training loop only learns one IO path. Mix and match modules per project.

2D hand pose

Live

MediaPipe: 21 keypoints per hand per frame

3D hand pose

Live

MediaPipe + Depth: 21 keypoints × 2 hands in 3D space

Depth estimation

Live

Depth Anything: per-frame depth maps

EgoHOS hand & object segmentation

Live

EgoHOS: 5 classes per frame, EU GPU inference

Contact timing

Live

Hand-object contact transitions, per frame

Action labels

Live

Gemini Vision (EU): timestamped natural-language labels

Face anonymization

Live

MediaPipe Face Detection: applied before any external call

Camera intrinsics

Live

Focal length, principal point, distortion, per clip

Metadata enrichment

Live

Episode boundaries, schema versioning, QA score per clip

Need a module that's not listed? Ask. We ship new modules in 2 weeks on request.

Format support

Native LeRobot v3.0 compatibility.

Datasets ship in the LeRobot v3.0 format out of the box. Load them straight into the Hugging Face LeRobot stack without writing a single conversion script. Tested against the canonical LeRobot dataset spec.

Format support

Open X-Embodiment compatible RLDS export.

Need RLDS? We export to the TFDS-backed schema used by Open X-Embodiment so your dataset slots into any existing imitation learning or VLA training pipeline, no glue code required.

EU-first by design

GDPR-compliant robotics datasets, built in Europe.

Face anonymization before any external processing, EU-only infrastructure, documented legal bases per processing activity, EU AI Act-ready audit trail. GDPR compliance details.

GDPR-compliant by default

Faces blurred at source before any annotation. DPA available, contributor consent collected per mission, RoPA documented.

EU-only compute

Storage Cloudflare R2 EU, GPU RunPod Amsterdam, Vertex AI europe-west4. No data crosses the Atlantic.

EU contributors

Network of European contributors recording in EU jurisdictions. Right to erasure, opt-out, and full data lineage on demand.

Audit trail end-to-end

Every clip, every annotation pass, every model invocation logged with timestamps. Inspect-ready for EU AI Act compliance.

Quality assurance

Every dataset shipped with a measurable QA score.

Annotation quality is checked per clip, scored, and shipped with the dataset. You see exactly what was reviewed, what passed, and where uncertainty remains.

Scoring

Per-dataset QA report

Schema validation, hand pose accuracy, action label precision and segmentation coverage measured per clip. Average score across delivered datasets sits above 90%.

Honest uncertainty

Predictions, not ground truth

Labels are model predictions, not ground truth. Every dataset includes per-frame uncertainty metadata so your team knows what to trust and what to filter before training.

Auditability

Full annotation provenance

Every annotation pass logs the model version, timestamp, and EU region. If your downstream training surfaces a regression, you can trace it back to the exact pipeline run.

Use cases

From manipulation policies to imitation learning pipelines.

The same annotated egocentric video powers three distinct training workflows. Pick the one closest to your stack, or talk to us about a hybrid.

BC · Offline RL · VLA

Manipulation policy training

Train BC, offline RL and VLA policies on egocentric human demonstrations annotated with hand pose, contact timing, depth, and action labels, shipped in your training format.

manipulation policy training
GAIL · DAgger · AIRL

Imitation learning research

Dense, diverse human demonstration data with the annotation layers that imitation learning baselines actually consume, no custom preprocessing required.

imitation learning workflows
Foundation models

Foundation model pretraining

Anonymized, GDPR-clean egocentric video at scale for pretraining vision encoders, action models, and large-scale world models, with full provenance.

Talk to us about pretraining

How it works

Three steps from raw video to training-ready dataset.

No black box. You see exactly what you brief, what we ship, and how it was annotated.

01

Brief

You specify the task, target environment, format (LeRobot or RLDS), schema, and any compliance constraints.

02

Capture & annotate

Our European contributor network captures egocentric footage; the 9-layer pipeline annotates, anonymizes, and packages it under EU-only infra.

03

Delivery

Encrypted signed URL, your chosen format and schema, with a per-dataset QA report and full annotation provenance.

FAQ

Frequently asked questions.

What video formats do you accept as input?

Most modern container/codec combinations: MP4 (H.264, H.265), MOV, MKV, WebM. Resolution from 720p to 4K. We strongly prefer egocentric footage (head-mounted camera, smart glasses, or chest-mounted GoPro), but we can work with first-person teleop or wrist-camera footage if that is what your project requires.

How long does a typical dataset take to deliver?

Timeline depends on volume and engagement model. For annotation-only on a small batch of existing video, expect days. For full collection + annotation, expect two to six weeks for a first pilot batch, depending on task complexity and contributor matching. We always agree on a delivery date before kickoff and ship the QA report with the data.

How do you ensure GDPR compliance for contributors and subjects?

Three layers: (1) face anonymization applied locally with MediaPipe before any external API call, so no identifiable face data ever crosses an external boundary; (2) EU-only infrastructure for storage, GPU inference, and labeling (Cloudflare R2 EU, RunPod Amsterdam, Vertex AI europe-west4); (3) contributor consent collected per mission with documented legal bases and a right to withdrawal that we honor within thirty days.

Can you replicate a specific task or environment?

Yes. Tell us the task, the environment, and any constraints on objects, lighting, or contributor profile. We brief our European contributor network, match the closest profiles, and ship targeted clips for that exact scenario. If you need a task that has never been captured, we can also stage controlled recordings.

What's the difference between LeRobot and RLDS exports?

LeRobot v3.0 is the Hugging Face robotics dataset format: parquet shards plus MP4 frames plus JSONL metadata, native to the LeRobot training stack. RLDS is the TFRecord-based format used by Open X-Embodiment and the broader TFDS ecosystem. We ship both from the same source data; pick whichever matches your training pipeline.

Do you provide ground truth or annotations only?

Annotations. Our hand pose, depth, segmentation, and action labels are model predictions, not ground truth. They have measurable accuracy but are not human-verified frame by frame. Every dataset ships with uncertainty metadata so your team can filter or weight noisy samples. Human-verified ground truth is available on request for evaluation subsets.

What happens if the quality doesn't meet our requirements?

The QA report ships before the dataset is final. If the score is below the threshold we agreed at kickoff, we re-annotate or recollect the affected clips at no additional cost. We also support evaluation pilots (a small sample before scaling) so you can validate quality on your own benchmark.

Can we sign an NDA before discussing details?

Yes. We have a standard mutual NDA template that covers your project specifics, training objectives, and the contents of the dataset. We can sign it before the first technical call. Email contact@egovista.app and we can have it back to you within 24 hours.

Start with a sample dataset.

Tell us the task, the environment, and the format. We'll ship you a fully annotated egocentric video sample in LeRobot or RLDS so your team can evaluate quality before scaling.

No commitment · 48h response · NDA available