Home / Datasets
Annotated egocentric video datasets for robotics manipulation.
Nine annotation layers (hand pose, depth, EgoHOS segmentation, contact timing, action labels), shipped in your training format with a measurable QA score. Built in Europe for robotics teams and the data path their policies actually consume.
What you receive
What's included in every dataset.
Nine annotation layers, each shipped with its output format and intended use. Limitations stated upfront so your team knows what to trust before training.
| Layer | Output | Typical use | Limitations |
|---|---|---|---|
| 2D hand pose | 21 keypoints × 2 hands per frame, with confidence | Behavioral cloning input, gesture detection | Accuracy drops on heavy occlusion / motion blur |
| 3D hand pose | 21 keypoints × 2 hands, 3D coordinates | VLA inputs, contact-aware policies | Depth-derived, inherits depth uncertainty |
| Depth estimation | Per-frame depth maps (16-bit PNG or float feature) | Geometry-aware policies, world models | Monocular, metric where intrinsics available, relative otherwise |
| EgoHOS segmentation | 5 classes per frame (left/right hand × objects) | Manipulation detection, contact regions | Trained on EgoHOS classes, out-of-distribution objects may fail |
| Contact timing | Per-frame hand-object contact transitions | Action segmentation, reward shaping | Derived signal, depends on segmentation quality |
| Action labels | Timestamped NL labels + controlled-vocabulary category | VLA training, instruction conditioning | Gemini Vision predictions, not human ground truth (uncertainty shipped) |
| Face anonymization | Blurred faces of third parties, applied locally | GDPR compliance, EU AI Act audit trail | Applied before external calls, no opt-out from the pipeline |
| Camera intrinsics | Focal, principal point, distortion per clip | Metric reconstruction, multi-view alignment | Estimated where EXIF unavailable |
| Metadata enrichment | Episode boundaries, schema version, QA score per clip | Dataset filtering, training selection | None |
Export
Export to your team's existing pipeline.
The same source dataset is repacked to four output formats. Pick the one your training loop already speaks.
LeRobot v3.0
Parquet shards plus MP4 frame chunks plus JSONL metadata, structured to match the canonical LeRobot v3.0 spec. Features dictionary follows observation.images.*, action, episode_index, frame_index. Load directly with lerobot.common.datasets.load_dataset. See the native LeRobot v3.0 export page for full schema.
RLDS TFRecord (Open X-Embodiment compatible)
TFRecord shards with a typed features dictionary, structured to slot into the Open X-Embodiment training pipeline. Compatible with TensorFlow Datasets and the RT-X / OpenVLA stack. See the RLDS TFRecord and Open X-Embodiment page for full schema.
Hugging Face Datasets
Private repo on the Hugging Face Hub with a dataset card, splits config, and a typed features schema. Loadable via datasets.load_dataset with token-based access control.
Raw ZIP (custom pipelines)
Frames, MP4 originals, and per-layer annotation files (JSON, NumPy, 16-bit PNG) packaged in a versioned ZIP. For teams with proprietary training stacks that prefer to handle their own IO.
Specifications
Technical specifications.
The defaults we ship with, adjustable per project at brief time.
| Accepted input formats | MP4 (H.264/H.265), MOV, MKV, WebM |
|---|---|
| Input resolution | 720p to 4K |
| Output frame rate | 10 fps default, 5–30 fps configurable |
| Annotation layers | 9 layers (see table above) |
| Schema versioning | Per-dataset, JSON-encoded, immutable after delivery |
| Delivery method | Signed URL (Cloudflare R2 EU, expires in 7 days, renewable) |
| Retention | Two options: ship-and-delete (we delete after delivery), or 12-month archival with exclusivity |
| QA scoring | Per-clip schema validation, hand pose accuracy, action label precision, segmentation coverage |
| Compliance | GDPR-compliant by design. See the GDPR compliance details |
Engagement
Pricing and engagement models.
Three ways to work with us, depending on whether you already have video, need new collection, or want raw anonymized data at scale.
Custom collection + annotation
Turn-key. Use when you need a specific task or environment captured fresh and annotated. Includes contributor matching, mission briefs, capture, anonymization, full 9-layer annotation, and QA report.
Annotation only
You provide egocentric or teleop footage; we run the full annotation pipeline and ship in your chosen format. Use when you already have capture infrastructure or proprietary footage to scale up.
Raw anonymized video
Volume play. Anonymized egocentric video without the annotation stack, typically used for foundation model pretraining, vision encoders, and large-scale world models. Compliance posture identical.
Pricing depends on volume, task complexity, and engagement model. We prefer to scope the work together. Request a quote tailored to your project and we will come back with a fixed price and timeline.
Technical FAQ
Technical FAQ.
Do you provide depth or only RGB?
Both. RGB frames are always shipped at full resolution; per-frame depth maps are generated with Depth Anything and shipped as either 16-bit PNGs (in ZIP exports) or as a typed feature in the LeRobot/RLDS schema. Depth is metric where camera intrinsics are available, relative otherwise, and we ship both flavors when the source allows it.
How accurate is the hand pose annotation?
Hand keypoint accuracy depends on viewpoint, occlusion, and motion blur. On typical egocentric footage, MediaPipe Hands reports per-keypoint confidence and tends to localize the dominant hand within a small fraction of the bounding box width when the hand is unoccluded and in frame. We ship per-keypoint confidence so your training loop can mask low-confidence frames or weight them down, and the dataset card reports measured accuracy per delivery.
Can you re-deliver with a different schema version?
Yes. Each dataset is versioned and stored under the original schema; if you need a re-export under a newer schema (LeRobot v3.1 once released, an updated RLDS features dict, or a custom mapping), we re-pack from the canonical source. There is a small re-pack delay (typically two business days) but no re-collection involved.
What is your typical schema for action labels?
Action labels are timestamped natural-language strings paired with start/end frame indices and an action category from a controlled vocabulary (pick, place, pour, push, rotate, open, close, wipe, etc.). We can adapt the vocabulary to your downstream model's tokenizer or instruction format on request.
Do you support synchronized multi-camera setups?
Yes for input. You can submit synchronized multi-view footage with hardware timestamps and we will preserve the sync metadata through the pipeline. Our annotation layers run per-view, so each camera gets its own pose, depth, segmentation, and action labels with frame-level alignment.
Can we audit a sample before scaling?
We strongly recommend it. Standard kickoff includes a small evaluation batch (typically ten clips) delivered with a full QA report and the same schema as the final dataset. You validate quality on your own benchmark before we scale, and we adjust the brief if anything is off.
What is your contributor diversity for environment variability?
Our European contributor network spans multiple countries, ages, dominant hands, and lighting conditions. For projects that require demographic or geographic diversity, we filter contributor matching at brief time and report distribution metrics with the dataset (anonymized).
How do you handle edge cases or failed annotations?
Failed annotations are detected by the QA layer (out-of-range pose, missing segmentation, schema validation errors). Affected clips are either re-annotated on the same source video or replaced from the contributor batch. If neither is possible, the clip is flagged in the QA report so your training loop can drop it cleanly.
Get a sample dataset for your use case.
Tell us the task, the environment, and the format. We'll ship a fully annotated sample so your team can evaluate quality before scaling.