Home / Datasets

Annotated egocentric video datasets for robotics manipulation.

Nine annotation layers (hand pose, depth, hand-object segmentation, contact timing, action labels), shipped in your training format with a measurable QA score. Built in Europe for robotics teams and the data path their policies actually consume.

Request a sample Talk to an engineer

What you receive

What's included in every dataset.

Nine annotation layers, each shipped with its output format and intended use. Limitations stated upfront so your team knows what to trust before training.

Layer	Output	Typical use	Limitations
2D hand pose	21 keypoints × 2 hands per frame, with confidence	Behavioral cloning input, gesture detection	Accuracy drops on heavy occlusion / motion blur
3D hand pose	21 keypoints × 2 hands, 3D coordinates	VLA inputs, contact-aware policies	Depth-derived, inherits depth uncertainty
Depth estimation	Per-frame depth maps (16-bit PNG or float feature)	Geometry-aware policies, world models	Monocular, metric where intrinsics available, relative otherwise
Hand-object segmentation	5 classes per frame (left/right hand × objects)	Manipulation detection, contact regions	Trained on a fixed class set; out-of-distribution objects may fail
Contact timing	Per-frame hand-object contact transitions	Action segmentation, reward shaping	Derived signal, depends on segmentation quality
Action labels	Timestamped NL labels + controlled-vocabulary category	VLA training, instruction conditioning	VLM-based predictions, not human ground truth (uncertainty shipped)
Face anonymization	Blurred faces of third parties, applied locally	GDPR compliance, EU AI Act audit trail	Applied before external calls, no opt-out from the pipeline
Camera intrinsics	Focal, principal point, distortion per clip	Metric reconstruction, multi-view alignment	Estimated where EXIF unavailable
Metadata enrichment	Episode boundaries, schema version, QA score per clip	Dataset filtering, training selection	None

Export

Export to your team's existing pipeline.

The same source dataset is repacked to four output formats. Pick the one your training loop already speaks.

LeRobot v3.0

Parquet shards plus MP4 frame chunks plus JSONL metadata, structured to match the canonical LeRobot v3.0 spec. Features dictionary follows observation.images.*, action, episode_index, frame_index. Load directly with lerobot.common.datasets.load_dataset. See the native LeRobot v3.0 export page for full schema.

RLDS TFRecord (Open X-Embodiment compatible)

TFRecord shards with a typed features dictionary, structured to slot into the Open X-Embodiment training pipeline. Compatible with TensorFlow Datasets and the RT-X / OpenVLA stack. See the RLDS TFRecord and Open X-Embodiment page for full schema.

Hugging Face Datasets

Private repo on the Hugging Face Hub with a dataset card, splits config, and a typed features schema. Loadable via datasets.load_dataset with token-based access control.

Raw ZIP (custom pipelines)

Frames, MP4 originals, and per-layer annotation files (JSON, NumPy, 16-bit PNG) packaged in a versioned ZIP. For teams with proprietary training stacks that prefer to handle their own IO.

Specifications

Technical specifications.

The defaults we ship with, adjustable per project at brief time.

Accepted input formats	MP4 (H.264/H.265), MOV, MKV, WebM
Input resolution	720p to 4K
Output frame rate	10 fps default, 5–30 fps configurable
Annotation layers	9 layers (see table above)
Schema versioning	Per-dataset, JSON-encoded, immutable after delivery
Delivery method	Signed URL (EU-region object storage, expires in 7 days, renewable)
Retention	Two options: ship-and-delete (we delete after delivery), or 12-month archival with exclusivity
QA scoring	Per-clip schema validation, hand pose accuracy, action label precision, segmentation coverage
Compliance	GDPR-compliant by design. See the GDPR compliance details

Engagement

Pricing and engagement models.

Three ways to work with us, depending on whether you already have video, need new collection, or want raw anonymized data at scale.

Custom collection + annotation

Turn-key. Use when you need a specific task or environment captured fresh and annotated. Includes contributor matching, mission briefs, capture, anonymization, full 9-layer annotation, and QA report.

Annotation only

You provide egocentric or teleop footage; we run the full annotation pipeline and ship in your chosen format. Use when you already have capture infrastructure or proprietary footage to scale up.

Raw anonymized video

Volume play. Anonymized egocentric video without the annotation stack, typically used for foundation model pretraining, vision encoders, and large-scale world models. Compliance posture identical.

Pricing depends on volume, task complexity, and engagement model. We prefer to scope the work together. Request a quote tailored to your project and we will come back with a fixed price and timeline.

Request a quote Talk to an engineer

Technical FAQ

Technical FAQ.

Do you provide depth or only RGB?

Both. RGB frames are always shipped at full resolution; per-frame depth maps are generated by a monocular depth model and shipped as either 16-bit PNGs (in ZIP exports) or as a typed feature in the LeRobot/RLDS schema. Depth is metric where camera intrinsics are available, relative otherwise, and we ship both flavors when the source allows it.

How accurate is the hand pose annotation?

Hand keypoint accuracy depends on viewpoint, occlusion, and motion blur. On typical egocentric footage, our hand-tracking model reports per-keypoint confidence and tends to localize the dominant hand within a small fraction of the bounding box width when the hand is unoccluded and in frame. We ship per-keypoint confidence so your training loop can mask low-confidence frames or weight them down, and the dataset card reports measured accuracy per delivery.

Can you re-deliver with a different schema version?

Yes. Each dataset is versioned and stored under the original schema; if you need a re-export under a newer schema (LeRobot v3.1 once released, an updated RLDS features dict, or a custom mapping), we re-pack from the canonical source. There is a small re-pack delay (typically two business days) but no re-collection involved.

What is your typical schema for action labels?

Action labels are timestamped natural-language strings paired with start/end frame indices and an action category from a controlled vocabulary (pick, place, pour, push, rotate, open, close, wipe, etc.). We can adapt the vocabulary to your downstream model's tokenizer or instruction format on request.

Do you support synchronized multi-camera setups?

Yes for input. You can submit synchronized multi-view footage with hardware timestamps and we will preserve the sync metadata through the pipeline. Our annotation layers run per-view, so each camera gets its own pose, depth, segmentation, and action labels with frame-level alignment.

Can we audit a sample before scaling?

We strongly recommend it. Standard kickoff includes a small evaluation batch (typically ten clips) delivered with a full QA report and the same schema as the final dataset. You validate quality on your own benchmark before we scale, and we adjust the brief if anything is off.

What is your contributor diversity for environment variability?

Our European contributor network spans multiple countries, ages, dominant hands, lighting conditions, and environments from home settings to skilled-trade workspaces. For projects that require demographic or geographic diversity, we filter contributor matching at brief time and report distribution metrics with the dataset (anonymized).

How do you handle edge cases or failed annotations?

Failed annotations are detected by the QA layer (out-of-range pose, missing segmentation, schema validation errors). Affected clips are either re-annotated on the same source video or replaced from the contributor batch. If neither is possible, the clip is flagged in the QA report so your training loop can drop it cleanly.

Get a sample dataset for your use case.

Tell us the task, the environment, and the format. We'll ship a fully annotated sample so your team can evaluate quality before scaling.

Request a sample Talk to the team