Home / Formats / RLDS TFRecord

RLDS dataset format and Open X-Embodiment compatibility.

EgoVista ships every dataset in the RLDS TFRecord schema, with the conventions used by Open X-Embodiment and RT-X. If your team trains on the tensorflow_datasets ecosystem, the integration is one library call away.

1 of 8 sections

Why RLDS for robotics datasets.

RLDS, short for Reinforcement Learning Datasets, was introduced by DeepMind to standardize how sequential decision making data is stored on disk. It is, in practice, the format the robotics research community converged on for cross-embodiment imitation learning, and it powers the Open X-Embodiment collaboration as well as the RT-X benchmark family. Datasets in this format are episodes of typed steps, encoded as TFRecord shards, with a typed features dictionary that the tensorflow_datasets library consumes natively.

For an ML team, RLDS matters in two situations. First, when the goal is to benchmark a new policy against Open X-Embodiment or RT-X numbers, the dataset must share the action space conventions and the observation keys used by those benchmarks. Second, when the existing training infrastructure is built around tensorflow and TFDS data loaders, an RLDS dataset slots into that stack with zero glue code. EgoVista produces datasets that satisfy both situations from a single source of raw annotations.

2 of 8 sections

RLDS specification overview.

An RLDS dataset is a collection of episodes, each episode is a sequence of steps, and each step is a typed dictionary with a stable schema. The encoding lives in TFRecord shards, and the schema is described separately so a data loader can deserialize without scanning the data.

RLDS TFRecord episode structure and features dict.

The concrete pieces of an RLDS dataset on disk:

TFRecord shards hold the serialized steps. Each shard typically contains one or more full episodes, and shards are sized to load in parallel by a standard tf.data pipeline.
features.json describes the step schema: dtypes and shapes for every key in the observation, action, reward, and metadata dictionaries. The schema is versioned, so a re-export under a newer schema does not break older shards.
dataset_info.json describes the dataset itself: number of episodes, total steps, recommended train and validation splits, metadata about the capture and annotation pipeline.
tfds_builder.py is the registration code that exposes the dataset to tfds.load. It declares the dataset name, version, and features dictionary, and points to the TFRecord shards.

The RLDS step structure follows a fixed convention: every step has a flag for is_first, is_last, and is_terminal, plus the observation, action, reward, and discount fields. EgoVista datasets do not produce rewards or discount signals for human demonstrations, so those fields are present as zero scalars to keep the schema valid while documenting their absence.

3 of 8 sections

Standard features for manipulation datasets.

The EgoVista 9-layer annotation stack is mapped into the RLDS features dictionary with stable keys. Observation keys follow Open X-Embodiment naming where one exists, and we add extra keys for the layers OXE does not standardize.

Default features in an EgoVista RLDS dataset:

observation.image_primary: egocentric RGB frame, encoded as a uint8 tensor with shape (H, W, 3).
observation.depth_map: per-frame depth from a monocular depth model.
observation.hand_pose_2d: 21 keypoints per hand in normalized image coordinates.
observation.hand_pose_3d: 21 keypoints per hand in 3D, lifted from 2D pose and depth.
observation.segmentation_mask: 5-class hand-object segmentation.
action.contact_phase: integer encoding of contact transitions per step.
action.language_instruction: natural language action label string.
discount, is_first, is_last, is_terminal: standard RLDS bookkeeping.

Loading an EgoVista RLDS dataset:

import tensorflow_datasets as tfds

ds = tfds.load("egovista/manipulation_egocentric", split="train")
for episode in ds.take(1):
    for step in episode["steps"]:
        print(step["action"]["language_instruction"].numpy())
        print(step["observation"]["hand_pose_3d"].shape)

4 of 8 sections

Open X-Embodiment and RT-X integration.

Open X-Embodiment is a multi-institution effort to combine robotics datasets from many labs and many robots into one corpus, with a unified schema that supports cross-embodiment training. RT-X is the family of policies trained on that corpus. EgoVista datasets are produced with the OXE conventions in mind so they can be mixed with the public OXE corpus during training.

What this means concretely:

Observation keys follow the OXE naming standard where applicable, so a data loader that handles OXE handles EgoVista as well.
The action space is left as the raw human signal (hand pose, contact transitions, action language) rather than a robot-specific action. This keeps the dataset usable as a source for retargeting to many robots, which is exactly what cross-embodiment training requires.
Episode boundaries and the is_first / is_last flags are set consistently with OXE so an episode iterator does not need special casing.
When a custom mapping into a specific OXE-compatible action space is needed, we produce it as a derived feature alongside the canonical one, so ablations can use either.

The mapping decisions, especially the action space ones, are documented in the dataset card so an auditor or a reviewer can reproduce them.

5 of 8 sections

From raw capture to RLDS TFRecord.

The path from a raw egocentric video to a delivered RLDS dataset is a sequence of deterministic stages. Every stage is logged so any TFRecord shard can be traced back to the exact source video and the exact model versions that produced its annotations.

Ingest the contributor video into EU-region object storage.
Apply face anonymization at source. The anonymized version is the only one that ever leaves the contributor zone for downstream processing.
Run the 9 annotation layers on EU compute: 2D and 3D hand pose, depth, hand-object segmentation, contact timing, action labels via a vision language model in the EU region, plus camera intrinsics and metadata.
Map annotations into the RLDS features dictionary using the canonical schema, plus any client-specific overrides.
Encode steps into TFRecord shards, typically between 50 and 100 MB each, with one or more full episodes per shard.
Generate features.json, dataset_info.json and tfds_builder.py, then validate the dataset by running a smoke tfds.load on a sampled shard.
Deliver via signed URL with the engagement-specific retention policy.

All processing operates on EU infrastructure. See the GDPR compliance details for the legal basis attached to each operation.

6 of 8 sections

Choosing between LeRobot and RLDS.

Both formats are well designed, both are actively maintained, and a single EgoVista capture can be exported to either. The pick depends on the rest of your stack. LeRobot is more recent and uses parquet, which fits well with the modern Python data tooling and the Hugging Face Hub flow. The visualization tools and the training scripts in the LeRobot repository accelerate iteration speed for teams comfortable in that ecosystem.

RLDS is older, more conservative, and built around TFRecord and tensorflow_datasets. It remains the right choice for teams that benchmark against Open X-Embodiment or RT-X, that already operate a tensorflow data pipeline at scale, or that want their dataset to interoperate with the broader OXE corpus during multi-embodiment training. See the LeRobot v3.0 format page for the matching breakdown.

7 of 8 sections

RLDS format frequently asked questions.

Do you produce datasets compatible with the RT-X benchmark?

Yes. RT-X compatibility means following the Open X-Embodiment conventions: a consistent action space, a fixed set of observation keys, and the standard RLDS step structure. EgoVista datasets are produced with those conventions in mind, and we document the mapping in the dataset card so a reviewer can audit how each feature was derived. If your team needs the dataset to plug directly into the RT-X benchmark harness, we can deliver under the exact feature names expected by that harness without re-collecting data.

How do you handle action space mapping for cross-embodiment training?

Action space mapping is the hard part of cross-embodiment work. We deliver the raw human signal (hand pose in 3D, contact transitions, action language) and let your retargeting layer translate that into your robot action space. We do not pretend to know your kinematics, so we do not produce robot-specific actions. For teams that want a pre-mapped action, we can deliver against a custom action schema once you describe the target embodiment, with the original human signal preserved alongside for ablations.

Can you deliver both LeRobot and RLDS for the same dataset?

Yes. The annotations are produced once from the raw source, then re-packed to either format on demand. A team can start with one format for prototyping and add the other for production training. There is a re-pack delay, typically a couple of business days, but no re-collection of video or re-running of the annotation pipeline. Both exports share the same canonical schema version so episode by episode parity is preserved.

What is the typical TFRecord shard size?

TFRecord shards are sized for parallel loading on standard ML clusters, typically between 50 and 100 MB per shard. Each shard contains one or more full episodes, never a partial episode, which keeps the dataloader logic simple. A `dataset_info.json` file accompanies the shards with the total number of episodes, per-feature shapes and dtypes, and the recommended train/validation split. You can re-shard on your side without re-encoding if your training infrastructure needs a different size.

Do you provide tfds metadata builder code?

Yes. Every RLDS dataset ships with a `tfds_builder.py` that registers the dataset under a stable name, declares the features dictionary, and points to the TFRecord shards. Once the file is on a Python path, `tfds.load("egovista/<dataset_name>")` works without any further configuration. The builder is generated at build time from the dataset spec, so it always matches the actual schema of the delivered files.

How are continuous and discrete actions encoded?

Continuous signals like hand pose coordinates are encoded as float32 tensors with their natural shape, no quantization unless explicitly requested. Discrete events like contact phase transitions are encoded as int8 scalars with documented value semantics. Action language descriptions are encoded as UTF-8 strings, and a confidence score is provided alongside as a float32 scalar. The features dictionary in the RLDS metadata documents every encoding choice so your data loader knows exactly what to expect.

8 of 8 sections

Request an RLDS sample.

Tell us the task and the embodiment you target. We can ship a 10 to 20 episode RLDS dataset so your team can validate the shard layout, the features dictionary, and the tfds.load integration before any scale up. For related material, see the LeRobot v3.0 export, the full product page, or the GDPR pipeline details.

Request an RLDS sample Talk to an engineer