The PhilaVerse

The PhilaVerse

Share this post

The PhilaVerse
The PhilaVerse
Meta introduces HOT3D: A dataset for advancing hand-object interaction research

Meta introduces HOT3D: A dataset for advancing hand-object interaction research

A comprehensive 3D video dataset designed to support progress in robotics, AR/VR, and computer vision applications

Phil Siarri's avatar
Phil Siarri
Jan 03, 2025
∙ Paid

Share this post

The PhilaVerse
The PhilaVerse
Meta introduces HOT3D: A dataset for advancing hand-object interaction research
1
Share
Image of hands and a cat face
Image credit: Microsoft Copilot and Canva

Meta Reality Labs has introduced HOT3D, a publicly available dataset designed to enhance machine learning research on hand-object interactions.

This dataset comprises over 833 minutes of multi-view egocentric 3D video streams, captured using Meta's Project Aria glasses and Quest 3 VR headset. It includes 3.7 million annotated images, offering high-quality data on 19 subjects interacting with 33 diverse objects in real-world tasks.

Key features of HOT3D include:

  • Multi-modal data: RGB/monochrome image streams, eye gaze tracking, and 3D point clouds.

  • Comprehensive annotations: 3D poses of objects, hands, and cameras; 3D models of hands and objects.

  • Real-world scenarios: Demonstrations range from basic object manipulation to complex activities like typing or using kitchen utensils.

The dataset leverages a professional motion-capture system and supports advanced 3D tracking formats such as UmeTrack and MANO. Initial experiments revealed that models trained on HOT3D’s multi-view data outperform those trained on single-view data, excelling in tasks like 3D hand tracking, 6DoF object pose estimation, and 3D lifting of objects.

Available as open-source, HOT3D aims to drive innovation in robotics, AR/VR systems, and human-machine interfaces by providing a robust foundation for computer vision and machine learning advancements.

HOT3D Overview.
HOT3D overview: The dataset features multi-view egocentric image streams captured using the Aria glasses and Quest 3 headset, annotated with precise 3D poses and models of hands and objects. On the left, three multi-view frames from Aria display contours of 3D models for hands (white) and objects (green) in their ground-truth poses. Additionally, Aria provides 3D point clouds generated by SLAM and includes eye gaze tracking data (right). Credit: Banerjee et al.

Read the scientific paper

More news!

Keep reading with a 7-day free trial

Subscribe to The PhilaVerse to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Phil Siarri
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share