Skip to content

A curated list of egocentric (first-person) vision and related area resources

License

Notifications You must be signed in to change notification settings

Sid2697/awesome-egocentric-vision

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

88 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Egocentric Vision Awesome

A curated list of egocentric vision resources.

Egocentric (first-person) vision is a sub-field of computer vision that analyses image/video data obtained using a wearable camera simulating a person's visual field.

Contents

Papers

Clustered in various problem statements.

Action/Activity Recognition

Object/Hand Recognition

Action/Gaze Anticipation

Localization

Clustering

Video Summarization

Social Interactions

Pose Estimation

Human Object Interaction

Temporal Boundary Detection

Privacy in Egocentric Videos

Multiple Egocentric Tasks

  • A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives - Simone Alberto Peirone, Francesca Pistilli, Antonio Alliegro, and Giuseppe Averta. In CVPR 2024.

  • Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives - Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain, Rawal Khirodkar, Devansh Kukreja, Kevin J Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh Kumar Ramakrishnan, Luigi Seminara, Arjun Somayazulu, Yale Song, Shan Su, Zihui Xue, Edward Zhang, Jinxu Zhang, Angela Castillo, Changan Chen, Xinzhu Fu, Ryosuke Furuta, Cristina Gonzalez, Prince Gupta, Jiabo Hu, Yifei Huang, Yiming Huang, Weslie Khoo, Anush Kumar, Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo, Zhengyi Luo, Brighid Meredith, Austin Miller, Oluwatumininu Oguntola, Xiaqing Pan, Penny Peng, Shraman Pramanick, Merey Ramazanova, Fiona Ryan, Wei Shan, Kiran Somasundaram, Chenan Song, Audrey Southerland, Masatoshi Tateno, Huiyu Wang, Yuchen Wang, Takuma Yagi, Mingfei Yan, Xitong Yang, Zecheng Yu, Shengxin Cindy Zha, Chen Zhao, Ziwei Zhao, Zhifan Zhu, Jeff Zhuo, Pablo Arbelaez, Gedas Bertasius, David Crandall, Dima Damen, Jakob Engel, Giovanni Maria Farinella, Antonino Furnari, Bernard Ghanem, Judy Hoffman, C. V. Jawahar, Richard Newcombe, Hyun Soo Park, James M. Rehg, Yoichi Sato, Manolis Savva, Jianbo Shi, Mike Zheng Shou, and Michael Wray. In CVPR 2024. [project page]

  • EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone - Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, and Pengchuan Zhang. In ICCV 2023. [project page] [code]

  • Egocentric Video-Language Pretraining - Kevin Qinghong Lin, Alex Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, Rongcheng Tu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu and Mike Zheng Shou. In NeurIPS 2022. [project page] [code]

  • Ego4D: Around the World in 3,000 Hours of Egocentric Video - Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Christian Fuegen, Abrham Gebreselasie, Cristina Gonzalez, James Hillis, Xuhua Huang, Yifei Huang, Wenqi Jia, Weslie Khoo, Jachym Kolar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Yunyi Zhu, Pablo Arbelaez, David Crandall, Dima Damen, Giovanni Maria Farinella, Bernard Ghanem, Vamsi Krishna Ithapu, C.V. Jawahar, Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, and Jitendra Malik. In CVPR 2022. [Github] [project page] [video]

Task Understanding

Miscellaneous (New Tasks)

Clustered according to the conferences.

CVPR

ECCV

ICCV

WACV

BMVC

Datasets

  • HOT3D - HOT3D is a dataset for benchmarking egocentric tracking of hands and objects in 3D. The dataset includes 833 minutes of multi-view image streams, which show 19 subjects interacting with 33 diverse rigid objects and are annotated with accurate 3D poses and shapes of hands and objects.
  • Nymeria - Dataset of human motion in the wild, capturing diverse people engaging in diverse activities across diverse locations. Record body motion using multiple egocentric multimodal devices, all accurately synchronized and localized in one single metric 3D world. 300 hours of daily activity, 3600 hours of video data, 1200 sequences, 264 participants, 50 indoor and outdoor locations.
  • Ego-Exo4D - 1,286 hours of multi-modal multiview videos recorded by 740 participants from 13 cities worldwide performing different skilled human activities (e.g., sports, music, dance, bike repair).
  • Aria Digital Twin - A comprehensive egocentric dataset containing 200 sequences of real-world activities conducted by Aria wearers in two real indoor scenes with 398 object instances (324 stationary and 74 dynamic).
  • HoloAssist - A large-scale egocentric human interaction dataset, where two people collaboratively complete physical manipulation tasks.
  • EgoProceL - 62 hours of egocentric videos recorded by 130 subjects performing 16 tasks for procedure learning.
  • EgoBody - Large-scale dataset capturing ground-truth 3D human motions during social interactions in 3D scenes.
  • UnrealEgo - Large-scale naturalistic dataset for egocentric 3D human pose estimation.
  • Hand-object Segments - Hand-object interactions in 11,235 frames from 1,000 videos covering daily activities in diverse scenarios.
  • Ego4D - 3,025 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 855 unique camera wearers from 74 worldwide locations and 9 different countries.
  • HOI4D - HOI4D consists of 2.4M RGB-D egocentric video frames over 4000 sequences collected by 9 participants interacting with 800 different object instances from 16 categories over 610 different indoor rooms.
  • EgoCom - A natural conversations dataset containing multi-modal human communication data captured simultaneously from the participants' egocentric perspectives.
  • TREK-100 - Object tracking in first person vision.
  • MECCANO - 20 subject assembling a toy motorbike.
  • EPIC-Kitchens 2020 - Subjects performing unscripted actions in their native environments.
  • EPIC-Tent - 29 participants assembling a tent while wearing two head-mounted cameras. [paper]
  • EGO-CH - 70 subjects visiting two cultural sites in Sicily, Italy.
  • EPIC-Kitchens 2018 - 32 subjects performing unscripted actions in their native environments.
  • Charade-Ego - Paired first-third person videos.
  • EGTEA Gaze+ - 32 subjects, 86 cooking sessions, 28 hours.
  • ADL - 20 subjects performing daily activities in their native environments.
  • CMU kitchen - Multimodal, 18 subjects cooking 5 different recipes: brownies, eggs, pizza, salad, sandwich.
  • EgoSeg - Long term actions (walking, running, driving, etc.)
  • First-Person Social Interactions - 8 subjects at disneyworld.
  • UEC Dataset - Two choreographed datasets with different egoactions (walk, jump, climb, etc.) + 6 YouTube sports videos.
  • JPL - Interaction with a robot.
  • FPPA - Five subjects performing 5 daily actions.
  • UT Egocentric - 3-5 hours long videos capturing a person's day.
  • VINST/ Visual Diaries - 31 videos capturing the visual experience of a subject walking from metro station to work.
  • Bristol Egocentric Object Interaction (BEOID) - 8 subjects, six locations. Interaction with objects and environment.
  • Object Search Dataset - 57 sequences of 55 subjects on search and retrieval tasks.
  • UNICT-VEDI - Different subjects visiting a museum.
  • UNICT-VEDI-POI - Different subjects visiting a museum.
  • Simulated Egocentric Navigations - Simulated navigations of a virtual agent within a large building.
  • EgoCart - Egocentric images collected by a shopping cart in a retail store.
  • Unsupervised Segmentation of Daily Living Activities - Egocentric videos of daily activities.
  • Visual Market Basket Analysis - Egocentric images collected by a shopping cart in a retail store.
  • Location Based Segmentation of Egocentric Videos - Egocentric videos of daily activities.
  • Recognition of Personal Locations from Egocentric Videos - Egocentric videos clips of daily.
  • EgoGesture - 2k videos from 50 subjects performing 83 gestures.
  • EgoHands - 48 videos of interactions between two people.
  • DoMSEV - 80 hours/different activities.
  • DR(eye)VE - 74 videos of people driving.
  • THU-READ - 8 subjects performing 40 actions with a head-mounted RGBD camera.
  • EgoDexter - 4 sequences with 4 actors (2 female), and varying interactions with various objects and and cluttered background. [paper]
  • First-Person Hand Action (FPHA) - 3D hand-object interaction. Includes 1175 videos belonging to 45 different activity categories performed by 6 actors. [paper]
  • UTokyo Paired Ego-Video (PEV) - 1,226 pairs of first-person clips extracted from the ones recorded synchronously during dyadic conversations.
  • UTokyo Ego-Surf - Contains 8 diverse groups of first-person videos recorded synchronously during face-to-face conversations.
  • TEgO: Teachable Egocentric Objects Dataset - Contains egocentric images of 19 distinct objects taken by two people for training a teachable object recognizer.
  • Multimodal Focused Interaction Dataset - Contains 377 minutes of continuous multimodal recording captured during 19 sessions, with 17 conversational partners in 18 different indoor/outdoor locations.

Contribute

This is a work in progress. Contributions welcome! Read the contribution guidelines first.