This paper introduces a self-supervised learning framework designed for pre-training neural networks tailored to dense prediction tasks using event camera data. Our approach utilizes solely event data for training.
Transferring achievements from dense RGB pre-training directly to event camera data yields subpar performance. This is attributed to the spatial sparsity inherent in an event image (converted from event data), where many pixels do not contain information. To mitigate this sparsity issue, we encode an event image into event patch features, automatically mine contextual similarity relationships among patches, group the patch features into distinctive contexts, and enforce context-to-context similarities to learn discriminative event features.
For training our framework, we curate a synthetic event camera dataset featuring diverse scene and motion patterns.
Transfer learning performance on downstream dense prediction tasks illustrates the superiority of our method over state-of-the-art approaches.
To pre-train our network, we synthesize an E-TartanAir event camera dataset from the TartanAir dataset. The TartanAir dataset is collected in photo-realistic simulation environments, featuring various light conditions, weather, and moving objects. It has 1037 sequences with RGB frames of 480 × 640 resolution.
@misc{yang2024eventcameradatadense, title={Event Camera Data Dense Pre-training}, author={Yan Yang and Liyuan Pan and Liu Liu}, year={2024}, eprint={2311.11533}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2311.11533}, }
If you use the E-TartanAir dataset. Please also cite the TartanAir paper.@article{tartanair2020iros, title = {TartanAir: A Dataset to Push the Limits of Visual SLAM}, author = {Wang, Wenshan and Zhu, Delong and Wang, Xiangwei and Hu, Yaoyu and Qiu, Yuheng and Wang, Chen and Hu, Yafei and Kapoor, Ashish and Scherer, Sebastian}, booktitle = {2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, year = {2020} }