AirObject: A Temporally Evolving Graph Embedding for Object Identification

Published: by
Nikhil Varma Keetha

Video Object Identification

Object encoding and identification are vital for robotic tasks such as autonomous exploration, semantic scene understanding, and re-localization. Previous approaches have attempted to either track objects or generate descriptors for object identification. However, such systems are limited to a “fixed” object representation from a single viewpoint and are not robust to severe occlusion, viewpoint shift, perceptual aliasing, or scale transform. These single frame representations tend to lead to false correspondences amongst perceptually-aliased objects, especially when severely occluded. Hence, we propose one of the first temporal object encoding methods, AirObject, that aggregates the temporally “evolving” object structure as the camera or object moves. The AirObject descriptors, which accumulate knowledge across multiple evolving representations of the objects, are robust to severe occlusion, viewpoint changes, deformation, perceptual aliasing, and the scale transform.

Matching Temporally Evolving Representations using AirObject

Topological Object Representations

Intuitively, a group of feature points on an object form a graphical representation where the feature points are nodes and their associated descriptors are the node features. Essentially, the graph’s nodes are the distinctive local features of the object, while the edges/structure of the graph represents the global structure of the object. Hence, to learn the geometric relationship of the feature points, we construct topological object graphs for each frame using Delaunay triangulation. The obtained triangular mesh representation enables the graph attention encoder to reason better about the object’s structure, thereby making the final temporal object descriptor robust to deformation or occlusion.

Topological Graph Representations of Objects. These representations are generated by using Delaunay Triangulation on object-wise grouped SuperPoint keypoints.

Simple & Effective Temporal Object Encoding Method

Our proposed method is very simple and only contains three modules. Specifically, we use extracted deep learned keypoint features across multiple frames to form sequences of object-wise topological graph neural networks (GNNs), which on embedding generate temporal object descriptors. We employ a graph attention-based sparse encoding method on these topological GNNs to generate content graph features and location graph features representing the structural information of the object. Then, these graph features are aggregated across multiple frames using a single-layer temporal convolutional network to generate a temporal object descriptor. These generated object descriptors are robust to severe occlusion, perceptual aliasing, viewpoint shift, deformation, and scale transform, outperforming the state-of-the-art single-frame and sequential descriptors.

Video Object Identification



 title={AirObject: A Temporally Evolving Graph Embedding for Object Identification},
 author={Keetha, Nikhil Varma and Wang, Chen and Qiu, Yuheng and Xu, Kuan and Scherer, Sebastian}, 
 booktitle={2022 Conference on Computer Vision and Pattern Recognition (CVPR)},

Please refer to our Paper and Code for details.