Learning Visual Commonsense for Robust Scene Graph Generation

30Citations
Citations of this article
117Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Scene graph generation models understand the scene through object and predicate recognition, but are prone to mistakes due to the challenges of perception in the wild. Perception errors often lead to nonsensical compositions in the output scene graph, which do not follow real-world rules and patterns, and can be corrected using commonsense knowledge. We propose the first method to acquire visual commonsense such as affordance and intuitive physics automatically from data, and use that to improve the robustness of scene understanding. To this end, we extend Transformer models to incorporate the structure of scene graphs, and train our Global-Local Attention Transformer on a scene graph corpus. Once trained, our model can be applied on any scene graph generation model and correct its obvious mistakes, resulting in more semantically plausible scene graphs. Through extensive experiments, we show our model learns commonsense better than any alternative, and improves the accuracy of state-of-the-art scene graph generation methods.

Cite

CITATION STYLE

APA

Zareian, A., Wang, Z., You, H., & Chang, S. F. (2020). Learning Visual Commonsense for Robust Scene Graph Generation. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 12368 LNCS, pp. 642–657). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-58592-1_38

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free