Navigating with Spatial Intelligence: A Survey of Scene Graph-Based Object Goal Navigation | Wuhan University Journal of Natural Sciences

Open Access

Issue		Wuhan Univ. J. Nat. Sci. Volume 30, Number 5, October 2025


Page(s)		405 - 426
DOI		https://doi.org/10.1051/wujns/2025305405
Published online		04 November 2025

Wirtz J, Hofmeister J, Chew P Y P, et al. Digital service technologies, service robots, AI, and the strategic pathways to cost-effective service excellence[J]. The Service Industries Journal, 2023, 43(15/16): 1173-1196. [Google Scholar]
Zhu Y K, Mottaghi R, Kolve E, et al. Target-driven visual navigation in indoor scenes using deep reinforcement learning[C]//2017 IEEE International Conference on Robotics and Automation (ICRA). New York: IEEE, 2017: 3357-3364. [Google Scholar]
Li F, Guo C, Luo B H, et al. Multi goals and multi scenes visual mapless navigation in indoor using meta-learning and scene priors[J]. Neurocomputing, 2021, 449: 368-377. [Google Scholar]
Li F, Guo C, Zhang H Y, et al. Context vector-based visual mapless navigation in indoor using hierarchical semantic information and meta-learning[J]. Complex & Intelligent Systems, 2023, 9(2): 2031-2041. [Google Scholar]
Sun J W, Wu J, Ji Z, et al. A survey of object goal navigation[J]. IEEE Transactions on Automation Science and Engineering, 2025, 22: 2292-2308. [Google Scholar]
Li C S, Zhang R H, Wong J, et al. BEHAVIOR-1K: A benchmark for embodied AI with 1 000 everyday activities and realistic simulation[C]//Conference on Robot Learning. New York: PMLR, 2022: 80-93. [Google Scholar]
Shadbolt N, Berners-Lee T, Hall W. IFIP International Conference on E-Business, E-Services, and E-Society[M]. Cham: Springer-Verlag, 2003. [Google Scholar]
Ji S X, Pan S R, Cambria E, et al. A survey on knowledge graphs: Representation, acquisition, and applications[J]. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33(2): 494-514. [CrossRef] [PubMed] [Google Scholar]
Guo C, Luo B, Li F, et al. Review and verification for brain-like navigation algorithm[J]. Geomatics and Information Science of Wuhan University, 2021, 46(12): 1819-1831. [Google Scholar]
Epstein R A, Patai E Z, Julian J B, et al. The cognitive map in humans: Spatial navigation and beyond[J]. Nature Neuroscience, 2017, 20(11): 1504-1513. [Google Scholar]
Sosa M, Giocomo L M. Navigating for reward[J]. Nature Reviews Neuroscience, 2021, 22(8): 472-487. [Google Scholar]
Ambrose R E, Pfeiffer B E, Foster D J. Reverse replay of hippocampal place cells is uniquely modulated by changing reward[J]. Neuron, 2016, 91(5): 1124-1136. [Google Scholar]
Bhattarai B, Lee J W, Jung M W. Distinct effects of reward and navigation history on hippocampal forward and reverse replays[J]. Proceedings of the National Academy of Sciences of the United States of America, 2020, 117(1): 689-697. [Google Scholar]
Banino A, Barry C, Uria B, et al. Vector-based navigation using grid-like representations in artificial agents[J]. Nature, 2018, 557: 429-433. [Google Scholar]
Rummery G A, Niranjan M. On-line Q-learning Using Connectionist Systems[M]. Cambridge: University of Cambridge, 1994. [Google Scholar]
Mnih V, Kavukcuoglu K, Silver D, et al. Playing atari with deep reinforcement learning[EB/OL]. [2013-12-19]. http://arxiv.org/abs/1312.5602. [Google Scholar]
Williams R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning[J]. Machine Learning, 1992, 8(3): 229-256. [Google Scholar]
Mnih V, Badia A P, Mirza M, et al. Asynchronous methods for deep reinforcement learning[EB/OL]. [2016-06-16]. http://arxiv.org/abs/1602.01783. [Google Scholar]
Devo A, Mezzetti G, Costante G, et al. Towards generalization in target-driven visual navigation by using deep reinforcement learning[J]. IEEE Transactions on Robotics, 2020, 36(5): 1546-1561. [Google Scholar]
Gupta S, Tolani V, Davidson J, et al. Cognitive mapping and planning for visual navigation[J]. International Journal of Computer Vision, 2020, 128(5): 1311-1330. [Google Scholar]
Mousavian A, Toshev A, Fišer M, et al. Visual representations for semantic target driven navigation[C]//2019 International Conference on Robotics and Automation (ICRA). New York: IEEE, 2019: 8846-8852. [Google Scholar]
Anderson P, Chang A, Chaplot D S, et al. On evaluation of embodied navigation agents[EB/OL]. [2018-07-18]. http://arxiv.org/abs/1807.06757. [Google Scholar]
He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2016: 770-778. [Google Scholar]
Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780. [CrossRef] [Google Scholar]
Yang W, Wang X L, Farhadi A, et al. Visual semantic navigation using scene priors[EB/OL]. [2018-10-15]. http://arxiv.org/abs/1810.06543. [Google Scholar]
Krishna R, Zhu Y K, Groth O, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations[J]. International Journal of Computer Vision, 2017, 123(1): 32-73. [Google Scholar]
Du H M, Yu X, Zheng L. Learning object relation graph and tentative policy for visual navigation[C]//European Conference on Computer Vision. Cham: Springer, 2020: 19-34. [Google Scholar]
Ren S Q, He K M, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. [CrossRef] [Google Scholar]
Bochkovskiy A, Wang C Y, Liao H Y M. YOLOv4: Optimal speed and accuracy of object detection[EB/OL]. [2020-04-23]. http://arxiv.org/abs/2004.10934. [Google Scholar]
Yang J W, Lu J S, Lee S, et al. Graph R-CNN for scene graph generation[C]//European Conference on Computer Vision. Cham: Springer-Verlag, 2018: 690-706. [Google Scholar]
Dang R H, Shi Z F, Wang L Y, et al. Unbiased directed object attention graph for object navigation[C]//Proceedings of the 30th ACM International Conference on Multimedia. New York: ACM, 2022: 3617-3627. [Google Scholar]
Hu X B, Lin Y F, Wang S, et al. Agent-centric relation graph for object visual navigation[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(2): 1295-1309. [Google Scholar]
Li W, Song X, Bai Y, et al. ION: Instance-level object navigation[C]//Proceedings of the 29th ACM International Conference on Multimedia. New York: ACM, 2021: 4343-4352. [Google Scholar]
Zhang S X, Song X H, Bai Y B, et al. Hierarchical object-to-zone graph for object navigation[C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). New York: IEEE, 2021: 15110-15120. [Google Scholar]
Izadinia H, Sadeghi F, Farhadi A. Incorporating scene context and object layout into appearance modeling[C]//2014 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2014: 232-239. [Google Scholar]
Zuo Z, Shuai B, Wang G, et al. Learning contextual dependence with convolutional hierarchical recurrent neural networks[J]. IEEE Transactions on Image Processing, 2016, 25(7): 2983-2996. [Google Scholar]
Kuhn H W. The Hungarian method for the assignment problem[J]. Naval Research Logistics Quarterly, 1955, 2(1/2): 83-97. [CrossRef] [Google Scholar]
Munkres J. Algorithms for the assignment and transportation problems[J]. Journal of the Society for Industrial and Applied Mathematics, 1957, 5(1): 32-38. [Google Scholar]
Kwon O, Kim N, Choi Y, et al. Visual graph memory with unsupervised representation for visual navigation[C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). New York: IEEE, 2021: 15870-15879. [Google Scholar]
Li J N, Zhou P, Xiong C M, et al. Prototypical contrastive learning of unsupervised representations[EB/OL]. [2021-03-30]. http://arxiv.org/abs/2005.04966. [Google Scholar]
Nguyen T L, Nguyen D V, Le T H. Reinforcement learning based navigation with semantic knowledge of indoor environments[C]//2019 11th International Conference on Knowledge and Systems Engineering (KSE). New York: IEEE, 2019: 1-7. [Google Scholar]
Zhou K, Guo C, Zhang H Y. Visual navigation via reinforcement learning and relational reasoning[C]//2021 IEEE SmartWorld, & , & , & , (/////). New York: IEEE, 2021: 131-138. [Google Scholar]
Qiu Y D, Pal A, Christensen H I. Learning hierarchical relationships for object-goal navigation[EB/OL]. [2020-11-18]. http://arxiv.org/abs/2003.06749. [Google Scholar]
Zhou K, Guo C, Guo W F, et al. Learning heterogeneous relation graph and value regularization policy for visual navigation[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(11): 16901-16915. [Google Scholar]
Lyu Y L, Talebi M S. Double graph attention networks for visual semantic navigation[J]. Neural Processing Letters, 2023, 55(7): 9019-9040. [Google Scholar]
Dang R H, Wang L Y, He Z T, et al. Search for or navigate to dual adaptive thinking for object navigation[C]//2023 IEEE/CVF International Conference on Computer Vision (ICCV). New York: IEEE, 2023: 8216-8225. [Google Scholar]
Yang B J, Yuan X F, Ying Z M, et al. HOGN-TVGN: Human-inspired embodied object goal navigation based on time-varying knowledge graph inference networks for robots[J]. Advanced Engineering Informatics, 2024, 62: 102671. [Google Scholar]
Luo J, Cai B, Yu Y X, et al. Learning multimodal adaptive relation graph and action boost memory for visual navigation[J]. Advanced Engineering Informatics, 2024, 62: 102678. [Google Scholar]
Singh K P, Salvador J, Weihs L, et al. Scene graph contrastive learning for embodied navigation[C]//2023 IEEE/CVF International Conference on Computer Vision (ICCV). New York: IEEE, 2023: 10850-10860. [Google Scholar]
Xu N, Wang W, Yang R, et al. Aligning knowledge graph with visual perception for object-goal navigation[EB/OL]. [2024-04-26]. http://arxiv.org/abs/2402.18892. [Google Scholar]
Kipf T N, Welling M. Semi-supervised classification with graph convolutional networks[EB/OL]. [2017-02-22]. http://arxiv.org/abs/1609.02907. [Google Scholar]
Yun S, Jeong M, Kim R, et al. Graph transformer networks[EB/OL]. [2020-02-05]. http://arxiv.org/abs/1911.06455. [Google Scholar]
Veličković P, Cucurull G, Casanova A, et al. Graph attention networks[EB/OL]. [2018-02-04]. http://arxiv.org/abs/1710.10903. [Google Scholar]
Moghaddam M M K, Wu Q, Abbasnejad E, et al. Utilising prior knowledge for visual navigation: Distil and adapt[EB/OL]. [2020-12-06]. http://arxiv.org/abs/2004.03222. [Google Scholar]
Lu Y, Chen Y R, Zhao D B, et al. MGRL: Graph neural network based inference in a Markov network with reinforcement learning for visual navigation[J]. Neurocomputing, 2021, 421: 140-150. [Google Scholar]
Lyu Y L, Shi Y M, Zhang X G. Improving target-driven visual navigation with attention on 3D spatial relationships[J]. Neural Processing Letters, 2022, 54(5): 3979-3998. [Google Scholar]
Moghaddam M K, Wu Q, Abbasnejad E, et al. Optimistic agent: Accurate graph-based value estimation for more successful visual navigation[C]//2021 IEEE Winter Conference on Applications of Computer Vision (WACV). New York: IEEE, 2021: 3732-3741. [Google Scholar]
Wani S, Patel S, Jain U, et al. MultiON: Benchmarking semantic map memory using multi-object navigation[EB/OL]. [2020-12-07]. http://arxiv.org/abs/2012.03912. [Google Scholar]
Ehsani K, Han W, Herrasti A, et al. ManipulaTHOR: A framework for visual object manipulation[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2021: 4495-4504. [Google Scholar]
Xia F, Zamir A R, He Z Y, et al. Gibson Env: Real-world perception for embodied agents[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 9068-9079. [Google Scholar]
Savva M, Kadian A, Maksymets O, et al. Habitat: A platform for embodied AI research[C]//2019 IEEE/CVF International Conference on Computer Vision (ICCV). New York: IEEE, 2019: 9338-9346. [Google Scholar]
Kolve E, Mottaghi R, Han W, et al. AI2-THOR: An interactive 3D environment for visual AI[EB/OL]. [2022-08-26]. http://arxiv.org/abs/1712.05474. [Google Scholar]
Chang A, Dai A, Funkhouser T, et al. Matterport3D: Learning from RGB-D data in indoor environments[EB/OL]. [2017-09-18]. http://arxiv.org/abs/1709.06158. [Google Scholar]
Ramakrishnan S K, Gokaslan A, Wijmans E, et al. Habitat-matterport 3D dataset (HM3D): 1000 large-scale 3D environments for embodied AI[EB/OL]. [2021-09-16]. http://arxiv.org/abs/2109.08238. [Google Scholar]
Deitke M, Han W, Herrasti A, et al. RoboTHOR: An open simulation-to-real embodied AI platform[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2020: 3161-3171. [Google Scholar]
Deitke M, VanderBilt E, Herrasti A, et al. ProcTHOR: Large-scale embodied AI using procedural generation[EB/OL]. [2022-06-14]. http://arxiv.org/abs/2206.06994. [Google Scholar]
Pennington J, Socher R, Manning C. Glove: Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Stroudsburg, PA: Association for Computational Linguistics, 2014: 1532-1543. [Google Scholar]
Wu Q Y, Manocha D, Wang J, et al. NeoNav: Improving the generalization of visual navigation via generating next expected observations[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(6): 10001-10008. [Google Scholar]
Wortsman M, Ehsani K, Rastegari M, et al. Learning to learn how to learn: Self-adaptive visual navigation using meta-learning[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2019: 6743-6752. [Google Scholar]
Mayo B, Hazan T, Tal A. Visual navigation with spatial attention[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2021: 16893-16902. [Google Scholar]
Du H M, Yu X, Zheng L. VTNet: Visual transformer network for object goal navigation[EB/OL]. [2021-05-20]. http://arxiv.org/abs/2105.09447. [Google Scholar]
Carion N, Massa F, Synnaeve G, et al. End-to-end object detection with transformers[C]//European Conference on Computer Vision. Cham: Springer-Verlag, 2020: 213-229. [Google Scholar]
Zhou K, Guo C, Zhang H Y, et al. Optimal graph transformer viterbi knowledge inference network for more successful visual navigation[J]. Adv Eng Informatics, 2023, 55: 101889. [Google Scholar]
Forney G D. The Viterbi algorithm[J]. Proceedings of the IEEE, 1973, 61(3): 268-278. [Google Scholar]
Sun Q R, Liu Y Y, Chua T S, et al. Meta-Transfer learning for few-shot learning[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2019: 403-412. [Google Scholar]
Kirillov A, Mintun E, Ravi N, et al. Segment anything[C]//2023 IEEE/CVF International Conference on Computer Vision (ICCV). New York: IEEE, 2023: 3992-4003. [Google Scholar]
Strader J, Hughes N, Chen W, et al. Indoor and outdoor 3D scene graph generation via language-enabled spatial ontologies[J]. IEEE Robotics and Automation Letters, 2024, 9(6): 4886-4893. [Google Scholar]
Gervet T, Chintala S, Batra D, et al. Navigating to objects in the real world[J]. Science Robotics, 2023, 8(79): eadf6991. [Google Scholar]
NVIDIA. Nvidia Isaac Sim: Robotics simulation and synthetic data[EB/OL]. [2023-10-25]. https://developer.nvidia.com/isaac-sim. [Google Scholar]

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.