| Issue |
Wuhan Univ. J. Nat. Sci.
Volume 30, Number 5, October 2025
|
|
|---|---|---|
| Page(s) | 405 - 426 | |
| DOI | https://doi.org/10.1051/wujns/2025305405 | |
| Published online | 04 November 2025 | |
CLC number: TP18
Navigating with Spatial Intelligence: A Survey of Scene Graph-Based Object Goal Navigation
以空间智能导航:基于场景图谱的目标驱动导航综述
1 GNSS Research Center, Wuhan University, Wuhan 430072, Hubei, China
2 Hubei Luojia Laboratory, Wuhan 430072, Hubei, China
3 Artificial Intelligence Institute, Wuhan University, Wuhan 430072, Hubei, China
4 School of Geodesy and Geomatics, Wuhan University, Wuhan 430072, Hubei, China
5 Electronic Information School, Wuhan University, Wuhan 430072, Hubei, China
† Corresponding author. E-mail: zt9877@163.com
Received:
25
September
2024
Today, autonomous mobile robots are widely used in all walks of life. Autonomous navigation, as a basic capability of robots, has become a research hotspot. Classical navigation techniques, which rely on pre-built maps, struggle to cope with complex and dynamic environments. With the development of artificial intelligence, learning-based navigation technology have emerged. Instead of relying on pre-built maps, the agent perceives the environment and make decisions through visual observation, enabling end-to-end navigation. A key challenge is to enhance the generalization ability of the agent in unfamiliar environments. To tackle this challenge, it is necessary to endow the agent with spatial intelligence. Spatial intelligence refers to the ability of the agent to transform visual observations into insights, insights into understanding, and understanding into actions. To endow the agent with spatial intelligence, relevant research uses scene graph to represent the environment. We refer to this method as scene graph-based object goal navigation. In this paper, we concentrate on scene graph, offering formal description, computational framework of object goal navigation. We provide a comprehensive summary of the methods for constructing and applying scene graph. Additionally, we present experimental evidence that highlights the critical role of scene graph in improving navigation success. This paper also delineates promising research directions, all aimed at sharpening the focus on scene graph. Overall, this paper shows how scene graph endows the agent with spatial intelligence, aiming to promote the importance of scene graph in the field of intelligent navigation.
摘要
自主移动机器人在各行各业中得到了广泛应用。自主导航作为机器人的基本能力,已成为研究热点。传统的导航技术依赖于预先构建的地图,难以应对复杂动态环境。随着人工智能的发展,基于学习的导航技术应运而生。与传统方法不同,智能体通过视觉观察感知环境并做出决策,实现端到端的导航。一个关键挑战在于增强智能体在陌生环境中的泛化能力。为了应对这一挑战,赋予智能体空间智能十分必要。空间智能指的是智能体将视觉观察转化为洞察、洞察转化为理解、理解转化为行动的能力。为了赋予智能体空间智能,相关研究使用场景图谱来表示环境。我们将这种方法称为基于场景图谱的目标驱动导航。在本文中,我们专注于场景图谱,提供目标驱动导航的形式描述和计算框架。我们全面总结了构建和应用场景图谱的方法。此外,我们提供实验证据,突出场景图谱在提高导航成功率中的关键作用。聚焦场景图谱,本文还概述了有关研究方向。总的来说,本文展示了场景图谱如何赋予智能体空间智能,旨在加强智能导航领域对场景图谱的重视。
Key words: object goal navigation / scene graph / spatial intelligence / deep reinforcement learning
关键字 : 目标驱动导航 / 场景图谱 / 空间智能 / 深度强化学习
Cite this article: GUO Chi, LI Aolin, MENG Yiyue. Navigating with Spatial Intelligence: A Survey of Scene Graph-Based Object Goal Navigation[J]. Wuhan Univ J of Nat Sci, 2025, 30(5): 405-426.
Biography: GUO Chi, Ph. D., Professor, research direction: BeiDou applications, unmanned system navigation and location-based services. E-mail: guochi@whu.edu.cn
Foundation item: Supported by the Major Science and Technology Project of Hubei Province of China (2022AAA009), the Open Fund of Hubei Luojia Laboratory
© Wuhan University 2025
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
0 Introduction
Nowadays, autonomous mobile robots play a vital role in many industries, such as logistics management, consumer rescue, and housekeeping services[1]. Navigating to the destination is a basic ability that robots need to have. Classic navigation methods are often based on precise and digital geometric maps. When faced with complex and variable environments that lack well-constructed maps, these methods exhibit poor generalizability. With the development of machine learning and computer vision, learning-based navigation methods have emerged. During interactions with the surrounding environment, the agent uses artificial neural networks to process visual observations and make action decisions. Owing to the independence from prior maps, this end-to-end navigation method has become a research hotspot in recent years. It is known as target-driven navigation[2], visual mapless navigation[3-4], or object goal navigation[5]. For the sake of clarity, this paper uniformly refers to it as object goal navigation.
Since there is no pre-built map in object goal navigation, the agent only perceives through visual images and lacks spatial cognition of the environment. This causes the agent to easily overfit the training environment, which greatly limits its generalization ability in unfamiliar environments. The core of addressing the challenge lies in endowing the agent with spatial intelligence. Spatial intelligence refers to the ability of an agent to transform visual observations into insights, insights into understanding, and understanding into actions[6]. Without this ability, the agent simply transforms observations into actions, which makes it difficult to complete navigation tasks in unfamiliar environments. For example, assuming that the navigation target is a refrigerator, the agent with spatial intelligence will not waste time in the living room, but will move towards objects related to the kitchen. However, the agent without spatial intelligence does not have spatial cognition ability and is likely to fall into an endless loop in rooms outside the kitchen.
To endow the agent with spatial intelligence, researchers drew on Knowledge Graph (KG) in the computer field and proposed the concept of scene graph. KG is a kind of semantic network that reveals relationships between entities, facilitating a formalized depiction of real-world entities and their interconnections[7-8]. It consists of a mesh-like structure of "entity-relationship-entity" triplets, formally denoted as
, where
denotes the set of entities, and
denotes the set of relationships. In scene graph, entities typically describe object classes (e.g., televisions), concrete instances (e.g., gray leather sofas), zones (e.g., bedrooms), and other environmental elements. Relationships represent spatial or semantic connections between these entities, such as "next to", "above", and "inside". Scene graph provides prior information on the relationship between the navigation target and the overall scene, allowing the agent to perceive the surrounding environment, infer the target location and plan the path based on visual information. To distinguish other methods, we define the navigation method that integrates the scene graph as scene graph-based object goal navigation.
When it comes to scene graph-based object goal navigation, several questions emerge: 1) How to construct scene graph? 2) How to integrate scene graph into the object goal navigation framework? 3) How to evaluate the performance of scene graph within the object goal navigation framework? Thus, this paper undertakes a survey of research related to scene graph in object goal navigation to answer the mentioned questions: 1) offering an overview of formal description, computational framework in object goal navigation; 2) summarizing methods for constructing and applying scene graph in object goal navigation; 3) experimentally validating the performance of scene graph in object goal navigation, and outlines future directions. This paper emphasizes the role of scene graphs in object goal navigation and shows how it endows the agent with spatial intelligence.
1 Object Goal Navigation
1.1 Formal Description of Object Goal Navigation
Object goal navigation is a basic task for indoor robots in unfamiliar environments. In this task, the agent is given a specific object target, such as a laptop. The agent needs to understand what the target is and navigate towards it using visual observations. Classic navigation tasks aim to guide the agent to a point goal based on pre-built maps. This method overly relies on maps and does not form an understanding of the surrounding environment for the agent. Unlike classic navigation, the realization of object goal navigation tasks is inseparable from interaction with the surrounding environment. The interaction process of object goal navigation is described in detail below.
The subject of interaction is the navigation agent, and the object is the navigation environment. After executing an action, the agent obtains visual observations from the navigation environment and then carries out the next action based on new observations, repeating this cycle continuously. Under this interaction, the navigation agent needs to perform an action that is most conducive to the completion of the navigation task based on the observation at each moment, which is a sequential decision problem. Therefore, this problem can be mathematically described as Markov Decision Processes (MDP). One approach for addressing this problem is Reinforcement Learning (RL). This approach emphasizes the interaction between the agent and environment, facilitating the memorization of navigation experiences and learning of action strategies to maximize cumulative rewards[9-14].
Object goal navigation defines the interaction between the agent and the environment through a quadruple
. In this context,
represents the state space, comprising all possible states in the environment;
signifies the action space, encompassing all feasible actions;
represents the state transition function, where
denotes the probability when transitioning from the current state
to the next state
given action
;
represents the reward function, where
signifies the reward obtained when transitioning from current state
to next state
via action
. The agent acquires information about these functions by observing states, actions, and rewards in interactions, represented as triplets
. As illustrated in Fig. 1, the agent moves through a "state-action-reward" loop in the environment across time steps
, creating a Markov Decision Trajectory
, defined as:
![]() |
Fig. 1 Interaction between the agent and environment |
Referring to equation (1), the agent promptly receives reward feedback for each action. The aim is to maximize the cumulative reward, denoted as
, in order to refine the action strategy
. Here,
represents a function that maps states to actions. By incorporating a weighting factor
,
is expressed as the weighted sum of all rewards along the trajectory:
The expected return, represented as
, is used to elucidate the inherent randomness in actions and the environment. Under the policy
, the state-value function
or action-value function
aids the agent in deriving the expected return. Here,
denotes the expected return achievable by adhering to policy
from state
, whereas
signifies the expected return from acting
in state
under policy
. The definitions of
and
are as follows:
Achieving superior action policies
primarily involves three methods: value-based methods[15-16], policy-based methods[17], and hybrid methods[18]. Value-based methods indirectly derive the agent's policy by iteratively updating value functions. Upon reaching its optimal value, the value function yields the optimal policy for the agent. This method ensures convergence toward locally optimal policies. Policy-based methods directly establish a policy network through function approximation. Actions are subsequently selected to acquire reward values, and the policy network parameters are optimized along the gradient direction to achieve an optimized policy that maximizes returns. These methods exhibit high sample training efficiency. Hybrid methods combine value-based and policy-based methods, resulting in fast training efficiency and convergence speed.
1.2 Computational Framework of Object Goal Navigation
Ongoing advancements in deep learning have led to the emergence of Deep Reinforcement Learning (DRL) algorithms, significantly facilitating spatial cognition training in object goal navigation[19-22]. Within object goal navigation, the agent executes navigation actions guided by the current RGB image-based observation. The agent then acquires subsequent observations from the environment, continuing this iterative process to maximize cumulative rewards. The interaction process consists of the following key components:
1) Action: The action refers to the set of navigation actions. The action commonly is discrete, such as moving forward 0.25 meters, turning left or right by 45 degrees, and marking the task as completed. It is crucial for the agent to choose the appropriate action based on its perception of the surroundings. Different navigation strategies have different preferences for action selection. Aggressive strategies tend to favor moving forward, while cautious strategies often opt for turning left and right to observe the surroundings. Regardless of the navigation strategy, the most important thing is that the agent can successfully find the target object and consciously propose the done action.
2) Observation: The observation is RGB image obtained via a monocular camera, depicting the agent's first-person perspective. The quality and content of the images play a crucial role in the agent's understanding of the surroundings.
3) Target: The target could be an RGB image or an object category, like "television".
4) Rewards: Positive rewards (e.g., +5.00) are only granted upon successful task completion, aimed at minimizing the agent's path from task initiation to target discovery, while other navigation actions yield negative rewards (e.g., -0.01). A well-designed reward mechanism facilitates the training of good navigation agent.
Object goal navigation employs deep learning networks to model navigation processes and utilizes reinforcement learning algorithms for making navigation policy (refer to Fig. 2). It comprises three layers: 1) Perception layer, utilizing Convolutional Neural Networks (CNNs) like ResNet-50[23] to extract image features to enhance the agent's perception; 2) Memory layer, employing Recurrent Neural Networks (RNNs) like LSTM[24] to effectively represent long-term image feature sequences processed by ResNet-50, facilitating the agent's storage of navigation experiences and spatial environments memorization; 3) Decision layer, grounded on a reinforcement learning network, continuously guiding the interaction between the agent and the environment, collecting information on states, actions, and rewards, and iteratively optimizing navigation action policies and value functions based on this information. Object goal navigation network framework, solely reliant on visual sensors, transforms sight into insight, insight into understanding, and understanding into action, assisting the agent in navigating unfamiliar and map-less environments, offering a navigational solution equipped with cognition and spatial intelligence.
![]() |
Fig. 2 The framework of object goal navigation based on DRL |
1.3 Scene Graph in Object Goal Navigation
Scene graphs are formally defined as
, with
representing the set of entities, comprising a total of
distinct entities.
denotes the set of relationships, consisting of
distinct types of relationships. Entities encompass objects, instances, images, or other elements within the environment, whereas relationships represent spatial or semantic connections among these entities. Scene graph provides prior knowledge about the relationships between the target and the overall scene, enabling the agent to perceive the surrounding environment, infer the target location based on visual information, and plan the path accordingly.
For instance, a typical living room contains objects such as a television (TV), sofa, coffee table, and TV stand, collectively forming an entity set denoted as
. The relationships among these entities, such as their relative positions, constitute the set of relationships
. In general, the TV is usually on the TV stand while the coffee table is near the sofa. Scene graph accurately represents the relationships between objects, which helps the agent to understand the spatial layout of the surrounding environment. For example, when the target is a TV, the agent will approach the TV stand instead of moving towards the sofa based on the scene graph.
The aforementioned example focuses on a scene graph constituted by different objects within a single room. When it comes to cross-room navigation tasks, the agent needs to understand the spatial relationships between different rooms. For example, in a large house, there are rooms such as kitchen, living room, bedroom and bathroom. These rooms constitute a room-level scene graph. When the target is a refrigerator, the agent needs to judge which room the refrigerator belongs to according to the scene graph. Usually, the refrigerator is in the kitchen. Therefore, the agent moves towards the kitchen based on the spatial relationship provided by the scene graph, instead of choosing to wander around rooms such as the living room.
Obviously, the scene graph is critical for the agent to complete the navigation task. So how to integrate the scene graph into object goal navigation? Undoubtedly, it requires a systematic approach. Initially, employing knowledge extraction techniques is necessary to collect "entity-relationship-entity" triplets for constructing the scene graph. The construction of the scene graph involves four models categorized by the types of triplets: object triplet model, instance triplet model, zone triplet model, and image triplet model. The four models provide the agent with different levels of spatial relations and semantic information, which is conducive to its spatial cognition. Subsequently, utilizing knowledge representation and reasoning techniques is essential to acquire the structural features of the scene graph. This utilization comprises three methods: multimodal feature fusion, value function estimation, and sparse reward improvement. The agent translates the spatial information provided by the scene graph into its own understanding, thereby navigating adaptively in the surroundings. The construction and usage of scene graph are introduced in detail in Section 2 and 3, respectively.
2 Construction of Scene Graph in Object Goal Navigation
The scene graph comprises multiple categories of "entity-relationship-entity" triplets, depicting different levels of spatial relationships and semantic information. Through research on scene graph-based object goal navigation, we categorize the construction methods of scene graph based on the type of triplet into the object triplet model, instance triplet model, zone triplet model, and image triplet model. In this section, we will introduce the four models respectively. At the end of this section, the four models will be compared and analyzed horizontally, pointing out their respective advantages and disadvantages.
2.1 Object Triplet Model
Yang et al[25] first introduced the concept of the object triplet model to define the scene graph. They constructed an indoor scene graph offline using the Visual Genome dataset[26] (refer to Fig. 3). The Visual Genome dataset comprises over 100 000 natural images, each annotated with objects, attributes, and various relationships. Every object category was represented as a node within the scene graph during the construction process. Connections between two nodes were established by computing the occurrence frequency of relationships between objects, setting a threshold of occurrence frequency exceeding 3. The scene graph based on Visual Genome dataset contains common sense about the relationship between objects. On this basis, the agent can organically combine single visual observations with spatial information, forming a primary object-level spatial intelligence. This study expanded the framework of object goal navigation, leading to increased interest in the study of scene graph.
![]() |
Fig. 3 An example of scene graph based on Visual Genome dataset |
Constructing the scene graph from the Visual Genome dataset involves matching external annotated data with existing objects within the navigation scenes, enabling offline graph construction. This type of graph, as a fixed common-sense graph, is integrated into the network without updating during navigation, limiting its generalizability. Consequently, there is a necessity to explore methods for dynamically constructing the scene graph in real time to enhance adaptability. Hence, Du et al[27], inspired by human exploratory behavior, incorporated Faster R-CNN[28] (an object detector, refer to Fig. 4) into the process of constructing the scene graph. Initially, given an input image, all objects within are detected using Faster R-CNN. If an object has multiple instances, only the one with the highest confidence level is chosen as the detected object. Subsequently, the position and confidence of each object's bounding box are recorded and concatenated to form local detection features. If some objects do not appear in the current observed image, their bounding box positions and confidences are set to zero. Next, the one-hot encoding of object is concatenated with the local detection features to form Location-Aware Features (LAF). Finally, all object node features within the scene graph are replaced with LAF. Following node determination, relationships between nodes are denoted by object occurrence, which is assigned to 1 when both targets appear in the same observation image, and 0 otherwise. The scene graph constructed based on Faster R-CNN contains real-time dynamic spatial information of the navigation environment, which has higher accuracy, effectiveness and robustness. Based on this scene graph, the agent can reason and analyze the visible objects within the current visual observation, convert the observation into insight and understanding, and finally make the correct navigation behavior. It can be said that the scene graph built based on Faster R-CNN provides the agent with more powerful object-level spatial intelligence compared with simple common-sense graph.
![]() |
Fig. 4 An example of scene graph based on Faster R-CNN |
Real-time object detection using a target detector has been a common approach for constructing scene graph[28-30]. However, representing relationships between object nodes as simple binary (0 or 1) values leads to an oversimplified portrayal of spatial relationships. To precisely learn and explicitly represent relationships between objects, Dang et al[31] introduced an unbiased Directed Object Attention (DOA) graph, providing a unique and adaptable representation of object relationships. The DOA graph is obtained through a weighted summation of the Intrinsic Object Graph (IOG) and the View Adaptive Graph (VAG): 1) the IOG represents inherent relationships between objects, such as the strong intrinsic relationship between a mouse and a computer. It is defined as a learnable matrix
featuring bidirectional edges between objects, where
represents the number of objects. One end of an edge points towards the target, and the other represents objects requiring attention, with the edge weight indicating the intrinsic relationship between the target and other objects. Finally, SoftMax normalization is employed on the edge weights to ensure that the sum of weights connected to the target node equals 1. As the agent navigates various environments using reinforcement learning,
gradually stabilizes and rationalizes; 2) the VAG demonstrates real-time adaptive relationships among objects within observed images. Upon obtaining an observed image, the agent generates global image features
(derived from ResNet-18) and object features
(obtained from Faster R-CNN). Object features
consist of object visual images
, object positional features
, confidence
, and target indicators
. To filter out noise, a confidence threshold is set, and when
> threshold, the final object feature
is obtained. Furthermore, the final image feature
is generated by concatenating the one-hot encoded feature of the target with the globally average-pooled
to include target information in global image features. Lastly, the VAG
is derived from the object feature and image feature using multi-head attention. Based on the scene graph constructed by Faster R-CNN, Dang et al[31] employed attention mechanisms to convert fixed 0 or 1 relationships between objects into trainable parameters. The agent can not only gain insight into the spatial relationships between objects within the current visual observation, but also understand the relevance of different objects to the navigation task. Compared with the fixed-relationship scene graph proposed by Du et al[27], the one proposed by Dang et al[31] endows the agent with a more adaptive object-level spatial intelligence.
Regarding object goal navigation as a continual interaction between the agent and objects within the environment, solely focusing on spatial relationships between objects through "object-relation-object" triplets without considering the relationship between the agent and objects impedes the agent's grasp of spatial relationships. To tackle this issue, Hu et al[32] introduced a technique for building two types of scene graph: 1) Object KG: This graph employs object detection algorithms for real-time extraction, similar to the previously mentioned approach. Edges in this graph illustrate horizontal positional relationships; 2) Agent-object KG: In this graph, edges represent depth relationships. These two types of scene graph allow the agent to understand the positions of objects in both horizontal and vertical directions, thereby improving navigation performance.
2.2 Instance Triplet Model
The object triplet model mentioned above concentrates on building a scene graph based on object categories. In this model, if the agent receives a target, any object aligning with the target category is deemed as a successful navigation instance. However, in real-world scenarios, the agent requires pinpointing a particular instance. For example, if the agent is instructed to "find a red wine glass", locating a "paper cup" would be considered a navigation failure. Hence, the instance triplet model is introduced to assist with precise navigation tasks encountered in real-life scenarios.
Li et al[33] tackled instance-level navigation tasks by developing an instance-level scene graph (refer to Fig. 5), formally denoted as
.
denotes instance categories and
signifies the spatial relationships between instances and their inherent category similarities. Each node
in the scene graph
comprises instance boundary
, color
, and texture
. In contrast to object-level scene graph, the instance-level scene graph exhibits two notable differences: 1) it incorporates target instance masks; 2) it selects instances from the same category instead of merely selecting the target with the highest detection score. Given the target description, an instance mask is employed to represent color and texture, while word vectors represent the target features, formally represented as
.
shares edge connections with the scene graph
, activating solely the target instance node
.
is formed through the concatenation of feature vectors of color
and texture
, while the remaining inactive instance nodes are denoted by zero vectors of identical dimensions. In cases where multiple instances of a category are detected, the crucial task is to select the most appropriate instance. Considering the detected instances
,
represents the number of instances within category
, along with the bounding box
, color
, and texture
of each instance, the process of instance selection can be formally defined as follows,
![]() |
Fig. 5 An example of instance nodes |
where
denotes the concatenation of
and
, and
represents the function used to compute the Euclidean distance between two vectors. The essence of instance selection lies in choosing the instance that bears the closest resemblance to the target instance.
The instance triplet model focuses on real-world navigation tasks. Compared to the object triplet model, the nodes of the scene graph are no longer simple object categories, but specific object instances. The nodes contain the color and texture characteristics of each object, providing the agent with complex and accurate spatial relationships and semantic information. Based on instance-level scene graph, the agent can effectively distinguish the object instances in the visual observation when navigating in the real environment and determine whether it is the target object instance. Therefore, compared with the object-level scene graph, the instance-level scene graph helps the intelligent agent form spatial intelligence that is more conducive to real-world navigation tasks.
2.3 Zone Triplet Model
The object triplet model employs the object itself as a feature input, guiding the agent in planning navigation actions at each step. Nevertheless, lacking prior knowledge about the unfamiliar zone, the agent's initial navigation actions may be meaningless, such as "looking for the fridge in the bathroom". Further constraints on object relationships are necessary to substantially enhance the navigation capabilities. This necessitates the utilization of a zone triplet model, which focuses on the spatial relationships among objects within a shared zone.
Zhang et al[34] introduced a Hierarchical Object-to-Zone graph (HOZ graph), wherein each zone comprises a correlated set of objects. For example, "microwave", "cookware", and "sink" frequently coexist within a shared zone. Therefore, if the target is designated as "microwave", the primary objective for the agent is to navigate to that particular zone. Based on the rooms and scenes present in the environment, the scene graph can be categorized into room-level HOZ graph and scene-level HOZ graphs (refer to Fig. 6).
![]() |
Fig. 6 An example of room-level and scene-level HOZ graph |
The room-level HOZ graph: Similar rooms tend to exhibit analogous layouts and objects[35-36]. For example, when considering a living room, we may envision an area containing a sofa, table and television. Such areas form the room-level HOZ. Initially, within a specific room
, the agent randomly explores the room, acquiring a set of visual tuple features
. Here,
represents a bag-of-words vector extracted by Faster-RCNN, indicating the presence of objects in the current observed image, with
indicating the number of object categories.
represents the observation position, where
and
denote horizontal coordinates, while
and
represent the agent's yaw and pitch angles, respectively. Subsequently, implementing the K-means clustering algorithm on feature
yields
zones utilized for building the room-level HOZ graph
, where
and
represent the
-th region node and its feature, respectively. The zone feature corresponds to the cluster center, calculated as
. Here,
denotes the
-th cluster after K-means clustering, and
represents the number of elements in that cluster. The edges in the room-level HOZ graph
are calculated using cosine similarity between the features of two zones, signifying the adjacency relationship between them.
The scene-level HOZ graph: Constructing the scene-level HOZ graph includes grouping, matching, and merging all room-level HOZ graphs, categorized by scene types. For a particular scene, we acquire a set of room-level HOZ graphs represented as
. The direct perfect matching of all HOZ nodes in the set incurs a significantly high time complexity. Hence, the procedure entails iteratively matching and merging two graphs successively until all graphs in the set are merged into the final graph, depicting the scene-level HOZ graph. The matching and merging method utilize the Kuhn-Munkres algorithm (KM algorithm)[37-38], efficiently resolving the optimal maximum matching problem in bipartite graphs. If two equivalent subgraphs within a bipartite graph have a perfect match, it signifies the optimal maximum matching. Upon obtaining a perfect matching in the scene graph, it facilitates the averaging fusion of matched nodes and edges. The newly generated nodes represent the average of the matched nodes, and the new edges represent the average of the original edges between the matched nodes. The update process is illustrated in Fig. 7.
![]() |
Fig. 7 The graph matching based on KM algorithm |
The zone triplet model focuses on the spatial relationships between zones, claiming that object relationships without zone constraints can easily mislead the agent into an infinite loop. It clusters the object-level scene graph into a room-level scene graph based on the similarity of the object layout, and then constructs a scene-level scene graph from the room-level scene graph through graph matching and merging. With the help of the zone triplet model, the agent infers the scene based on the objects observed. In this way, the agent can quickly determine whether the environment is relevant to the navigation task. For example, suppose the navigation target is a TV, and the objects currently observed by the agent are a refrigerator and a microwave. The agent will judge that it is in the kitchen based on the scene graph and choose to leave instead of hanging around in it. In general, the zone triplet model endows the agent with zone-level spatial intelligence.
2.4 Image Triplet Model
The scene graphs introduced above are all constructed or extended based on the spatial relationships between objects. However, when establishing the mapping relationship between objects and the navigation environment, the feature information other than the object in the visual observation image is ignored, which limits its generalization ability to a certain extent. To address the challenge, Kwon et al[39] introduced a Visual Graph Memory (VGM) module. This module directly utilized the observation images for constructing the scene graph
. They employed the Prototypical Contrastive Learning (PCL) algorithm[40] for feature representation learning of the graph. Initially, the observation images are clustered, and then the PCL algorithm utilizes an image encoder
to bring similar cluster images closer and different cluster images farther apart in encoding. The utilization of the PCL algorithm facilitates unsupervised learning of the feature representation
for the scene graph.
VGM module comprises two components: 1) Localization: The current position of the agent can be determined using the features of the current observation image and the node features in VGM. The computation is as follows: the node from the previous localization is denoted as
, and at time
, the number of nodes in VGM is
. A pre-trained image encoder
is used to encode the current observation image
to derive the feature
, where
is the feature dimension. Employing cosine similarity allows computing
. When
exceeds a set similarity threshold, the corresponding node
is closer to the agent. Finally, the node with the highest similarity is chosen as the agent's current position; 2) Memory update: Assume the current localized position is node
. If
matches the previously localized node
, no VGM update is needed. However, if they differ, a new edge is added between
and
, with the feature of
replaced by the feature
encoded from the current observation image. If the current position cannot be localized, i.e., no
exceeds the set similarity threshold, a new node
and its corresponding feature
is added. Similarly, a new edge is created between
and
. Furthermore, the feature
of each node
incorporates time information
. Hence, the VGM obtained from these two modules, Localization and Memory Update, stores the agent's navigation experiences and temporal-spatial relationships. This scene graph provides the agent with image-level spatial intelligence, enabling the agent to directly use the feature information of the current observed image to determine its own position, plan the navigation path, and complete the navigation task.
2.5 Discussion
This section offers an overview of existing scene graph construction methods in scene graph-based object goal navigation classified into four categories: object triplet model, instance triplet model, zone triplet model, and image triplet model. 1) The object triplet model is the most fundamental and commonly used method in constructing scene graph, representing the co-occurrence relationships among objects within the same observed image. The object triplet model has evolved from offline construction based on existing datasets to online construction based on object detectors. Moreover, to improve the cognitive abilities of agents regarding spatial relationships, the relational features in scene graph have evolved from simple binary definitions (0 or 1) to trainable parameters. The object triplet model endows the agent with object-level spatial intelligence, which is suitable for relatively simple navigation scenarios. However, in real scenarios, agent needs to pinpoint specific instances. Simple categorization may result in failure to correctly locate a specific target. 2) The instance triplet model, focuses on real-world navigation tasks, constructing scene graph from an instance-level perspective, with nodes representing co-occurrence relationships among instance categories rather than object within the same image. These nodes incorporate attributes such as color and texture to help the agent recognize specific target instances, reflecting instance-level spatial intelligence. Despite its potential, the instance triplet model is an area that has seen limited exploration, and there is a clear need for further research and development to enhance its capabilities. 3) The zone triplet model introduces spatial constraints like rooms and scenes during the construction process. Furthermore, it employs scene graph matching and merging, resulting in relationships among nodes exhibiting inherent regional co-occurrences. This method enhances the inference capabilities of the scene graph compared to the object triplet model and the instance triplet model, showing powerful zone-level spatial intelligence. However, the construction process is computationally intensive, necessitating future research to streamline its complexity. In summary, the zone triplet model holds significant potential for broader development and application. 4) The image triplet model initiates directly from more comprehensive observation images, forming scene graph with images serving as nodes, thereby avoiding loss of image features. It endows the agent with image-level spatial intelligence and enhances its navigation generalization ability. Nonetheless, the image triplet model demands substantial computational resources, and with limited existing research, there is a pressing need for further advancements and studies to optimize its efficiency and effectiveness. Table 1 presents a detailed comparison of these four methods, including the construction methods, advantages and disadvantages, and the difficulty of the navigation tasks applied. Table 2 provides a detailed summary of the scene graph construction, categorizing the relevant works according to category and listing the salient contributions of each.
A comparison of scene graph construction methods
A summary of scene graph construction
3 Usage of Scene Graph in Object Goal Navigation
The object goal navigation framework encompasses the perception, memory, and decision layers, wherein scene graph exhibit adaptability across diverse network layers. Through research exploration regarding the usage of scene graph in scene graph-based object goal navigation, we classify their usage into multimodal feature fusion, value function estimation, and sparse reward improvement. In this section, we will introduce the three usage methods respectively, explaining how the scene graph is integrated with the navigation framework. At the end of this section, the advantages and disadvantages of the three usage methods are briefly analyzed.
3.1 Multimodal Feature Fusion
In the framework of object goal navigation, the perception layer is responsible for processing the agent's visual observations. The quality of the observation features processed by the perception layer greatly influences the final decision-making. Specifically, if the perception layer only extracts a limited number of effective observation features, the agent will be unable to make effective navigation decisions and will struggle to complete the navigation task. Relevant researchers have adopted the multimodal feature fusion of scene graph and visual observations to solve the above problem.
Compared with visual observation (e.g., RGB images), scene graph contains rich semantic information and spatial relationships. For instance, the agent's current navigation target is a relatively small object (e.g., a mouse). Thus, it is difficult to complete the navigation task solely through first-person RGB images. However, the scene graph establishes relationships between different objects, which can guide the agent to first find the "desk", then the "laptop", and finally the "mouse". The spatial dependencies between objects contained in the scene graph are key to compensating for the insufficient navigation information provided by visual observations.
To leverage the spatial graph features of scene graph, graph neural networks such as Graph Convolutional Networks (GCNs)[51], Graph Transformer Networks (GTNs)[52], and Graph Attention Networks (GATs)[53] are employed. These graph neural networks encode the graph into feature vectors and fuse them with feature vectors from other modalities, bolstering the agent's cognition and memory of the spatial structure of the environment[54].
For instance, GCN extends convolutional neural networks to operate within graph structures, specifically focused on learning graph feature representations. Given the current observed image, Faster R-CNN is utilized to initialize each object node and its relationship. GCN propagates information based on node relationships, computing feature vectors for each node. Subsequently, these feature vectors are merged with image features and target word vector features via multimodal feature fusion. This fused representation is fed into the object goal navigation neural network[41,55-56]. Fusing the spatial relationships of the scene graph with visual observations helps the perception layer extract richer navigation information. Based on the information, the agent can effectively transform visual observations into understanding, make correct navigation behaviors, and gradually form spatial intelligence. Refer to Fig. 8 for the network framework.
![]() |
Fig. 8 An example of multimodal feature fusion with scene graph in the object goal navigation framework |
3.2 Value Function Estimation
The object goal navigation framework is an end-to-end DRL neural network. Incorporating the scene graph within the perception layer allows the agent to autonomously train and learn the multimodal features. However, the long-distance impact on the reinforcement learning network towards the end of the model impedes rapid convergence and optimization of navigation strategies[17]. In particular, during the training process of object goal navigation, the agent can only access local environmental information at each time step. Although recurrent neural networks encode global contextual information, they face problems of forgetting, posing a significant challenge in accurately estimating state values.
To address this challenge, Moghaddam et al[57] proposed a Graph-based Value Estimation (GVE) module utilizing the Graph Transformer Network to capture the global contextual information of objects in the scene graph. The significance of global contextual information lies in that it contains the relationship between the objects detected and the target at each time step. With it, the agent can actively track objects related to the target. For instance, when searching for a specific small target like a "book", the agent might prioritize searching for larger objects such as a "bookshelf" or a "desk". This is because the Graph Transformer Network constantly adds new edges to the scene graph and adjust their weights at each time steps. A higher weight between the relevant object and the navigation target means higher attention. Applying the adjusted scene graph to the decision player, the related objects show higher estimated state values and are closer to the state value of the target based on the context information. This facilitates the agent in achieving more accurate state value estimations (refer to Fig. 9).
![]() |
Fig. 9 An example of value function estimation with scene graph in the object goal navigation framework |
In general, applying the scene graph to value function estimation helps the agent make positive state estimates related to the navigation target, showing keen insight, which is an intuitive manifestation of the combination of spatial intelligence and reinforcement learning.
3.3 Sparse Reward Improvement
In the reinforcement learning framework of object goal navigation, the agent accumulates rewards through continuous training, striving to maximize these rewards to learn the optimal navigation strategy. In the state space
of the environment, the reward space
remains constant, offering positive rewards solely upon successful arrival at the target, while all other navigation actions incur negative rewards. However, this reward configuration leads to the challenge of sparse rewards. Consequently, this sparsity issue renders the agent insensitive to positions offering positive rewards, fostering a "risk-averse" mindset wherein the agent might prematurely terminate navigation tasks to mitigate negative reward penalties.
To encourage the agent to overcome "risk-averse" thinking and actively explore the environment, Qiu et al[43] introduced the concept of "parent objects" into the scene graph. These "parent objects" are readily observable and share semantic or spatial relationships with the target. For instance, "countertop" serves as a "parent object" in kitchen and bathroom environments, while "shelf" functions as a "parent object" in bedroom and bathroom environments. Based on this concept, a set of "parent objects" denoted as
is derived to modify the original reward space. It aims to reinforce the agent's memory of these readily observable "parent objects" associated with the target and to comprehend spatial or semantic relationships between them. The rules for reward shaping are as follows: when the agent observes a "parent object" denoted as
, it receives a partial reward denoted as
, where
is the reward for reaching the target
,
represents a scaling factor, and
is extracted from a partial reward matrix
, where each row represents the relative distance between "parent objects" and the target. If multiple "parent objects" are visible in the agent's observation, the "parent object" yielding the maximum
is chosen. Furthermore, the agent is programmed not to receive this reward upon repeated observations of the same "parent object", encouraging exploration of different "parent objects". If neither the "parent objects" nor the target are visible, the agent incurs a negative reward as a penalty. The new reward space
is defined as,
When the "Done" action is triggered, it signifies that the termination condition has been met. Meanwhile, if
is visible, the agent receives a goal reward consisting of
and
, allowing it to associate parent objects with the target and current state. As the navigation framework is a kind of end-to-end network, this shaped reward propagates back to the GCN layers, fine-tuning them to accurately learn the "parent-target" hierarchical relationships from the input scene graph. Through continuous learning of the semantic information and spatial relationships within the scene graph, the reward function designed based on the "parent-target" relationship becomes increasingly precise, which in turn improves the problem of reward sparsity.
In contrast to the research on manually designing reward functions, Singh et al[49] proposed the SGC learning for training the agent. SGC employs non-parametric scene graph as a training-only signal in a contrastive learning framework. The core concept of contrastive learning is to reduce the distance between positive sample pairs in the latent space while increasing the distance between negative sample pairs. Concretely speaking, in object goal navigation, the scene graphs and visual observations are sample pairs. The scene graph is collected from parallel agent rollout. The nodes of the scene graph include the agent, the room, and the object, and the edges represent different spatial relationships, such as seen and contained. The agent trained with SGC is able to pick out the most relevant graph in a large number of scene graphs based on the visual observation. The scene graph most relevant to the current visual observation can provide the agent with the most accurate semantic information and spatial relationships, thereby helping the agent make the correct navigation decisions. In this way, at each time step, the agent can obtain the most accurate scene graph, which accelerates the training process.
Singh et al[49] evaluated SGC, by training agent on three embodied tasks, object goal navigation, multi-object navigation[58] and arm point navigation[59], and showed performance improvements across all of them. In general, using scene graph to improve the reward sparsity problem is essentially to enhance the spatial cognition ability of the agent through its rich spatial relationships and semantic information.
3.4 Discussion
This section provides an overview of the existing scene graph usage methods in scene graph-based object goal navigation, categorized into three methods: multimodal feature fusion, value function estimation, and sparse reward improvement.
1) Multimodal feature fusion, the most common usage, involves integrating the scene graph into the perception layer of the object goal navigation network framework. It optimizes graph feature extraction and fusion to enhance the agent's spatial intelligence. However, the approach faces certain challenges. GTN performs the best in extracting graphical features, followed by GAT and GCN. Along with the improved effectiveness, the complexity of the network model increases significantly, which in turn leads to more severe requirements for computing resources. This requires researchers to balance the performance of the network model with the computational complexity, by optimizing the model structure or proposing new graph feature extraction methods.
2) Value function estimation, originating from the decision-making layer of the object goal navigation network framework, optimizes the navigation strategy through refining the value function estimation. The goal is to incorporate relationships between objects and the target into the value function. Using scene graph for Value Function Estimation in navigation tasks, while providing valuable insights into the relationships between objects and the target, also comes with several limitations. The accuracy of value function estimate is affected to some extent by the accuracy and completeness of the scene graph. Inaccurate object relationships may lead to inaccurate value function estimates. Meanwhile, scene graph might include a lot of irrelevant information that can interfere with the learning process and affect decision making. To improve the accuracy of scene graph, researchers can apply more accurate target detectors. For task-irrelevant information in scene graph, researchers can use methods such as attention mechanisms to highlight task-related observation state features.
3) Sparse reward improvement, stemming from the Markov Decision Process described in object goal navigation, focus on the challenge of target insensitivity due to rewards sparsity in reinforcement learning. Qiu and Pal[43] explored spatial relationships between the target and other objects within the scene graph, increasing the reward for objects associated with the target. Instead of designing the reward function directly, Singh et al[49] used the scene graph as an auxiliary signal to compute the loss and train the agent by means of contrastive learning. Using scene graph to address the issue of sparse rewards in reinforcement learning is a promising approach, but it also has potential drawbacks. The design of scene graph and the tuning of reward functions may be highly dependent on specific tasks, limiting the method's versatility across different tasks. Utilizing contrastive learning as a training method may require a large number of negative samples, which could make the training process cumbersome and inefficient. For the sparse rewards improvement, researchers can refer to the weight mechanism used by scene graph to express spatial relationships. When the agent moves towards an area with a higher weight, a higher reward is given, and vice versa, which is beneficial to enhance the agent's spatial intelligence. Table 3 summarizes the strengths and weaknesses of the three methods.
A summary of the three usage methods
4 Experimental Validation of Scene Graph in Object Goal Navigation
4.1 Experimental Setup
Object goal navigation is a critical task for indoor robots. It aims to direct an agent navigating to a specific target object within unseen environments. Researchers in the field have proposed several comprehensive simulators for this task, including Gibson Env[60], Habitat Sim[61], and AI2-THOR[62]. Based on these simulators, the researchers constructed a variety of datasets. The Gibson Env simulator supports the Gibson dataset. The Habitat Sim is compatible with the Matterport3D[63], Gibson and Habitat Matterport 3D dataset[64]. The AI2-THOR simulator is paired with the iTHOR, RoboTHOR[65] and ProcTHOR dataset[66]. Each dataset offers unique challenges and insights, contributing to the advancement of object goal navigation.
Focusing on scene graph-based object goal navigation, most researchers chose AI2-THOR as the simulator, using the iTHOR dataset it provided as a benchmark for training and testing. The dataset is fully open source. AI2-THOR, constructed using Unity3D, provides a collection of over 120 rooms, including kitchens, living rooms, bedrooms, and bathrooms (30 rooms per room type), and hosts a catalog of more than 2 000 objects (refer to Fig. 10). For evaluating navigation performance across various scenes for training, validation, and testing, the initial 20 rooms of each room type constitute the training set, followed by 5 rooms for validation, and concluding with 5 rooms for testing.
![]() |
Fig. 10 An example of AI2-THOR simulation environment |
The navigation task is set as follows: in each episode, the agent is randomly initialized in position and orientation in an unknown environment, and then receives a target (e.g., a television) as a Glove embedding[67]. The agent utilizes observations from a monocular RGB image to explore, perceive, and memorize the navigation environment. It selects navigation actions from the action space 
. More specifically, the
action advances the agent by a step length of 0.25 m,
and
actions turn the agent by 45 degrees,
and
actions adjust the view angle by 30 degrees, and
denotes the completion action. A successful navigation episode is characterized by the following conditions: 1) The agent issues the Done action within the maximum allowed time steps (set to 100 steps for kitchens, bedrooms, and bathrooms, and 200 steps for living rooms due to varying room sizes). 2) The target is within the agent's field of view. 3) The distance between the target and the agent is less than 1.5 m. Furthermore, the reward space
is defined as follows: a negative reward of -0.01 is assigned for each step taken, a negative reward of -0.10 is given in case of collision, and a positive reward of +10.00 is assigned for a successfully completed navigation task. To evaluate and validate the enhancement in the generalization performance of navigation due to the scene graph, the targets are set as follows (refer to Table 4).
The setup of targets for each room
4.2 Metrics
Table 5 summarizes the simulators, datasets and metrics used in the related works. According to Table 5, the evaluation metrics consist of three primary measures[22]: 1) Success Rate (SR) denotes the ratio of successful occurrences where the agent successfully locates the target. It is defined as
, where
represents the number of navigation episodes, and
represents the success or failure of navigation episode
(
denotes success,
denotes failure); 2) Success Weighted by Path Length (SPL) signifies the efficiency of the agent in locating the target based on the path length taken. It is defined as
, where
denotes the path length of navigation episode
, and
represents the optimal path length for navigation episode
. 3) Success Weighted by Action Efficiency (SAE) signifies the efficiency of all actions. It is defined as
, where
represents the total number of episodes,
is the indicator function that returns 1 if the condition inside is true, and 0 otherwise,
is the action taken by the agent at time step
in episode
,
is the set of all possible action categories and
is the subset of actions that can change the agent's location.
A summary of the simulator, dataset and metric used in the related works
4.3 Comparison Models
We select classic models with or without scene graph from recent years, and train and test them in AI2-THOR. The selected models are as follows:
1) Random: The agent randomly samples actions from the action space.
2) TD-A3C: Zhu et al[2] proposed training the agent using the A3C reinforcement learning algorithm, which is a fundamental model.
3) Bayes: Wu et al[68] introduced the variational Bayesian inference to assist the agent in envisioning the next time step's observation image for action prediction.
4) SAVN: Wortsman et al[69] proposed using Model-Agnostic Meta-Learning (MAML) to enhance the interaction capabilities between the agent and environment.
5) SpAtt: Mayo et al[70] introduced spatial attention mechanisms to guide the agent's focus toward areas relevant to the target.
The above are object goal navigation models without scene graph. Below, we list the models that utilized scene graph.
6) SP (Scene Priors): Yang et al[25] introduced the object triplet model first for multimodal feature fusion, enhancing the agent's spatial memory of the environment.
7) GVE: Moghaddam et al[57] introduced the GVE module, assisting the agent in more accurately estimating state values.
8) VGM: Kwon et al[39] introduced the image triplet model, creating a VGM module for online training and reinforcement of memory.
9) ORG: Du et al[27] proposes the Object Relation Graph (ORG) mentioned above, trial-driven Imitation Learning (IL), and a memory-augmented Tentative Policy Network (TPN). The TPN offers explicit action instructions during failure steps to update the navigation network.
10) HOZ: Zhang et al[34] proposed the zone triplet model, establishing a hierarchical object-zone graph, allowing the agent to locate the region and then the target, which aids in navigation reasoning.
11) DOA: Dang et al[31] introduced an unbiased DOA graph. The model learns DOA graph online as scene graph, providing a unique and adaptable representation of object relationships.
12) ABM: Luo et al[48] presented a model equipped with an ARG for acquiring the structure of new scenes and retaining a graph memory to facilitate the selection of effective actions in Active Behavior Modeling (ABM).
13) VTNet: Du et al[71] introduced a visual transformer network, known as VTNet, designed to enhance visual representations by encoding the interactions between object instances and their spatial positions. In particular, VTNet uses DETR[72] (an object detector) rather than Faster-RCNN to extract target features to construct the scene graph.
14) GTV+MTL: Zhou et al[73] constructed a Commonsense Graph (CG) by leveraging ground truth data from AI2-THOR to align the scene graph constructed online, The GTN is used to extract global context features of the KG as an encoder, while the Viterbi algorithm[74] serves as a decoder to infer the probability of the next navigation action. Eventually, combining the action inference probability with the backbone network's predicted action probability serves as a supervisory signal for Meta-Transfer Learning (MTL)[75], aiming to attain the optimal strategy.
See Table 6 for a comparison of all models.
The comparison of object goal navigation models
4.4 Results
Under the AI2-THOR environment, each model was trained for 10 million episodes. The test results on the testing set are shown in Table 7, where "ALL" denotes all navigation path lengths, and "L
5" signifies optimal navigation paths of 5 or more time steps. Since only a few works use SAE as a metric, to be fair, we choose SR and SPL as metrics here.
Based on the experimental results in Table 7, we make the following analysis:
1) Effect of meta-learning. Compared with Random, TD-A3C, and Bayes, models such as SAVN and SpAtt have demonstrated better adaptability to unfamiliar environments in their testing processes by employing meta-learning without scene graph. Among them, the SpAtt model has achieved significant improvements, with SR and SPL reaching 43.2% and 16.9%, respectively. Consequently, scene graph-based models like GVE, ABM, and GTV+MTL also utilize meta-learning or meta-transfer learning to enhance their navigation generalization capabilities when training agents.
2) Effect of scene graph. SP, a basic scene knowledge-based model using GCN to extract graph features, outperforms models lacking scene graph like Random, TD-A3C, and Bayes. It endows the agent with the most basic object-level spatial intelligence. However, it simply concatenates features from scene graph, lacking sufficient extraction of spatial context features, exhibiting lower effectiveness in comparison to SAVN. In contrast to SP, GVE module utilizes the Graph Transformer Network to capture global contextual information about objects within the environment, facilitating the agent to achieve more accurate state value estimation. Thus, GVE achieves the same level of effectiveness as the SpAtt model. Unlike other models focus on the relationships between objects, VGM constructs scene graph directly from the observation images. It utilizes the PCL method to learn feature representations for the graph. Nodes that consist of raw image features preserve more valuable information, leading to better navigation performance compared to SP and GVE. It shows that image-based scene graph helps the agent show better spatial intelligence than primary object-level scene graph.
The scene graph of both SP and GVE are constructed based on Visual Genome dataset and cannot be updated in real time, limiting the generalization ability of the agent. Therefore, models such as ORG, HOZ and DOA opt for using object detectors to construct scene graph based on the navigation environment. ORG pioneered the use of Faster-RCNN to detect object in images for constructing scene graph, increasing SR to 67.3% and SPL to 39.5%. From an experimental perspective, it is proved that the scene graph built based on the object detector endows the agent with more powerful object-level spatial intelligence. Notably, HOZ also constructs scene graph based on Faster-RCNN with remarkable results, suggesting that the constraints imposed by the room-level and scene-level in the zone triplet model prove more advantageous in extracting spatial and semantic relationships compared to models like the object triplet model. Both VTNet and ABM use DETR to recognize target features and further pre-train them with a transformer to obtain more precise spatial relationships. As a result, the results of both models are superior to ORG and HOZ. It suggests that the scene graph constructed based on the navigation environment provides the agent with more accurate spatial relationships between objects.
GTV+MTL notably enhances navigation performance due to integrating scene graph with diverse modules or algorithms. It stands out as an innovative approach that harnesses the power of scene graph, representing the Graph Transformer Viterbi network coupled with Meta-Transfer Learning. The model supports efficient learning signals and the generation of new graph edges in previously unseen scenes, enabling agent to learn policies faster and exhibit enhanced generalization ability. Overall, with the continuous improvement of scene graph construction and usage methods, the agent has obtained rich spatial relationships and semantic information. Based on this, the agent has learned how to transform simple visual observations into rich insights and understanding, make correct navigation actions, and thus enhance the generalization ability in unfamiliar environments.
The experimental results of object goal navigation models (unit:%)
5 Conclusion
Following an investigation and analysis of scene graph-based object goal navigation, alongside formal description, and computational framework, this study consolidates the role of scene graph. This encompasses its construction, usage, and experimental validation. Construction methods encompass object triplet model, instance triplet model, zone triplet model, and image triplet model. Their usage involves multimodal feature fusion, value function estimation, and sparse reward improvement. Through experimental comparison, this study analyzes scene graph-based methods against those not utilizing scene graph, affirming the performance enhancement achieved with scene graph endowing agent with spatial intelligence. This research aims to underscore the significance of scene graph in navigation while fortifying its construction and utilization. Here, we list directions for future research.
1) Construction methodology: The current mainstream construction methods are based on the object detectors Faster-RCNN or DETR. Compared to the former, DETR, although able to extract better spatial relations, requires pre-training using transformers, resulting in much higher computational costs. Therefore, balancing the detection results with the computation cost is a topic that needs to be emphasized in constructing scene graph. Introducing novel node types, establishing new relationships, and applying innovative spatial or semantic constraints could improve navigation performance and generalize it across diverse navigation tasks.
2) Introducing novel memory networks: In object goal navigation, the use of LSTM as memory networks to store the spatial relationships and semantic information provided by the scene graph is worth further exploration. LSTM has limitations, such as difficulty in parallel processing and performance issues with long sequence data. This results in the spatial intelligence provided by the scene graph not being fully exploited. The Segment Anything Model (SAM)[76] is an advanced model capable of possessing zero-shot generalization capabilities, allowing it to identify unfamiliar objects and images without additional training. The capabilities of SAM hold potential value in object goal navigation. The introduction of SAM, especially when combined with the advantages of the transformer architecture, may offer a new solution.
3) Combining scene graph with Large Language Models (LLMs): Expanding the usage of scene graph-based object goal navigation models from indoor to outdoor scenarios demands the development of a universal model, posing challenges in data volume and intricate navigation tasks. Using the LLMs to directly construct the scene graph for outdoor scenarios is a viable approach to solving this problem[77]. In addition, utilizing the powerful reasoning capabilities of LLMs is conducive to fully exploring the semantic and spatial relationships in scene graph, endowing the agent with more powerful spatial intelligence, which is one of the most promising research directions in the future.
4) Bridging the sim-to-real gap based on scene graph: There is a huge sim-to-real gap in DRL-based object goal navigation. Gervet et al[78] proposed that a principal factor contributing to this phenomenon was the divergence in RGB image quality between virtual and actual environments. Consequently, the degree of realism inherent to the virtual environment is of paramount importance in enabling a seamless transition from the virtual scene to the real world. The Isaac simulation platform[79] provides a virtual environment that is nearly realistic, which markedly enhances the quality of observed images. It is postulated that the scene graph constructed on this basis will indubitably mitigate the sim-to-real gap in object goal navigation. This work focuses on helping object goal navigation move from virtual to reality and truly realize spatial intelligence based on real life, which is of great significance to the development of the field of intelligent navigation.
5) Quality assessment of scene graph: Evaluating scene graph comprehensively across various domains like knowledge modeling, storage, extraction, fusion, computation, usage, and assessment within the context of object goal navigation could enrich the understanding of their quality.
References
- Wirtz J, Hofmeister J, Chew P Y P, et al. Digital service technologies, service robots, AI, and the strategic pathways to cost-effective service excellence[J]. The Service Industries Journal, 2023, 43(15/16): 1173-1196. [Google Scholar]
- Zhu Y K, Mottaghi R, Kolve E, et al. Target-driven visual navigation in indoor scenes using deep reinforcement learning[C]//2017 IEEE International Conference on Robotics and Automation (ICRA). New York: IEEE, 2017: 3357-3364. [Google Scholar]
- Li F, Guo C, Luo B H, et al. Multi goals and multi scenes visual mapless navigation in indoor using meta-learning and scene priors[J]. Neurocomputing, 2021, 449: 368-377. [Google Scholar]
- Li F, Guo C, Zhang H Y, et al. Context vector-based visual mapless navigation in indoor using hierarchical semantic information and meta-learning[J]. Complex & Intelligent Systems, 2023, 9(2): 2031-2041. [Google Scholar]
- Sun J W, Wu J, Ji Z, et al. A survey of object goal navigation[J]. IEEE Transactions on Automation Science and Engineering, 2025, 22: 2292-2308. [Google Scholar]
- Li C S, Zhang R H, Wong J, et al. BEHAVIOR-1K: A benchmark for embodied AI with 1 000 everyday activities and realistic simulation[C]//Conference on Robot Learning. New York: PMLR, 2022: 80-93. [Google Scholar]
- Shadbolt N, Berners-Lee T, Hall W. IFIP International Conference on E-Business, E-Services, and E-Society[M]. Cham: Springer-Verlag, 2003. [Google Scholar]
- Ji S X, Pan S R, Cambria E, et al. A survey on knowledge graphs: Representation, acquisition, and applications[J]. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33(2): 494-514. [Google Scholar]
- Guo C, Luo B, Li F, et al. Review and verification for brain-like navigation algorithm[J]. Geomatics and Information Science of Wuhan University, 2021, 46(12): 1819-1831. [Google Scholar]
- Epstein R A, Patai E Z, Julian J B, et al. The cognitive map in humans: Spatial navigation and beyond[J]. Nature Neuroscience, 2017, 20(11): 1504-1513. [Google Scholar]
- Sosa M, Giocomo L M. Navigating for reward[J]. Nature Reviews Neuroscience, 2021, 22(8): 472-487. [Google Scholar]
- Ambrose R E, Pfeiffer B E, Foster D J. Reverse replay of hippocampal place cells is uniquely modulated by changing reward[J]. Neuron, 2016, 91(5): 1124-1136. [Google Scholar]
- Bhattarai B, Lee J W, Jung M W. Distinct effects of reward and navigation history on hippocampal forward and reverse replays[J]. Proceedings of the National Academy of Sciences of the United States of America, 2020, 117(1): 689-697. [Google Scholar]
- Banino A, Barry C, Uria B, et al. Vector-based navigation using grid-like representations in artificial agents[J]. Nature, 2018, 557: 429-433. [Google Scholar]
- Rummery G A, Niranjan M. On-line Q-learning Using Connectionist Systems[M]. Cambridge: University of Cambridge, 1994. [Google Scholar]
- Mnih V, Kavukcuoglu K, Silver D, et al. Playing atari with deep reinforcement learning[EB/OL]. [2013-12-19]. http://arxiv.org/abs/1312.5602. [Google Scholar]
- Williams R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning[J]. Machine Learning, 1992, 8(3): 229-256. [Google Scholar]
- Mnih V, Badia A P, Mirza M, et al. Asynchronous methods for deep reinforcement learning[EB/OL]. [2016-06-16]. http://arxiv.org/abs/1602.01783. [Google Scholar]
- Devo A, Mezzetti G, Costante G, et al. Towards generalization in target-driven visual navigation by using deep reinforcement learning[J]. IEEE Transactions on Robotics, 2020, 36(5): 1546-1561. [Google Scholar]
- Gupta S, Tolani V, Davidson J, et al. Cognitive mapping and planning for visual navigation[J]. International Journal of Computer Vision, 2020, 128(5): 1311-1330. [Google Scholar]
- Mousavian A, Toshev A, Fišer M, et al. Visual representations for semantic target driven navigation[C]//2019 International Conference on Robotics and Automation (ICRA). New York: IEEE, 2019: 8846-8852. [Google Scholar]
- Anderson P, Chang A, Chaplot D S, et al. On evaluation of embodied navigation agents[EB/OL]. [2018-07-18]. http://arxiv.org/abs/1807.06757. [Google Scholar]
- He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2016: 770-778. [Google Scholar]
- Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780. [CrossRef] [Google Scholar]
- Yang W, Wang X L, Farhadi A, et al. Visual semantic navigation using scene priors[EB/OL]. [2018-10-15]. http://arxiv.org/abs/1810.06543. [Google Scholar]
- Krishna R, Zhu Y K, Groth O, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations[J]. International Journal of Computer Vision, 2017, 123(1): 32-73. [Google Scholar]
- Du H M, Yu X, Zheng L. Learning object relation graph and tentative policy for visual navigation[C]//European Conference on Computer Vision. Cham: Springer, 2020: 19-34. [Google Scholar]
- Ren S Q, He K M, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. [CrossRef] [Google Scholar]
- Bochkovskiy A, Wang C Y, Liao H Y M. YOLOv4: Optimal speed and accuracy of object detection[EB/OL]. [2020-04-23]. http://arxiv.org/abs/2004.10934. [Google Scholar]
- Yang J W, Lu J S, Lee S, et al. Graph R-CNN for scene graph generation[C]//European Conference on Computer Vision. Cham: Springer-Verlag, 2018: 690-706. [Google Scholar]
- Dang R H, Shi Z F, Wang L Y, et al. Unbiased directed object attention graph for object navigation[C]//Proceedings of the 30th ACM International Conference on Multimedia. New York: ACM, 2022: 3617-3627. [Google Scholar]
- Hu X B, Lin Y F, Wang S, et al. Agent-centric relation graph for object visual navigation[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(2): 1295-1309. [Google Scholar]
- Li W, Song X, Bai Y, et al. ION: Instance-level object navigation[C]//Proceedings of the 29th ACM International Conference on Multimedia. New York: ACM, 2021: 4343-4352. [Google Scholar]
- Zhang S X, Song X H, Bai Y B, et al. Hierarchical object-to-zone graph for object navigation[C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). New York: IEEE, 2021: 15110-15120. [Google Scholar]
- Izadinia H, Sadeghi F, Farhadi A. Incorporating scene context and object layout into appearance modeling[C]//2014 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2014: 232-239. [Google Scholar]
- Zuo Z, Shuai B, Wang G, et al. Learning contextual dependence with convolutional hierarchical recurrent neural networks[J]. IEEE Transactions on Image Processing, 2016, 25(7): 2983-2996. [Google Scholar]
- Kuhn H W. The Hungarian method for the assignment problem[J]. Naval Research Logistics Quarterly, 1955, 2(1/2): 83-97. [Google Scholar]
- Munkres J. Algorithms for the assignment and transportation problems[J]. Journal of the Society for Industrial and Applied Mathematics, 1957, 5(1): 32-38. [Google Scholar]
- Kwon O, Kim N, Choi Y, et al. Visual graph memory with unsupervised representation for visual navigation[C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). New York: IEEE, 2021: 15870-15879. [Google Scholar]
- Li J N, Zhou P, Xiong C M, et al. Prototypical contrastive learning of unsupervised representations[EB/OL]. [2021-03-30]. http://arxiv.org/abs/2005.04966. [Google Scholar]
- Nguyen T L, Nguyen D V, Le T H. Reinforcement learning based navigation with semantic knowledge of indoor environments[C]//2019 11th International Conference on Knowledge and Systems Engineering (KSE). New York: IEEE, 2019: 1-7. [Google Scholar]
- Zhou K, Guo C, Zhang H Y. Visual navigation via reinforcement learning and relational reasoning[C]//2021 IEEE SmartWorld, & , & , & , (/////). New York: IEEE, 2021: 131-138. [Google Scholar]
- Qiu Y D, Pal A, Christensen H I. Learning hierarchical relationships for object-goal navigation[EB/OL]. [2020-11-18]. http://arxiv.org/abs/2003.06749. [Google Scholar]
- Zhou K, Guo C, Guo W F, et al. Learning heterogeneous relation graph and value regularization policy for visual navigation[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(11): 16901-16915. [Google Scholar]
- Lyu Y L, Talebi M S. Double graph attention networks for visual semantic navigation[J]. Neural Processing Letters, 2023, 55(7): 9019-9040. [Google Scholar]
- Dang R H, Wang L Y, He Z T, et al. Search for or navigate to dual adaptive thinking for object navigation[C]//2023 IEEE/CVF International Conference on Computer Vision (ICCV). New York: IEEE, 2023: 8216-8225. [Google Scholar]
- Yang B J, Yuan X F, Ying Z M, et al. HOGN-TVGN: Human-inspired embodied object goal navigation based on time-varying knowledge graph inference networks for robots[J]. Advanced Engineering Informatics, 2024, 62: 102671. [Google Scholar]
- Luo J, Cai B, Yu Y X, et al. Learning multimodal adaptive relation graph and action boost memory for visual navigation[J]. Advanced Engineering Informatics, 2024, 62: 102678. [Google Scholar]
- Singh K P, Salvador J, Weihs L, et al. Scene graph contrastive learning for embodied navigation[C]//2023 IEEE/CVF International Conference on Computer Vision (ICCV). New York: IEEE, 2023: 10850-10860. [Google Scholar]
- Xu N, Wang W, Yang R, et al. Aligning knowledge graph with visual perception for object-goal navigation[EB/OL]. [2024-04-26]. http://arxiv.org/abs/2402.18892. [Google Scholar]
- Kipf T N, Welling M. Semi-supervised classification with graph convolutional networks[EB/OL]. [2017-02-22]. http://arxiv.org/abs/1609.02907. [Google Scholar]
- Yun S, Jeong M, Kim R, et al. Graph transformer networks[EB/OL]. [2020-02-05]. http://arxiv.org/abs/1911.06455. [Google Scholar]
- Veličković P, Cucurull G, Casanova A, et al. Graph attention networks[EB/OL]. [2018-02-04]. http://arxiv.org/abs/1710.10903. [Google Scholar]
- Moghaddam M M K, Wu Q, Abbasnejad E, et al. Utilising prior knowledge for visual navigation: Distil and adapt[EB/OL]. [2020-12-06]. http://arxiv.org/abs/2004.03222. [Google Scholar]
- Lu Y, Chen Y R, Zhao D B, et al. MGRL: Graph neural network based inference in a Markov network with reinforcement learning for visual navigation[J]. Neurocomputing, 2021, 421: 140-150. [Google Scholar]
- Lyu Y L, Shi Y M, Zhang X G. Improving target-driven visual navigation with attention on 3D spatial relationships[J]. Neural Processing Letters, 2022, 54(5): 3979-3998. [Google Scholar]
- Moghaddam M K, Wu Q, Abbasnejad E, et al. Optimistic agent: Accurate graph-based value estimation for more successful visual navigation[C]//2021 IEEE Winter Conference on Applications of Computer Vision (WACV). New York: IEEE, 2021: 3732-3741. [Google Scholar]
- Wani S, Patel S, Jain U, et al. MultiON: Benchmarking semantic map memory using multi-object navigation[EB/OL]. [2020-12-07]. http://arxiv.org/abs/2012.03912. [Google Scholar]
- Ehsani K, Han W, Herrasti A, et al. ManipulaTHOR: A framework for visual object manipulation[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2021: 4495-4504. [Google Scholar]
- Xia F, Zamir A R, He Z Y, et al. Gibson Env: Real-world perception for embodied agents[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 9068-9079. [Google Scholar]
- Savva M, Kadian A, Maksymets O, et al. Habitat: A platform for embodied AI research[C]//2019 IEEE/CVF International Conference on Computer Vision (ICCV). New York: IEEE, 2019: 9338-9346. [Google Scholar]
- Kolve E, Mottaghi R, Han W, et al. AI2-THOR: An interactive 3D environment for visual AI[EB/OL]. [2022-08-26]. http://arxiv.org/abs/1712.05474. [Google Scholar]
- Chang A, Dai A, Funkhouser T, et al. Matterport3D: Learning from RGB-D data in indoor environments[EB/OL]. [2017-09-18]. http://arxiv.org/abs/1709.06158. [Google Scholar]
- Ramakrishnan S K, Gokaslan A, Wijmans E, et al. Habitat-matterport 3D dataset (HM3D): 1000 large-scale 3D environments for embodied AI[EB/OL]. [2021-09-16]. http://arxiv.org/abs/2109.08238. [Google Scholar]
- Deitke M, Han W, Herrasti A, et al. RoboTHOR: An open simulation-to-real embodied AI platform[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2020: 3161-3171. [Google Scholar]
- Deitke M, VanderBilt E, Herrasti A, et al. ProcTHOR: Large-scale embodied AI using procedural generation[EB/OL]. [2022-06-14]. http://arxiv.org/abs/2206.06994. [Google Scholar]
- Pennington J, Socher R, Manning C. Glove: Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Stroudsburg, PA: Association for Computational Linguistics, 2014: 1532-1543. [Google Scholar]
- Wu Q Y, Manocha D, Wang J, et al. NeoNav: Improving the generalization of visual navigation via generating next expected observations[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(6): 10001-10008. [Google Scholar]
- Wortsman M, Ehsani K, Rastegari M, et al. Learning to learn how to learn: Self-adaptive visual navigation using meta-learning[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2019: 6743-6752. [Google Scholar]
- Mayo B, Hazan T, Tal A. Visual navigation with spatial attention[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2021: 16893-16902. [Google Scholar]
- Du H M, Yu X, Zheng L. VTNet: Visual transformer network for object goal navigation[EB/OL]. [2021-05-20]. http://arxiv.org/abs/2105.09447. [Google Scholar]
- Carion N, Massa F, Synnaeve G, et al. End-to-end object detection with transformers[C]//European Conference on Computer Vision. Cham: Springer-Verlag, 2020: 213-229. [Google Scholar]
- Zhou K, Guo C, Zhang H Y, et al. Optimal graph transformer viterbi knowledge inference network for more successful visual navigation[J]. Adv Eng Informatics, 2023, 55: 101889. [Google Scholar]
- Forney G D. The Viterbi algorithm[J]. Proceedings of the IEEE, 1973, 61(3): 268-278. [Google Scholar]
- Sun Q R, Liu Y Y, Chua T S, et al. Meta-Transfer learning for few-shot learning[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2019: 403-412. [Google Scholar]
- Kirillov A, Mintun E, Ravi N, et al. Segment anything[C]//2023 IEEE/CVF International Conference on Computer Vision (ICCV). New York: IEEE, 2023: 3992-4003. [Google Scholar]
- Strader J, Hughes N, Chen W, et al. Indoor and outdoor 3D scene graph generation via language-enabled spatial ontologies[J]. IEEE Robotics and Automation Letters, 2024, 9(6): 4886-4893. [Google Scholar]
- Gervet T, Chintala S, Batra D, et al. Navigating to objects in the real world[J]. Science Robotics, 2023, 8(79): eadf6991. [Google Scholar]
- NVIDIA. Nvidia Isaac Sim: Robotics simulation and synthetic data[EB/OL]. [2023-10-25]. https://developer.nvidia.com/isaac-sim. [Google Scholar]
All Tables
All Figures
![]() |
Fig. 1 Interaction between the agent and environment |
| In the text | |
![]() |
Fig. 2 The framework of object goal navigation based on DRL |
| In the text | |
![]() |
Fig. 3 An example of scene graph based on Visual Genome dataset |
| In the text | |
![]() |
Fig. 4 An example of scene graph based on Faster R-CNN |
| In the text | |
![]() |
Fig. 5 An example of instance nodes |
| In the text | |
![]() |
Fig. 6 An example of room-level and scene-level HOZ graph |
| In the text | |
![]() |
Fig. 7 The graph matching based on KM algorithm |
| In the text | |
![]() |
Fig. 8 An example of multimodal feature fusion with scene graph in the object goal navigation framework |
| In the text | |
![]() |
Fig. 9 An example of value function estimation with scene graph in the object goal navigation framework |
| In the text | |
![]() |
Fig. 10 An example of AI2-THOR simulation environment |
| In the text | |
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.
















