In this section, we perform extensive evaluations on our
Haisor for scene optimization and the extension of
Haisor. Further, we compare it with the random agent, heuristics agent (see the rules in Section
6.2) and Sync2Gen-Opt agent [Yang et al.
2021] as three baselines in terms of visualization, quantitative metrics, and a perceptual study, illustrating our excellent performance. Some key designs are further validated via an ablation study.
7.1 Dataset Preparation
Training Data. To our knowledge, 3D indoor scene datasets with movable parts are unavailable. The 3D-FRONT [Fu et al.
2021] is a currently available 3D indoor scene dataset that is created by professional designers. We take the scenes generated by the underlying generators (see the next section) and replace the 3D-FUTURE [Fu et al.
2021] meshes with PartNet [Mo et al.
2019] and PartNet-mobility [Xiang et al.
2020] meshes. For each 3D-FUTURE mesh, we retrieve top-10 meshes with the smallest Chamfer Distance [Barrow et al.
1977] from PartNet. The 3D-FUTURE mesh in the scene is randomly replaced with top-10 retrieved meshes.
Note that this simple mesh-replacement approach may produce a large number of scenes that contain objects with inappropriate movable parts. For example, nearly half of the scenes have cabinet doors that cannot be opened. Since most indoor scene generation pipelines choose furniture using the scales of the bounding box of furniture, our replaced furniture can simulate the generated scenes by such methods.
7.2 Baselines
Scene Generators. To test our optimization framework, we utilize two SoTA scene generators: Sync2Gen [Yang et al.
2021] and ATISS [Paschalidou et al.
2021]. They are both trained on the 3D-FRONT dataset and generate layouts with the 3D box representation. Following these two works, we then retrieve meshes based on the sizes of generated boxes from the PartNet dataset. The generated scenes are then fed into different scene optimization algorithms to be evaluated. Considering the scale of the scenes to be optimized, the maximum optimization step count of the random, heuristic, and
Haisor agents in the following experiments are limited to 80 steps.
In our experiments, we perform the comparison with three agents, including the random agent, heuristics agent, and Sync2Gen-Opt agent:
(1)
Random Agent: The agent randomly selects an action to perform from the action space.
(2)
Heuristic agent: Our goal is to optimize the scene to make it satisfy multiple criteria (e.g., no collision, more free space) simultaneously. Instead of our
Haisor agent, we can use an agent driven by some simple heuristics that are identical to the rules described in Section
6.2.
(3)
Sync2Gen-Opt agent [
2021]: The scene generation framework consists of two steps: It first generates an initial prediction of scene layout by a Variational Autoencoder and then optimizes the prediction by L-BFGS [Liu and Nocedal
1989], a Quasi-Newton method. The target function is defined by Bayesian theory, and it optimizes the translation and the existence of the furniture objects. The “generate-optimize” process of Sync2Gen is similar to our setting. To better compare with theirs, we only use the VAE part of Sync2Gen (denoted Sync2Gen-VAE) to generate an initial scene layout. However, we modify the prior of their Bayesian target function to take human-free space of activity and collision between objects into account and delete the part of object existence (i.e., prohibit the agent from adding or removing objects); the modified version of the agent is denoted as Sync2Gen-Opt.
(4)
Simulated Annealing Agent. Qi et al.’s pipeline [Qi et al.
2018] also leverages the layout optimization problem using a set of criteria, but the optimization algorithm is simulated annealing. We do not fully reproduce their method, because they optimize the scene from a totally random initial state. Instead, we use the criteria of our setting (trained regressor, human affordance, etc.) and substitute the RL MCTS agent with simulated annealing.
Besides, to demonstrate the changes relative to the originally generated scenes, the metrics measured on the originally generated scenes are also reported.
7.3 Metrics
To evaluate the performance quantitatively, we adopt three kinds of metrics: accuracy, collision, and human affordance [Qi et al.
2018].
Accuracy. To measure the overall realism of the optimized scenes, a ResNet18 [He et al.
2016] is trained to regress the score for each top-down rendered image of the generated scene. The training data of the network consists of top-down renderings of randomly perturbed 3D-FRONT scenes and scores between 0 and 1. The scores are assigned as follows: First, we define a constant distance
\(d_{max}\) for every type of room, which indicates the maximum distance of furniture from its original position. The distance is 0.35 m for bedrooms and 1.00 m for living rooms. Second, we pick a random value between 0 and 1, denoted
\(d_{rand}\). Third, we move every furniture in a random direction along the ground for a distance of
\(d_{rand} \cdot d_{max}\). The score of the scene is then assigned as
\(1 - d_{rand}\).
The rendered images have \(2 \mathbf {C}\) channels of three types: (a) Floor channel: takes the value of 255 when inside the floor; (b) Layout channel: takes the value of 255 when inside any furniture object; (c) Category channels: a channel for each object type and takes the value of 255 when inside any furniture object of that category. Any other pixel is assigned the value of 0.
Note that both the reward given by the simulation environment and the “accuracy” reported by experiments below make use of this network, but we use different sets of 3D-FRONT scene data to train two versions of the network in practice. We randomly split the 3D-FRONT scenes into two datasets with equal sizes and train the two networks of the same architecture separately.
Collision. The collision metric measures the collisions between different objects and between objects and the wall. It is calculated as \(N_{coll} = 3 \times N_{wall} N_{obj}\), where \(N_{wall}\) is the number of collisions between objects and the wall, and \(N_{obj}\) is the number of collisions between different objects. Note that \(N_{wall}\) and \(N_{obj}\) are both collision numbers, which means any small collision between objects will count as 1 collision.
Human Affordance. According to the discussion of human-scene interaction in Section
3, we evaluate two metrics associated with human affordance: (1) Movable Manipulation, defined as the percentage
p described in Section
5.1, and (2) Free Space, defined as the free space of human activity divided by available space, which is exactly
\(R_{fs} \div 15\) in Section
5.1. For the detailed calculation of the two metrics, please refer to Section
5.1. The overall metric of human affordance can be regarded as the average of these two individual metrics.
7.4 Evaluations & Comparisons
The comparison of scene optimization. We perform optimization on the generated scenes by every baseline, namely, Sync2Gen [Yang et al.
2021] and ATISS [Paschalidou et al.
2021]. In Table
1, we show the numerical results of different baselines on scene optimization of generated scenes. According to the generated scene, we can see that both ATISS and Sync2Gen-VAE commonly perform worse in the living rooms than in bedrooms, which indicates the generation in living room is more challenging than in bedrooms. After the optimization, our method performs the best on all three metrics, indicating that
Haisor is able to reconfigure the furniture placement to satisfy the designed criteria. In Figure
7, some optimized visual results are presented for different baselines. For the generated results of living rooms, more collisions between furniture still exist and more movable parts cannot be manipulated. We observe that our method captures the “functional region” of a set of furniture better, and our method can achieve multiple goals (e.g., solving collisions, extending free space, and finding space for movable parts) simultaneously, while the heuristic agent is only capable of solving collisions, Sync2Gen-Opt agent struggles to strike a balance between multiple goals, and the simulated annealing agent tends to overly separate the objects apart and fails to combine furniture into meaningful regions.
Furthermore, in our optimization results, we observed some smart strategies. One of the examples is shown in Figure
8. In this example, the initial scene is a living room, in which six chairs are placed together, causing a lot of collisions. The agent first moves two chairs of the same size to the center of the room, forming a region of “chairs surrounding a table” and subsequently moves four other chairs, forming the second region of “chairs surrounding a table.” Finally, the agent makes some additional moves that increase the human affordance of the scene. We observe that our agent can learn the key feature for a scene to be realistic (in this case, the “chair-table” region) and derive the corresponding actions. Examples of more strategies learned to improve rationality and human affordance can be found in the supplementary material. Additionally, we include another example in Figure
9. The initial scene is overall rational, with the chair-table region and the table-sofa region placed together correctly. However, there are three objects with movable parts: two hinges of cabinets and one slider of a table. There is not enough space for humans to manipulate them comfortably. The results in Figure
9 demonstrate that our pipeline is able to optimize the scene to get more space for movable parts while preserving the rationality of the scene.
Perceptual Study. In addition to quantitative and qualitative results, we conduct a perceptual study to further evaluate how realistic and viable the optimized results of our method and two baseline methods (Sync2Gen-Opt and Heuristics agent) are. To fairly compare the results, we randomly select 50 living rooms and 50 bedrooms generated by Sync2Gen-VAE and ATISS to be optimized. We render the scenes in a top-down view and ask each participant to rank each algorithm according to two criteria (Rationality and Human Affordance): (a) the overall performance of the optimized rooms; (b) whether the participants will enjoy themselves when living in the scene. In every round of the survey, the participants are presented with two scenes for each method, including the input scene before optimization (without informing participants which is which) and rank the five results according to the above criteria. All the participants were local volunteers known to be reliable. The results of this perceptual study are reported in Table
2, and our method is the most preferred among the five results (the input scene and three optimized results). Further, we can see that the input scene gets a high score due to users’ preference for the overall rationality of the scene, and Sync2Gen-Opt and heuristic agents may move furniture for long distances and make the scene look messy, although they may have higher metrics of the collision or human affordance.
Ablation Study. The performance of our method relies heavily on various components of rewards used in the simulation environment, the training methodology (i.e., imitation learning), and the Monte Carlo Tree Search. We perform ablation studies to demonstrate the key designs of our method.
We compare our full method with eight ablated versions of our method, where we remove a certain design of our method for each ablated version and train the network on the same settings as our full method. Note that we have included the w/o GCN ablated version, which replaces the GCN in the Q-network with a plain MLP. The MLP is simply fed with the concatenated feature as input, and the dimension of the output is \(4 \times V_{max}\), where \(V_{max}\) is the maximum number of furniture in a room. But, the first \(4 \times |V|\) elements of the output are used as the predicted Q-value, where \(|V|\) is the real number of furniture of a certain scene. The rest elements are simply discarded in our experiments. We have also included a “Pure MCTS” ablated version of the agent, which makes decisions solely based on the search results of MCTS without the Q-Network prior.
The quantitative metrics are reported in Table
3. Statistics from the table clearly indicate that our full model outperforms all the ablated versions overall. The ablated methods achieve similar performance to our full method on some metrics (e.g., collision or human affordance), but our method is capable of considering all the aspects simultaneously. The qualitative results of all ablated versions and ours are presented in Figure
10. It is very clear that the results without imitation learning are totally worse than our full model under the same coverage steps. The agent without MCTS fails to move the chair under the table. When the free space and human affordance components are removed from our model, the furniture is moved to be highly scattered, which results in not enough space for humans to manipulate the cabinet beside the dining table and chairs. Without the collision reward, there are still some unresolved slight collisions, since they have almost no effect on rationality and human affordance. Without the rationality reward, the agent behaves similarly to the heuristics agent, which only tends to resolve collision. If the network is replaced by a plain MLP, then the agent is only capable of capturing part of the relationships between objects but fails to perform as well as the agent with GCN.
7.5 Extension of Haisor to Scene Personalization
In addition to the rearrangement of the scene layout, our method can be extended for scene customization by adding more components to the reward settings. The motivation of our extension is to solve some common issues of general generative models. There are two key observations: (1) some of the generative models still predict the unreasonable orientation of objects. For example, it is unrealistic that a chair faces the wall. (2) Some objects are duplicated and lie in the same location, such as two dining chairs surrounding the dining table are predicted at similar locations.
Based on the observation above, we perform two types of extensions: orientation adjustment and object removal. In general, we extend the action space for each object by adding an extra action below:
(1)
Orientation Adjustment: When the agent performs this action, the rotation angle \(\theta\) along y-axis of the selected object is added by \(\pi /2\), and if \(\theta \gt 2\pi\), \(\theta\) is subtracted by \(2\pi\). Additionally, a reward of \(-\)10 is given to avoid this action to be performed repeatedly.
(2)
Object Removal: When the agent performs this action, the selected object will be removed from the scene. Additionally, a reward of \(-25 \times S\) is given to avoid emptying all objects in the scene. S represents the summation of dimensions along three axes of the axis-aligned bounding box of the selected object.
Figures
11 and
12 show two examples of the extensions of our method. In Figure
11, one dining chair around the dining table does not face the table, while three other chairs do. This is frequently seen in the generation results of the SoTA scene generative models. The predicted orientation is often aligned with the
x or
z axes, so we only need to adjust the orientation by
\(\pi /2, \pi , 3\pi /2\) radians or rotate the object by
\(\pi /2\) for 1,2,3 times. In Figure
12, two dining tables are predicted in exactly the same position. This is very common in the current generative models based on sequential generation, such as some frameworks based on transformers. To optimize this scene, we only need to remove one of the two tables to make the indoor scene more realistic.
The two examples above are extensions of our optimization method. Since our method is based on a simulation environment, Reinforcement Learning, and MCTS, Haisor is able to achieve various optimization actions and goals. By simply adding some personalized actions or rewards, our method can optimize the indoor scene to the different targets that satisfy individual needs.