In an extended reality (XR) system, it is important to reproduce a given space in three dimensions and to accurately determine the number, location, and relationships of objects, so that users get a proper understanding of unfamiliar environments. 3D reconstruction and localization technologies are also essential to present information in a way that seems natural to those using devices such as XR glasses.
There are, however, several challenges that make this difficult.
Figure 1. Link between 3D digital twin and real world
We have developed technologies for scanning 3D space and generating 3D digital twins, and for recognizing these 3D digital twins.
These allow the XR system to understand information about the user, such as his or her state, surrounding environment, behavioral tendencies, and interactions, resulting in a more natural and intuitive XR experience.
We provide various methods to easily and quickly reconstruct 3D space.
The novel 3D reconstruction device newly developed by RICOH can easily and quickly capture a 360-degree field of view, scan an environment at high frequency, provide accurate 3D scans in dynamic environments and reduce discrepancy between actual scenes and 3D digital twins. 3D data captured using this device in real-world scenes is also available for use as a dataset for multiple AI tasks (see related information).
Overview of a 3D reconstruction device
Figure 2. 3D reconstruction device
Figure 3. Possible usage industries, resolution, camera field of view, measurable range
Figure 4. Data from multiple scans overlaid to reproduce entire space as single accurate 3D space
海角社区and the German Research Center for Artificial Intelligence (DFKI) have developed a spatial understanding AI model that requires no training, based on zero-shot learning*4. It utilizes models that have been pre-trained on large-scale data, such as the Segment Anything Model (SAM)*5 and Contrastive Language-Image Pre-training (CLIP)*6, to estimate correspondence relationships between natural language and 3D data. In addition to allowing for a given scene to be understood with a high degree of accuracy for unknown data, this technology also allows for interaction with the XR system using natural language.
Figure 5. Scanned 3D data and spatial understanding results
The spatial understanding AI model suggests the object in space that best matches the question asked by the user in natural language. The figure shows an example of the XR system responding to the user based on the spatial information it has recognized.
Demo video
In an XR system, accurate positional alignment in 3D space with the real world is essential for virtual content in the digital twin to be displayed naturally in real space. An XR device uses pre-acquired 3D scan data taken from the real world as a reference to estimate its own position and orientation in real time and aligns this with scan data. 海角社区is currently developing technology that makes use of AR markers, structure from motion (SfM), and simultaneous localization and mapping (SLAM) so these devices can estimate their own positions. AR markers function as reference points during the initial alignment process by mapping the position where they are physically attached to a position in the 3D scan. SfM is a technique for reconstructing 3D structures and camera trajectories from multiple viewpoint images, and it is useful for geometric matching with scan data. SLAM estimates the current position of the XR system in real time from camera and inertial measurement unit (IMU) information and simultaneously builds an environment map. Through combining all of these, scans, the real world, and spatial coordinates in the XR system are kept consistent and integrated.
Figure 7. Technology to estimate the position of the XR device in a 3D space that has been scanned
Room layout estimation using 2D color images without 3D information can be used to estimate regions such as floors and walls in a scene. Spatial constraints such as the Manhattan world assumption*7 can be added to this to reconstruct a realistic room layout. With this serving as inspiration, 海角社区developed a high-precision ground plane detection method for 3D point clouds. The room layout is estimated from a 2D color image, and then the raw 3D point cloud obtained from a sensor is positioned and shaped to fit the layout to estimate the ground plane region. This estimation technique allows for the avatar in the digital twin to be positioned and aligned correctly with the floor in the real world.
Figure 8. Ground plane estimation flowcharts and example results
海角社区is participating in Luminous*8 to research and develop next-generation XR technology. This technology is the result of these efforts.
We aim to build 3D digital twins and provide solutions that utilize them.