3D Reconstruction Technology for Next-Generation XR | Global

3D reconstruction to provide XR systems utilizing 3D digital twin information

Background

In an extended reality (XR) system, it is important to reproduce a given space in three dimensions and to accurately determine the number, location, and relationships of objects, so that users get a proper understanding of unfamiliar environments. 3D reconstruction and localization technologies are also essential to present information in a way that seems natural to those using devices such as XR glasses.

There are, however, several challenges that make this difficult.

The conventional terrestrial laser scanners commonly used today take several minutes per scan and require a tremendous number of steps to scan and process large scenes. Data capture can also be interrupted by objects or people moving during the scan.
Training an AI model that understands scenes^*1 in the real world requires 3D data on many objects, paired with ground truth labels, making it impossible to train a model that can be applied to all scenes in the real world.
Two technologies are required to properly link a 3D digital twin in an XR system with the real world: “camera pose estimation” for the XR system to accurately determine its own position in the real world, and “ground plane estimation” to correctly display the avatar in the digital twin on the floor in real space (Figure 1).
Conventional plane detection methods sometimes extract the wrong plane from 3D data obtained from ToF sensors^*2 due to noise or MPI^*3.

*1

Scene understanding: The process of interpreting the context of a situation by comprehensively determining “what” is present “where” in a given scene, and “how.” For example, a system would determine information such as “people walking on the sidewalk” or “cars waiting at a traffic light” from video of a street and then interpret the overall situation.
*2

ToF (time of flight) sensor: A sensor that measures the distance to an object by emitting light on the object and measuring the time it takes for the reflected wave to return or the phase of the reflected wave.
*3

MPI: Multipath interference. A phenomenon in which emitted light reaches a ToF sensor via multiple paths and overlaps, causing distorted or incorrect depth measurements.

Figure 1. Link between 3D digital twin and real world

Solutions

We have developed technologies for scanning 3D space and generating 3D digital twins, and for recognizing these 3D digital twins.

These allow the XR system to understand information about the user, such as his or her state, surrounding environment, behavioral tendencies, and interactions, resulting in a more natural and intuitive XR experience.

Technical highlights

We provide various methods to easily and quickly reconstruct 3D space.

(1) Single-shot 360-degree 3D reconstruction device

The novel 3D reconstruction device newly developed by RICOH can easily and quickly capture a 360-degree field of view, scan an environment at high frequency, provide accurate 3D scans in dynamic environments and reduce discrepancy between actual scenes and 3D digital twins. 3D data captured using this device in real-world scenes is also available for use as a dataset for multiple AI tasks (see related information).

Overview of a 3D reconstruction device

Omni-directional indirect ToF camera
Capable of convenient and quick capture
Generates digital twins in a single shot

ToF receiver: Wide angle fisheye lens (×4), ToF emitter: VCSEL + DOE + fisheye lens (×2), Color sensor: Hemispherical fisheye lens (×2) — Figure 2. 3D reconstruction device

For use in industries such as interior coordination, equipment, facility management, construction, and civil engineering. Resolution is 7296 × 3648 pixels, range is 0.5 to 5 m, and field of view is 360° × 150°. — Figure 3. Possible usage industries, resolution, camera field of view, measurable range

Figure 4. Data from multiple scans overlaid to reproduce entire space as single accurate 3D space

Recognition of 3D space (spatial segmentation using open-vocabulary spatial segmentation)

海角社区and the German Research Center for Artificial Intelligence (DFKI) have developed a spatial understanding AI model that requires no training, based on zero-shot learning^*4. It utilizes models that have been pre-trained on large-scale data, such as the Segment Anything Model (SAM)^*5 and Contrastive Language-Image Pre-training (CLIP)^*6, to estimate correspondence relationships between natural language and 3D data. In addition to allowing for a given scene to be understood with a high degree of accuracy for unknown data, this technology also allows for interaction with the XR system using natural language.

*4

Zero-shot learning: A machine learning technique used to predict the recognition and classification of data with absolutely no training data.
*5

Segment Anything Model (SAM): A foundation model released by Meta, which performs segmentation. It can automatically segment any region in an image based on user-specified prompts. It is also capable of zero-shot learning.
*6

Contrastive Language-Image Pre-training (CLIP): A foundation model released by OpenAI, which effectively combines information from different modalities, such as text and images, to make zero-shot learning possible.

Figure 5. Scanned 3D data and spatial understanding results

The spatial understanding AI model suggests the object in space that best matches the question asked by the user in natural language. The figure shows an example of the XR system responding to the user based on the spatial information it has recognized.

Demo video

Technology to realize XR experiences that connect virtual spaces with real spaces

(1) XR system and real world relocalization

In an XR system, accurate positional alignment in 3D space with the real world is essential for virtual content in the digital twin to be displayed naturally in real space. An XR device uses pre-acquired 3D scan data taken from the real world as a reference to estimate its own position and orientation in real time and aligns this with scan data. 海角社区is currently developing technology that makes use of AR markers, structure from motion (SfM), and simultaneous localization and mapping (SLAM) so these devices can estimate their own positions. AR markers function as reference points during the initial alignment process by mapping the position where they are physically attached to a position in the 3D scan. SfM is a technique for reconstructing 3D structures and camera trajectories from multiple viewpoint images, and it is useful for geometric matching with scan data. SLAM estimates the current position of the XR system in real time from camera and inertial measurement unit (IMU) information and simultaneously builds an environment map. Through combining all of these, scans, the real world, and spatial coordinates in the XR system are kept consistent and integrated.

Figure 7. Technology to estimate the position of the XR device in a 3D space that has been scanned

(2) Ground plane estimation

Room layout estimation using 2D color images without 3D information can be used to estimate regions such as floors and walls in a scene. Spatial constraints such as the Manhattan world assumption^*7 can be added to this to reconstruct a realistic room layout. With this serving as inspiration, 海角社区developed a high-precision ground plane detection method for 3D point clouds. The room layout is estimated from a 2D color image, and then the raw 3D point cloud obtained from a sensor is positioned and shaped to fit the layout to estimate the ground plane region. This estimation technique allows for the avatar in the digital twin to be positioned and aligned correctly with the floor in the real world.

*7

Manhattan world assumption: The assumption that there are three dominant axes that are mutually perpendicular in artificial objects, and that any surfaces comprising the structure will be aligned perpendicular or parallel to them.

Figure 8. Ground plane estimation flowcharts and example results

Ricoh's vision

海角社区is participating in Luminous*8 to research and develop next-generation XR technology. This technology is the result of these efforts.

We aim to build 3D digital twins and provide solutions that utilize them.

*8

Luminous: A research project funded by the EU Horizon Europe program, which aims to develop language augmented XR (extended reality) systems that adapt to the context and needs of the user, and to provide intuitive human-centered experiences in education, healthcare, rehabilitation, design, and remote assistance.

Related information

Research paper

Source: IEEE/CVF (Hrsg.). 21st. CVPR Workshop on Perception Beyond the Visible Spectrum (PBVS-2025), June 11-15, Nashville, Tennessee, USA, IEEE, 2025.

Public dataset

Related technology

AI Solution for Spatial Data Creation and Utilization

Return to previous page

Sorted by

field

Machine Vision

Display / Projection

AI (Artificial Intelligence)

Sensing