Nvidia Taught an AI to Instantly Generate Fully-Textured 3D Models From Flat 2D Images

Turning a sketch or photo of an object into a fully realized 3D model so that it can be duplicated using a 3D printer, played in a video game, or brought to life in a movie through visual effects, requires the skills of a digital modeler working from a stack of images. But Nvidia has successfully trained a neural network to generate fully-textured 3D models based on just a single photo.

We’ve seen similar approaches to automatically generating 3D models before, but they’ve either required a series of photos snapped from many different angles for accurate results or input from a human user to help the software figure out the dimensions and shape of a specific object in an image. Neither are wrong approaches to the problem; any improvements made to the task of 3D modeling are welcome as they make such tools available to a wider audience, even those lacking advanced skills. But they also limit the potential uses for such software.

At the annual Conference on Neural Information Processing Systems which is taking place in Vancouver, British Columbia, this week, researchers from Nvidia will be presenting a new paper—“Learning to Predict 3D Objects with an Interpolation-Based Renderer”—that details the creation of a new graphics tool called a differentiable interpolation-based renderer, or DIB-R, for short, which sounds only slightly less intimidating.

Nvidia’s researchers trained their DIB-R neural network on multiple datasets including photos previously turned into 3D models, 3D models presented from multiple angles, and sets of photos that focused on a particular subject from multiple angles. It takes roughly two days to train the neural network on how to extrapolate the extra dimensions of a given subject, such as birds, but once complete it’s able to churn out a 3D model based on a 2D photo it’s never been analyzed before in less than 100 milliseconds.

That impressive processing speed is what makes this tool particularly interesting because it has the potential to vastly improve how machines like robots, or autonomous cars, see the world, and understand what lies before them. Still images pulled from a live video stream from a camera could be instantaneously converted to 3D models allowing an autonomous car, for example, to accurately gauge the size of a large truck it needs to avoid, or robots to predict how to properly pick up a random object based on its estimated shape. DIB-R could even improve the performance of security cameras tasked with identifying people and tracking them, as an instantly generated 3D model would make it easier to perform image matches as a person moves through its field of view. Yes, every new technology is equal parts scary and cool.