Gaming, as it always has been, is in a strong state of transformation at the moment. Major developers are focusing on creating 3D ready platforms, while others, like Nintendo and Microsoft, are trying to take us beyond controllers--actually developing games that require physical movement and in-game interactions.
The brave new world of gaming will be an interesting one indeed, so we decided to take a look at two of the pioneering technologies that may change games forever: Microsoft's Kinect and autostereoscopy.
You can check out our previous white paper round ups here and here!
Microsoft Kinect
Microsoft's unique input device for the Xbox has opened up some very intriguing possibilities. But how exactly does it work?
Kinect is, perhaps, the most significant product Microsoft has developed since Windows itself. It has the potential to impact not only gaming, but general computing, communications, and media, as well. It's an evolutionary platform blending sight, sound, and software that, if developed correctly into the future, could become a revolutionary UI.
Sight
Kinect's console includes an RGB camera—the same type found in webcams and cell phones across the globe. Currently, it's a device with a 640x480 resolution capable of capturing 30 frames per second. It's not 3D.
An avatar, in this context, is simply a wireframe representation of the player that has been mapped with recognition points. These points correspond to the movement nexus that's available from the wireframe (wrists, neck, elbows, shoulders, hips, etc., in the case of human beings) and are what allow the system to emulate accurate player motion onscreen in real time. "Real," in this case, entails a reported 200ms lag—including screen response time—thanks to processing overhead and the usual screen refresh timing. It's possible to reduce this using a faster CPU, but in general, 200ms is right on the border of human perception.
This is basically the same motion-capture process that's been used for the last decade or so in, among other things, sports games, to accurately record athletes' movement for reproduction during the game's playback. But these professional systems use keyframes to flow the motion, while Kinect's approach bypasses the static recording of pre-existing motion, instead reproducing the kinetic motion presented by the live player (in 20 points of motion) as the action proceeds.
Perhaps more mundane but nonetheless important, the combination of infrared and RGB cameras also allows Kinect to provide facial recognition that can automatically log a player on to the Microsoft network as well as associate the player with a previously used avatar. A recent update, called Avatar Kinect, gives the console the power to recognize players' facial expressions and display them onscreen. In context, this ability can be used in several preconfigured venues (currently all thinly disguised chat room environments) to communicate with other players both verbally and through facial expressions. Apply notions of affective computing—which posits that systems will soon be capable of reacting to human facial expressions and emotions—and you can see why this is such a big deal.
The entire Kinect console sits atop a pedestal, much like those of 1960s lava lamps. Unlike (most) lava lamps, the Kinect pedestal has a built-in tilt motor that lets the entire console move. The tilt range is about 27 degrees, and it's used in conjunction with the 57 degree horizontal field of view and 43 degree vertical field of the console's cameras to give the system a greater ability to track you as you move around.
Sound
Although you may hear a barely perceptible whir coming from the console, it's the only sound you'll hear. There are no speakers inside the Kinect. Instead, the interior sports four microphones—three on the lower-right end, and a single on the lower-left side. All four face downward.
The quartet composes a spatial sound array that samples incoming audio and compares the four streams, separating background noise from speech, and different voices from each other. It's effective to about 4 meters from the console.
Nestled alongside the RGB camera are an infrared emitter and an infrared camera. The former bathes the immediate area in infrared while the latter collects the radiated and reflected information for spatial analysis. The Kinect combines the 2D RGB image with the IR background fill to complete a recognizable object that exists at a distance "L" from the system and is along the X, Y, and Z (3D) axes.
While noise-cancellation microphones have been around for years, Kinect faces the unique challenge of typically having TV/receiver speakers closer to the mics while the human voices are farther away. The acoustic-echo-cancellation techniques used in common speaker phones tend to work well, but the recognizable-voices-versus-background-noise scenario is the reverse of that for the Kinect. Software created by the Speech Group at Microsoft Redmond Research solved the problem.
Software
The Kinect console does not have a processor, which is surprising considering all that's expected of it. The console did have one when it was first announced (Project Natal in 2009) but Microsoft withdrew the internal CPU and decided to let the processing power of the Xbox handle matters. Kudo Tsunoda, the mastermind behind Kinect, insists that the add-on uses "less than one percent" of the Xbox 360's processing power.
To help achieve that, Microsoft dropped the effectiveness of the camera down from the 60fps at its announcement in 2009 to 30fps at its commercial release. Still, that would put a huge burden on the software efficiency of the algorithms that run the console—except that the bulk of the overhead has been mitigated because the algorithms are located in the Xbox console as Kinect drivers.
These drivers are what describe a human's position in Cartesian space, and they are what handle reverberation problems and suppress loudspeaker echoes in the stereo acoustic-echo-cancellation algorithm. They do all this and more based on comparisons to decision forests (a collection of decision trees) in conjunction with thousands of stored samples.
Continuum
There is no technical reason why a Kinect console could not be attached to any computing device that was loaded with the algorithms it needed to function. While that might be slightly difficult for the traditional BIOS/OS arrangement found in most contemporary computers, a UEFI environment would clear the way for the archetypal house of the future—run by voice commands and gestures with only its own facial recognition algorithms needed to provide security.
By the time you read this, it's likely that Microsoft will have made some form of Kinect-related announcement at the 2011 Electronic Entertainment Expo in Los Angeles. Early speculation is that Microsoft's purchase of Skype might herald advanced video conferencing—such as predefined avatars with full expressions instead of true video images, to keep the CPU overhead down. And somewhere in the far-out reaches of time and space, what might a Kinect for PC/Mac be able to do with an über CPU?
It's going to be an interesting future.
Autostereoscopy
When will we get 3d without the dorky glasses?
One of the first (if not the first) 3D motion pictures was called Power of Love, released in 1922. A mere 89 years later, 3D technology continues to intrigue and yet struggle to gain widespread consumer acceptance. Three-dimensional production techniques have changed, theater screen designs have changed, and TV and home-theater video projectors have changed to incorporate 3D. In spite of all this progress, most modern 3D technology still requires viewers to don a pair of dorky glasses.
A new technology saddled with the ungainly, but technically accurate, name of "autostereoscopy," promises to change all that and finally allow us to see 3D video with our naked eyes.
Classic 3D Technology
Power of Love was produced using an anaglyptic process. The film was produced by simultaneously shooting each scene from two different angles (about 2.5 inches apart, roughly the distance between the centers of the average person's eyeballs). The black-and-white film was then printed in two colors, red and green, and combined into a layered film on a single reel.
When the film was screened, everyone in the audience was given a pair of special glasses outfitted with red and green lenses. The red lens canceled out the red version of the film and allowed the green version to pass through, while the green lens did just the opposite. The combination produced the illusion of depth of field. Unfortunately, the anaglyptic process induced headaches in some viewers; it also proved to be incompatible with color movies.
Some 30 years later, with the movie studios desperate to find a means of luring people away from their television sets, the film House of Wax hit theaters in 1953 and did sensational box office. House of Wax was filmed using Edwin Land's Polaroid 3D system (it also featured the very first stereophonic soundtrack). The Polaroid 3D system used two lenses that captured light waves passing in perpendicular planes. Moviegoers wore polarized glasses that functioned like anaglyptic lenses.
The 3D movie craze sparked by House of Wax petered out just a few years later, and Hollywood largely lost interest in 3D until the early 1980s. A string of schlocky "event" films—The Treasure of the Four Crowns, Jaws 3-D, and Amityville 3-D—passed through theaters, but the mania didn't last long and not even the release of 1983's science-fiction 3D classic Metalstorm: The Destruction of Jared-Syn could resurrect the popularity of the genre. The 3D glasses caused a viewer to watch a movie with his or her eyes slightly crossed, giving some people headaches.
Automatic Stereoscopic Imaging
Despite all the known problems with 3D glasses, most modern film studios, cinemas, and TV and video-projector manufacturers still rely on either active shutter glasses (that alternate between darkening the left and right lenses in sync with the display) or passive glasses (that filter light through polarized lenses).
With parallax barrier technology, slits in the barrier between the viewer and the screen present a different view to each eye. It causes the same image separation that 3D glasses would. Unfortunately, because the 3D effect is generated at the source, you can't move your head very much without spoiling the illusion.
Autostereoscopy (the creation of stereoscopic images automatically at the source, obviating the need for glasses) could be the ideal solution, although it's not entirely perfect either. The three most common autostereoscopic solutions available or in development today are parallax barrier, lenticular lens, and integral image.
A parallax barrier screen, such as is deployed in the Nintendo DS, is fabricated by facing a display—such as an LCD—with a layer of material with slits that partially obscure each pixel. The left eye is able to see only the pixels intended for the left eye, and the right eye is able to see only the pixels intended for the right eye. When the brain combines both fields of vision, it perceives depth. A parallax barrier screen depends on the viewer sitting in an ideal position—a sweet spot—to deliver maximum effectiveness. Another problem is that the 3D illusion will collapse if the viewer moves his or her head too much. And finally, the parallax barrier blocks much of the light emanating from the display, significantly reducing its brightness.
These restrictions aren't major issues for a single-user, handheld gaming device like the Nintendo DS. TVs, on the other hand, are designed for multiple users in brightly lit rooms sitting far from the display. It's not unusual for none of the viewers to be in the sweet spot. Even the most sedentary couch potato will have difficulty sitting relatively still while watching TV. And TVs need to be as bright as possible to overcome the ambient lighting conditions.
Another autostereoscopic technology is the lenticular lens display. This type of display effectively puts the 3D glasses on the TV itself, with a series of very small lenses that refract light to the left and right, so each eye sees only the pixels intended for it. As with other technologies we've discussed, the brain combines the two fields of view and perceives depth.
Since lenticular lens technology doesn't place an opaque physical barrier on the display, it doesn't reduce image brightness. It can also be viewed from a wider angle without losing the 3D effect, and it's more tolerant of viewer movement. Unfortunately, lenticular lens displays remain difficult and very expensive to manufacture.
Integral imaging is similar to the lenticular lens concept in that it places an array of micro-lenses—one lens for each pixel—in front of the display panel, so that each lens produces a different perspective on the image depending on the viewing angle. With this technique, the eye can see not only right and left views of an object, but top and bottom views as well. The downsides to integral imaging are that it reduces contrast, and no one has come up with a cost-effective means of manufacturing the lens array (a feat nature has already accomplished and bestowed on the eyes of house flies and honeybees).
The Current State of Retail 3D
If you can perceive 3D—not everyone can—and you're willing to accept its shortcomings, you can jump into the market now, confident in the knowledge that it's unlikely a major autostereoscopy breakthrough is right around the corner.
That doesn't mean companies will cease their research and development efforts, but we wouldn't be surprised if another decade passes before "glasses-free" 3D becomes a retail reality. And then we'll all start waiting for the first demos of holographic TV.