[Note: This article was written in 2014, and although it is somewhat out of date, it is still a good high level overview of AR concepts, so still work reading].
What is AR?
AR is used to annotate the environment, and for displaying graphics that show information about an object or a location in the real world. This provides users with a clear, unambiguous, head-up way to see information related to their surroundings, allowing users to focus on the task at hand rather than averting attention to look at maps and other tradition methods of displaying information.
What can AR be used for?
AR is also used to overlay realistic graphics, virtual objects, characters or effects on the environment to create scenery and scenes that assist training exercises or gaming. These features are an added advantage over traditional training or gaming environments that are created for fixed screen laptops or Virtual Reality (VR) setups with opaque displays. These VR reproductions are often low fidelity, inaccurate reproductions of the real world that do not let users experience or see their surrounding environment.
How does it work?
AR visualization systems use a database that contains a description of the graphics being overlaid on the real world and the location where to attach those graphics. The user’s head pose, referred to as their position and orientation, is measured by a tracking system that determines where the user is looking and from what point of view. The rendering system uses this information to determine how users look at an object or location. The system then generates the graphical overlays to match the users viewpoint by using the database description of what graphics to display. A see-through display shows the user the generated scene on top of the real world. This process is in real-time, and the scene is frequently updated to match the observed scene as the user changes perspective, providing the illusion that the virtual and real are attached together.
Video vs. Optical see-through displays
There are many different ways to superimpose graphics on top of the real world, but the two primary ways are the video and optical see-through approaches. In the video see-through approach, the real world is captured by a video camera. The graphical scene is then generated by a rendering module that overlays graphics on top of the real world using chroma-keying or similar method. The combined video frame is shown to the user, thereby letting the user see the real world through the eye of the camera.
In the optical see-through approach, a combination of real world and virtual scenery generated by the rendering engine is created using an optical combining element. The optical element is similar to a semi-reflective mirror that lets the user both see the real world and the reflection of the augmented image from a micro-display driven by the rendering engine, thereby showing both the real and virtual combined.
The ODG Glasses use an optical combiner that allows users to see the real world through their own eyes instead of through the limited resolution and dynamic range of a video camera. Since each display in the glasses is driven by its own frame buffer, it is possible to show a different image on the left and right eye, naturally merging the graphics shown from two viewpoints into a full 3D percept with the real world. This feature is not possible in look-aside displays like the Google Glass or wearable displays with only one eye (monocular displays), since it is impossible to make an object appear 3D to the brain through only one view. Such displays produce binocular rivalry known to divert the users attention.
Marker-based vs Geo-located AR
There are many ways to implement pose tracking using different modalities and sensors. The two main approaches used by the industry and implemented in the ODG Glasses are the marker-based and the geo-located AR approach.
In the marker-based approach, a marker or graphic is presented to a video camera. The video from the camera is correlated to the known image or marker by using a computer vision pose estimation technique that determines the pose of the camera and produces the image. By using the pose of the camera, the pose of the viewer is inferred. The 3D scene is then generated to overlay the image. This technique is mostly used in smartphones and tablets and requires the use of a physical marker to be carried around. This approach attaches information to any object that has the 2D marker image, no matter where this object is placed in the world.
In its simplest form, Geo-located AR uses GPS to determine the position of the user’s field of view and an Inertial Motion Unit (IMU) to determine its orientation such as the pitch, roll and yaw of the viewer’s field of view. Both sensors are embedded in the ODG Glasses and ready for use. As a user wears the glasses, the sensors measure the pose of the user’s head and the user’s field of view. When using this tracking modality, it is necessary to use a database that geo-locates augmentations with latitude, longitude, and height so the system can place the augmentation as a defined place in the world. This approach can be used to annotate Blue-Force tracking positions or buildings at a know location. One of the drawbacks of this approach is it works best outdoors where GPS is available.
Developing AR applications for the R-6 Glasses
The following sections go through what a developer needs to do to develop AR applications running on the R-6 Glasses. This assumes that the developer is familiar with building or modifying either Android OpenGL or Unity applications for Android devices. We first describe how to create the correct projection frustum in both Android and Unity to enable the glasses to display graphics that can overlay correctly on the real world. We then describe how to use marker-based tracking or GPS/INS for a geolocated AR information
Compose the left and right images
Once the graphics chip is being switched in stereo mode and expecting the left and right images to be both composed in the frame buffer, the developer must create this composited frame. This is done by alternatively selecting the viewport (or framebuffer) half that correspond to the current eye’s image being rendered and then draw the scene. To illustrate where this is done in the code, a snippet of the StereoRender.java code is shown below.
The function OnDrawFrame is called by the system upon a need to refresh the view and here the framebuffer is first cleared. Then if the stereo mode is on, the call to *cam.apply (where * can be left or right) is setting the view frustum parameters for the left and right eye and then draw the scene for the left or right eye by passing the current eye to draw as a parameters to the drawScene function. In turn, the drawScene function first calls the setViewport function with the parameter of the current eye drawn and this in turn selects the correct half of the viewport.
For example in the case of an over-under configuration, the call for the left eye will end executing glViewport ( 0, 0, width, height / 2 ) and for the call for the right eye the call will be glViewport ( 0, height / 2, width, height / 2 ), which will result in selecting respectively the top and bottom halves for each of the left and right eye. This process end up drawing the left scene on top and the right scene on the bottom, thereby creating the composed image that the video graphics card will then split into the left and right displays and producing the percept needed. Note the the source code support setting a mStereo boolean flag which does allow the normal viewing of existing applications using this same rendering loop architecture and in that case the complete viewport is selected before to draw the scene.
Setting the view frustum
Once one has selected the correct viewport for each eye or for a monocular view to use, the rendering system is instructed to render the scene in this viewport region in the drawScene function implementation following the code shown above. However for this function to draw the scene from the correct viewpoint, that is either from the left or right eye, the correct view frustum encoded in the projection matrix has to be set. This is done through the function leftCam.apply() and rightCam.apply() in the OnDrawFrame function which code snippet is inserted hereunder. As it can be seen this function applies an orientation corresponding to the toe-in angle of the view frustum and a translation to set the origin of the view frustum to each eye.
To understand what is being done here, one has to consider the specific optical system of ODG's R-6 Glasses schematized hereunder. The optical elements placed in front of the user's eyes supported by the glasses’ frame are actually angle or toed-in toward the center and so that their optical axes are crossing at a vergence distance that is about 2.5m. The vergence is made this way so that the region where the user can see full stereo, or the region of stereo overlap, is maximized at that working distance.
Because the image of the screens projected by the lenses is quasi collimated, the eye location does not change the perceived projected image and therefore the actual Inter-Pupillary Distance (IPD) of the user is currently not measured or taken into account in the visualization model. This is valid as long as the working distance and focus distance is more than arm length and the superimposition of overlay is not as precise as what a surgeon might need.
However, for custom model where ODG would tune this vergence distance closer, one should consider using the actual IPD of the user. So the IPD currently used is corresponding to the separation between the optical lens centers and about 65mm. So to take the perspective view of each eye, the code above start at the display center (the mid-point between the two displays) and translate by half the IPD in the direction of the eye (the last glTranslate in the code) and then rotate by the toe angle (that the glRotate on the lines before). This is being done the reverse way since OpenGL provide functions to move the scene (i.e. we are manipulating the GL_MODELVIEW matrix) rather than the viewpoint so the reverse motion has to be done. Once this setup is made, it is possible to superimpose a virtual object on its physical counter part placed at the same relative position from the display, thereby achieving the overlay effect needed for AR.
It should be noted that if the existing application that the developer is converting to run on the Glasses is using a video preview, as it is the case for AR applications running on smartphone where the user cannot see the real world through the device and video preview of the camera must be used to simulate this see-through capability, then this video preview should be disabled so that the user can see the real world optically through the glasses and there is no confusion between the video of the real world and the actual real world view.
Once the stereo setup of the display has been done, the user is now able to align a virtual scene presented in the display with a real scene of the same geometry as long as the real and and virtual scene are at the same relative place or pose with respect to the display. This can be done by placing the virtual scene at the same distance and making sure that the display is oriented the same way by moving the head and the display orientation till the virtual and real scene align. However, to fully implement AR, this function should be done automatically as the user changes position and orientation and the scene should be modified automatically. This is the function of the tracking system that needs to be implemented to track the pose of the device in the real world and in turn change the pose of the view frustum in the virtual scene accordingly so that the virtual/real alignment is kept.
A simple tracking strategy, assuming that the user is staying at the same position, is to use orientation only tracking. This allows the glasses to capture the orientation of the head of the user but not the position. This works well for visualization where the position is assumed to be fixed and the user is only looking around; for example, in the case where there is a need to show virtual objects around the user, perhaps to create some user’s attached 3D GUI of some sort. In the ODGStereoTester sample, this tracking strategy is used to show the image of two soldiers placed in front of the user. By pushing the A key on a keyboard while running this specific sample, this scenery is re-aligned to be right in front of the display no matter where the user is looking at. Note that this tracking capability could be easily modified to track the position of the user as well using GPS, however that example does not include this capability as of yet.
The understanding of the sensors included in an Android device, how to turn them on, and how to extract the information from them is out of the scope of this document and the developer can learn more here: http://developer.android.com/reference/android/hardware/Sensor.html
The example below shows where things are done in the ODGStereoTester example and how this measurement is being used to drive the frustum orientation in this sample app. First in the main Activity in cube/CuboidActivity.java we find the sensor manager, then the sensor of TYPE_ORIENTATION and then we register the sensor to pull at the fastest update rate to ensure the fastest tracking and declare the callback to call when there is a new sensor update.
Then in the onSensorChanged function we check for the sensor data to come from that specific sensor type and if that is the case we send this orientation to the renderer to drive the viewpoint after arranging the order of the values so that they correspond to the order in which to apply the Eulers angle extracted and get the correct rotation.
Then in the renderer implemented in StereoRenderer.java, we receive this value, set the Eulers angle array, and bias those values by a pitch and azimuth biases. The pitch bias value is not used in the sample but the azimuth one is used to reset the azimuth every time the “A” key is pushed in the application, which in turn reset the scene in front of the user.
Finally we apply the Eulers to the viewpoint before to draw the scene by successively rotating around the Z, Y, and X axis to apply the azimuth, roll and pitch respectively, which in turn rotate the scene and keep it aligned to the real world. Note again that in that specific example, the user cannot move as the location is not measured using GPS for example.