A Wizard of Oz Study for an AR Multimodal Interface
In this paper we describe a Wizard of Oz (WOz) user study of an Augmented Reality (AR) interface that uses multimodal input (MMI) with natural hand interaction and speech commands. Augmented Reality technology creates the illusion that virtual objects are part of the user’s real environment. So the goal of AR systems is to provide users with information enhanced environments with seamless connection between the real and virtual worlds. To achieve this, we need to consider not only accurate tracking and registration to align real and virtual objects, but also an interface which supports interaction in the real and virtual worlds. There has been a significant amount of research on AR interaction techniques, but there has been almost no research on multimodal interfaces for AR, and none of this research has been based on a Wizard of Oz study. Accordingly, our goal is using a WOz study to help to create a multimodal AR interface which is most natural to the user. In this study we wanted to learn how users would issue multimodal commands and how different AR display conditions would affect the commands used when they did not have a set of given commands, but had perfect speech and gesture recognition. We used three virtual object arranging tasks with two different display types (a head mounted display, and a desktop monitor) to see how users used multimodal commands, as well as how different AR display conditions affect those commands. The three tasks were (1) changing the colour and shape of simple primitives and copying them to a target object configuration, (2) moving sample objects distributed in 3D space into a final arrangement of objects, and (3) creating a virtual scene by arranging detailed models as users want. Subjects filled out surveys after each condition and their performance was videoed for later analysis. We also interviewed subjects to get additional comments from them. The results provided valuable insights into how people naturally interact in a multimodal AR scene assembly task. Video analysis showed that the main types of speech input were words for colour and object shape; 74% of all speech commands were phrases of a few discrete words while only 26% of speech commands were complete sentences. We also found that the main classes of gestures were deictic (65%) and metaphoric (35%) gestures. Commands that combined speech and gesture input were 63% of the total number of commands whereas gesture input only commands were 34%, and speech only input was 3.7%. This implies that multimodal AR interfaces for object manipulation will rely heavily on accurate recognition of users input gestures; almost 97% of commands involved some gesture input. We also found that overall 94% of the time gesture commands were issued before the corresponding speech input in a multimodal command in the AR environment. When considering fusion of speech and gesture commands, we defined the time frame for combining gesture and speech input and found an optimal time window of 7.9 seconds which would capture 98% of combined speech and gesture input that is related to each other. We also found that display type did not produce a significant difference in the type of commands used, although users felt that the screen-based AR application provided a better experience. Using these results, we present design recommendations for multimodal interaction in AR environments which will be useful for others trying to develop multimodal AR interfaces. In the future we will use these WOz results to create a functioning multimodal AR interface.