New to Beyond Casual? – start from part 1!

Part 7: “Me Tarzan, You Jane?”

“Johnny-Cab” / Total Recall (1990)

Tarzan the Ape Man (1932)
From IMDb:
Tarzan the Ape Man (1932)
…At no point in this movie is the line "Me Tarzan, you Jane" spoken. When Jane and Tarzan meet, it is she who initiates the verbal exchange, repeatedly indicating herself and giving her name until he repeats it. She then points to him, indicating that she wants to know if there's a word for who he is as "Jane" is the word for who she is, until eventually he understands and says, "Tarzan." 

It seems human communication combines words and gestures – could Tarzan and Jane communicate on voice only? And how did he got that cool hair-cut living in the jungle? Man – it does not make any sense

Jungle Hunt (1982) …totally unrelated sorry J 

The story of the personal virtual assistance

When I speak with people about natural user interfaces (NUI) some people just think that voice is the answer to everything. “After all – talking is the most natural thing, right?”. Yes, talking is natural – no argue about it. The verbal language is only a part of the picture.

2001: A Space Odyssey (1968) 
 Sci-Fi has always been much into the idea of having a digital servant to help us in our day to day tasks. And many wonder how come this didn’t became our day to day common interface with machines. But the full story is its not only related about the accuracy of the speech recognition algorithms alone. Having a butler that follows you is not a pleasant experience if this butler is just incapable. In order to really help you – he needs to be able to do things, and he needs to have a solid context of his master and surrounding.

Perhaps – this is the main reason why Apple’s Siri is actually meaningful: for the first time, the assistant has solid context and capabilities. Siri knows me, where I am, who the people in my contact list are, and she can send messages, add reminders and even solve algebra (With the help of Wolfram Alpha). Apple just managed to bring the context to an interesting level.

Gesture + Voice

Adults use voice to communicate. But if you close your eyes – your comprehension level will drop significantly. We look at each other and use the whole body language when we communicate. In many cases – body language holds more information than the spoken word
Imagine you go to shopping – when you will be asked which shoe you want you will probably just say ‘this one’. The communicated information is now encapsulated by your pointing gesture and surrounding context.

Now - back to reality.

Imagine an application that ask you which item you want to choose and you just point to it and say ‘this!’. Hey – I don’t need even voice recognition for that! Just detecting the pointing gesture together with some synchronized vocal burst might be enough. And it will work in any language! (Just like Tarzan and Jane…). If you don’t give up on the decades of work done in speech recognition, you can simply use it to dramatically improve accuracy. Body language takes a huge part in our communication context.

Assistant vs. tool

Another way to look at modern life – is that we all want to be served. Just like kings of prior centauries. But everyone will be kings now!
A dining king

So imagine the king and queen sit together to dine. They have like 5 cooks and 10 waiters. This surrounding staff is responsible to doing what is hard (IE: prepare a steak) or bringing what is out of reach. But if the food is on his plate, the king will prefer to take the fork and bring the food to his mouth by himself (Asking the servant to do it will be awkward and freaky)
Yes – sometimes we prefer to do stuff ourselves. In such cases – we prefer to use tools.

 Making sounds while playing
Actually – I could continue the philosophical MMI discussion for an hour – but you guys read this to have fun, right? Let’s bring on gaming!
There are only a few examples of game experiences that combine gestures and voice. Some examples:

But let’s just stretch our imagination a bit furtherer…

"Bang bang, my baby shot me down"

If you look at kids when they play around – you will notice many of their games are actually role playing. They imagine they are some hero and they try to imitate the appropriate comic gestures. It does not end with gestures –they are also imitating the sound effects!
·         Cowboys yell ‘bang bang’ as they shoot
·         Kong-fu master make ‘shhhhffff’ to simulate impossibly quick karate chops
·         Wizards and other super-natural beings make all sort of sounds  (KAMEAMEA!)
By analyzing the audio stream, we can detect sounds coordinated to the gesture and give it some meaning:
·         Karate chops and kicks gets powerful on many appropriate sounds: you see some white trail effects and it actually inflict more damage!
·         A boxing hit explodes when the user say ‘boom’
·         A tennis racket get emphasized once the player shouts on the hit

Instead of ‘collecting’ magic spells and scrolls, the master wizard can show you how to move and what to say in order to invoke magic!
This way you actually learn the magic spells that works in the imaginary virtual world of the game. The learning is practically done by the user’s mind – just like it is really imagined in the fantasy story!

Another example is triggering ‘bullet time’ slow motion scene using sounds.
Imagine the player encounter many enemies coming towards him. He stands in a battle pose and start saying ‘ta-ka-ta-ka-ta-ka’. Then the system continues the ticking sounds. The enemies and world physics are now in slow motion. The player can easily hit all enemies. After a timeout – world time return to normal and all the enemies fall on the floor together!


  1. 1. i believe in multiple levels of interaction in order to complete a full and immersive experience (like that video i linked to that other post).
    therefore, i believe in voice recognition in parallel to pointing UI and body language recognition.
    the examples you talked about at the end of your post present that kind of interface. i also thought about doing magics using the humming sound of the wizard while they are preparing the
    spell. :)

    2. i agree about the Siri's success assumption. this is maybe why Android's Evi have some more development time until it will gain more success.

    3. i am one of the readers who would be happy to hear more of that psychology of the human mind. :)

    4. israeli voice controlled game:

  2. Hi Shachar - it's a great example!
    Yes - its funny - but its the same psychological root from the discussion (In many aspects, Humour works on same principles as illusions)