A decade since touch screens became ubiquitous in phones, we still generally interact with mobile apps using only minor variations of a few gestures: tapping, panning, zooming and rotating.
Recently, I explored a highly underutilized class of gestures. The approach I took to detecting them in-app has several important advantages over the state-of-the-art techniques.
The idea originally came to me while imagining a puzzle game in which the user solves physical challenges by designing machines out of simple components. Sort of like Besiege but 2D, mobile and focused on solving physical puzzles rather than medieval warfare.
Instead of cluttering the small mobile screen with buttons for switching tools, I thought the game could have different gestures for different tools or components. For instance:
- Coil shapes or zigzags for springs
- Little circles for axles
- U shapes for grabbers or magnets
On top of cleaning up the UI, this could save the user time that they would have spent looking for buttons, navigating submenus and moving their finger back and forth to work on something.
While pan and rotation gestures can be recognized using basic geometry, more complex gestures like these are tricky.
Detecting complex gestures naively
It’s tempting to try to handcraft algorithms to identify each gesture you intend to use in your app. For something like a check mark, one might use the following rules:
- The touch moves left to right.
- The touch moves down and then up.
- The end of the touch is higher than the start.
Indeed, the official iOS tutorial on custom gesture recognizers suggests these rules. While they may seem to describe a check mark, consider these issues:
- Left-handed writers generally prefer to make strokes from right to left, so most of them draw a check mark from right to left.
- At the bottom of the stroke, the user may essentially stop moving their finger. If their finger moves even slightly to the left, or slightly up and then down, rules 1 or 2 can fail.
- A lot of shapes that are not check marks also satisfy these rules. Where users can enter any of a number of different gestures, you need to be able to distinguish between them reliably.
And these are only for simple check marks. Imagine what can go wrong when dealing with more complex gestures, with perhaps multiple strokes, that all have to be distinguished from each other.
The state-of-the-art for complex gesture recognition on mobile devices seems to be an algorithm called $P. $P is the newest in the “dollar family” of gesture recognizers developed by researchers at the University of Washington.
$P’s main advantage is that all the code needed to make it work is short and simple. That makes it easy to write new implementations, and easy to allow the user to make new gestures at runtime. $P can also recognize gestures regardless of their orientation.
While $P performs decently, I don’t think it’s good enough to rely on in a real mobile app for a few reasons. First, it’s not very accurate.
The examples above make $P look worse than it actually is, but I still don’t think an app can afford errors like those.
It’s possible to improve $P’s accuracy by giving it more gesture templates but it won’t generalize as well from those templates as the solution you’re about to see. And the algorithm becomes slower to evaluate as you add more templates.
Another limitation of $P is its inability to extract high-level features, which are sometimes the only way of detecting a gesture.
My approach using convolutional neural networks
A robust and flexible approach to detecting complex gestures is by posing the detection problem as a machine learning problem. There are a number of ways to do this. I tried out the simplest reasonable one I could think of:
- Scale and translate the user’s gesture to fit in fixed-size box.
- Convert the strokes into a grayscale image.
- Use that image as input to a convolutional neural network (CNN).
This converts the problem into an image classification problem, which CNNs solve extremely well.
Like any machine learning algorithm, my network needed examples (gestures) to learn from. To make the data as realistic as possible, I wrote an iOS app for inputting and maintaining a data set of gestures on the same touch screen where the network would eventually be used.
While I’ll cover the technical details of the implementation in my next article, here’s a summary:
- The app shown above saves the raw touch data in one file and generates a second file with all the rasterized images.
- Python scripts shuffle the rasterized data, split it into a training set and a test set, and convert the sets into a format that’s easy to input into Tensorflow.
- The convolutional neural network is designed and trained using Tensorflow.
- To run on iOS, I export a Core ML mlmodel file using Core ML’s protobuf specification (new in iOS 11). For supporting iOS 10, see iOS machine learning APIs. On Android you can use the official Tensorflow API.
With this approach it doesn’t matter how many strokes you use in your gesture, where you start or finish or how the speed of your touch varies throughout your strokes. Only the image matters.
This throws away the timing information associated with the user’s gesture. But it sort of makes sense to use an image-based approach when the gesturing user has an image in mind, as when they’re drawing hearts or check marks.
One weakness is that it would be difficult to allow a user to define their own gesture at runtime since most mobile machine learning frameworks can only evaluate, not train neural networks. I think this is a rare use case.
I was dumbfounded at how well the neural network performed. Training it on 85% of the 5233 drawings in my data set results in 99.87% accuracy on the remaining 15% (the test set). That means it makes 1 error out of 785 test set drawings. A lot of those drawings are very ugly so this seems miraculous to me.
Note: You don’t need nearly 5233 drawings to get similar accuracy. When I first created a data set, I significantly overestimated how many drawings I’d need, and spent all day drawing 5011 instances of 11 gestures (about 455 each).
Taking 60 images for each gesture, I found that the algorithm would still reach about 99.4% accuracy on the remaining unused images. I think about 100 drawings per gesture may be a good number, depending on the complexity of the gestures.
The network is robust to length changes, proportion changes and small rotations. While it’s not invariant to all rotations like $P, there are several ways to imbue a CNN with that property. The simplest method is to randomly rotate the gestures during training, but more effective techniques exist.
Speed-wise, the app takes about 4 milliseconds to generate an image for the gesture and 7 ms to evaluate the neural network on the (legacy) Apple A8 chip. That’s without trying much to optimize it. Because of the nature of neural networks, a large number of new gestures can be added with little increase required in the size of the network.
Clearly adopting these types of gestures is more of a UI/UX problem than a technical hurdle; the technology is there. I’m excited to see if I can find ways to use them in my clients’ apps.
That’s not to say that the UI/UX problem is trivial. Though I present some ideas here on when it might make sense to use gestures like these, much additional thought is needed. If you have ideas of your own, do share!
As a guideline, it may be a good idea to use gestures like these to perform actions in your app if one or more of the following is true:
*You can think of intuitive, simple gestures to go with each action, making them easy for the user to remember and quick to apply. * The actions are performed at a specific location on-screen (very common in games). The actions may also have associated orientations, lengths, etc. that could be extracted from the same gesture. A hack and slash game could use one finger to move the character and the other to execute a variety of ranged attacks like spells. * Making buttons for all possible actions would unacceptably clutter the screen or require too much navigation to submenus.
You’ll also need to make sure you have a good way for users to learn and review what gestures they can make.
It’s difficult to convey to blind users what motion to make. To help with accessibility, you can have submenus that enable the same actions as your gestures.
Dealing with scroll views
It can be tricky to use complex gestures inside scroll views since they interpret any movement of a user’s touch as scrolling.
One idea is to have a temporary “drawing mode” that activates when the user puts the large flat part of their thumb anywhere on the screen. They would briefly have a chance then to gesture some action. The location of the flat tap could also be used in association with the action (e.g. scribble to delete the item that was tapped).
The current implementation is very good at distinguishing between the 13 symbols I gave it—exactly what it was trained to do. But it’s not so easy with this setup to decide whether a gesture that the user drew is any of those symbols at all.
Suppose a user draws randomly and the network decides that the closest thing to their gesture is a symbol that represents deletion. I’d rather not have the app actually delete something.
We can solve that problem by adding a class that represents invalid gestures. For that class we make a variety of symbols or drawings that are not check marks, hearts or any of our other symbols. Done correctly, a class should only receive a high score from the neural network when a gesture closely resembles it.
A more sophisticated neural network might also use velocity or acceleration data to detect motion-based gestures that produce messy images, like continual circular motions. That network could even be combined with the image-based one by concatenating their layers toward the end of the network.
Some apps like games may need to determine not just that the user made a gesture, or the positions where they started or finished (which are easy to get), but additional information as well. For example, for a V shape gesture we might want to know the location of the vertex and which direction the V points. I have some ideas for solving this problem that might be fun to explore.
By refining the tools I made here, I think the barrier to adopting these ideas in a new app could be made very small. Once set up it only takes about 20 minutes to add a new gesture (input 100 images, train to 99.5+% accuracy, and export model).
Stayed tuned for part 2 where I go into more technical detail on this implementation. In the meantime, here’s the source code.