QUIK, sensor hubs and the "adjacent possible"

Wanted to reread some Sensory Inc material...

Deploying speech recognition locally versus the cloud
Todd Mozer, Sensory, Inc.
RSS

cloud-based recognizers deploying "deep learning" on big data.

Although it's often out of the spotlight, there's been lots of progress in speech recognition for embedded systems. In fact, most of the major speech engines deploy a combination of embedded plus cloud-based recognition. This is most noticeable in commands like "Hey Siri," "OK Google," "Hey Cortana," "Hi Galaxy," and "Alexa." All of these cloud-based recognition systems use embedded "trigger" phrases to open the cloud connection to ready itself for the speech recognition.

Embedded trigger phrases allow a few improvements and practicalities over cloud-based approaches. For one, having an embedded recognizer "always on" is a lot less creepy than having your conversations going up to the clouds for Google and others to analyze any way they want. Since it's on-device, there's no speech recording or transmitting until the trigger phrase is spoken, and the trigger listening is done in real time without your speech being sent off.

There are also practical reasons for an embedded wake-up trigger, and a leading one is power consumption. Running exclusively on in the cloud would require lots of data transfer and analysis, making a battery-operated or "green" product impractical. Many major DSP companies have solutions for "always on" DSPs that run Sensory's TrulyHandsfree wake-up trigger options at 2 mA or less. With sound activity detection schemes, the average battery drain can be under 1 mA, placing it in the realm of battery leakage.

Other popular uses of embedded speech recognition are in devices that want fast and accurate responses to limited commands. One of my favorite examples is in the Samsung Galaxy smartphones where, in camera mode, users can enable voice commands to take pictures. This works for me from up to 20 feet away in a quiet setting or 5 feet in a noisier location. It's an awesome alternative to carrying around a selfie stick, and whenever I show this feature to people they quickly get it and love it.

Embedded speaker verification is also being deployed more frequently and is often incorporated into a wake-up trigger to decrease the probability that others can wake up your device. With speech recognition and speaker verification, there's always a trade-off between false accepts (accepting the wrong user) and false rejects (rejecting the right user). The preferred wake-up trigger setting is often to keep false rejects extremely low at the cost of occasionally letting the right person in. In systems requiring more sophisticated speaker verification for security, it’s possible to deploy more complex algorithms that don’t require the lowest power consumption, to gain better accuracy at the cost of increasing current consumption.

As consumer products and mobile phones use more sophisticated processors, I expect a higher percentage of speech recognition use will move to the embedded devices, and a "layered" speech-recognition approach will emerge, whereby a fast initial analysis is done on device and responded to if the device has a high confidence of success (self-perception), but passed to the cloud if it's less sure of its response or if a cloud-based search is required.

Todd Mozer is the CEO of Sensory. He holds over a dozen patents in speech technology and has been involved in previous startups that reached IPO or were acquired by public companies. Todd holds an MBA from Stanford University, and has technical experience in machine learning, semiconductors, speech recognition, computer vision, and embedded software.

Todd M. will enjoy this next few yrs for all his efforts will see a major uptake.....

jfieb, Mar 19, 2016Edit Delete Report

#23 Reply
jfiebWell-Known Member
five minutes with......TM

http://video.opensystemsmedia.com/2016/01/13/five-president-ceo-founder-sensory/

the amazing thing about sensory?

It does it with neural networks ( all the rage these days in the GO playing world !)
How much lower in power for audio via this route?

An order of magnitude or more..........10x less.
That way they can keep it on the device.
Sensory did it this way from the START- now they
will enjoy the spotlight.

The talk is worth the five minutes.

Mr. Todd Mozer Co-Founded Sensory, Inc. and serves as its Chairman of the Board, Chief Executive Officer and President.

Education

Stanford University Graduate School of Business
MBA
1986–1988

University of California, Berkeley
Coursework, Engineering and Computer Science
1984–1985
Night school to broaden education and technical skills

UC Santa Barbara
BA's, Psychology and Economics,
1978–1983
Completed double major, played IM soccer, member of various rock and roll bands

Activities and Societies: Various honor societies/awards including Phi Beta Kappa, Psi Chi, Omicron Delta Epsilon, Deans Honors, etc.
Last edited: Mar 19, 2016

jfieb, Mar 19, 2016Edit Delete Report

#24 Reply
jfiebWell-Known Member

machine learning- can take the place of millions of Kalmans.

neural networks- are an order of magnitude lower in power for audio.

so because of the this stuff I did follow the Go story. I put it here for those who don't mind digressions

http://www.theverge.com/2016/3/12/11210650/alphago-deepmind-go-match-3-result

AlphaGo beats Lee Se-dol again to take Google DeepMind Challenge series
DeepMind AI goes 3-0 up to seal historic victory

I was a little sad to see humans bested in a realm where we were the best.

Here is the snip to note however...

DeepMind's AlphaGo program has gone further than anyone else by using an advanced system based on deep neural networks and machine learning, which has seen it overwhelm Lee over the course of three games. The series is the first time a professional 9-dan Go player has taken on a computer; Lee was competing for a $1 million prize.

OK Dr Saxe, can you build a NNLE-neural network learning engine. so there will be one layer of real intelligence that resides on the device?

Thanks in advance.

jfieb, Mar 19, 2016Edit Delete Report

#25 Reply
Kerry BeaverMember

Microsofts big push at the build conference this year was natural speech apps. Should help Sensory and Quik as more apps come out that are speech oriented.

http://www.crn.com/news/application...ps-that-can-have-conversations-with-users.htm

Kerry Beaver, Mar 31, 2016Report

#26 Like Reply
jfiebWell-Known Member

Audio as a UI has very broad support from the BIG ecosystems. I spend time thinking on the division between what stays ON the device, and what goes up, here is a recent Sensory item on just that....

Just saw an interesting article on www.eweek.com

Covers a consumer survey about being connected and particularly with IoT devices. What’s interesting is that those surveyed were technically savvy (70% were self-described as intermediate or advanced with computers, and 83% said they could set up their own router), yet the survey found:

1) 68 percent of consumers expressed concern about security risks such as viruses, malware and hackers;
2) 65 percent of consumers were concerned over data collected by device manufacturers being inappropriately used or stolen; and
3) 51 percent of consumers said they are also anxious about privacy breaches.

These concerns are quite understandable, since we as consumers tend to give away many of our data rights in return for free services and software.

People have asked me if embedded speech and other embedded technologies will continue to persist if our cloud connections get better and faster, and the privacy issues are one of the reasons why embedded is critical.

This is especially true for “always on” devices that listen for triggers; if the always on listening is in the cloud, then everything we discuss around the always on mics goes into the cloud to be analyzed and potentially collected!

For the casual reader, we want the Sensory model to be successful, as then the focus has to be placed on the power figures of the audio and QUIK/Sensory will just shine in comparison to others.
It fits into the Amazon Echo device, which is NOT battery powered yet, because of this. We want someone to clone the Echo, but unplug it and give it portability.....

Audio seems to be the first compute intense algo that is going ubiquitous, it is just a step in the right direction, we want more of this category.

So sensory has the angle that on the device gives privacy.
We would like to see extensions with Sensory.Like what?

Maybe the more complex noise algos Sensory got from Philips

We would like to see more IoT ecosystem involvment, like Telit on steroids.

Last edited: Mar 31, 2016

jfieb, Mar 31, 2016Edit Delete Report

#27 Reply
jfiebWell-Known Member

New

Speaking the language of the voice assistant
TODD MOZER, SENSORY, INC.

Now that Google and Apple have announced that they’ll be followingAmazon into the home far-field voice assistant business, I’m wondering how many things in my home will always be on, listening for voice wakeup phrases. In addition, how will they work together (if at all). Let’s look at some possible alternatives:

Co-existence. We’re heading down a path where we as consumers will have multiple devices on and listening in our homes and each device will respond to its name when spoken to. This works well with my family; we just talk to each other, and if we need to, we use each other’s names to differentiate. I can have friends and family over or even a big party, and it doesn’t become problematic calling different people by different names.

The issue for household computer assistants all being on simultaneously is that false fires will grow in direct proportion to the number of devices on and listening. With Amazon’s Echo, I get a false fire about every other day, and Alexa does a great job of listening to what I say after the false fire and ignoring if it doesn’t seem to be an intended command. It’s actually the best performing system I’ve used and the fact that its starts playing music or talking every other week is a testament to what a good job they have done. However, interrupting my family every other week is not good enough. And if I have five always-listening devices interrupting us 10 times a month, that becomes unacceptable. And if they don’t do as good a job as Alexa, and interrupt more frequently, it becomes quite problematic.

Functional winners. Maybe each device could own a functional category. For example, all my music systems could use Alexa, my TV’s use Hi Galaxy, and all appliances are Bosch. Then I’d have less “names” to call out to and there would be some big benefits: 1) The devices using the same trigger phrase could communicate and compare what they heard to improve performance; 2) More relevant data could be collected on the specific usage models, thus further improving performance; and 3) With less names to call out, I’d have fewer false fires. Of course, this would force me as a consumer to decide on certain brands to stick to in certain categories.

Winner take all. Amazon is adopting a multi-pronged strategy of developing its own products (Echo, Dot, Tap, etc.) and also letting its products control other products. In addition, Amazon is offering the backend Alexa voice service to independent product developers. It’s unclear whether competitors will follow suit, but one thing is clear—the big guys want to own the home, not share it.

Amazon has a nice lead as it gets other products to be controlled by Echo. The company even launched an investment fund to spur more startups writing to Alexa. Consumers might choose an assistant we like (and we think performs well) and just stick with that across the household. The more we share with that assistant, the better it knows us, and the better it serves us. This knowledge base could carry across products and make our lives easier.

Just Talk. In the “co-existence” case previously mentioned, there six people in my household, so it can be a busy place. But when I speak to someone, I don’t always start with their name. In fact, I usually don’t. If there’s just one other person in the room, it’s obvious who I’m speaking to. If there are multiple people in the room, I tend to look at or gesture toward the person I’m addressing. This is more natural than speaking their name.

An “always listening” device should have other sensors to know things like how many people are in the room, where they’re standing and looking at, how they’re gesturing, and so on. These are the subconscious cues humans use to know who is talking to us, and our devices would be smarter and more capable if they could do it.

The success of the echo will create some urgency for products and also for M & A of the crucial bits and pieces. Who knocks on the door for Sensory Inc?