I/O 2022 saw the announcement of Look and Talk for the Nest Hub Max, and Google has now provided specifics on how the camera-based Hey Google replacement functions.
The first multimodal on-device Assistant function, Google calls Look and Talk—funnily enough dubbed Blue Steel after Zoolander—determines when you are speaking to your Nest Hub Max by simultaneously analyzing audio, video, and text.
Every contact goes through three processing stages, during which Assistant searches for signals like as proximity, face matching, head orientation, gaze direction, lip movement, voice matching, contextual awareness, and intent classification. There are more than 100 camera and microphone feeds in all, and processing takes place locally on the device.
Existing models for intent understanding contribute to this since they demand entire, not partial queries, and frequently a strong enough engagement signal does not appear until well after the user has started speaking. Look and Talk fully forgoes transmitting audio to the server in order to fill this gap, instead relying on on-device transcription and intent recognition.
The Nest Hub Max first determines whether a user is expressing a desire to interact with “Assistant.” Face Match must identify them and they must be within five feet of the device, with Google taking care to overlook passing looks.
An individual eye gaze model for each enrolled user within range detects whether they are staring at the device or not. This model uses a multi-tower convolutional neural network architecture, with one tower processing the entire face and another processing patches around the eyes, to estimate the gaze angle and a binary gaze-on-camera confidence from image frames. We map the gaze angle and binary gaze-on-camera prediction to the device screen area since it covers a region underneath the camera that a user would naturally look at. We add a smoothing function to the individual frame-based predictions to remove spurious individual forecasts in order to make sure that the final prediction is resistant to spurious individual predictions, involuntary eye blinks, and saccades .
In phase two, the Hub Max begins to listen, checks Voice Match, and determines whether the user’s speech was meant to be an Assistant request.
Two components make up this: A text analysis model that assesses whether the transcript is an Assistant request and a model that examines non-lexical information in the audio to determine whether the utterance sounds like an Assistant query. These two together exclude questions that are not for Assistant. To assess the chance that the interaction was intended for Assistant, it additionally makes use of contextual visual cues.
After the first two are completed, the third phase—fulfillment—involves speaking with the Assistant server to get a response to the user’s intent and query language. This feature underwent a variety of tests, including:
To evaluate the feature across demographic subgroups, we created a heterogeneous dataset with more than 3,000 people. Performance for all subgroups was enhanced by modeling advancements fueled by diversity in our training data.
Look and Talk is a key milestone for Google in its mission to make user interaction with Google Assistant as natural as possible. Next to the Nest Hub Max, there will be Quick Phrases that don’t require the hotword to perform predetermined actions (such setting an alarm or turning on/off the lights).
FTC: We employ income-generating auto affiliate connections. MORE ON NEST HUB MAX. More.
Check out 9to5Google on YouTube for more news: