I’m struggling with social situations. What if you could just send an AI agent to meet your future in-laws? Here's what it would look like.
🏠 The Dinner
It’s the first time you've met Sam, your partner's father. Sam is the CEO of TokenAI, one of the larger AI companies in your area. Since you spent your college years studying Machine Learning instead of meeting potential in-laws, you built an AI voice agent to help you tonight.
👂Listening
At the dinner table, Sam is rambling about the new multimodal model they're about to launch. You've done your research and know that Sam is not a man of patience - so you designed your AI system to be blazingly fast. You connected your smartphone to a streamed transcription model that continuously listens and sends you Sam's transcribed speech. A few minutes in, Sam seems to be taking a break while forking a bite of his plant-based protein bowl, which he got from TokenAI's employee-benefit health food station. Intimidated by the authority and unsure when to speak up, you're running a voice activity detection model on your phone, which just at this moment signals you it's time!
🧠 Cognition
While Sam's grinding his spoon in the protein bowl (wait... did you just spot a cricket leg in there?), you send the transcribed text to the remote LLM system you prepared before dinner. Your LLM is based on a custom parents-in-law retrieval model fine-tuned on all three seasons of MTV’s "Date My Mom” show, allowing the LLM to deliver state-of-the-art reasoning capabilities in this domain. You were smart enough to host your model with a fast cloud-based model inference provider, so you see the first response tokens being streamed to your phone after just the blink of an eye.
🗣️ Speaking
Sam seems distracted and furiously fidgets with an acai berry in his bowl. “Why is my dearest child dating this idiot?”, he thinks. You have to be fast now; don't ruin your first impression! While your LLM is streaming the text, you instantly send them to a cloud-based voice generator that sends you the audio chunks you can play to Sam. In preparation for this important dinner, you skipped today's yoga class and instead cloned your voice. Playing the synthetic voice, you realize you generated an intriguing, realistic-sounding response. While playing it from your phone's speakers, you struggle to keep your lips in synch, but Sam, checking the TokenAI Slack channel while listening, doesn’t notice. A minute into your speech, he starts coughing wildly before cutting you off (which your voice activity detection model noticed even before you). Sam is incredibly impressed by your brilliant response and offers you a leading position at TokenAI. You graciously acknowledge the offer but politely decline, explaining that you're fully committed to working on your latest startup idea, "AirBnb for Goldfish Enthusiasts".
⏭️ Jokes Aside
Don't miss part 2 of this post - I will detail how we built our voice AI at Sonia (YC W24).