Building a real-time AI assistant (with voice and vision)
I built a voice assistant using the same platform OpenAI is using.
This story started a few weeks ago. I built an assistant in Python. It used my webcam to see the world. It was a ton of fun.
Then, something happened:
Have you seen ChatGPT's voice interface? The team behind the platform powering it saw my demo and told me, "Cool, bro, but here is how you actually build something like this."
Welcome to the big leagues of real-time voice and video applications!
I rebuilt my agent using LiveKit. They collaborated with me on this post and helped me build this example.
This new version is pretty awesome!
The assistant has access to my microphone and webcam. It listens and sees everything. I can also interrupt it whenever I want.
The main improvements on this new version are speed and reliability.
We are doing something very clever to avoid sending images from the webcam with every request. This makes the agent feel faster and much cheaper to run.
Here is the trick:
• By default, we only send text to the LLM.
• When the agent needs vision, it asks for it through a function call
• We then send the text plus a frame from the webcam
You gotta watch the video to see it in action. This is as good as it gets!
And here is a YouTube video where I explain step-by-step how everything works: