Forward-looking: OpenAI just introduced GPT-4o (GPT-4 Omni or "O" for short). The model is no "smarter" than GPT-4 but still some remarkable innovations set it apart: the ability to process text, visual, and audio data simultaneously, almost no latency between asking and answering, and an unbelievably human-sounding voice.
While today's chatbots are some of the most advanced ever created, they all suffer from high latency. Depending on the query, response times can range from a second to several seconds. Some companies, like Apple, want to resolve this with on-device AI processing. OpenAI took a different approach with Omni.
Most of Omni's replies were quick during the Monday demonstration, making the conversation more fluid than your typical chatbot session. It also accepted interruptions gracefully. If the presenter started talking over the GPT-4o's reply, it would pause what it was saying rather than finishing its response.
OpenAI credits O's low latency to the model's capability of processing all three forms of input--text, visual, and audio. For example, ChatGPT processed mixed input through a network of separate models. Omni processes everything, correlating it into a cohesive response without waiting on another model's output. It still possesses the GPT-4 "brain," but has additional modes of input that it can process, which OpenAI CTO Mira Murati says should become the norm.
"GPT-4o provides GPT-4 level intelligence but is much faster," said Murati. "We think GPT-4o is really shifting that paradigm into the future of collaboration, where this interaction becomes much more natural and far easier."
Omni's voice (or voices) stood out the most in the demo. When the presenter spoke to the bot, it responded with casual language interspersed with natural-sounding pauses. It even chuckled, giving it a human quality that made me wonder whether it was computer-generated or faked.
Real and armchair experts will undoubtedly scrutinize the footage to validate or debunk it. We saw the same thing happen when Google unveiled Duplex. Google's digital helper was eventually validated, so we can expect the same from Omni, even though its voice puts Duplex to shame.
However, we might not need the extra scrutiny. OpenAI had GPT-4o talk to itself on two phones. Having two versions of the bot converse with each other broke that human-like illusion somewhat. While the male and female voices still sounded human, the conversation felt less organic and more mechanical, which makes sense if we removed the only human voice.
At the end of the demo, the presenter asked the bots to sing. It was another awkward moment as he struggled to coordinate the bots to sing a duet, again breaking the illusion. Omni's ultra-enthusiastic tone could use some tuning as well.
OpenAI also announced today that it's releasing a ChatGPT desktop app for macOS, with a Windows version coming later this year. Paid GPT users can access the app already, and it will eventually offer a free version at an unspecified date. The web version of ChatGPT is already running GPT-4o and the model is also expected to become available with limitations to free users.