If you were eagerly awaiting the latest Spring update from OpenAI for ChatGPT and hoping that the company would release GPT-5, you would be disappointed in that respect. But what OpenAI has released instead would more than make up for it.
The company recently unveiled its newest flagship model – GPT-4o – and it's a masterpiece of human innovation. The 'o' in GPT-4o stands for "omni" and it's an apt nod to ChatGPT's newest omnipresent capabilities. While there is not much improvement in the intelligence and reasoning part over the GPT-4 model, the new model features drastic improvements in speed and multimodality.
What does it mean? GPT-4o has improved capabilities across text, voice, and vision. It can understand and discuss images better. But the most exciting part about the update is its capability to converse with you in real-time over audio and video, ushering us into the future of human-machine interaction. Most of us only imagined this sci-fi-esque interaction with an AI way down the lane. But it's here, and it's thrilling.
Mira Murati, CTO of OpenAI, along with two research leads, showcased the new capabilities of GPT-40.
The voice model has incredible personality and tonality, capable of making you forget (for a while) that you are interacting with an AI. It's scarily exciting. The responses are much more natural and it even laughs and pretends to blush like a human.
The demo also highlighted the range of emotions ChatGPT can display when explicitly asked: While narrating a story, ChatGPT imbibed its voice with more emotions and drama, switched to a robotic sound, and even sang like it was in a musical, and it did it all seamlessly.
Many users say the voice reminds them of Scarlett Johansson's AI from the movie "Her", but notably, it's the same voice ChatGPT's had in the past. All the difference comes from the changes in tonality and some well-placed laughs.
When you pair it with its abilities to see and respond to the content on the screen, it's downright mindblowing. With its new seeing capabilities, ChatGPT could not only comprehend stuff like linear equations, but it did a pretty bang-on job of interpreting the surroundings as well as the emotions on a person's face shown to it using the camera. You can now even play rock-paper-scissors and ask ChatGPT to be the referee or take the interview prep with ChatGPT one step further by asking it to critique your outfit, and it won't gloss over any bad choices you make.
Overall, the effect is remarkable and almost makes you believe you're interacting with a real person over a video call (if the other person kept their camera off at all times, that is).
The Voice Model is also better overall than the one currently available. The dialogue flows more like a natural conversation, where you can interrupt it in the middle, it can understand and differentiate multiple voices and background noises, and the tone of the voice.
On a technical level, it's because GPT-4o can do everything natively that up until now required three different models: Transcription, Intelligence, and Text-to-Speech. These improvements bring a more immersive, collaborative experience to the user instead of the latencies of the previous models.
While access to GPT-4o is already starting to roll out to free as well as Plus users in the web app, the new Voice Mode with GPT-4o will be launched in alpha only to ChatGPT Plus users in the coming weeks. A new macOS ChatGPT app is also being released, with access rolling out iteratively, starting from ChatGPT Plus users.
While the demo was quite impressive, we'll have to wait to see if the real-world application will be as smooth when the model is finally released.
Member discussion