OpenAI just demoed its new flagship foundational model, GPT-4o, with incredible speech recognition and translation skills.
As CEO Sam Altman himself stated, we knew OpenAI‘s latest “Spring update” was unrelated to GPT-5 or AI search.
But at 10 a.m. PT today, hundreds of thousands joined the live-streamed presentation of the new model as Chief Technology Officer (CTO) Mira Murati demonstrated its benefits over its predecessor, GPT-4.
Key announcements from the demo session include:
- GPT-4o (the o stands for omni) intends to supersede GPT-4, with OpenAI calling it its new flagship foundational model.
- While broadly similar to GPT-4, GPT-4o offers superior multilingual and audiovisual processing. It can process and translate audio in near real-time.
- OpenAI is making GPT-4o freely available, with limits. Pro users still get priority and a higher message cap.
- OpenAI is also releasing a desktop version of ChatGPT, initially for Mac only, which is rolling out immediately.
- Custom GPTs will become accessible to free users, too.
- GPT-4o and its voice features will roll out slowly over the coming weeks and months.
GPT-4o‘s real-time audio translation
The headline that’s got everyone talking is GPT-4o’s impressive audio processing and translation, which operate in near real-time.
Demonstrations showed the AI engaged in remarkably natural voice conversations, offering immediate translations, telling stories, and providing coding advice.
For example, the model can analyze an image of a foreign language menu, translate it, and provide cultural insights and recommendations.
OpenAI has just demonstrated its new GPT-4o model doing real-time translations 🤯 pic.twitter.com/Cl0gp9v3kN
— Tom Warren (@tomwarren) May 13, 2024
It can also recognize emotion through breathing, expressions, and other visual cues.
Clip of real time conversation with GPT4-o running on ChatGPT app
NEW: Instead of just turning SPEECH to text, GPT-4o can also understand and label other features of audio, like BREATHING and EMOTION. Not sure how this is expressed in the model response.#openai https://t.co/CpvCkjI0iA pic.twitter.com/24C8rhMFAw
— Andrew Gao (@itsandrewgao) May 13, 2024
GPT-4o’s emotional recognition skills will probably attract controversy once the dust settles.
Emotionally cognizant AI might evolve potentially nefarious use cases that rely on human mimicry, such as deep fakes, social engineering, etc.
Another impressive skill demoed by the team is real-time coding assistance provided via voice.
With the GPT-4o/ChatGPT desktop app, you can have a coding buddy (black circle) that talks to you and sees what you see!#openai announcements thread! https://t.co/CpvCkjI0iA pic.twitter.com/Tfh81mBHCv
— Andrew Gao (@itsandrewgao) May 13, 2024
One demo even saw two instances of the model singing to each other.
This demo of two GPT-4o’s singing to each other is one of the craziest things I’ve ever seen. pic.twitter.com/UXFfbIpuF6
— Matt Shumer (@mattshumer_) May 13, 2024
The general gist of OpenAI’s demos is that the company aims to make AI multimodality genuinely useful in everyday scenarios, challenging tools like Google Translate in the process.
Another key point is that these demos are true to life. OpenAI pointed out, “All videos on this page are at 1x real time,” possibly alluding to Google, which heavily edited its Gemini demo video to exaggerate its multi-modal skills.
With GPT-4o, multi-modal AI applications might move from a novelty buried deep inside AI interfaces to something average users can interact with daily.
Aside from real-time voice processing and translation, which is soaking up the limelight, the fact that OpenAI is making this new model free of constraints is massive.
While GPT-4o is *just* a slightly better GPT-4, it will equip anyone with a top-quality AI model, leveling the playing field for millions worldwide.
You can watch the announcement and demo below:
Everything we know about GPT-4o
Here’s a rundown of everything we know about GPT-4o thus far:
- Multimodal integration: GPT-4o rapidly processes and generates text, audio, and image data, enabling dynamic interactions across different formats.
- Real-time responses: The model boasts impressive response times, comparable to human reaction speeds in conversation, with audio responses starting in as little as 232 milliseconds.
- Language and coding capabilities: GPT-4o matches the performance of GPT-4 Turbo in English and coding tasks and surpasses it in non-English text processing.
- Audio-visual improvements: Compared to previous models, GPT-4o shows a superior understanding of vision and audio tasks, enhancing its ability to interact with multimedia content.
- Natural interactions: Demonstrations included two GPT-4os engaging in a song, helping with interview preparation, playing games like rock paper scissors, and even creating humor with dad jokes.
- Reduced costs for developers: OpenAI has slashed the cost for developers using GPT-4o by 50% and doubled its processing speed.
- Benchmark performance: GPT-4o benchmarks excel in multilingual, audio, and visual tasks.
GPT-4o is a meaningful announcement for OpenAI, particularly as it’ll be the most powerful free model by a sizeable margin.
It might signal an era of practical, useful AI multi-modality that people begin to engage with en-masse.
That would be a massive milestone both for the company and the generative AI industry as a whole.
Discover more from reviewer4you.com
Subscribe to get the latest posts to your email.