GPT-4V(ision), ChatGpt can hear and speak

Oct 02, 2023

OpenAI is introducing voice and image capabilities in ChatGPT to provide users with a more intuitive interface and expanded functionality.

white and brown human robot illustration — Photo by Possessed Photography on Unsplash

Users will be able to engage in voice conversations with ChatGPT and choose from different voices, image capabilities allow users to show ChatGPT pictures and receive assistance with troubleshooting, meal planning, or data analysis.
These features are being rolled out gradually to ensure safety and refine risk mitigations.

The voice capability is powered by a new text-to-speech model and collaboration with voice actors, while image understanding is facilitated by multimodal models.

OpenAI highlights the potential risks of voice technology, such as impersonation, and the challenges of vision-based models, including misinterpretation.

Plus and Enterprise users will be the first to access these new capabilities, with plans for further expansion in the future.

Below an example video:

GPT-4 with vision (GPT-4V)

GPT-4 with vision (GPT-4V) is the new capability that enables users to use image inputs for analysis alongside the traditional language capabilities of GPT-4.

The new OpenAi integration of multimodal inputs, such as images, into large language models, is considered a significant frontier in artificial intelligence research and development.

The introduction of image inputs expands the impact of language-only systems, providing novel interfaces and capabilities to solve new tasks and offer unique user experiences.

Multimodal models like GPT-4V

Multimodal models like GPT-4V represent a significant advancement in AI, offering new interfaces and capabilities. The training process for GPT-4V was similar to that of GPT-4, utilizing a large dataset and then fine-tuning the model with reinforcement learning from human feedback (RLHF).

However, GPT-4V introduces unique challenges compared to text-only models, such as an expanded risk surface.
OpenAI conducted a comprehensive safety analysis, including red teaming and other evaluations, to prepare GPT-4V for deployment.

Be my Eyes

Starting in March 2023, OpenAI and Be My Eyes joined forces to create Be My AI, a revolutionary tool designed to assist the visually impaired by describing their surroundings.

Utilizing GPT-4V, this feature was integrated into the existing Be My Eyes app, which allows blind or low-vision users to get descriptions of images captured on their smartphones. After initial testing with nearly 200 beta testers, the program expanded to 16,000 users by September, averaging 25,000 daily description requests.

The service aims to meet a wide range of needs, from informational and cultural to employment.

The pilot phase served to evaluate the safe and responsible deployment of GPT-4V. Users reported issues such as hallucinations and errors, although there was a noticeable improvement over time, particularly in optical character recognition. Despite its advancements, users are cautioned not to rely on Be My AI for critical tasks like reading prescriptions or navigating streets, as the technology is not a substitute for trained guide dogs or white canes. An option to switch to human assistance is also available within the app for verification or in cases where the AI falls short.

One significant challenge raised by users is the desire to use Be My AI for facial recognition, a complex issue due to privacy laws and potential biases. Despite these concerns, the impact of even limited visual description capabilities has been profound for the community, allowing them to understand pictures, logos, and even descriptions of family members in a new way.

In response, Be My Eyes is working on developing features that can describe faces without identifying individuals by name, aiming for a more equitable user experience while considering privacy and bias issues.

Navigating the Future: GPT-4V's Capabilities, Challenges, and Next Steps

The development and deployment of GPT-4V, a multimodal language model capable of processing both text and images, have opened new doors in the field of artificial intelligence.

While the possibilities are exhilarating, they also come with their own set of unique challenges.

The Preparatory Phase

Before deploying GPT-4V, OpenAI has been focused on assessing and mitigating various risks associated with the model, these risks include person identification, biased outputs, and the model's proficiency in high-risk domains like medicine and science.

The goal is to ensure the responsible and ethical use of AI technology.

Key Questions Moving Forward

As OpenAi plans for the future,
everyone will focus on addressing several fundamental questions:

What behaviours should be permitted or restricted for these models?

For instance, should they identify public figures like Alan Turing based on images? Should they infer attributes like race, gender, or emotion from photos?

Global Relevance:

As the technology gains traction globally, enhancing its performance for multiple languages and diverse cultural contexts becomes imperative.

These questions intersect with broader issues of privacy, fairness, and the role that AI can or should play in society.

Enhancing Global Usability

Given the global adoption of such models, OpenAI aims to improve performance across various languages and cultures.

This is crucial for making the technology relevant to a global audience.

By doing so, we all have to ensure that the advantages of AI are accessible to people from different parts of the world.

Precision and Sensitivity

OpenAi is also investing in research to handle image uploads with higher precision.

The current system has broad but imperfect guidelines for refusing certain kinds of sensitive information.
The goal is to advance the model's capabilities in handling sensitive aspects of images, such as a person’s identity or protected characteristics.

The journey ahead for GPT-4V is both thrilling and challenging, by investing in research and public engagement, OpenAi aims to address the ethical and technical complexities associated with this cutting-edge technology, from global usability to ethical considerations and technological refinements, the focus now is on creating an AI model that is not just advanced, but also responsible and equitable.

Stay tuned for more updates as I will continue to refine and expand the knowledge of GPT-4V!

Gustavo’s The Business Automator

Discussion about this post