Conversational AI with GPT-4 Vision, OpenAI Whisper, and TTS

Overview

This project integrates GPT-4 Vision, OpenAI Whisper, and OpenAI Text-to-Speech (TTS) to create an interactive AI system for conversations. It combines visual and audio inputs for a seamless user experience.

Demo Video:

https://twitter.com/ayushspai/status/1726222586380557647

Components

GPT-4 Vision: Analyzes visual input and generates contextual responses.
OpenAI Whisper: Converts spoken language into text.
OpenAI TTS: Transforms text responses into spoken language.

Main Files

main.py: Manages audio processing, image encoding, AI interactions, and text-to-speech output.
capture.py: Captures and processes video frames for visual analysis.

Installation

Prerequisites

Python 3.x
An OpenAI API key (set as an environment variable OPENAI_API_KEY)

Libraries

Install the necessary libraries with the requirements.txt file.

pip install -r requirements.txt

Usage

Running the Scripts

Start capture.py: Captures video frames and saves them for AI analysis.
- Reads a video file, displays the video, and saves the current frame as frame.jpg.
- Execute with python capture.py.
Run main.py concurrently: Orchestrates the conversational AI.
- Continuously listens for user audio input.
- Transcribes speech to text, captures the current video frame, and sends both to GPT-4 for analysis.
- Converts the AI's response to speech and plays it back.
- Execute with python main.py.

Workflow

main.py listens for audio input and transcribes it using OpenAI Whisper.
Meanwhile, capture.py captures a video frame.
Both the audio transcription and the encoded image are sent to GPT-4 Vision.
GPT-4 Vision responds, considering the visual and textual context.
The response is vocalized using OpenAI TTS and played to the user.

Notes

Ensure both main.py and capture.py are active for the system to function.
The video file in capture.py can be customized.
Adequate hardware is recommended for smooth audio and video processing.

Conclusion

This project demonstrates a novel approach to combining various AI technologies, creating a dynamic and interactive conversational AI experience. It harnesses the capabilities of GPT-4 Vision, Whisper, and TTS for a comprehensive audio-visual interaction.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.idea		.idea
frames		frames
LICENSE		LICENSE
README.md		README.md
capture.py		capture.py
main.py		main.py
requirements.txt		requirements.txt
singlestore.py		singlestore.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Conversational AI with GPT-4 Vision, OpenAI Whisper, and TTS

Overview

Demo Video:

Components

Main Files

Installation

Prerequisites

Libraries

Usage

Running the Scripts

Workflow

Notes

Conclusion

About

Releases

Packages

Languages

License

ayushpai/Sports-Buddy

Folders and files

Latest commit

History

Repository files navigation

Conversational AI with GPT-4 Vision, OpenAI Whisper, and TTS

Overview

Demo Video:

Components

Main Files

Installation

Prerequisites

Libraries

Usage

Running the Scripts

Workflow

Notes

Conclusion

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages