126 lines
5.0 KiB
Markdown
126 lines
5.0 KiB
Markdown
# AI Voicebot
|
||
|
||
AI Voicebot is an agentic AI agent that communicates via ICE and TURN running on a coturn server.
|
||
|
||
coturn provides ICE and related specs:
|
||
|
||
* RFC 5245 - ICE
|
||
* RFC 5768 – ICE–SIP
|
||
* RFC 6336 – ICE–IANA Registry
|
||
* RFC 6544 – ICE–TCP
|
||
* RFC 5928 - TURN Resolution Mechanism
|
||
|
||
## To use
|
||
|
||
Set the environment variable COTURN_SERVER to point to the URL running the
|
||
coturn server by modifying the .env file:
|
||
|
||
```.env
|
||
COTURN_SERVER="turns:ketrenos.com:5349"
|
||
```
|
||
|
||
You then launch the application, providing
|
||
|
||
## Architecture
|
||
|
||
The system is broken into two major components: client and server
|
||
|
||
### client
|
||
|
||
The frontend client is written using React, exposed via a static build of the
|
||
client through the server's static file endpoint.
|
||
|
||
Implementation of the client is in the `client` subdirectory.
|
||
|
||
Provides a Web UI for starting a human chat session. A lobby is created based on the URL, and any user with that URL can join that lobby.
|
||
|
||
The client uses RTCPeerConnection, RTCSessionDescription, RTCIceCandidate, MediaStream, navigator.getUserMedia, navigator.mediaDevices, and associated APIs for creating audio (via audio tag) and video (via video tag) media instantiations in the Web UI client.
|
||
|
||
The client also exposes the ability to add new AI "users" to the lobby. When creating a user, you can provide a brief description of the user. The server
|
||
will use that description to generate an AI person, including profile picture, voice signature used for text-to-speech, etc.
|
||
|
||
### server
|
||
|
||
The backend server is written in Python and the OpenAI Agentic AI SDK, connecting to an OPENAI compatible server running at OPENAI_BASE_URL.
|
||
|
||
Implementation of the client is in the `server` subdirectory.
|
||
|
||
The model used by the server for LLM communication is set via OPENAI_MODEL. For example:
|
||
|
||
```.env
|
||
OPENAI_BASE_URL=http://192.168.1.198:8000/v3
|
||
OPENAI_MODEL=Qwen/Qwen3-8B
|
||
OPENAI_API_KEY=sk-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
|
||
```
|
||
|
||
|
||
If you want to use OpenAI instead of a self hosted service, do not set OPENAI_BASE_URL and set the OPENAI_API_KEY accordingly.
|
||
|
||
The server provides the AI chatbot and hosts the static files for the client frontend.
|
||
|
||
|
||
### Speech-to-Text and Text-to-Speech Configuration
|
||
|
||
The server supports pluggable speech-to-text (STT) and text-to-speech (TTS) backends. To configure these, set the following environment variables in your `.env` file:
|
||
|
||
```
|
||
STT_MODEL=your-speech-to-text-model
|
||
TTS_MODEL=your-text-to-speech-model
|
||
```
|
||
|
||
These models are used to transcribe incoming audio and synthesize AI responses, respectively. (See future roadmap for planned model support.)
|
||
|
||
|
||
The server communicates with the coturn server in the same manner as the client, only via Python instead.
|
||
|
||
|
||
The server exposes an http endpoint via FastAPI. This endpoint exposes the following capabilities:
|
||
|
||
1. Lobby creation
|
||
2. User management within lobby
|
||
3. AI agent creation for a lobby
|
||
4. Connection details for the voice system to attach / detach to audio coturn streams as users join / leave.
|
||
|
||
|
||
Once an AI agent is added to a lobby, it joins the audio stream(s) for that lobby.
|
||
|
||
|
||
Audio input is then passed to the speech-to-text processor to provide a stream of text with time markers.
|
||
|
||
|
||
That text is then passed to the language processing layer of the AI agent, which passes it to the LLM for a response.
|
||
|
||
|
||
The response is then passed through the text-to-speech processor, with the output stream being routed back to coturn server for dispatch to the human UI viewers.
|
||
|
||
## Lobby Features
|
||
|
||
- **Player Management:** Players can join/leave lobbies, and their status is tracked in real time.
|
||
- **AI and Human Users:** Both AI and human users can participate in lobbies. AI users are generated with custom profiles and voices.
|
||
|
||
### Media and Peer Connection Handling
|
||
|
||
- **WebRTC Integration:** The client uses WebRTC APIs (RTCPeerConnection, RTCSessionDescription, RTCIceCandidate, MediaStream, etc.) to manage real-time audio/video streams between users and AI agents.
|
||
- **Dynamic Peer Management:** Peers are dynamically added/removed as users join or leave lobbies. The system handles ICE candidate negotiation, connection state changes, and media stream routing.
|
||
- **Audio/Video UI:** Audio and video streams are rendered in the browser using standard HTML media elements.
|
||
|
||
### Extensibility and Planned Enhancements
|
||
|
||
- **Pluggable STT/TTS Backends:** Support for additional speech-to-text and text-to-speech providers is planned.
|
||
- **Custom AI Agent Personalities:** Future versions will allow more detailed customization of AI agent behavior, voice, and appearance.
|
||
- **Improved Moderation and Controls:** Features for lobby moderation, user muting, and reporting are under consideration.
|
||
- **Mobile and Accessibility Improvements:** Enhanced support for mobile devices and accessibility features is on the roadmap.
|
||
|
||
---
|
||
|
||
## Roadmap
|
||
|
||
- [ ] Add support for multiple STT/TTS providers
|
||
- [ ] Expand game logic and add new game types
|
||
- [ ] Improve AI agent customization options
|
||
- [ ] Add lobby moderation and user controls
|
||
- [ ] Enhance mobile and accessibility support
|
||
|
||
Contributions and feature requests are welcome!
|
||
|