# AI Voicebot AI Voicebot is an agentic AI agent that communicates via ICE and TURN running on a coturn server. coturn provides ICE and related specs: * RFC 5245 - ICE * RFC 5768 – ICE–SIP * RFC 6336 – ICE–IANA Registry * RFC 6544 – ICE–TCP * RFC 5928 - TURN Resolution Mechanism ## To use Set the environment variable COTURN_SERVER to point to the URL running the coturn server by modifying the .env file: ```.env COTURN_SERVER="turns:ketrenos.com:5349" ``` You then launch the application, providing ## Architecture The system is broken into two major components: client and server ### client The frontend client is written using React, exposed via a static build of the client through the server's static file endpoint. Implementation of the client is in the `client` subdirectory. Provides a Web UI for starting a human chat session. A lobby is created based on the URL, and any user with that URL can join that lobby. The client uses RTCPeerConnection, RTCSessionDescription, RTCIceCandidate, MediaStream, navigator.getUserMedia, navigator.mediaDevices, and associated APIs for creating audio (via audio tag) and video (via video tag) media instantiations in the Web UI client. The client also exposes the ability to add new AI "users" to the lobby. When creating a user, you can provide a brief description of the user. The server will use that description to generate an AI person, including profile picture, voice signature used for text-to-speech, etc. ### server The backend server is written in Python and the OpenAI Agentic AI SDK, connecting to an OPENAI compatible server running at OPENAI_BASE_URL. Implementation of the client is in the `server` subdirectory. The model used by the server for LLM communication is set via OPENAI_MODEL. For example: ```.env OPENAI_BASE_URL=http://192.168.1.198:8000/v3 OPENAI_MODEL=Qwen/Qwen3-8B OPENAI_API_KEY=sk-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX ``` If you want to use OpenAI instead of a self hosted service, do not set OPENAI_BASE_URL and set the OPENAI_API_KEY accordingly. The server provides the AI chatbot and hosts the static files for the client frontend. ### Speech-to-Text and Text-to-Speech Configuration The server supports pluggable speech-to-text (STT) and text-to-speech (TTS) backends. To configure these, set the following environment variables in your `.env` file: ``` STT_MODEL=your-speech-to-text-model TTS_MODEL=your-text-to-speech-model ``` These models are used to transcribe incoming audio and synthesize AI responses, respectively. (See future roadmap for planned model support.) The server communicates with the coturn server in the same manner as the client, only via Python instead. ### shared The `shared/` directory contains shared Pydantic models used for API communication between the server and voicebot components. This ensures type safety and consistency across the entire application. Key benefits: - **Type Safety**: All API communications are validated using Pydantic models - **Consistency**: Both components use identical data structures - **Maintainability**: Changes to data models only need to be made in one place - **Documentation**: Models serve as living documentation of the API The shared models include: - Core data models (lobbies, sessions, participants) - HTTP API request/response models - WebSocket message models - WebRTC signaling models See `shared/README.md` for detailed documentation. ### API Communication The server exposes an http endpoint via FastAPI. This endpoint exposes the following capabilities: 1. Lobby creation 2. User management within lobby 3. AI agent creation for a lobby 4. Connection details for the voice system to attach / detach to audio coturn streams as users join / leave. Once an AI agent is added to a lobby, it joins the audio stream(s) for that lobby. Audio input is then passed to the speech-to-text processor to provide a stream of text with time markers. That text is then passed to the language processing layer of the AI agent, which passes it to the LLM for a response. The response is then passed through the text-to-speech processor, with the output stream being routed back to coturn server for dispatch to the human UI viewers. ## Lobby Features - **Player Management:** Players can join/leave lobbies, and their status is tracked in real time. - **AI and Human Users:** Both AI and human users can participate in lobbies. AI users are generated with custom profiles and voices. ### Media and Peer Connection Handling - **WebRTC Integration:** The client uses WebRTC APIs (RTCPeerConnection, RTCSessionDescription, RTCIceCandidate, MediaStream, etc.) to manage real-time audio/video streams between users and AI agents. - **Dynamic Peer Management:** Peers are dynamically added/removed as users join or leave lobbies. The system handles ICE candidate negotiation, connection state changes, and media stream routing. - **Audio/Video UI:** Audio and video streams are rendered in the browser using standard HTML media elements. ### Extensibility and Planned Enhancements - **Pluggable STT/TTS Backends:** Support for additional speech-to-text and text-to-speech providers is planned. - **Custom AI Agent Personalities:** Future versions will allow more detailed customization of AI agent behavior, voice, and appearance. - **Improved Moderation and Controls:** Features for lobby moderation, user muting, and reporting are under consideration. - **Mobile and Accessibility Improvements:** Enhanced support for mobile devices and accessibility features is on the roadmap. --- ## Roadmap - [ ] Add support for multiple STT/TTS providers - [ ] Expand game logic and add new game types - [ ] Improve AI agent customization options - [ ] Add lobby moderation and user controls - [ ] Enhance mobile and accessibility support Contributions and feature requests are welcome! # Message sequence for WebRTC application This application provides session management, lobby management, and WebRTC signaling. ## Phase 1: Initial Connection & Session Management ``` Frontend Backend | | |----- HTTP Request ------>| (Initial page load) | | Check session cookie | | If no cookie -> create new session | | If cookie exists -> validate session |<---- HTTP Response ------| Set/update session cookie | | |----- WebSocket Conn ---->| Upgrade to WebSocket | | Associate WebSocket with session |<---- session_established-| { sessionId } ``` ## Phase 2: Lobby Management ### Creating a Lobby: ``` Frontend A Backend | | |----- create_lobby ------>| { lobbyName, settings } | | Create lobby instance | | Add user to lobby |<---- lobby_created ------| { lobbyId, lobbyInfo } ``` ### Joining a Lobby: ``` Frontend B Backend Frontend A |----- ws:join_lobby ----->| { lobbyId } | | | Add user to lobby | |<---- ws:lobby_joined ----| { lobbyInfo } | |<---- ws:lobby_state -----| { participants: [...] } | | | | | |--- ws: user_joined ----->| { newUser } ``` ## Phase 3: WebRTC Signaling Initiation When all required participants are in the lobby, the backend initiates WebRTC negotiation: ``` Frontend A Backend Frontend B | | | | | Check if conditions | | | are met for WebRTC | |<--- start_webrtc_nego ---| { participants } | | |--- start_webrtc_nego --->| { participants } | | | | Create RTCPeerConnection | | Create RTCPeerConnection | Set up local media | | Set up local media | | | |<-- negotiation_needed ---| |--- negotiation_needed --->| ``` ## Phase 4: WebRTC Offer/Answer Exchange ``` Frontend A (Initiator) Backend Frontend B (Receiver) | | | | createOffer() | | | setLocalDescription() | | | | | |----- webrtc_offer ------>| { offer, targetUser } | | |------ webrtc_offer ----->| | | | setRemoteDescription() | | | createAnswer() | | | setLocalDescription() | | | | |<----- webrtc_answer -----| { answer, targetUser } |<----- webrtc_answer -----| | | setRemoteDescription() | | ``` ## Phase 5: ICE Candidate Exchange ``` Frontend A Backend Frontend B | | | | ICE gathering starts | | ICE gathering starts | | | |------ ice_candidate ---->| { candidate, target } | | |----- ice_candidate ----->| addIceCandidate() | | | | |<---- ice_candidate ------| { candidate, target } |<----- ice_candidate -----| | addIceCandidate() | | | | (Repeat for all ICE candidates collected) | ``` ## Phase 6: Connection Establishment & State Management ``` Frontend A Backend Frontend B | | | | onconnectionstatechange | | onconnectionstatechange | | | |--- webrtc_state_change ->| { state: "connecting" } | | |-- webrtc_state_change -->| { state: "connecting" } | | | | P2P Connection Established (WebRTC direct) | |<===================== Direct Media Flow ===========>| | | | |-- webrtc_state_change -->| { state: "connected" } | | |-- webrtc_state_change -->| { state: "connected" } | | | |<---- connection_ready ---| | | |----- connection_ready -->| ``` ## Key Message Types ### Session Management: - `session_established` - Confirms session creation/restoration - `session_expired` - Session timeout notification ### Lobby Management: - `create_lobby` / `lobby_created` - `join_lobby` / `lobby_joined` - `leave_lobby` / `user_left` - `lobby_state` - Current lobby participants and settings - `lobby_destroyed` - Lobby cleanup ### WebRTC Signaling: - `start_webrtc_negotiation` - Triggers WebRTC setup - `webrtc_offer` - SDP offer - `webrtc_answer` - SDP answer - `ice_candidate` - ICE candidate exchange - `webrtc_state_change` - Connection state updates - `connection_ready` - P2P connection established ### Error Handling: - `error` - Generic error message - `lobby_full` - Lobby at capacity - `webrtc_failed` - WebRTC negotiation failure - `session_invalid` - Session validation failed ## Implementation Considerations: 1. **Session Persistence**: Store session data in Redis/database for horizontal scaling 2. **Lobby State**: Maintain lobby state in memory with periodic persistence 3. **WebSocket Management**: Handle reconnections and cleanup properly 4. **WebRTC Timeout**: Implement timeouts for offer/answer and ICE gathering 5. **Error Recovery**: Graceful fallbacks when WebRTC negotiation fails 6. **Security**: Validate session cookies and sanitize all incoming messages The backend acts as the signaling server, routing WebRTC negotiation messages between peers while managing application state. Once the P2P connection is established, media flows directly between clients, but the WebSocket connection remains for application-level messaging.