9.9 KiB
Backend Restart Issue Fix
Problem Description
When backend services (server or voicebot) restart, active frontend UIs become unable to add bots, resulting in:
POST https://ketrenos.com/ai-voicebot/api/bots/ai_chatbot/join 404 (Not Found)
Root Cause Analysis
The issue was caused by three main problems:
-
Incorrect Provider Registration Check: The voicebot service was checking provider registration using the wrong API endpoint (
/api/bots
instead of/api/bots/providers
) -
No Persistence for Bot Providers: Bot providers were stored only in memory and lost on server restart, requiring re-registration
-
AsyncIO Task Initialization Issue: The cleanup task was being created during
__init__
when no event loop was running, causing FastAPI route registration failures
Fixes Implemented
1. Fixed Provider Registration Check Endpoint
File: voicebot/bot_orchestrator.py
Problem: The check_provider_registration
function was calling /api/bots
(which returns available bots) instead of /api/bots/providers
(which returns registered providers).
Fix: Updated the function to use the correct endpoint and parse the response properly:
async def check_provider_registration(server_url: str, provider_id: str, insecure: bool = False) -> bool:
"""Check if the bot provider is still registered with the server."""
try:
import httpx
verify = not insecure
async with httpx.AsyncClient(verify=verify) as client:
# Check if our provider is still in the provider list
response = await client.get(f"{server_url}/api/bots/providers", timeout=5.0)
if response.status_code == 200:
data = response.json()
providers = data.get("providers", [])
# providers is a list of BotProviderModel objects, check if our provider_id is in the list
is_registered = any(provider.get("provider_id") == provider_id for provider in providers)
logger.debug(f"Registration check: provider_id={provider_id}, found_providers={len(providers)}, is_registered={is_registered}")
return is_registered
else:
logger.warning(f"Registration check failed: HTTP {response.status_code}")
return False
except Exception as e:
logger.debug(f"Provider registration check failed: {e}")
return False
2. Added Bot Provider Persistence
File: server/core/bot_manager.py
Problem: Bot providers were stored only in memory and lost on server restart.
Fix: Added persistence functionality to save/load bot providers to/from bot_providers.json
:
def _save_bot_providers(self):
"""Save bot providers to disk"""
try:
with self.lock:
providers_data = {}
for provider_id, provider in self.bot_providers.items():
providers_data[provider_id] = provider.model_dump()
with open(self.bot_providers_file, 'w') as f:
json.dump(providers_data, f, indent=2)
logger.debug(f"Saved {len(providers_data)} bot providers to {self.bot_providers_file}")
except Exception as e:
logger.error(f"Failed to save bot providers: {e}")
def _load_bot_providers(self):
"""Load bot providers from disk"""
try:
if not os.path.exists(self.bot_providers_file):
logger.debug(f"No bot providers file found at {self.bot_providers_file}")
return
with open(self.bot_providers_file, 'r') as f:
providers_data = json.load(f)
with self.lock:
for provider_id, provider_dict in providers_data.items():
try:
provider = BotProviderModel.model_validate(provider_dict)
self.bot_providers[provider_id] = provider
except Exception as e:
logger.warning(f"Failed to load bot provider {provider_id}: {e}")
logger.info(f"Loaded {len(self.bot_providers)} bot providers from {self.bot_providers_file}")
except Exception as e:
logger.error(f"Failed to load bot providers: {e}")
Integration: The persistence functions are automatically called:
_load_bot_providers()
duringBotManager.__init__()
_save_bot_providers()
when registering new providers or removing stale ones
3. Fixed AsyncIO Task Initialization Issue
File: server/core/bot_manager.py
Problem: The cleanup task was being created during BotManager.__init__()
when no event loop was running, causing the FastAPI application to fail to register routes properly.
Fix: Deferred the cleanup task creation until it's actually needed:
def __init__(self):
# ... other initialization ...
# Load persisted bot providers
self._load_bot_providers()
# Note: Don't start cleanup task here - will be started when needed
def start_cleanup(self):
"""Start the cleanup task"""
try:
if self.cleanup_task is None:
self.cleanup_task = asyncio.create_task(self._periodic_cleanup())
logger.debug("Bot provider cleanup task started")
except RuntimeError:
# No event loop running yet, cleanup will be started later
logger.debug("No event loop available for bot provider cleanup task")
async def register_provider(self, request: BotProviderRegisterRequest) -> BotProviderRegisterResponse:
# ... registration logic ...
# Start cleanup task if not already running
self.start_cleanup()
return BotProviderRegisterResponse(provider_id=provider_id)
4. Added Periodic Cleanup for Stale Providers
File: server/core/bot_manager.py
Enhancement: Added a background task that periodically removes providers that haven't been seen in 15 minutes:
async def _periodic_cleanup(self):
"""Periodically clean up stale bot providers"""
cleanup_interval = 300 # 5 minutes
stale_threshold = 900 # 15 minutes
while not self._shutdown_event.is_set():
try:
await asyncio.sleep(cleanup_interval)
now = time.time()
providers_to_remove = []
with self.lock:
for provider_id, provider in self.bot_providers.items():
if now - provider.last_seen > stale_threshold:
providers_to_remove.append(provider_id)
logger.info(f"Marking stale bot provider for removal: {provider.name} (ID: {provider_id}, last_seen: {now - provider.last_seen:.1f}s ago)")
if providers_to_remove:
with self.lock:
for provider_id in providers_to_remove:
if provider_id in self.bot_providers:
del self.bot_providers[provider_id]
self._save_bot_providers()
logger.info(f"Cleaned up {len(providers_to_remove)} stale bot providers")
except asyncio.CancelledError:
break
except Exception as e:
logger.error(f"Error in bot provider cleanup: {e}")
5. Added Client-Side Retry Logic
File: client/src/BotManager.tsx
Enhancement: Added retry logic to handle temporary 404s during service restarts:
// Retry logic for handling service restart scenarios
let retries = 3;
let response;
while (retries > 0) {
try {
response = await botsApi.requestJoinLobby(selectedBot, request);
break; // Success, exit retry loop
} catch (err: any) {
retries--;
// If it's a 404 error and we have retries left, wait and retry
if (err?.status === 404 && retries > 0) {
console.log(`Bot join failed with 404, retrying... (${retries} attempts left)`);
await new Promise(resolve => setTimeout(resolve, 1000)); // Wait 1 second
continue;
}
// If it's not a 404 or we're out of retries, throw the error
throw err;
}
}
Benefits
- Persistence: Bot providers now survive server restarts and don't need to re-register immediately
- Correct Registration Checks: Provider registration checks use the correct API endpoint
- Proper AsyncIO Task Management: Cleanup tasks are started only when an event loop is available
- Automatic Cleanup: Stale providers are automatically removed to prevent accumulation of dead entries
- Client Resilience: Frontend can handle temporary 404s during service restarts with automatic retries
- Reduced Downtime: Users experience fewer failed bot additions during service restarts
Testing
After implementing these fixes:
- Bot providers are correctly persisted in
bot_providers.json
- Server restarts load existing providers from disk
- Provider registration checks use the correct
/api/bots/providers
endpoint - AsyncIO cleanup tasks start properly without interfering with route registration
- Client retries failed requests with 404 errors
- Periodic cleanup prevents accumulation of stale providers
- Bot join requests work correctly:
POST /api/bots/{bot_name}/join
returns 200 OK
Verification Commands
Test the fix with these commands:
# Check available lobbies
curl -k https://ketrenos.com/ai-voicebot/api/lobby
# Test bot join (replace lobby_id and provider_id with actual values)
curl -k -X POST https://ketrenos.com/ai-voicebot/api/bots/ai_chatbot/join \
-H "Content-Type: application/json" \
-d '{"lobby_id":"<lobby_id>","nick":"test-bot","provider_id":"<provider_id>"}'
# Check bot providers
curl -k https://ketrenos.com/ai-voicebot/api/bots/providers
# Check available bots
curl -k https://ketrenos.com/ai-voicebot/api/bots
Files Modified
voicebot/bot_orchestrator.py
- Fixed registration check endpointserver/core/bot_manager.py
- Added persistence and cleanupclient/src/BotManager.tsx
- Added retry logic
Configuration
No additional configuration is required. The fixes work with existing environment variables and settings.