street-lingo/SETUP.md

4.3 KiB

Indonesian Learning App with AI Speech Integration

Setup Instructions

1. Prerequisites

  • Python 3.11+
  • Node.js 16+
  • Google Cloud Account
  • OpenAI API Key

2. Google Cloud Setup

  1. Create a new Google Cloud project or use existing one
  2. Enable the following APIs:
    • Cloud Speech-to-Text API
    • Cloud Text-to-Speech API
  3. Create a service account with the following roles:
    • Speech Client
    • Text-to-Speech Client
  4. Download the service account key JSON file

3. Environment Configuration

Backend Configuration

  1. Copy the environment template:

    cd backend
    cp .env.example .env
    
  2. Edit backend/.env with your credentials:

    # Required
    GOOGLE_APPLICATION_CREDENTIALS=path/to/your/service-account-key.json
    OPENAI_API_KEY=your-openai-api-key-here
    
    # Optional - customize as needed
    OPENAI_MODEL=gpt-4o-mini
    GOOGLE_CLOUD_PROJECT=your-project-id
    SPEECH_LANGUAGE_CODE=id-ID
    TTS_VOICE_NAME=id-ID-Standard-A
    TTS_VOICE_GENDER=FEMALE
    HOST=0.0.0.0
    PORT=8000
    CORS_ORIGINS=http://localhost:3000,http://localhost:5173
    

Frontend Configuration

  1. Copy the environment template:

    cp .env.example .env
    
  2. Edit .env if needed (defaults should work):

    VITE_API_BASE_URL=http://localhost:8000
    VITE_WS_BASE_URL=ws://localhost:8000
    VITE_ENABLE_SPEECH_FEATURES=true
    VITE_ENABLE_AI_CHAT=true
    

4. Backend Setup

cd backend
pip install uv  # if not already installed
uv sync

5. Frontend Setup

npm install

6. Running the Application

Start the backend:

cd backend
uv run python main.py

The backend will run on http://localhost:8000

Start the frontend:

npm run dev

The frontend will run on http://localhost:5173

7. Using the App

  1. Traditional Mode: The original structured learning experience
  2. AI Chat Mode: New conversational AI with speech-to-text and text-to-speech

AI Chat Features:

  • Speech Input: Click "🎤 Speak" to record your voice in Indonesian
  • Text Input: Type messages in Indonesian
  • AI Response: GPT-4o-mini responds in Indonesian with educational guidance
  • Speech Output: AI responses are automatically converted to speech
  • Real-time: WebSocket streaming for low-latency conversation

8. Environment Variables Summary

Backend (.env file):

# Required
GOOGLE_APPLICATION_CREDENTIALS=path/to/your/service-account-key.json
OPENAI_API_KEY=your-openai-api-key

# Optional Configuration
OPENAI_MODEL=gpt-4o-mini
GOOGLE_CLOUD_PROJECT=your-project-id
SPEECH_LANGUAGE_CODE=id-ID
SPEECH_SAMPLE_RATE=48000
SPEECH_ENCODING=WEBM_OPUS
TTS_LANGUAGE_CODE=id-ID
TTS_VOICE_NAME=id-ID-Standard-A
TTS_VOICE_GENDER=FEMALE
TTS_SPEAKING_RATE=1.0
TTS_PITCH=0.0
HOST=0.0.0.0
PORT=8000
DEBUG=false
CORS_ORIGINS=http://localhost:3000,http://localhost:5173

Frontend (.env file):

VITE_API_BASE_URL=http://localhost:8000
VITE_WS_BASE_URL=ws://localhost:8000
VITE_DEV_MODE=true
VITE_LOG_LEVEL=info
VITE_ENABLE_SPEECH_FEATURES=true
VITE_ENABLE_AI_CHAT=true
VITE_ENABLE_TRADITIONAL_MODE=true

9. Testing

  • Visit any scenario (warung, ojek, alfamart)
  • Toggle between "📝 Traditional" and "🗣️ AI Chat" modes
  • Test speech input (requires microphone permission)
  • Verify audio output plays automatically

10. Troubleshooting

Common Issues:

  1. Microphone not working: Check browser permissions
  2. Audio not playing: Check browser audio settings
  3. Google Cloud errors: Verify service account permissions
  4. OpenAI errors: Check API key and usage limits
  5. WebSocket connection issues: Check backend is running on port 8000

Browser Compatibility:

  • Chrome/Edge: Full support
  • Firefox: Limited WebRTC support
  • Safari: May require additional permissions

11. Architecture

User speaks → Browser captures audio → WebSocket → 
Google Cloud Speech-to-Text → OpenAI GPT-4o-mini → 
Google Cloud Text-to-Speech → WebSocket → Browser plays audio

12. Cost Considerations

  • Google Cloud Speech-to-Text: ~$0.006 per 15-second chunk
  • Google Cloud Text-to-Speech: ~$0.000004 per character
  • OpenAI GPT-4o-mini: ~$0.150 per 1M input tokens, ~$0.600 per 1M output tokens

For typical usage (5-10 minutes of conversation), costs should be under $0.50 per session.