# Deployment Configuration Guide ## Critical Issues and Solutions ### 1. Cache Directory Permissions **Problem**: `PermissionError: [Errno 13] Permission denied: '/.cache'` **Solution**: The code now automatically detects Docker and uses `/tmp/huggingface_cache`. However, ensure the Dockerfile sets proper permissions. **Dockerfile Fix**: ```dockerfile # Create cache directory with proper permissions RUN mkdir -p /tmp/huggingface_cache && chmod 777 /tmp/huggingface_cache ENV HF_HOME=/tmp/huggingface_cache ENV TRANSFORMERS_CACHE=/tmp/huggingface_cache ``` ### 2. User ID Issues **Problem**: `KeyError: 'getpwuid(): uid not found: 1000'` **Solution**: Run container with proper user or ensure user exists in container. **Option A - Use root (simplest for HF Spaces)**: ```dockerfile # Already running as root in HF Spaces - this is fine # Just ensure cache directories are writable ``` **Option B - Create user in Dockerfile**: ```dockerfile RUN useradd -m -u 1000 -s /bin/bash appuser && \ mkdir -p /tmp/huggingface_cache && \ chown -R appuser:appuser /tmp/huggingface_cache /app USER appuser ``` **For Hugging Face Spaces**: Spaces typically run as root, so Option A is fine. ### 3. HuggingFace Token Configuration **Problem**: Gated repository access errors **Solution**: Set HF_TOKEN in Hugging Face Spaces secrets. **Steps**: 1. Go to your Space → Settings → Repository secrets 2. Add `HF_TOKEN` with your Hugging Face access token 3. Token should have read access to gated models **Verify Token**: ```bash # Test token access curl -H "Authorization: Bearer YOUR_TOKEN" https://huggingface.co/api/models/Qwen/Qwen2.5-7B-Instruct ``` ### 4. GPU Tensor Device Placement **Problem**: `Tensor on device cuda:0 is not on the expected device meta!` **Solution**: Use explicit device placement instead of `device_map="auto"` for non-quantized models. **Code Fix**: Already implemented in `src/local_model_loader.py` - uses `device_map="auto"` only with quantization, explicit placement otherwise. ### 5. Model Selection for Testing **Current Models**: - Primary: `Qwen/Qwen2.5-7B-Instruct` (gated - requires access) - Fallback: `microsoft/Phi-3-mini-4k-instruct` (non-gated, verified) **For Testing Without Gated Models**: Update `src/models_config.py` to use non-gated models: ```python "reasoning_primary": { "model_id": "microsoft/Phi-3-mini-4k-instruct", # Non-gated ... } ``` ## Recommended Dockerfile Updates ```dockerfile FROM python:3.10-slim WORKDIR /app # Install system dependencies RUN apt-get update && apt-get install -y \ gcc \ g++ \ cmake \ libopenblas-dev \ libomp-dev \ curl \ && rm -rf /var/lib/apt/lists/* # Create cache directories with proper permissions RUN mkdir -p /tmp/huggingface_cache && \ chmod 777 /tmp/huggingface_cache && \ mkdir -p /tmp/logs && \ chmod 777 /tmp/logs # Copy requirements file COPY requirements.txt . # Install Python dependencies RUN pip install --no-cache-dir --upgrade pip && \ pip install --no-cache-dir -r requirements.txt # Copy application code COPY . . # Set environment variables ENV PYTHONUNBUFFERED=1 ENV PORT=7860 ENV OMP_NUM_THREADS=4 ENV MKL_NUM_THREADS=4 ENV DB_PATH=/tmp/sessions.db ENV FAISS_INDEX_PATH=/tmp/embeddings.faiss ENV LOG_DIR=/tmp/logs ENV HF_HOME=/tmp/huggingface_cache ENV TRANSFORMERS_CACHE=/tmp/huggingface_cache ENV RATE_LIMIT_ENABLED=true # Expose port EXPOSE 7860 # Health check HEALTHCHECK --interval=30s --timeout=30s --start-period=120s --retries=3 \ CMD curl -f http://localhost:7860/api/health || exit 1 # Run with Gunicorn CMD ["gunicorn", "--bind", "0.0.0.0:7860", "--workers", "4", "--threads", "2", "--timeout", "120", "--access-logfile", "-", "--error-logfile", "-", "--log-level", "info", "flask_api_standalone:app"] ``` ## Hugging Face Spaces Configuration ### Required Secrets: 1. `HF_TOKEN` - Your Hugging Face access token (for gated models) ### Environment Variables (Optional): - `HF_HOME` - Will auto-detect to `/tmp/huggingface_cache` in Docker - `TRANSFORMERS_CACHE` - Will auto-detect to `/tmp/huggingface_cache` in Docker ### Hardware Requirements: - GPU: NVIDIA T4 (16GB VRAM) - ✅ Detected in logs - Memory: At least 8GB RAM - Disk: 20GB+ for model cache ## Verification Steps 1. **Check Cache Directory**: ```bash ls -la /tmp/huggingface_cache # Should show writable directory ``` 2. **Check HF Token**: ```python import os print("HF_TOKEN set:", bool(os.getenv("HF_TOKEN"))) ``` 3. **Check GPU**: ```python import torch print("CUDA available:", torch.cuda.is_available()) print("GPU:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "None") ``` 4. **Test Model Loading**: - Check logs for: `✓ Cache directory verified: /tmp/huggingface_cache` - Check logs for: `✓ HF_TOKEN authenticated for gated model access` (if token set) - Check logs for: `✓ Model loaded successfully` ## Troubleshooting ### Issue: Still getting permission errors **Fix**: Ensure Dockerfile creates cache directory with 777 permissions ### Issue: Gated repository errors persist **Fix**: 1. Verify HF_TOKEN is set in Spaces secrets 2. Visit model page and request access 3. Wait for approval (usually instant) 4. Use fallback model (Phi-3-mini) until access granted ### Issue: Tensor device errors **Fix**: Code now handles this - if quantization fails, loads without quantization and uses explicit device placement ### Issue: Model too large for GPU **Fix**: - Code automatically falls back to no quantization if bitsandbytes fails - Consider using smaller model (Phi-3-mini) for testing - Check GPU memory: `nvidia-smi` ## Quick Start Checklist - [ ] HF_TOKEN set in Spaces secrets - [ ] Dockerfile creates cache directory with proper permissions - [ ] GPU detected (check logs) - [ ] Cache directory writable (check logs) - [ ] Model access granted (or using non-gated fallback) - [ ] No tensor device errors (check logs) ## Next Steps 1. Update Dockerfile with cache directory creation 2. Set HF_TOKEN in Spaces secrets 3. Request access to gated models (Qwen) 4. Test with fallback model first (Phi-3-mini) 5. Monitor logs for successful model loading