| # Troubleshooting Guide | |
| Common issues and solutions for Hugging Face Jobs. | |
| ## Authentication Issues | |
| ### Error: 401 Unauthorized | |
| **Symptoms:** | |
| ``` | |
| 401 Client Error: Unauthorized for url: https://huggingface.co/api/... | |
| ``` | |
| **Causes:** | |
| - Token missing from job | |
| - Token invalid or expired | |
| - Token not passed correctly | |
| **Solutions:** | |
| 1. Add `secrets={"HF_TOKEN": "$HF_TOKEN"}` to job config | |
| 2. Verify `hf_whoami()` works locally | |
| 3. Re-login: `hf auth login` | |
| 4. Check token hasn't expired | |
| **Verification:** | |
| ```python | |
| # In your script | |
| import os | |
| assert "HF_TOKEN" in os.environ, "HF_TOKEN missing!" | |
| ``` | |
| ### Error: 403 Forbidden | |
| **Symptoms:** | |
| ``` | |
| 403 Client Error: Forbidden for url: https://huggingface.co/api/... | |
| ``` | |
| **Causes:** | |
| - Token lacks required permissions | |
| - No access to private repository | |
| - Organization permissions insufficient | |
| **Solutions:** | |
| 1. Ensure token has write permissions | |
| 2. Check token type at https://huggingface.co/settings/tokens | |
| 3. Verify access to target repository | |
| 4. Use organization token if needed | |
| ### Error: Token not found in environment | |
| **Symptoms:** | |
| ``` | |
| KeyError: 'HF_TOKEN' | |
| ValueError: HF_TOKEN not found | |
| ``` | |
| **Causes:** | |
| - `secrets` not passed in job config | |
| - Wrong key name (should be `HF_TOKEN`) | |
| - Using `env` instead of `secrets` | |
| **Solutions:** | |
| 1. Use `secrets={"HF_TOKEN": "$HF_TOKEN"}` (not `env`) | |
| 2. Verify key name is exactly `HF_TOKEN` | |
| 3. Check job config syntax | |
| ## Job Execution Issues | |
| ### Error: Job Timeout | |
| **Symptoms:** | |
| - Job stops unexpectedly | |
| - Status shows "TIMEOUT" | |
| - Partial results only | |
| **Causes:** | |
| - Default 30min timeout exceeded | |
| - Job takes longer than expected | |
| - No timeout specified | |
| **Solutions:** | |
| 1. Check logs for actual runtime | |
| 2. Increase timeout with buffer: `"timeout": "3h"` | |
| 3. Optimize code for faster execution | |
| 4. Process data in chunks | |
| 5. Add 20-30% buffer to estimated time | |
| **Example:** | |
| ```python | |
| hf_jobs("uv", { | |
| "script": "...", | |
| "timeout": "2h" # Set appropriate timeout | |
| }) | |
| ``` | |
| ### Error: Out of Memory (OOM) | |
| **Symptoms:** | |
| ``` | |
| RuntimeError: CUDA out of memory | |
| MemoryError: Unable to allocate array | |
| ``` | |
| **Causes:** | |
| - Batch size too large | |
| - Model too large for hardware | |
| - Insufficient GPU memory | |
| **Solutions:** | |
| 1. Reduce batch size | |
| 2. Process data in smaller chunks | |
| 3. Upgrade hardware: cpu → t4 → a10g → a100 | |
| 4. Use smaller models or quantization | |
| 5. Enable gradient checkpointing (for training) | |
| **Example:** | |
| ```python | |
| # Reduce batch size | |
| batch_size = 1 | |
| # Process in chunks | |
| for chunk in chunks: | |
| process(chunk) | |
| ``` | |
| ### Error: Missing Dependencies | |
| **Symptoms:** | |
| ``` | |
| ModuleNotFoundError: No module named 'package_name' | |
| ImportError: cannot import name 'X' | |
| ``` | |
| **Causes:** | |
| - Package not in dependencies | |
| - Wrong package name | |
| - Version mismatch | |
| **Solutions:** | |
| 1. Add to PEP 723 header: | |
| ```python | |
| # /// script | |
| # dependencies = ["package-name>=1.0.0"] | |
| # /// | |
| ``` | |
| 2. Check package name spelling | |
| 3. Specify version if needed | |
| 4. Check package availability | |
| ### Error: Script Not Found | |
| **Symptoms:** | |
| ``` | |
| FileNotFoundError: script.py not found | |
| ``` | |
| **Causes:** | |
| - Local file path used (not supported) | |
| - URL incorrect | |
| - Script not accessible | |
| **Solutions:** | |
| 1. Use inline script (recommended) | |
| 2. Use publicly accessible URL | |
| 3. Upload script to Hub first | |
| 4. Check URL is correct | |
| **Correct approaches:** | |
| ```python | |
| # ✅ Inline code | |
| hf_jobs("uv", {"script": "# /// script\n# dependencies = [...]\n# ///\n\n<code>"}) | |
| # ✅ From URL | |
| hf_jobs("uv", {"script": "https://huggingface.co/user/repo/resolve/main/script.py"}) | |
| ``` | |
| ## Hub Push Issues | |
| ### Error: Push Failed | |
| **Symptoms:** | |
| ``` | |
| Error pushing to Hub | |
| Upload failed | |
| ``` | |
| **Causes:** | |
| - Network issues | |
| - Token missing or invalid | |
| - Repository access denied | |
| - File too large | |
| **Solutions:** | |
| 1. Check token: `assert "HF_TOKEN" in os.environ` | |
| 2. Verify repository exists or can be created | |
| 3. Check network connectivity in logs | |
| 4. Retry push operation | |
| 5. Split large files into chunks | |
| ### Error: Repository Not Found | |
| **Symptoms:** | |
| ``` | |
| 404 Client Error: Not Found | |
| Repository not found | |
| ``` | |
| **Causes:** | |
| - Repository doesn't exist | |
| - Wrong repository name | |
| - No access to private repo | |
| **Solutions:** | |
| 1. Create repository first: | |
| ```python | |
| from huggingface_hub import HfApi | |
| api = HfApi() | |
| api.create_repo("username/repo-name", repo_type="dataset") | |
| ``` | |
| 2. Check repository name format | |
| 3. Verify namespace exists | |
| 4. Check repository visibility | |
| ### Error: Results Not Saved | |
| **Symptoms:** | |
| - Job completes successfully | |
| - No results visible on Hub | |
| - Files not persisted | |
| **Causes:** | |
| - No persistence code in script | |
| - Push code not executed | |
| - Push failed silently | |
| **Solutions:** | |
| 1. Add persistence code to script | |
| 2. Verify push executes successfully | |
| 3. Check logs for push errors | |
| 4. Add error handling around push | |
| **Example:** | |
| ```python | |
| try: | |
| dataset.push_to_hub("username/dataset") | |
| print("✅ Push successful") | |
| except Exception as e: | |
| print(f"❌ Push failed: {e}") | |
| raise | |
| ``` | |
| ## Hardware Issues | |
| ### Error: GPU Not Available | |
| **Symptoms:** | |
| ``` | |
| CUDA not available | |
| No GPU found | |
| ``` | |
| **Causes:** | |
| - CPU flavor used instead of GPU | |
| - GPU not requested | |
| - CUDA not installed in image | |
| **Solutions:** | |
| 1. Use GPU flavor: `"flavor": "a10g-large"` | |
| 2. Check image has CUDA support | |
| 3. Verify GPU availability in logs | |
| ### Error: Slow Performance | |
| **Symptoms:** | |
| - Job takes longer than expected | |
| - Low GPU utilization | |
| - CPU bottleneck | |
| **Causes:** | |
| - Wrong hardware selected | |
| - Inefficient code | |
| - Data loading bottleneck | |
| **Solutions:** | |
| 1. Upgrade hardware | |
| 2. Optimize code | |
| 3. Use batch processing | |
| 4. Profile code to find bottlenecks | |
| ## General Issues | |
| ### Error: Job Status Unknown | |
| **Symptoms:** | |
| - Can't check job status | |
| - Status API returns error | |
| **Solutions:** | |
| 1. Use job URL: `https://huggingface.co/jobs/username/job-id` | |
| 2. Check logs: `hf_jobs("logs", {"job_id": "..."})` | |
| 3. Inspect job: `hf_jobs("inspect", {"job_id": "..."})` | |
| ### Error: Logs Not Available | |
| **Symptoms:** | |
| - No logs visible | |
| - Logs delayed | |
| **Causes:** | |
| - Job just started (logs delayed 30-60s) | |
| - Job failed before logging | |
| - Logs not yet generated | |
| **Solutions:** | |
| 1. Wait 30-60 seconds after job start | |
| 2. Check job status first | |
| 3. Use job URL for web interface | |
| ### Error: Cost Unexpectedly High | |
| **Symptoms:** | |
| - Job costs more than expected | |
| - Longer runtime than estimated | |
| **Causes:** | |
| - Job ran longer than timeout | |
| - Wrong hardware selected | |
| - Inefficient code | |
| **Solutions:** | |
| 1. Monitor job runtime | |
| 2. Set appropriate timeout | |
| 3. Optimize code | |
| 4. Choose right hardware | |
| 5. Check cost estimates before running | |
| ## Debugging Tips | |
| ### 1. Add Logging | |
| ```python | |
| import logging | |
| logging.basicConfig(level=logging.INFO) | |
| logger = logging.getLogger(__name__) | |
| logger.info("Starting processing...") | |
| logger.info(f"Processed {count} items") | |
| ``` | |
| ### 2. Verify Environment | |
| ```python | |
| import os | |
| print(f"Python version: {os.sys.version}") | |
| print(f"CUDA available: {torch.cuda.is_available()}") | |
| print(f"HF_TOKEN present: {'HF_TOKEN' in os.environ}") | |
| ``` | |
| ### 3. Test Locally First | |
| Run script locally before submitting to catch errors early: | |
| ```bash | |
| python script.py | |
| ``` | |
| ### 4. Check Job Logs | |
| ```python | |
| # View logs | |
| hf_jobs("logs", {"job_id": "your-job-id"}) | |
| # Or use job URL | |
| # https://huggingface.co/jobs/username/job-id | |
| ``` | |
| ### 5. Add Error Handling | |
| ```python | |
| try: | |
| # Your code | |
| process_data() | |
| except Exception as e: | |
| print(f"Error: {e}") | |
| import traceback | |
| traceback.print_exc() | |
| raise | |
| ``` | |
| ## Quick Reference | |
| ### Common Error Codes | |
| | Code | Meaning | Solution | | |
| |------|---------|----------| | |
| | 401 | Unauthorized | Add `secrets={"HF_TOKEN": "$HF_TOKEN"}` | | |
| | 403 | Forbidden | Check token permissions | | |
| | 404 | Not Found | Verify repository exists | | |
| | 500 | Server Error | Retry or contact support | | |
| ### Checklist Before Submitting | |
| - [ ] Token configured: `secrets={"HF_TOKEN": "$HF_TOKEN"}` | |
| - [ ] Script checks for token: `assert "HF_TOKEN" in os.environ` | |
| - [ ] Timeout set appropriately | |
| - [ ] Hardware selected correctly | |
| - [ ] Dependencies listed in PEP 723 header | |
| - [ ] Persistence code included | |
| - [ ] Error handling added | |
| - [ ] Logging added for debugging | |
| ## Getting Help | |
| If issues persist: | |
| 1. **Check logs** - Most errors include detailed messages | |
| 2. **Review documentation** - See main SKILL.md | |
| 3. **Check Hub status** - https://status.huggingface.co | |
| 4. **Community forums** - https://discuss.huggingface.co | |
| 5. **GitHub issues** - For bugs in huggingface_hub | |
| ## Key Takeaways | |
| 1. **Always include token** - `secrets={"HF_TOKEN": "$HF_TOKEN"}` | |
| 2. **Set appropriate timeout** - Default 30min may be insufficient | |
| 3. **Verify persistence** - Results won't persist without code | |
| 4. **Check logs** - Most issues visible in job logs | |
| 5. **Test locally** - Catch errors before submitting | |
| 6. **Add error handling** - Better debugging information | |
| 7. **Monitor costs** - Set timeouts to avoid unexpected charges | |