Troubleshooting Guide
Common issues and solutions for Hugging Face Jobs.
Authentication Issues
Error: 401 Unauthorized
Symptoms:
401 Client Error: Unauthorized for url: https://huggingface.co/api/...
Causes:
- Token missing from job
- Token invalid or expired
- Token not passed correctly
Solutions:
- Add
secrets={"HF_TOKEN": "$HF_TOKEN"}to job config - Verify
hf_whoami()works locally - Re-login:
hf auth login - Check token hasn't expired
Verification:
# In your script
import os
assert "HF_TOKEN" in os.environ, "HF_TOKEN missing!"
Error: 403 Forbidden
Symptoms:
403 Client Error: Forbidden for url: https://huggingface.co/api/...
Causes:
- Token lacks required permissions
- No access to private repository
- Organization permissions insufficient
Solutions:
- Ensure token has write permissions
- Check token type at https://huggingface.co/settings/tokens
- Verify access to target repository
- Use organization token if needed
Error: Token not found in environment
Symptoms:
KeyError: 'HF_TOKEN'
ValueError: HF_TOKEN not found
Causes:
secretsnot passed in job config- Wrong key name (should be
HF_TOKEN) - Using
envinstead ofsecrets
Solutions:
- Use
secrets={"HF_TOKEN": "$HF_TOKEN"}(notenv) - Verify key name is exactly
HF_TOKEN - Check job config syntax
Job Execution Issues
Error: Job Timeout
Symptoms:
- Job stops unexpectedly
- Status shows "TIMEOUT"
- Partial results only
Causes:
- Default 30min timeout exceeded
- Job takes longer than expected
- No timeout specified
Solutions:
- Check logs for actual runtime
- Increase timeout with buffer:
"timeout": "3h" - Optimize code for faster execution
- Process data in chunks
- Add 20-30% buffer to estimated time
Example:
hf_jobs("uv", {
"script": "...",
"timeout": "2h" # Set appropriate timeout
})
Error: Out of Memory (OOM)
Symptoms:
RuntimeError: CUDA out of memory
MemoryError: Unable to allocate array
Causes:
- Batch size too large
- Model too large for hardware
- Insufficient GPU memory
Solutions:
- Reduce batch size
- Process data in smaller chunks
- Upgrade hardware: cpu β t4 β a10g β a100
- Use smaller models or quantization
- Enable gradient checkpointing (for training)
Example:
# Reduce batch size
batch_size = 1
# Process in chunks
for chunk in chunks:
process(chunk)
Error: Missing Dependencies
Symptoms:
ModuleNotFoundError: No module named 'package_name'
ImportError: cannot import name 'X'
Causes:
- Package not in dependencies
- Wrong package name
- Version mismatch
Solutions:
- Add to PEP 723 header:
# /// script # dependencies = ["package-name>=1.0.0"] # /// - Check package name spelling
- Specify version if needed
- Check package availability
Error: Script Not Found
Symptoms:
FileNotFoundError: script.py not found
Causes:
- Local file path used (not supported)
- URL incorrect
- Script not accessible
Solutions:
- Use inline script (recommended)
- Use publicly accessible URL
- Upload script to Hub first
- Check URL is correct
Correct approaches:
# β
Inline code
hf_jobs("uv", {"script": "# /// script\n# dependencies = [...]\n# ///\n\n<code>"})
# β
From URL
hf_jobs("uv", {"script": "https://huggingface.co/user/repo/resolve/main/script.py"})
Hub Push Issues
Error: Push Failed
Symptoms:
Error pushing to Hub
Upload failed
Causes:
- Network issues
- Token missing or invalid
- Repository access denied
- File too large
Solutions:
- Check token:
assert "HF_TOKEN" in os.environ - Verify repository exists or can be created
- Check network connectivity in logs
- Retry push operation
- Split large files into chunks
Error: Repository Not Found
Symptoms:
404 Client Error: Not Found
Repository not found
Causes:
- Repository doesn't exist
- Wrong repository name
- No access to private repo
Solutions:
- Create repository first:
from huggingface_hub import HfApi api = HfApi() api.create_repo("username/repo-name", repo_type="dataset") - Check repository name format
- Verify namespace exists
- Check repository visibility
Error: Results Not Saved
Symptoms:
- Job completes successfully
- No results visible on Hub
- Files not persisted
Causes:
- No persistence code in script
- Push code not executed
- Push failed silently
Solutions:
- Add persistence code to script
- Verify push executes successfully
- Check logs for push errors
- Add error handling around push
Example:
try:
dataset.push_to_hub("username/dataset")
print("β
Push successful")
except Exception as e:
print(f"β Push failed: {e}")
raise
Hardware Issues
Error: GPU Not Available
Symptoms:
CUDA not available
No GPU found
Causes:
- CPU flavor used instead of GPU
- GPU not requested
- CUDA not installed in image
Solutions:
- Use GPU flavor:
"flavor": "a10g-large" - Check image has CUDA support
- Verify GPU availability in logs
Error: Slow Performance
Symptoms:
- Job takes longer than expected
- Low GPU utilization
- CPU bottleneck
Causes:
- Wrong hardware selected
- Inefficient code
- Data loading bottleneck
Solutions:
- Upgrade hardware
- Optimize code
- Use batch processing
- Profile code to find bottlenecks
General Issues
Error: Job Status Unknown
Symptoms:
- Can't check job status
- Status API returns error
Solutions:
- Use job URL:
https://huggingface.co/jobs/username/job-id - Check logs:
hf_jobs("logs", {"job_id": "..."}) - Inspect job:
hf_jobs("inspect", {"job_id": "..."})
Error: Logs Not Available
Symptoms:
- No logs visible
- Logs delayed
Causes:
- Job just started (logs delayed 30-60s)
- Job failed before logging
- Logs not yet generated
Solutions:
- Wait 30-60 seconds after job start
- Check job status first
- Use job URL for web interface
Error: Cost Unexpectedly High
Symptoms:
- Job costs more than expected
- Longer runtime than estimated
Causes:
- Job ran longer than timeout
- Wrong hardware selected
- Inefficient code
Solutions:
- Monitor job runtime
- Set appropriate timeout
- Optimize code
- Choose right hardware
- Check cost estimates before running
Debugging Tips
1. Add Logging
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
logger.info("Starting processing...")
logger.info(f"Processed {count} items")
2. Verify Environment
import os
print(f"Python version: {os.sys.version}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"HF_TOKEN present: {'HF_TOKEN' in os.environ}")
3. Test Locally First
Run script locally before submitting to catch errors early:
python script.py
4. Check Job Logs
# View logs
hf_jobs("logs", {"job_id": "your-job-id"})
# Or use job URL
# https://huggingface.co/jobs/username/job-id
5. Add Error Handling
try:
# Your code
process_data()
except Exception as e:
print(f"Error: {e}")
import traceback
traceback.print_exc()
raise
Quick Reference
Common Error Codes
| Code | Meaning | Solution |
|---|---|---|
| 401 | Unauthorized | Add secrets={"HF_TOKEN": "$HF_TOKEN"} |
| 403 | Forbidden | Check token permissions |
| 404 | Not Found | Verify repository exists |
| 500 | Server Error | Retry or contact support |
Checklist Before Submitting
- Token configured:
secrets={"HF_TOKEN": "$HF_TOKEN"} - Script checks for token:
assert "HF_TOKEN" in os.environ - Timeout set appropriately
- Hardware selected correctly
- Dependencies listed in PEP 723 header
- Persistence code included
- Error handling added
- Logging added for debugging
Getting Help
If issues persist:
- Check logs - Most errors include detailed messages
- Review documentation - See main SKILL.md
- Check Hub status - https://status.huggingface.co
- Community forums - https://discuss.huggingface.co
- GitHub issues - For bugs in huggingface_hub
Key Takeaways
- Always include token -
secrets={"HF_TOKEN": "$HF_TOKEN"} - Set appropriate timeout - Default 30min may be insufficient
- Verify persistence - Results won't persist without code
- Check logs - Most issues visible in job logs
- Test locally - Catch errors before submitting
- Add error handling - Better debugging information
- Monitor costs - Set timeouts to avoid unexpected charges