hf-jobs / references /troubleshooting.md
burtenshaw's picture
burtenshaw HF Staff
Upload folder using huggingface_hub
7200e76 verified
# Troubleshooting Guide
Common issues and solutions for Hugging Face Jobs.
## Authentication Issues
### Error: 401 Unauthorized
**Symptoms:**
```
401 Client Error: Unauthorized for url: https://huggingface.co/api/...
```
**Causes:**
- Token missing from job
- Token invalid or expired
- Token not passed correctly
**Solutions:**
1. Add `secrets={"HF_TOKEN": "$HF_TOKEN"}` to job config
2. Verify `hf_whoami()` works locally
3. Re-login: `hf auth login`
4. Check token hasn't expired
**Verification:**
```python
# In your script
import os
assert "HF_TOKEN" in os.environ, "HF_TOKEN missing!"
```
### Error: 403 Forbidden
**Symptoms:**
```
403 Client Error: Forbidden for url: https://huggingface.co/api/...
```
**Causes:**
- Token lacks required permissions
- No access to private repository
- Organization permissions insufficient
**Solutions:**
1. Ensure token has write permissions
2. Check token type at https://huggingface.co/settings/tokens
3. Verify access to target repository
4. Use organization token if needed
### Error: Token not found in environment
**Symptoms:**
```
KeyError: 'HF_TOKEN'
ValueError: HF_TOKEN not found
```
**Causes:**
- `secrets` not passed in job config
- Wrong key name (should be `HF_TOKEN`)
- Using `env` instead of `secrets`
**Solutions:**
1. Use `secrets={"HF_TOKEN": "$HF_TOKEN"}` (not `env`)
2. Verify key name is exactly `HF_TOKEN`
3. Check job config syntax
## Job Execution Issues
### Error: Job Timeout
**Symptoms:**
- Job stops unexpectedly
- Status shows "TIMEOUT"
- Partial results only
**Causes:**
- Default 30min timeout exceeded
- Job takes longer than expected
- No timeout specified
**Solutions:**
1. Check logs for actual runtime
2. Increase timeout with buffer: `"timeout": "3h"`
3. Optimize code for faster execution
4. Process data in chunks
5. Add 20-30% buffer to estimated time
**Example:**
```python
hf_jobs("uv", {
"script": "...",
"timeout": "2h" # Set appropriate timeout
})
```
### Error: Out of Memory (OOM)
**Symptoms:**
```
RuntimeError: CUDA out of memory
MemoryError: Unable to allocate array
```
**Causes:**
- Batch size too large
- Model too large for hardware
- Insufficient GPU memory
**Solutions:**
1. Reduce batch size
2. Process data in smaller chunks
3. Upgrade hardware: cpu → t4 → a10g → a100
4. Use smaller models or quantization
5. Enable gradient checkpointing (for training)
**Example:**
```python
# Reduce batch size
batch_size = 1
# Process in chunks
for chunk in chunks:
process(chunk)
```
### Error: Missing Dependencies
**Symptoms:**
```
ModuleNotFoundError: No module named 'package_name'
ImportError: cannot import name 'X'
```
**Causes:**
- Package not in dependencies
- Wrong package name
- Version mismatch
**Solutions:**
1. Add to PEP 723 header:
```python
# /// script
# dependencies = ["package-name>=1.0.0"]
# ///
```
2. Check package name spelling
3. Specify version if needed
4. Check package availability
### Error: Script Not Found
**Symptoms:**
```
FileNotFoundError: script.py not found
```
**Causes:**
- Local file path used (not supported)
- URL incorrect
- Script not accessible
**Solutions:**
1. Use inline script (recommended)
2. Use publicly accessible URL
3. Upload script to Hub first
4. Check URL is correct
**Correct approaches:**
```python
# ✅ Inline code
hf_jobs("uv", {"script": "# /// script\n# dependencies = [...]\n# ///\n\n<code>"})
# ✅ From URL
hf_jobs("uv", {"script": "https://huggingface.co/user/repo/resolve/main/script.py"})
```
## Hub Push Issues
### Error: Push Failed
**Symptoms:**
```
Error pushing to Hub
Upload failed
```
**Causes:**
- Network issues
- Token missing or invalid
- Repository access denied
- File too large
**Solutions:**
1. Check token: `assert "HF_TOKEN" in os.environ`
2. Verify repository exists or can be created
3. Check network connectivity in logs
4. Retry push operation
5. Split large files into chunks
### Error: Repository Not Found
**Symptoms:**
```
404 Client Error: Not Found
Repository not found
```
**Causes:**
- Repository doesn't exist
- Wrong repository name
- No access to private repo
**Solutions:**
1. Create repository first:
```python
from huggingface_hub import HfApi
api = HfApi()
api.create_repo("username/repo-name", repo_type="dataset")
```
2. Check repository name format
3. Verify namespace exists
4. Check repository visibility
### Error: Results Not Saved
**Symptoms:**
- Job completes successfully
- No results visible on Hub
- Files not persisted
**Causes:**
- No persistence code in script
- Push code not executed
- Push failed silently
**Solutions:**
1. Add persistence code to script
2. Verify push executes successfully
3. Check logs for push errors
4. Add error handling around push
**Example:**
```python
try:
dataset.push_to_hub("username/dataset")
print("✅ Push successful")
except Exception as e:
print(f"❌ Push failed: {e}")
raise
```
## Hardware Issues
### Error: GPU Not Available
**Symptoms:**
```
CUDA not available
No GPU found
```
**Causes:**
- CPU flavor used instead of GPU
- GPU not requested
- CUDA not installed in image
**Solutions:**
1. Use GPU flavor: `"flavor": "a10g-large"`
2. Check image has CUDA support
3. Verify GPU availability in logs
### Error: Slow Performance
**Symptoms:**
- Job takes longer than expected
- Low GPU utilization
- CPU bottleneck
**Causes:**
- Wrong hardware selected
- Inefficient code
- Data loading bottleneck
**Solutions:**
1. Upgrade hardware
2. Optimize code
3. Use batch processing
4. Profile code to find bottlenecks
## General Issues
### Error: Job Status Unknown
**Symptoms:**
- Can't check job status
- Status API returns error
**Solutions:**
1. Use job URL: `https://huggingface.co/jobs/username/job-id`
2. Check logs: `hf_jobs("logs", {"job_id": "..."})`
3. Inspect job: `hf_jobs("inspect", {"job_id": "..."})`
### Error: Logs Not Available
**Symptoms:**
- No logs visible
- Logs delayed
**Causes:**
- Job just started (logs delayed 30-60s)
- Job failed before logging
- Logs not yet generated
**Solutions:**
1. Wait 30-60 seconds after job start
2. Check job status first
3. Use job URL for web interface
### Error: Cost Unexpectedly High
**Symptoms:**
- Job costs more than expected
- Longer runtime than estimated
**Causes:**
- Job ran longer than timeout
- Wrong hardware selected
- Inefficient code
**Solutions:**
1. Monitor job runtime
2. Set appropriate timeout
3. Optimize code
4. Choose right hardware
5. Check cost estimates before running
## Debugging Tips
### 1. Add Logging
```python
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
logger.info("Starting processing...")
logger.info(f"Processed {count} items")
```
### 2. Verify Environment
```python
import os
print(f"Python version: {os.sys.version}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"HF_TOKEN present: {'HF_TOKEN' in os.environ}")
```
### 3. Test Locally First
Run script locally before submitting to catch errors early:
```bash
python script.py
```
### 4. Check Job Logs
```python
# View logs
hf_jobs("logs", {"job_id": "your-job-id"})
# Or use job URL
# https://huggingface.co/jobs/username/job-id
```
### 5. Add Error Handling
```python
try:
# Your code
process_data()
except Exception as e:
print(f"Error: {e}")
import traceback
traceback.print_exc()
raise
```
## Quick Reference
### Common Error Codes
| Code | Meaning | Solution |
|------|---------|----------|
| 401 | Unauthorized | Add `secrets={"HF_TOKEN": "$HF_TOKEN"}` |
| 403 | Forbidden | Check token permissions |
| 404 | Not Found | Verify repository exists |
| 500 | Server Error | Retry or contact support |
### Checklist Before Submitting
- [ ] Token configured: `secrets={"HF_TOKEN": "$HF_TOKEN"}`
- [ ] Script checks for token: `assert "HF_TOKEN" in os.environ`
- [ ] Timeout set appropriately
- [ ] Hardware selected correctly
- [ ] Dependencies listed in PEP 723 header
- [ ] Persistence code included
- [ ] Error handling added
- [ ] Logging added for debugging
## Getting Help
If issues persist:
1. **Check logs** - Most errors include detailed messages
2. **Review documentation** - See main SKILL.md
3. **Check Hub status** - https://status.huggingface.co
4. **Community forums** - https://discuss.huggingface.co
5. **GitHub issues** - For bugs in huggingface_hub
## Key Takeaways
1. **Always include token** - `secrets={"HF_TOKEN": "$HF_TOKEN"}`
2. **Set appropriate timeout** - Default 30min may be insufficient
3. **Verify persistence** - Results won't persist without code
4. **Check logs** - Most issues visible in job logs
5. **Test locally** - Catch errors before submitting
6. **Add error handling** - Better debugging information
7. **Monitor costs** - Set timeouts to avoid unexpected charges