Spaces:

hf-skills
/

hf-jobs

Running

App Files Files Community

hf-jobs / references /troubleshooting.md

burtenshaw HF Staff

Upload folder using huggingface_hub

7200e76 verified 5 days ago

preview code

raw

history blame contribute delete

8.86 kB

	# Troubleshooting Guide

	Common issues and solutions for Hugging Face Jobs.

	## Authentication Issues

	### Error: 401 Unauthorized

	Symptoms:
	```
	401 Client Error: Unauthorized for url: https://huggingface.co/api/...
	```

	Causes:
	- Token missing from job
	- Token invalid or expired
	- Token not passed correctly

	Solutions:
	1. Add `secrets={"HF_TOKEN": "$HF_TOKEN"}` to job config
	2. Verify `hf_whoami()` works locally
	3. Re-login: `hf auth login`
	4. Check token hasn't expired

	Verification:
	```python
	# In your script
	import os
	assert "HF_TOKEN" in os.environ, "HF_TOKEN missing!"
	```

	### Error: 403 Forbidden

	Symptoms:
	```
	403 Client Error: Forbidden for url: https://huggingface.co/api/...
	```

	Causes:
	- Token lacks required permissions
	- No access to private repository
	- Organization permissions insufficient

	Solutions:
	1. Ensure token has write permissions
	2. Check token type at https://huggingface.co/settings/tokens
	3. Verify access to target repository
	4. Use organization token if needed

	### Error: Token not found in environment

	Symptoms:
	```
	KeyError: 'HF_TOKEN'
	ValueError: HF_TOKEN not found
	```

	Causes:
	- `secrets` not passed in job config
	- Wrong key name (should be `HF_TOKEN`)
	- Using `env` instead of `secrets`

	Solutions:
	1. Use `secrets={"HF_TOKEN": "$HF_TOKEN"}` (not `env`)
	2. Verify key name is exactly `HF_TOKEN`
	3. Check job config syntax

	## Job Execution Issues

	### Error: Job Timeout

	Symptoms:
	- Job stops unexpectedly
	- Status shows "TIMEOUT"
	- Partial results only

	Causes:
	- Default 30min timeout exceeded
	- Job takes longer than expected
	- No timeout specified

	Solutions:
	1. Check logs for actual runtime
	2. Increase timeout with buffer: `"timeout": "3h"`
	3. Optimize code for faster execution
	4. Process data in chunks
	5. Add 20-30% buffer to estimated time

	Example:
	```python
	hf_jobs("uv", {
	"script": "...",
	"timeout": "2h" # Set appropriate timeout
	})
	```

	### Error: Out of Memory (OOM)

	Symptoms:
	```
	RuntimeError: CUDA out of memory
	MemoryError: Unable to allocate array
	```

	Causes:
	- Batch size too large
	- Model too large for hardware
	- Insufficient GPU memory

	Solutions:
	1. Reduce batch size
	2. Process data in smaller chunks
	3. Upgrade hardware: cpu → t4 → a10g → a100
	4. Use smaller models or quantization
	5. Enable gradient checkpointing (for training)

	Example:
	```python
	# Reduce batch size
	batch_size = 1

	# Process in chunks
	for chunk in chunks:
	process(chunk)
	```

	### Error: Missing Dependencies

	Symptoms:
	```
	ModuleNotFoundError: No module named 'package_name'
	ImportError: cannot import name 'X'
	```

	Causes:
	- Package not in dependencies
	- Wrong package name
	- Version mismatch

	Solutions:
	1. Add to PEP 723 header:
	```python
	# /// script
	# dependencies = ["package-name>=1.0.0"]
	# ///
	```
	2. Check package name spelling
	3. Specify version if needed
	4. Check package availability

	### Error: Script Not Found

	Symptoms:
	```
	FileNotFoundError: script.py not found
	```

	Causes:
	- Local file path used (not supported)
	- URL incorrect
	- Script not accessible

	Solutions:
	1. Use inline script (recommended)
	2. Use publicly accessible URL
	3. Upload script to Hub first
	4. Check URL is correct

	Correct approaches:
	```python
	# ✅ Inline code
	hf_jobs("uv", {"script": "# /// script\n# dependencies = [...]\n# ///\n\n<code>"})

	# ✅ From URL
	hf_jobs("uv", {"script": "https://huggingface.co/user/repo/resolve/main/script.py"})
	```

	## Hub Push Issues

	### Error: Push Failed

	Symptoms:
	```
	Error pushing to Hub
	Upload failed
	```

	Causes:
	- Network issues
	- Token missing or invalid
	- Repository access denied
	- File too large

	Solutions:
	1. Check token: `assert "HF_TOKEN" in os.environ`
	2. Verify repository exists or can be created
	3. Check network connectivity in logs
	4. Retry push operation
	5. Split large files into chunks

	### Error: Repository Not Found

	Symptoms:
	```
	404 Client Error: Not Found
	Repository not found
	```

	Causes:
	- Repository doesn't exist
	- Wrong repository name
	- No access to private repo

	Solutions:
	1. Create repository first:
	```python
	from huggingface_hub import HfApi
	api = HfApi()
	api.create_repo("username/repo-name", repo_type="dataset")
	```
	2. Check repository name format
	3. Verify namespace exists
	4. Check repository visibility

	### Error: Results Not Saved

	Symptoms:
	- Job completes successfully
	- No results visible on Hub
	- Files not persisted

	Causes:
	- No persistence code in script
	- Push code not executed
	- Push failed silently

	Solutions:
	1. Add persistence code to script
	2. Verify push executes successfully
	3. Check logs for push errors
	4. Add error handling around push

	Example:
	```python
	try:
	dataset.push_to_hub("username/dataset")
	print("✅ Push successful")
	except Exception as e:
	print(f"❌ Push failed: {e}")
	raise
	```

	## Hardware Issues

	### Error: GPU Not Available

	Symptoms:
	```
	CUDA not available
	No GPU found
	```

	Causes:
	- CPU flavor used instead of GPU
	- GPU not requested
	- CUDA not installed in image

	Solutions:
	1. Use GPU flavor: `"flavor": "a10g-large"`
	2. Check image has CUDA support
	3. Verify GPU availability in logs

	### Error: Slow Performance

	Symptoms:
	- Job takes longer than expected
	- Low GPU utilization
	- CPU bottleneck

	Causes:
	- Wrong hardware selected
	- Inefficient code
	- Data loading bottleneck

	Solutions:
	1. Upgrade hardware
	2. Optimize code
	3. Use batch processing
	4. Profile code to find bottlenecks

	## General Issues

	### Error: Job Status Unknown

	Symptoms:
	- Can't check job status
	- Status API returns error

	Solutions:
	1. Use job URL: `https://huggingface.co/jobs/username/job-id`
	2. Check logs: `hf_jobs("logs", {"job_id": "..."})`
	3. Inspect job: `hf_jobs("inspect", {"job_id": "..."})`

	### Error: Logs Not Available

	Symptoms:
	- No logs visible
	- Logs delayed

	Causes:
	- Job just started (logs delayed 30-60s)
	- Job failed before logging
	- Logs not yet generated

	Solutions:
	1. Wait 30-60 seconds after job start
	2. Check job status first
	3. Use job URL for web interface

	### Error: Cost Unexpectedly High

	Symptoms:
	- Job costs more than expected
	- Longer runtime than estimated

	Causes:
	- Job ran longer than timeout
	- Wrong hardware selected
	- Inefficient code

	Solutions:
	1. Monitor job runtime
	2. Set appropriate timeout
	3. Optimize code
	4. Choose right hardware
	5. Check cost estimates before running

	## Debugging Tips

	### 1. Add Logging

	```python
	import logging
	logging.basicConfig(level=logging.INFO)
	logger = logging.getLogger(__name__)

	logger.info("Starting processing...")
	logger.info(f"Processed {count} items")
	```

	### 2. Verify Environment

	```python
	import os
	print(f"Python version: {os.sys.version}")
	print(f"CUDA available: {torch.cuda.is_available()}")
	print(f"HF_TOKEN present: {'HF_TOKEN' in os.environ}")
	```

	### 3. Test Locally First

	Run script locally before submitting to catch errors early:
	```bash
	python script.py
	```

	### 4. Check Job Logs

	```python
	# View logs
	hf_jobs("logs", {"job_id": "your-job-id"})

	# Or use job URL
	# https://huggingface.co/jobs/username/job-id
	```

	### 5. Add Error Handling

	```python
	try:
	# Your code
	process_data()
	except Exception as e:
	print(f"Error: {e}")
	import traceback
	traceback.print_exc()
	raise
	```

	## Quick Reference

	### Common Error Codes

	\| Code \| Meaning \| Solution \|
	\|------\|---------\|----------\|
	\| 401 \| Unauthorized \| Add `secrets={"HF_TOKEN": "$HF_TOKEN"}` \|
	\| 403 \| Forbidden \| Check token permissions \|
	\| 404 \| Not Found \| Verify repository exists \|
	\| 500 \| Server Error \| Retry or contact support \|

	### Checklist Before Submitting

	- [ ] Token configured: `secrets={"HF_TOKEN": "$HF_TOKEN"}`
	- [ ] Script checks for token: `assert "HF_TOKEN" in os.environ`
	- [ ] Timeout set appropriately
	- [ ] Hardware selected correctly
	- [ ] Dependencies listed in PEP 723 header
	- [ ] Persistence code included
	- [ ] Error handling added
	- [ ] Logging added for debugging

	## Getting Help

	If issues persist:

	1. Check logs - Most errors include detailed messages
	2. Review documentation - See main SKILL.md
	3. Check Hub status - https://status.huggingface.co
	4. Community forums - https://discuss.huggingface.co
	5. GitHub issues - For bugs in huggingface_hub

	## Key Takeaways

	1. Always include token - `secrets={"HF_TOKEN": "$HF_TOKEN"}`
	2. Set appropriate timeout - Default 30min may be insufficient
	3. Verify persistence - Results won't persist without code
	4. Check logs - Most issues visible in job logs
	5. Test locally - Catch errors before submitting
	6. Add error handling - Better debugging information
	7. Monitor costs - Set timeouts to avoid unexpected charges