hf-jobs / references /troubleshooting.md
burtenshaw's picture
burtenshaw HF Staff
Upload folder using huggingface_hub
7200e76 verified

Troubleshooting Guide

Common issues and solutions for Hugging Face Jobs.

Authentication Issues

Error: 401 Unauthorized

Symptoms:

401 Client Error: Unauthorized for url: https://huggingface.co/api/...

Causes:

  • Token missing from job
  • Token invalid or expired
  • Token not passed correctly

Solutions:

  1. Add secrets={"HF_TOKEN": "$HF_TOKEN"} to job config
  2. Verify hf_whoami() works locally
  3. Re-login: hf auth login
  4. Check token hasn't expired

Verification:

# In your script
import os
assert "HF_TOKEN" in os.environ, "HF_TOKEN missing!"

Error: 403 Forbidden

Symptoms:

403 Client Error: Forbidden for url: https://huggingface.co/api/...

Causes:

  • Token lacks required permissions
  • No access to private repository
  • Organization permissions insufficient

Solutions:

  1. Ensure token has write permissions
  2. Check token type at https://huggingface.co/settings/tokens
  3. Verify access to target repository
  4. Use organization token if needed

Error: Token not found in environment

Symptoms:

KeyError: 'HF_TOKEN'
ValueError: HF_TOKEN not found

Causes:

  • secrets not passed in job config
  • Wrong key name (should be HF_TOKEN)
  • Using env instead of secrets

Solutions:

  1. Use secrets={"HF_TOKEN": "$HF_TOKEN"} (not env)
  2. Verify key name is exactly HF_TOKEN
  3. Check job config syntax

Job Execution Issues

Error: Job Timeout

Symptoms:

  • Job stops unexpectedly
  • Status shows "TIMEOUT"
  • Partial results only

Causes:

  • Default 30min timeout exceeded
  • Job takes longer than expected
  • No timeout specified

Solutions:

  1. Check logs for actual runtime
  2. Increase timeout with buffer: "timeout": "3h"
  3. Optimize code for faster execution
  4. Process data in chunks
  5. Add 20-30% buffer to estimated time

Example:

hf_jobs("uv", {
    "script": "...",
    "timeout": "2h"  # Set appropriate timeout
})

Error: Out of Memory (OOM)

Symptoms:

RuntimeError: CUDA out of memory
MemoryError: Unable to allocate array

Causes:

  • Batch size too large
  • Model too large for hardware
  • Insufficient GPU memory

Solutions:

  1. Reduce batch size
  2. Process data in smaller chunks
  3. Upgrade hardware: cpu β†’ t4 β†’ a10g β†’ a100
  4. Use smaller models or quantization
  5. Enable gradient checkpointing (for training)

Example:

# Reduce batch size
batch_size = 1

# Process in chunks
for chunk in chunks:
    process(chunk)

Error: Missing Dependencies

Symptoms:

ModuleNotFoundError: No module named 'package_name'
ImportError: cannot import name 'X'

Causes:

  • Package not in dependencies
  • Wrong package name
  • Version mismatch

Solutions:

  1. Add to PEP 723 header:
    # /// script
    # dependencies = ["package-name>=1.0.0"]
    # ///
    
  2. Check package name spelling
  3. Specify version if needed
  4. Check package availability

Error: Script Not Found

Symptoms:

FileNotFoundError: script.py not found

Causes:

  • Local file path used (not supported)
  • URL incorrect
  • Script not accessible

Solutions:

  1. Use inline script (recommended)
  2. Use publicly accessible URL
  3. Upload script to Hub first
  4. Check URL is correct

Correct approaches:

# βœ… Inline code
hf_jobs("uv", {"script": "# /// script\n# dependencies = [...]\n# ///\n\n<code>"})

# βœ… From URL
hf_jobs("uv", {"script": "https://huggingface.co/user/repo/resolve/main/script.py"})

Hub Push Issues

Error: Push Failed

Symptoms:

Error pushing to Hub
Upload failed

Causes:

  • Network issues
  • Token missing or invalid
  • Repository access denied
  • File too large

Solutions:

  1. Check token: assert "HF_TOKEN" in os.environ
  2. Verify repository exists or can be created
  3. Check network connectivity in logs
  4. Retry push operation
  5. Split large files into chunks

Error: Repository Not Found

Symptoms:

404 Client Error: Not Found
Repository not found

Causes:

  • Repository doesn't exist
  • Wrong repository name
  • No access to private repo

Solutions:

  1. Create repository first:
    from huggingface_hub import HfApi
    api = HfApi()
    api.create_repo("username/repo-name", repo_type="dataset")
    
  2. Check repository name format
  3. Verify namespace exists
  4. Check repository visibility

Error: Results Not Saved

Symptoms:

  • Job completes successfully
  • No results visible on Hub
  • Files not persisted

Causes:

  • No persistence code in script
  • Push code not executed
  • Push failed silently

Solutions:

  1. Add persistence code to script
  2. Verify push executes successfully
  3. Check logs for push errors
  4. Add error handling around push

Example:

try:
    dataset.push_to_hub("username/dataset")
    print("βœ… Push successful")
except Exception as e:
    print(f"❌ Push failed: {e}")
    raise

Hardware Issues

Error: GPU Not Available

Symptoms:

CUDA not available
No GPU found

Causes:

  • CPU flavor used instead of GPU
  • GPU not requested
  • CUDA not installed in image

Solutions:

  1. Use GPU flavor: "flavor": "a10g-large"
  2. Check image has CUDA support
  3. Verify GPU availability in logs

Error: Slow Performance

Symptoms:

  • Job takes longer than expected
  • Low GPU utilization
  • CPU bottleneck

Causes:

  • Wrong hardware selected
  • Inefficient code
  • Data loading bottleneck

Solutions:

  1. Upgrade hardware
  2. Optimize code
  3. Use batch processing
  4. Profile code to find bottlenecks

General Issues

Error: Job Status Unknown

Symptoms:

  • Can't check job status
  • Status API returns error

Solutions:

  1. Use job URL: https://huggingface.co/jobs/username/job-id
  2. Check logs: hf_jobs("logs", {"job_id": "..."})
  3. Inspect job: hf_jobs("inspect", {"job_id": "..."})

Error: Logs Not Available

Symptoms:

  • No logs visible
  • Logs delayed

Causes:

  • Job just started (logs delayed 30-60s)
  • Job failed before logging
  • Logs not yet generated

Solutions:

  1. Wait 30-60 seconds after job start
  2. Check job status first
  3. Use job URL for web interface

Error: Cost Unexpectedly High

Symptoms:

  • Job costs more than expected
  • Longer runtime than estimated

Causes:

  • Job ran longer than timeout
  • Wrong hardware selected
  • Inefficient code

Solutions:

  1. Monitor job runtime
  2. Set appropriate timeout
  3. Optimize code
  4. Choose right hardware
  5. Check cost estimates before running

Debugging Tips

1. Add Logging

import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

logger.info("Starting processing...")
logger.info(f"Processed {count} items")

2. Verify Environment

import os
print(f"Python version: {os.sys.version}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"HF_TOKEN present: {'HF_TOKEN' in os.environ}")

3. Test Locally First

Run script locally before submitting to catch errors early:

python script.py

4. Check Job Logs

# View logs
hf_jobs("logs", {"job_id": "your-job-id"})

# Or use job URL
# https://huggingface.co/jobs/username/job-id

5. Add Error Handling

try:
    # Your code
    process_data()
except Exception as e:
    print(f"Error: {e}")
    import traceback
    traceback.print_exc()
    raise

Quick Reference

Common Error Codes

Code Meaning Solution
401 Unauthorized Add secrets={"HF_TOKEN": "$HF_TOKEN"}
403 Forbidden Check token permissions
404 Not Found Verify repository exists
500 Server Error Retry or contact support

Checklist Before Submitting

  • Token configured: secrets={"HF_TOKEN": "$HF_TOKEN"}
  • Script checks for token: assert "HF_TOKEN" in os.environ
  • Timeout set appropriately
  • Hardware selected correctly
  • Dependencies listed in PEP 723 header
  • Persistence code included
  • Error handling added
  • Logging added for debugging

Getting Help

If issues persist:

  1. Check logs - Most errors include detailed messages
  2. Review documentation - See main SKILL.md
  3. Check Hub status - https://status.huggingface.co
  4. Community forums - https://discuss.huggingface.co
  5. GitHub issues - For bugs in huggingface_hub

Key Takeaways

  1. Always include token - secrets={"HF_TOKEN": "$HF_TOKEN"}
  2. Set appropriate timeout - Default 30min may be insufficient
  3. Verify persistence - Results won't persist without code
  4. Check logs - Most issues visible in job logs
  5. Test locally - Catch errors before submitting
  6. Add error handling - Better debugging information
  7. Monitor costs - Set timeouts to avoid unexpected charges