Spaces:

hf-skills
/

hf-jobs

Running

App Files Files Community

hf-jobs / references /troubleshooting.md

burtenshaw HF Staff

Upload folder using huggingface_hub

7200e76 verified 3 days ago

preview code

raw

history blame contribute delete

8.86 kB

Troubleshooting Guide

Common issues and solutions for Hugging Face Jobs.

Authentication Issues

Error: 401 Unauthorized

Symptoms:

401 Client Error: Unauthorized for url: https://huggingface.co/api/...

Causes:

Token missing from job
Token invalid or expired
Token not passed correctly

Solutions:

Add secrets={"HF_TOKEN": "$HF_TOKEN"} to job config
Verify hf_whoami() works locally
Re-login: hf auth login
Check token hasn't expired

Verification:

# In your script
import os
assert "HF_TOKEN" in os.environ, "HF_TOKEN missing!"

Error: 403 Forbidden

Symptoms:

403 Client Error: Forbidden for url: https://huggingface.co/api/...

Causes:

Token lacks required permissions
No access to private repository
Organization permissions insufficient

Solutions:

Ensure token has write permissions
Check token type at https://huggingface.co/settings/tokens
Verify access to target repository
Use organization token if needed

Error: Token not found in environment

Symptoms:

KeyError: 'HF_TOKEN'
ValueError: HF_TOKEN not found

Causes:

secrets not passed in job config
Wrong key name (should be HF_TOKEN)
Using env instead of secrets

Solutions:

Use secrets={"HF_TOKEN": "$HF_TOKEN"} (not env)
Verify key name is exactly HF_TOKEN
Check job config syntax

Job Execution Issues

Error: Job Timeout

Symptoms:

Job stops unexpectedly
Status shows "TIMEOUT"
Partial results only

Causes:

Default 30min timeout exceeded
Job takes longer than expected
No timeout specified

Solutions:

Check logs for actual runtime
Increase timeout with buffer: "timeout": "3h"
Optimize code for faster execution
Process data in chunks
Add 20-30% buffer to estimated time

Example:

hf_jobs("uv", {
    "script": "...",
    "timeout": "2h"  # Set appropriate timeout
})

Error: Out of Memory (OOM)

Symptoms:

RuntimeError: CUDA out of memory
MemoryError: Unable to allocate array

Causes:

Batch size too large
Model too large for hardware
Insufficient GPU memory

Solutions:

Reduce batch size
Process data in smaller chunks
Upgrade hardware: cpu → t4 → a10g → a100
Use smaller models or quantization
Enable gradient checkpointing (for training)

Example:

# Reduce batch size
batch_size = 1

# Process in chunks
for chunk in chunks:
    process(chunk)

Error: Missing Dependencies

Symptoms:

ModuleNotFoundError: No module named 'package_name'
ImportError: cannot import name 'X'

Causes:

Package not in dependencies
Wrong package name
Version mismatch

Solutions:

Add to PEP 723 header:

# /// script
# dependencies = ["package-name>=1.0.0"]
# ///

Check package name spelling
Specify version if needed
Check package availability

Error: Script Not Found

Symptoms:

FileNotFoundError: script.py not found

Causes:

Local file path used (not supported)
URL incorrect
Script not accessible

Solutions:

Use inline script (recommended)
Use publicly accessible URL
Upload script to Hub first
Check URL is correct

Correct approaches:

# ✅ Inline code
hf_jobs("uv", {"script": "# /// script\n# dependencies = [...]\n# ///\n\n<code>"})

# ✅ From URL
hf_jobs("uv", {"script": "https://huggingface.co/user/repo/resolve/main/script.py"})

Hub Push Issues

Error: Push Failed

Symptoms:

Error pushing to Hub
Upload failed

Causes:

Network issues
Token missing or invalid
Repository access denied
File too large

Solutions:

Check token: assert "HF_TOKEN" in os.environ
Verify repository exists or can be created
Check network connectivity in logs
Retry push operation
Split large files into chunks

Error: Repository Not Found

Symptoms:

404 Client Error: Not Found
Repository not found

Causes:

Repository doesn't exist
Wrong repository name
No access to private repo

Solutions:

Create repository first:

from huggingface_hub import HfApi
api = HfApi()
api.create_repo("username/repo-name", repo_type="dataset")

Check repository name format
Verify namespace exists
Check repository visibility

Error: Results Not Saved

Symptoms:

Job completes successfully
No results visible on Hub
Files not persisted

Causes:

No persistence code in script
Push code not executed
Push failed silently

Solutions:

Add persistence code to script
Verify push executes successfully
Check logs for push errors
Add error handling around push

Example:

try:
    dataset.push_to_hub("username/dataset")
    print("✅ Push successful")
except Exception as e:
    print(f"❌ Push failed: {e}")
    raise

Hardware Issues

Error: GPU Not Available

Symptoms:

CUDA not available
No GPU found

Causes:

CPU flavor used instead of GPU
GPU not requested
CUDA not installed in image

Solutions:

Use GPU flavor: "flavor": "a10g-large"
Check image has CUDA support
Verify GPU availability in logs

Error: Slow Performance

Symptoms:

Job takes longer than expected
Low GPU utilization
CPU bottleneck

Causes:

Wrong hardware selected
Inefficient code
Data loading bottleneck

Solutions:

Upgrade hardware
Optimize code
Use batch processing
Profile code to find bottlenecks

General Issues

Error: Job Status Unknown

Symptoms:

Can't check job status
Status API returns error

Solutions:

Use job URL: https://huggingface.co/jobs/username/job-id
Check logs: hf_jobs("logs", {"job_id": "..."})
Inspect job: hf_jobs("inspect", {"job_id": "..."})

Error: Logs Not Available

Symptoms:

No logs visible
Logs delayed

Causes:

Job just started (logs delayed 30-60s)
Job failed before logging
Logs not yet generated

Solutions:

Wait 30-60 seconds after job start
Check job status first
Use job URL for web interface

Error: Cost Unexpectedly High

Symptoms:

Job costs more than expected
Longer runtime than estimated

Causes:

Job ran longer than timeout
Wrong hardware selected
Inefficient code

Solutions:

Monitor job runtime
Set appropriate timeout
Optimize code
Choose right hardware
Check cost estimates before running

Debugging Tips

1. Add Logging

import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

logger.info("Starting processing...")
logger.info(f"Processed {count} items")

2. Verify Environment

import os
print(f"Python version: {os.sys.version}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"HF_TOKEN present: {'HF_TOKEN' in os.environ}")

3. Test Locally First

Run script locally before submitting to catch errors early:

python script.py

4. Check Job Logs

# View logs
hf_jobs("logs", {"job_id": "your-job-id"})

# Or use job URL
# https://huggingface.co/jobs/username/job-id

5. Add Error Handling

try:
    # Your code
    process_data()
except Exception as e:
    print(f"Error: {e}")
    import traceback
    traceback.print_exc()
    raise

Quick Reference

Common Error Codes

Code	Meaning	Solution
401	Unauthorized	Add `secrets={"HF_TOKEN": "$HF_TOKEN"}`
403	Forbidden	Check token permissions
404	Not Found	Verify repository exists
500	Server Error	Retry or contact support

Checklist Before Submitting

Token configured: secrets={"HF_TOKEN": "$HF_TOKEN"}
Script checks for token: assert "HF_TOKEN" in os.environ
Timeout set appropriately
Hardware selected correctly
Dependencies listed in PEP 723 header
Persistence code included
Error handling added
Logging added for debugging

Getting Help

If issues persist:

Check logs - Most errors include detailed messages
Review documentation - See main SKILL.md
Check Hub status - https://status.huggingface.co
Community forums - https://discuss.huggingface.co
GitHub issues - For bugs in huggingface_hub

Key Takeaways

Always include token - secrets={"HF_TOKEN": "$HF_TOKEN"}
Set appropriate timeout - Default 30min may be insufficient
Verify persistence - Results won't persist without code
Check logs - Most issues visible in job logs
Test locally - Catch errors before submitting
Add error handling - Better debugging information
Monitor costs - Set timeouts to avoid unexpected charges