Spaces:

hf-skills
/

hf-jobs

Running

App Files Files Community

hf-jobs / SKILL.md

burtenshaw HF Staff

Upload folder using huggingface_hub

7200e76 verified 3 days ago

preview code

raw

history blame contribute delete

22.9 kB

metadata

name: hf-jobs
description: >-
  This skill should be used when users want to run any workload on Hugging Face
  Jobs infrastructure. Covers UV scripts, Docker-based jobs, hardware selection,
  cost estimation, authentication with tokens, secrets management, timeout
  configuration, and result persistence. Designed for general-purpose compute
  workloads including data processing, inference, experiments, batch jobs, and
  any Python-based tasks. Should be invoked for tasks involving cloud compute,
  GPU workloads, or when users mention running jobs on Hugging Face
  infrastructure without local setup.
license: Complete terms in LICENSE.txt

Running Workloads on Hugging Face Jobs

Overview

Run any workload on fully managed Hugging Face infrastructure. No local setup required—jobs run on cloud CPUs, GPUs, or TPUs and can persist results to the Hugging Face Hub.

Common use cases:

Data Processing - Transform, filter, or analyze large datasets
Batch Inference - Run inference on thousands of samples
Experiments & Benchmarks - Reproducible ML experiments
Model Training - Fine-tune models (see model-trainer skill for TRL-specific training)
Synthetic Data Generation - Generate datasets using LLMs
Development & Testing - Test code without local GPU setup
Scheduled Jobs - Automate recurring tasks

For model training specifically: See the model-trainer skill for TRL-based training workflows.

When to Use This Skill

Use this skill when users want to:

Run Python workloads on cloud infrastructure
Execute jobs without local GPU/TPU setup
Process data at scale
Run batch inference or experiments
Schedule recurring tasks
Use GPUs/TPUs for any workload
Persist results to the Hugging Face Hub

Key Directives

When assisting with jobs:

ALWAYS use hf_jobs() MCP tool - Submit jobs using hf_jobs("uv", {...}) or hf_jobs("run", {...}). The script parameter accepts Python code directly. Do NOT save to local files unless the user explicitly requests it. Pass the script content as a string to hf_jobs().
Always handle authentication - Jobs that interact with the Hub require HF_TOKEN via secrets. See Token Usage section below.
Provide job details after submission - After submitting, provide job ID, monitoring URL, estimated time, and note that the user can request status checks later.
Set appropriate timeouts - Default 30min may be insufficient for long-running tasks.

Prerequisites Checklist

Before starting any job, verify:

✅ Account & Authentication

Hugging Face Account with Pro, Team, or Enterprise plan (Jobs require paid plan)
Authenticated login: Check with hf_whoami()
HF_TOKEN for Hub Access ⚠️ CRITICAL - Required for any Hub operations (push models/datasets, download private repos, etc.)
Token must have appropriate permissions (read for downloads, write for uploads)

✅ Token Usage (See Token Usage section for details)

When tokens are required:

Pushing models/datasets to Hub
Accessing private repositories
Using Hub APIs in scripts
Any authenticated Hub operations

How to provide tokens:

{
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # Recommended: automatic token
}

⚠️ CRITICAL: The $HF_TOKEN placeholder is automatically replaced with your logged-in token. Never hardcode tokens in scripts.

Token Usage Guide

Understanding Tokens

What are HF Tokens?

Authentication credentials for Hugging Face Hub
Required for authenticated operations (push, private repos, API access)
Stored securely on your machine after hf auth login

Token Types:

Read Token - Can download models/datasets, read private repos
Write Token - Can push models/datasets, create repos, modify content
Organization Token - Can act on behalf of an organization

When Tokens Are Required

Always Required:

Pushing models/datasets to Hub
Accessing private repositories
Creating new repositories
Modifying existing repositories
Using Hub APIs programmatically

Not Required:

Downloading public models/datasets
Running jobs that don't interact with Hub
Reading public repository information

How to Provide Tokens to Jobs

Method 1: Automatic Token (Recommended)

hf_jobs("uv", {
    "script": "your_script.py",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # ✅ Automatic replacement
})

How it works:

$HF_TOKEN is a placeholder that gets replaced with your actual token
Uses the token from your logged-in session (hf auth login)
Most secure and convenient method
Token is encrypted server-side when passed as a secret

Benefits:

No token exposure in code
Uses your current login session
Automatically updated if you re-login
Works seamlessly with MCP tools

Method 2: Explicit Token (Not Recommended)

hf_jobs("uv", {
    "script": "your_script.py",
    "secrets": {"HF_TOKEN": "hf_abc123..."}  # ⚠️ Hardcoded token
})

When to use:

Only if automatic token doesn't work
Testing with a specific token
Organization tokens (use with caution)

Security concerns:

Token visible in code/logs
Must manually update if token rotates
Risk of token exposure

Method 3: Environment Variable (Less Secure)

hf_jobs("uv", {
    "script": "your_script.py",
    "env": {"HF_TOKEN": "hf_abc123..."}  # ⚠️ Less secure than secrets
})

Difference from secrets:

env variables are visible in job logs
secrets are encrypted server-side
Always prefer secrets for tokens

Using Tokens in Scripts

In your Python script, tokens are available as environment variables:

# /// script
# dependencies = ["huggingface-hub"]
# ///

import os
from huggingface_hub import HfApi

# Token is automatically available if passed via secrets
token = os.environ.get("HF_TOKEN")

# Use with Hub API
api = HfApi(token=token)

# Or let huggingface_hub auto-detect
api = HfApi()  # Automatically uses HF_TOKEN env var

Best practices:

Don't hardcode tokens in scripts
Use os.environ.get("HF_TOKEN") to access
Let huggingface_hub auto-detect when possible
Verify token exists before Hub operations

Token Verification

Check if you're logged in:

from huggingface_hub import whoami
user_info = whoami()  # Returns your username if authenticated

Verify token in job:

import os
assert "HF_TOKEN" in os.environ, "HF_TOKEN not found!"
token = os.environ["HF_TOKEN"]
print(f"Token starts with: {token[:7]}...")  # Should start with "hf_"

Common Token Issues

Error: 401 Unauthorized

Cause: Token missing or invalid
Fix: Add secrets={"HF_TOKEN": "$HF_TOKEN"} to job config
Verify: Check hf_whoami() works locally

Error: 403 Forbidden

Cause: Token lacks required permissions
Fix: Ensure token has write permissions for push operations
Check: Token type at https://huggingface.co/settings/tokens

Error: Token not found in environment

Cause: secrets not passed or wrong key name
Fix: Use secrets={"HF_TOKEN": "$HF_TOKEN"} (not env)
Verify: Script checks os.environ.get("HF_TOKEN")

Error: Repository access denied

Cause: Token doesn't have access to private repo
Fix: Use token from account with access
Check: Verify repo visibility and your permissions

Token Security Best Practices

Never commit tokens - Use $HF_TOKEN placeholder or environment variables
Use secrets, not env - Secrets are encrypted server-side
Rotate tokens regularly - Generate new tokens periodically
Use minimal permissions - Create tokens with only needed permissions
Don't share tokens - Each user should use their own token
Monitor token usage - Check token activity in Hub settings

Complete Token Example

# Example: Push results to Hub
hf_jobs("uv", {
    "script": """
# /// script
# dependencies = ["huggingface-hub", "datasets"]
# ///

import os
from huggingface_hub import HfApi
from datasets import Dataset

# Verify token is available
assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"

# Use token for Hub operations
api = HfApi(token=os.environ["HF_TOKEN"])

# Create and push dataset
data = {"text": ["Hello", "World"]}
dataset = Dataset.from_dict(data)
dataset.push_to_hub("username/my-dataset", token=os.environ["HF_TOKEN"])

print("✅ Dataset pushed successfully!")
""",
    "flavor": "cpu-basic",
    "timeout": "30m",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # ✅ Token provided securely
})

Quick Start: Two Approaches

Approach 1: UV Scripts (Recommended)

UV scripts use PEP 723 inline dependencies for clean, self-contained workloads.

hf_jobs("uv", {
    "script": """
# /// script
# dependencies = ["transformers", "torch"]
# ///

from transformers import pipeline
import torch

# Your workload here
classifier = pipeline("sentiment-analysis")
result = classifier("I love Hugging Face!")
print(result)
""",
    "flavor": "cpu-basic",
    "timeout": "30m"
})

Benefits: Direct MCP tool usage, clean code, dependencies declared inline, no file saving required

When to use: Default choice for all workloads, custom logic, any scenario requiring hf_jobs()

Working with Scripts

⚠️ Important: There are two “script path” stories depending on how you run Jobs:

Using the hf_jobs() MCP tool (recommended in this repo): the script value must be inline code (a string) or a URL. A local filesystem path (like "./scripts/foo.py") won’t exist inside the remote container.
Using the hf jobs uv run CLI: local file paths do work (the CLI uploads your script).

Common mistake with hf_jobs() MCP tool:

# ❌ Will fail (remote container can't see your local path)
hf_jobs("uv", {"script": "./scripts/foo.py"})

Correct patterns with hf_jobs() MCP tool:

# ✅ Inline: read the local script file and pass its *contents*
from pathlib import Path
script = Path("hf-jobs/scripts/foo.py").read_text()
hf_jobs("uv", {"script": script})

# ✅ URL: host the script somewhere reachable
hf_jobs("uv", {"script": "https://huggingface.co/datasets/uv-scripts/.../raw/main/foo.py"})

CLI equivalent (local paths supported):

hf jobs uv run ./scripts/foo.py -- --your --args

Approach 2: Docker-Based Jobs

Run jobs with custom Docker images and commands.

hf_jobs("run", {
    "image": "python:3.12",
    "command": ["python", "-c", "print('Hello from HF Jobs!')"],
    "flavor": "cpu-basic",
    "timeout": "30m"
})

Benefits: Full Docker control, use pre-built images, run any command When to use: Need specific Docker images, non-Python workloads, complex environments

Example with GPU:

hf_jobs("run", {
    "image": "pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel",
    "command": ["python", "-c", "import torch; print(torch.cuda.get_device_name())"],
    "flavor": "a10g-small",
    "timeout": "1h"
})

Finding More UV Scripts on Hub

The uv-scripts organization provides ready-to-use UV scripts stored as datasets on Hugging Face Hub:

# Discover available UV script collections
dataset_search({"author": "uv-scripts", "sort": "downloads", "limit": 20})

# Explore a specific collection
hub_repo_details(["uv-scripts/classification"], repo_type="dataset", include_readme=True)

Popular collections: OCR, classification, synthetic-data, vLLM, dataset-creation

Hardware Selection

Workload Type	Recommended Hardware	Cost (approx./hr)	Use Case
Data processing, testing	`cpu-basic`, `cpu-upgrade`	~$0.10-0.50	Lightweight tasks
Small models, demos	`t4-small`	~$0.75	<1B models, quick tests
Medium models	`t4-medium`, `l4x1`	~$1.50-2.50	1-7B models
Large models, production	`a10g-small`, `a10g-large`	~$3.50-5.00	7-13B models
Very large models	`a100-large`	~$8-12	13B+ models
Batch inference	`a10g-large`, `a100-large`	~$5-10	High-throughput
Data processing	`cpu-upgrade`, `l4x1`	~$0.50-2.50	Parallel workloads

GPU Flavors: cpu-basic/upgrade, t4-small/medium, l4x1/x4, a10g-small/large/largex2/largex4, a100-large, h100/h100x8

TPU Flavors: v5e-1x1, v5e-2x2, v5e-2x4

Guidelines:

Start with smaller hardware for testing
Scale up based on actual needs
Use multi-GPU for parallel workloads
See references/hardware_guide.md for detailed specifications

Critical: Saving Results

⚠️ EPHEMERAL ENVIRONMENT—MUST PERSIST RESULTS

The Jobs environment is temporary. All files are deleted when the job ends. If results aren't persisted, ALL WORK IS LOST.

Persistence Options

1. Push to Hugging Face Hub (Recommended)

# Push models
model.push_to_hub("username/model-name", token=os.environ["HF_TOKEN"])

# Push datasets
dataset.push_to_hub("username/dataset-name", token=os.environ["HF_TOKEN"])

# Push artifacts
api.upload_file(
    path_or_fileobj="results.json",
    path_in_repo="results.json",
    repo_id="username/results",
    token=os.environ["HF_TOKEN"]
)

2. Use External Storage

# Upload to S3, GCS, etc.
import boto3
s3 = boto3.client('s3')
s3.upload_file('results.json', 'my-bucket', 'results.json')

3. Send Results via API

# POST results to your API
import requests
requests.post("https://your-api.com/results", json=results)

Required Configuration for Hub Push

In job submission:

{
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # Enables authentication
}

In script:

import os
from huggingface_hub import HfApi

# Token automatically available from secrets
api = HfApi(token=os.environ.get("HF_TOKEN"))

# Push your results
api.upload_file(...)

Verification Checklist

Before submitting:

Results persistence method chosen
secrets={"HF_TOKEN": "$HF_TOKEN"} if using Hub
Script handles missing token gracefully
Test persistence path works

See: references/hub_saving.md for detailed Hub persistence guide

Timeout Management

⚠️ DEFAULT: 30 MINUTES

Setting Timeouts

{
    "timeout": "2h"   # 2 hours (formats: "90m", "2h", "1.5h", or seconds as integer)
}

Timeout Guidelines

Scenario	Recommended	Notes
Quick test	10-30 min	Verify setup
Data processing	1-2 hours	Depends on data size
Batch inference	2-4 hours	Large batches
Experiments	4-8 hours	Multiple runs
Long-running	8-24 hours	Production workloads

Always add 20-30% buffer for setup, network delays, and cleanup.

On timeout: Job killed immediately, all unsaved progress lost

Cost Estimation

General guidelines:

Total Cost = (Hours of runtime) × (Cost per hour)

Example calculations:

Quick test:

Hardware: cpu-basic ($0.10/hour)
Time: 15 minutes (0.25 hours)
Cost: $0.03

Data processing:

Hardware: l4x1 ($2.50/hour)
Time: 2 hours
Cost: $5.00

Batch inference:

Hardware: a10g-large ($5/hour)
Time: 4 hours
Cost: $20.00

Cost optimization tips:

Start small - Test on cpu-basic or t4-small
Monitor runtime - Set appropriate timeouts
Use checkpoints - Resume if job fails
Optimize code - Reduce unnecessary compute
Choose right hardware - Don't over-provision

Monitoring and Tracking

Check Job Status

# List all jobs
hf_jobs("ps")

# Inspect specific job
hf_jobs("inspect", {"job_id": "your-job-id"})

# View logs
hf_jobs("logs", {"job_id": "your-job-id"})

# Cancel a job
hf_jobs("cancel", {"job_id": "your-job-id"})

Remember: Wait for user to request status checks. Avoid polling repeatedly.

Job URLs

After submission, jobs have monitoring URLs:

https://huggingface.co/jobs/username/job-id

View logs, status, and details in the browser.

Scheduled Jobs

Run jobs on a schedule using CRON expressions or predefined schedules.

# Schedule a job that runs every hour
hf_jobs("scheduled uv", {
    "script": "your_script.py",
    "schedule": "@hourly",
    "flavor": "cpu-basic"
})

# Use CRON syntax
hf_jobs("scheduled uv", {
    "script": "your_script.py",
    "schedule": "0 9 * * 1",  # 9 AM every Monday
    "flavor": "cpu-basic"
})

Available schedules:

@annually, @yearly - Once per year
@monthly - Once per month
@weekly - Once per week
@daily - Once per day
@hourly - Once per hour
CRON expression - Custom schedule (e.g., "0 9 * * 1")

Manage scheduled jobs:

hf_jobs("scheduled ps")  # List scheduled jobs
hf_jobs("scheduled suspend", {"job_id": "..."})  # Pause
hf_jobs("scheduled resume", {"job_id": "..."})  # Resume
hf_jobs("scheduled delete", {"job_id": "..."})  # Delete

Common Workload Patterns

This repository ships ready-to-run UV scripts in hf-jobs/scripts/. Prefer using them instead of inventing new templates.

Pattern 1: Dataset → Model Responses (vLLM) — `scripts/generate-responses.py`

What it does: loads a Hub dataset (chat messages or a prompt column), applies a model chat template, generates responses with vLLM, and pushes the output dataset + dataset card back to the Hub.

Requires: GPU + write token (it pushes a dataset).

from pathlib import Path

script = Path("hf-jobs/scripts/generate-responses.py").read_text()
hf_jobs("uv", {
    "script": script,
    "script_args": [
        "username/input-dataset",
        "username/output-dataset",
        "--messages-column", "messages",
        "--model-id", "Qwen/Qwen3-30B-A3B-Instruct-2507",
        "--temperature", "0.7",
        "--top-p", "0.8",
        "--max-tokens", "2048",
    ],
    "flavor": "a10g-large",
    "timeout": "4h",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"},
})

Pattern 2: CoT Self-Instruct Synthetic Data — `scripts/cot-self-instruct.py`

What it does: generates synthetic prompts/answers via CoT Self-Instruct, optionally filters outputs (answer-consistency / RIP), then pushes the generated dataset + dataset card to the Hub.

Requires: GPU + write token (it pushes a dataset).

from pathlib import Path

script = Path("hf-jobs/scripts/cot-self-instruct.py").read_text()
hf_jobs("uv", {
    "script": script,
    "script_args": [
        "--seed-dataset", "davanstrien/s1k-reasoning",
        "--output-dataset", "username/synthetic-math",
        "--task-type", "reasoning",
        "--num-samples", "5000",
        "--filter-method", "answer-consistency",
    ],
    "flavor": "l4x4",
    "timeout": "8h",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"},
})

Pattern 3: Streaming Dataset Stats (Polars + HF Hub) — `scripts/finepdfs-stats.py`

What it does: scans parquet directly from Hub (no 300GB download), computes temporal stats, and (optionally) uploads results to a Hub dataset repo.

Requires: CPU is often enough; token needed only if you pass --output-repo (upload).

from pathlib import Path

script = Path("hf-jobs/scripts/finepdfs-stats.py").read_text()
hf_jobs("uv", {
    "script": script,
    "script_args": [
        "--limit", "10000",
        "--show-plan",
        "--output-repo", "username/finepdfs-temporal-stats",
    ],
    "flavor": "cpu-upgrade",
    "timeout": "2h",
    "env": {"HF_XET_HIGH_PERFORMANCE": "1"},
    "secrets": {"HF_TOKEN": "$HF_TOKEN"},
})

Common Failure Modes

Out of Memory (OOM)

Fix:

Reduce batch size or data chunk size
Process data in smaller batches
Upgrade hardware: cpu → t4 → a10g → a100

Job Timeout

Fix:

Check logs for actual runtime
Increase timeout with buffer: "timeout": "3h"
Optimize code for faster execution
Process data in chunks

Hub Push Failures

Fix:

Add to job: secrets={"HF_TOKEN": "$HF_TOKEN"}
Verify token in script: assert "HF_TOKEN" in os.environ
Check token permissions
Verify repo exists or can be created

Missing Dependencies

Fix: Add to PEP 723 header:

# /// script
# dependencies = ["package1", "package2>=1.0.0"]
# ///

Authentication Errors

Fix:

Check hf_whoami() works locally
Verify secrets={"HF_TOKEN": "$HF_TOKEN"} in job config
Re-login: hf auth login
Check token has required permissions

Troubleshooting

Common issues:

Job times out → Increase timeout, optimize code
Results not saved → Check persistence method, verify HF_TOKEN
Out of Memory → Reduce batch size, upgrade hardware
Import errors → Add dependencies to PEP 723 header
Authentication errors → Check token, verify secrets parameter

See: references/troubleshooting.md for complete troubleshooting guide

Resources

References (In This Skill)

references/token_usage.md - Complete token usage guide
references/hardware_guide.md - Hardware specs and selection
references/hub_saving.md - Hub persistence guide
references/troubleshooting.md - Common issues and solutions

Scripts (In This Skill)

scripts/generate-responses.py - vLLM batch generation: dataset → responses → push to Hub
scripts/cot-self-instruct.py - CoT Self-Instruct synthetic data generation + filtering → push to Hub
scripts/finepdfs-stats.py - Polars streaming stats over finepdfs-edu parquet on Hub (optional push)

External Links

Key Takeaways

Submit scripts inline - The script parameter accepts Python code directly; no file saving required unless user requests
Jobs are asynchronous - Don't wait/poll; let user check when ready
Always set timeout - Default 30 min may be insufficient; set appropriate timeout
Always persist results - Environment is ephemeral; without persistence, all work is lost
Use tokens securely - Always use secrets={"HF_TOKEN": "$HF_TOKEN"} for Hub operations
Choose appropriate hardware - Start small, scale up based on needs
Use UV scripts - Default to hf_jobs("uv", {...}) with inline scripts for Python workloads
Handle authentication - Verify tokens are available before Hub operations
Monitor jobs - Provide job URLs and status check commands
Optimize costs - Choose right hardware, set appropriate timeouts

Running Workloads on Hugging Face Jobs

Overview

When to Use This Skill

Key Directives

Prerequisites Checklist

✅ Account & Authentication

✅ Token Usage (See Token Usage section for details)

Token Usage Guide

Understanding Tokens

When Tokens Are Required

How to Provide Tokens to Jobs

Method 1: Automatic Token (Recommended)

Method 2: Explicit Token (Not Recommended)

Method 3: Environment Variable (Less Secure)

Using Tokens in Scripts

Token Verification

Common Token Issues

Token Security Best Practices

Complete Token Example

Quick Start: Two Approaches

Approach 1: UV Scripts (Recommended)

Working with Scripts

Approach 2: Docker-Based Jobs

Finding More UV Scripts on Hub

Hardware Selection

Critical: Saving Results

Persistence Options

Required Configuration for Hub Push

Verification Checklist

Timeout Management

Setting Timeouts

Timeout Guidelines

Cost Estimation

Monitoring and Tracking

Check Job Status

Job URLs

Scheduled Jobs

Common Workload Patterns

Pattern 1: Dataset → Model Responses (vLLM) — scripts/generate-responses.py

Pattern 2: CoT Self-Instruct Synthetic Data — scripts/cot-self-instruct.py

Pattern 3: Streaming Dataset Stats (Polars + HF Hub) — scripts/finepdfs-stats.py

Common Failure Modes

Out of Memory (OOM)

Job Timeout

Hub Push Failures

Missing Dependencies

Authentication Errors

Troubleshooting

Resources

References (In This Skill)

Scripts (In This Skill)

External Links

Key Takeaways

Pattern 1: Dataset → Model Responses (vLLM) — `scripts/generate-responses.py`

Pattern 2: CoT Self-Instruct Synthetic Data — `scripts/cot-self-instruct.py`

Pattern 3: Streaming Dataset Stats (Polars + HF Hub) — `scripts/finepdfs-stats.py`