Spaces:

hf-skills
/

hf-jobs

Running

App Files Files Community

burtenshaw HF Staff commited on 4 days ago

Commit

7200e76

verified ·

1 Parent(s): ab1dc3b

Upload folder using huggingface_hub

Browse files

Files changed (9) hide show

SKILL.md +752 -0
index.html +213 -18
references/hardware_guide.md +266 -0
references/hub_saving.md +339 -0
references/token_usage.md +546 -0
references/troubleshooting.md +431 -0
scripts/cot-self-instruct.py +718 -0
scripts/finepdfs-stats.py +546 -0
scripts/generate-responses.py +587 -0

SKILL.md ADDED Viewed

	@@ -0,0 +1,752 @@

+---
+name: hf-jobs
+description: This skill should be used when users want to run any workload on Hugging Face Jobs infrastructure. Covers UV scripts, Docker-based jobs, hardware selection, cost estimation, authentication with tokens, secrets management, timeout configuration, and result persistence. Designed for general-purpose compute workloads including data processing, inference, experiments, batch jobs, and any Python-based tasks. Should be invoked for tasks involving cloud compute, GPU workloads, or when users mention running jobs on Hugging Face infrastructure without local setup.
+license: Complete terms in LICENSE.txt
+---
+# Running Workloads on Hugging Face Jobs
+## Overview
+Run any workload on fully managed Hugging Face infrastructure. No local setup required—jobs run on cloud CPUs, GPUs, or TPUs and can persist results to the Hugging Face Hub.
+**Common use cases:**
+- **Data Processing** - Transform, filter, or analyze large datasets
+- **Batch Inference** - Run inference on thousands of samples
+- **Experiments & Benchmarks** - Reproducible ML experiments
+- **Model Training** - Fine-tune models (see `model-trainer` skill for TRL-specific training)
+- **Synthetic Data Generation** - Generate datasets using LLMs
+- **Development & Testing** - Test code without local GPU setup
+- **Scheduled Jobs** - Automate recurring tasks
+**For model training specifically:** See the `model-trainer` skill for TRL-based training workflows.
+## When to Use This Skill
+Use this skill when users want to:
+- Run Python workloads on cloud infrastructure
+- Execute jobs without local GPU/TPU setup
+- Process data at scale
+- Run batch inference or experiments
+- Schedule recurring tasks
+- Use GPUs/TPUs for any workload
+- Persist results to the Hugging Face Hub
+## Key Directives
+When assisting with jobs:
+1. **ALWAYS use `hf_jobs()` MCP tool** - Submit jobs using `hf_jobs("uv", {...})` or `hf_jobs("run", {...})`. The `script` parameter accepts Python code directly. Do NOT save to local files unless the user explicitly requests it. Pass the script content as a string to `hf_jobs()`.
+2. **Always handle authentication** - Jobs that interact with the Hub require `HF_TOKEN` via secrets. See Token Usage section below.
+3. **Provide job details after submission** - After submitting, provide job ID, monitoring URL, estimated time, and note that the user can request status checks later.
+4. **Set appropriate timeouts** - Default 30min may be insufficient for long-running tasks.
+## Prerequisites Checklist
+Before starting any job, verify:
+### ✅ **Account & Authentication**
+- Hugging Face Account with [Pro](https://hf.co/pro), [Team](https://hf.co/enterprise), or [Enterprise](https://hf.co/enterprise) plan (Jobs require paid plan)
+- Authenticated login: Check with `hf_whoami()`
+- **HF_TOKEN for Hub Access** ⚠️ CRITICAL - Required for any Hub operations (push models/datasets, download private repos, etc.)
+- Token must have appropriate permissions (read for downloads, write for uploads)
+### ✅ **Token Usage** (See Token Usage section for details)
+**When tokens are required:**
+- Pushing models/datasets to Hub
+- Accessing private repositories
+- Using Hub APIs in scripts
+- Any authenticated Hub operations
+**How to provide tokens:**
+```python
+{
+    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # Recommended: automatic token
+}
+```
+**⚠️ CRITICAL:** The `$HF_TOKEN` placeholder is automatically replaced with your logged-in token. Never hardcode tokens in scripts.
+## Token Usage Guide
+### Understanding Tokens
+**What are HF Tokens?**
+- Authentication credentials for Hugging Face Hub
+- Required for authenticated operations (push, private repos, API access)
+- Stored securely on your machine after `hf auth login`
+**Token Types:**
+- **Read Token** - Can download models/datasets, read private repos
+- **Write Token** - Can push models/datasets, create repos, modify content
+- **Organization Token** - Can act on behalf of an organization
+### When Tokens Are Required
+**Always Required:**
+- Pushing models/datasets to Hub
+- Accessing private repositories
+- Creating new repositories
+- Modifying existing repositories
+- Using Hub APIs programmatically
+**Not Required:**
+- Downloading public models/datasets
+- Running jobs that don't interact with Hub
+- Reading public repository information
+### How to Provide Tokens to Jobs
+#### Method 1: Automatic Token (Recommended)
+```python
+hf_jobs("uv", {
+    "script": "your_script.py",
+    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # ✅ Automatic replacement
+})
+```
+**How it works:**
+- `$HF_TOKEN` is a placeholder that gets replaced with your actual token
+- Uses the token from your logged-in session (`hf auth login`)
+- Most secure and convenient method
+- Token is encrypted server-side when passed as a secret
+**Benefits:**
+- No token exposure in code
+- Uses your current login session
+- Automatically updated if you re-login
+- Works seamlessly with MCP tools
+#### Method 2: Explicit Token (Not Recommended)
+```python
+hf_jobs("uv", {
+    "script": "your_script.py",
+    "secrets": {"HF_TOKEN": "hf_abc123..."}  # ⚠️ Hardcoded token
+})
+```
+**When to use:**
+- Only if automatic token doesn't work
+- Testing with a specific token
+- Organization tokens (use with caution)
+**Security concerns:**
+- Token visible in code/logs
+- Must manually update if token rotates
+- Risk of token exposure
+#### Method 3: Environment Variable (Less Secure)
+```python
+hf_jobs("uv", {
+    "script": "your_script.py",
+    "env": {"HF_TOKEN": "hf_abc123..."}  # ⚠️ Less secure than secrets
+})
+```
+**Difference from secrets:**
+- `env` variables are visible in job logs
+- `secrets` are encrypted server-side
+- Always prefer `secrets` for tokens
+### Using Tokens in Scripts
+**In your Python script, tokens are available as environment variables:**
+```python
+# /// script
+# dependencies = ["huggingface-hub"]
+# ///
+import os
+from huggingface_hub import HfApi
+# Token is automatically available if passed via secrets
+token = os.environ.get("HF_TOKEN")
+# Use with Hub API
+api = HfApi(token=token)
+# Or let huggingface_hub auto-detect
+api = HfApi()  # Automatically uses HF_TOKEN env var
+```
+**Best practices:**
+- Don't hardcode tokens in scripts
+- Use `os.environ.get("HF_TOKEN")` to access
+- Let `huggingface_hub` auto-detect when possible
+- Verify token exists before Hub operations
+### Token Verification
+**Check if you're logged in:**
+```python
+from huggingface_hub import whoami
+user_info = whoami()  # Returns your username if authenticated
+```
+**Verify token in job:**
+```python
+import os
+assert "HF_TOKEN" in os.environ, "HF_TOKEN not found!"
+token = os.environ["HF_TOKEN"]
+print(f"Token starts with: {token[:7]}...")  # Should start with "hf_"
+```
+### Common Token Issues
+**Error: 401 Unauthorized**
+- **Cause:** Token missing or invalid
+- **Fix:** Add `secrets={"HF_TOKEN": "$HF_TOKEN"}` to job config
+- **Verify:** Check `hf_whoami()` works locally
+**Error: 403 Forbidden**
+- **Cause:** Token lacks required permissions
+- **Fix:** Ensure token has write permissions for push operations
+- **Check:** Token type at https://huggingface.co/settings/tokens
+**Error: Token not found in environment**
+- **Cause:** `secrets` not passed or wrong key name
+- **Fix:** Use `secrets={"HF_TOKEN": "$HF_TOKEN"}` (not `env`)
+- **Verify:** Script checks `os.environ.get("HF_TOKEN")`
+**Error: Repository access denied**
+- **Cause:** Token doesn't have access to private repo
+- **Fix:** Use token from account with access
+- **Check:** Verify repo visibility and your permissions
+### Token Security Best Practices
+1. **Never commit tokens** - Use `$HF_TOKEN` placeholder or environment variables
+2. **Use secrets, not env** - Secrets are encrypted server-side
+3. **Rotate tokens regularly** - Generate new tokens periodically
+4. **Use minimal permissions** - Create tokens with only needed permissions
+5. **Don't share tokens** - Each user should use their own token
+6. **Monitor token usage** - Check token activity in Hub settings
+### Complete Token Example
+```python
+# Example: Push results to Hub
+hf_jobs("uv", {
+    "script": """
+# /// script
+# dependencies = ["huggingface-hub", "datasets"]
+# ///
+import os
+from huggingface_hub import HfApi
+from datasets import Dataset
+# Verify token is available
+assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
+# Use token for Hub operations
+api = HfApi(token=os.environ["HF_TOKEN"])
+# Create and push dataset
+data = {"text": ["Hello", "World"]}
+dataset = Dataset.from_dict(data)
+dataset.push_to_hub("username/my-dataset", token=os.environ["HF_TOKEN"])
+print("✅ Dataset pushed successfully!")
+""",
+    "flavor": "cpu-basic",
+    "timeout": "30m",
+    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # ✅ Token provided securely
+})
+```
+## Quick Start: Two Approaches
+### Approach 1: UV Scripts (Recommended)
+UV scripts use PEP 723 inline dependencies for clean, self-contained workloads.
+```python
+hf_jobs("uv", {
+    "script": """
+# /// script
+# dependencies = ["transformers", "torch"]
+# ///
+from transformers import pipeline
+import torch
+# Your workload here
+classifier = pipeline("sentiment-analysis")
+result = classifier("I love Hugging Face!")
+print(result)
+""",
+    "flavor": "cpu-basic",
+    "timeout": "30m"
+})
+```
+**Benefits:** Direct MCP tool usage, clean code, dependencies declared inline, no file saving required
+**When to use:** Default choice for all workloads, custom logic, any scenario requiring `hf_jobs()`
+#### Working with Scripts
+⚠️ **Important:** There are *two* “script path” stories depending on how you run Jobs:
+- **Using the `hf_jobs()` MCP tool (recommended in this repo)**: the `script` value must be **inline code** (a string) or a **URL**. A local filesystem path (like `"./scripts/foo.py"`) won’t exist inside the remote container.
+- **Using the `hf jobs uv run` CLI**: local file paths **do work** (the CLI uploads your script).
+**Common mistake with `hf_jobs()` MCP tool:**
+```python
+# ❌ Will fail (remote container can't see your local path)
+hf_jobs("uv", {"script": "./scripts/foo.py"})
+```
+**Correct patterns with `hf_jobs()` MCP tool:**
+```python
+# ✅ Inline: read the local script file and pass its *contents*
+from pathlib import Path
+script = Path("hf-jobs/scripts/foo.py").read_text()
+hf_jobs("uv", {"script": script})
+# ✅ URL: host the script somewhere reachable
+hf_jobs("uv", {"script": "https://huggingface.co/datasets/uv-scripts/.../raw/main/foo.py"})
+```
+**CLI equivalent (local paths supported):**
+```bash
+hf jobs uv run ./scripts/foo.py -- --your --args
+```
+### Approach 2: Docker-Based Jobs
+Run jobs with custom Docker images and commands.
+```python
+hf_jobs("run", {
+    "image": "python:3.12",
+    "command": ["python", "-c", "print('Hello from HF Jobs!')"],
+    "flavor": "cpu-basic",
+    "timeout": "30m"
+})
+```
+**Benefits:** Full Docker control, use pre-built images, run any command
+**When to use:** Need specific Docker images, non-Python workloads, complex environments
+**Example with GPU:**
+```python
+hf_jobs("run", {
+    "image": "pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel",
+    "command": ["python", "-c", "import torch; print(torch.cuda.get_device_name())"],
+    "flavor": "a10g-small",
+    "timeout": "1h"
+})
+```
+### Finding More UV Scripts on Hub
+The `uv-scripts` organization provides ready-to-use UV scripts stored as datasets on Hugging Face Hub:
+```python
+# Discover available UV script collections
+dataset_search({"author": "uv-scripts", "sort": "downloads", "limit": 20})
+# Explore a specific collection
+hub_repo_details(["uv-scripts/classification"], repo_type="dataset", include_readme=True)
+```
+**Popular collections:** OCR, classification, synthetic-data, vLLM, dataset-creation
+## Hardware Selection
+| Workload Type | Recommended Hardware | Cost (approx./hr) | Use Case |
+|---------------|---------------------|------------------|----------|
+| Data processing, testing | `cpu-basic`, `cpu-upgrade` | ~$0.10-0.50 | Lightweight tasks |
+| Small models, demos | `t4-small` | ~$0.75 | <1B models, quick tests |
+| Medium models | `t4-medium`, `l4x1` | ~$1.50-2.50 | 1-7B models |
+| Large models, production | `a10g-small`, `a10g-large` | ~$3.50-5.00 | 7-13B models |
+| Very large models | `a100-large` | ~$8-12 | 13B+ models |
+| Batch inference | `a10g-large`, `a100-large` | ~$5-10 | High-throughput |
+| Data processing | `cpu-upgrade`, `l4x1` | ~$0.50-2.50 | Parallel workloads |
+**GPU Flavors:** cpu-basic/upgrade, t4-small/medium, l4x1/x4, a10g-small/large/largex2/largex4, a100-large, h100/h100x8
+**TPU Flavors:** v5e-1x1, v5e-2x2, v5e-2x4
+**Guidelines:**
+- Start with smaller hardware for testing
+- Scale up based on actual needs
+- Use multi-GPU for parallel workloads
+- See `references/hardware_guide.md` for detailed specifications
+## Critical: Saving Results
+**⚠️ EPHEMERAL ENVIRONMENT—MUST PERSIST RESULTS**
+The Jobs environment is temporary. All files are deleted when the job ends. If results aren't persisted, **ALL WORK IS LOST**.
+### Persistence Options
+**1. Push to Hugging Face Hub (Recommended)**
+```python
+# Push models
+model.push_to_hub("username/model-name", token=os.environ["HF_TOKEN"])
+# Push datasets
+dataset.push_to_hub("username/dataset-name", token=os.environ["HF_TOKEN"])
+# Push artifacts
+api.upload_file(
+    path_or_fileobj="results.json",
+    path_in_repo="results.json",
+    repo_id="username/results",
+    token=os.environ["HF_TOKEN"]
+)
+```
+**2. Use External Storage**
+```python
+# Upload to S3, GCS, etc.
+import boto3
+s3 = boto3.client('s3')
+s3.upload_file('results.json', 'my-bucket', 'results.json')
+```
+**3. Send Results via API**
+```python
+# POST results to your API
+import requests
+requests.post("https://your-api.com/results", json=results)
+```
+### Required Configuration for Hub Push
+**In job submission:**
+```python
+{
+    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # Enables authentication
+}
+```
+**In script:**
+```python
+import os
+from huggingface_hub import HfApi
+# Token automatically available from secrets
+api = HfApi(token=os.environ.get("HF_TOKEN"))
+# Push your results
+api.upload_file(...)
+```
+### Verification Checklist
+Before submitting:
+- [ ] Results persistence method chosen
+- [ ] `secrets={"HF_TOKEN": "$HF_TOKEN"}` if using Hub
+- [ ] Script handles missing token gracefully
+- [ ] Test persistence path works
+**See:** `references/hub_saving.md` for detailed Hub persistence guide
+## Timeout Management
+**⚠️ DEFAULT: 30 MINUTES**
+### Setting Timeouts
+```python
+{
+    "timeout": "2h"   # 2 hours (formats: "90m", "2h", "1.5h", or seconds as integer)
+}
+```
+### Timeout Guidelines
+| Scenario | Recommended | Notes |
+|----------|-------------|-------|
+| Quick test | 10-30 min | Verify setup |
+| Data processing | 1-2 hours | Depends on data size |
+| Batch inference | 2-4 hours | Large batches |
+| Experiments | 4-8 hours | Multiple runs |
+| Long-running | 8-24 hours | Production workloads |
+**Always add 20-30% buffer** for setup, network delays, and cleanup.
+**On timeout:** Job killed immediately, all unsaved progress lost
+## Cost Estimation
+**General guidelines:**
+```
+Total Cost = (Hours of runtime) × (Cost per hour)
+```
+**Example calculations:**
+**Quick test:**
+- Hardware: cpu-basic ($0.10/hour)
+- Time: 15 minutes (0.25 hours)
+- Cost: $0.03
+**Data processing:**
+- Hardware: l4x1 ($2.50/hour)
+- Time: 2 hours
+- Cost: $5.00
+**Batch inference:**
+- Hardware: a10g-large ($5/hour)
+- Time: 4 hours
+- Cost: $20.00
+**Cost optimization tips:**
+1. Start small - Test on cpu-basic or t4-small
+2. Monitor runtime - Set appropriate timeouts
+3. Use checkpoints - Resume if job fails
+4. Optimize code - Reduce unnecessary compute
+5. Choose right hardware - Don't over-provision
+## Monitoring and Tracking
+### Check Job Status
+```python
+# List all jobs
+hf_jobs("ps")
+# Inspect specific job
+hf_jobs("inspect", {"job_id": "your-job-id"})
+# View logs
+hf_jobs("logs", {"job_id": "your-job-id"})
+# Cancel a job
+hf_jobs("cancel", {"job_id": "your-job-id"})
+```
+**Remember:** Wait for user to request status checks. Avoid polling repeatedly.
+### Job URLs
+After submission, jobs have monitoring URLs:
+```
+https://huggingface.co/jobs/username/job-id
+```
+View logs, status, and details in the browser.
+## Scheduled Jobs
+Run jobs on a schedule using CRON expressions or predefined schedules.
+```python
+# Schedule a job that runs every hour
+hf_jobs("scheduled uv", {
+    "script": "your_script.py",
+    "schedule": "@hourly",
+    "flavor": "cpu-basic"
+})
+# Use CRON syntax
+hf_jobs("scheduled uv", {
+    "script": "your_script.py",
+    "schedule": "0 9 * * 1",  # 9 AM every Monday
+    "flavor": "cpu-basic"
+})
+```
+**Available schedules:**
+- `@annually`, `@yearly` - Once per year
+- `@monthly` - Once per month
+- `@weekly` - Once per week
+- `@daily` - Once per day
+- `@hourly` - Once per hour
+- CRON expression - Custom schedule (e.g., `"0 9 * * 1"`)
+**Manage scheduled jobs:**
+```python
+hf_jobs("scheduled ps")  # List scheduled jobs
+hf_jobs("scheduled suspend", {"job_id": "..."})  # Pause
+hf_jobs("scheduled resume", {"job_id": "..."})  # Resume
+hf_jobs("scheduled delete", {"job_id": "..."})  # Delete
+```
+## Common Workload Patterns
+This repository ships ready-to-run UV scripts in `hf-jobs/scripts/`. Prefer using them instead of inventing new templates.
+### Pattern 1: Dataset → Model Responses (vLLM) — `scripts/generate-responses.py`
+**What it does:** loads a Hub dataset (chat `messages` or a `prompt` column), applies a model chat template, generates responses with vLLM, and **pushes** the output dataset + dataset card back to the Hub.
+**Requires:** GPU + **write** token (it pushes a dataset).
+```python
+from pathlib import Path
+script = Path("hf-jobs/scripts/generate-responses.py").read_text()
+hf_jobs("uv", {
+    "script": script,
+    "script_args": [
+        "username/input-dataset",
+        "username/output-dataset",
+        "--messages-column", "messages",
+        "--model-id", "Qwen/Qwen3-30B-A3B-Instruct-2507",
+        "--temperature", "0.7",
+        "--top-p", "0.8",
+        "--max-tokens", "2048",
+    ],
+    "flavor": "a10g-large",
+    "timeout": "4h",
+    "secrets": {"HF_TOKEN": "$HF_TOKEN"},
+})
+```
+### Pattern 2: CoT Self-Instruct Synthetic Data — `scripts/cot-self-instruct.py`
+**What it does:** generates synthetic prompts/answers via CoT Self-Instruct, optionally filters outputs (answer-consistency / RIP), then **pushes** the generated dataset + dataset card to the Hub.
+**Requires:** GPU + **write** token (it pushes a dataset).
+```python
+from pathlib import Path
+script = Path("hf-jobs/scripts/cot-self-instruct.py").read_text()
+hf_jobs("uv", {
+    "script": script,
+    "script_args": [
+        "--seed-dataset", "davanstrien/s1k-reasoning",
+        "--output-dataset", "username/synthetic-math",
+        "--task-type", "reasoning",
+        "--num-samples", "5000",
+        "--filter-method", "answer-consistency",
+    ],
+    "flavor": "l4x4",
+    "timeout": "8h",
+    "secrets": {"HF_TOKEN": "$HF_TOKEN"},
+})
+```
+### Pattern 3: Streaming Dataset Stats (Polars + HF Hub) — `scripts/finepdfs-stats.py`
+**What it does:** scans parquet directly from Hub (no 300GB download), computes temporal stats, and (optionally) uploads results to a Hub dataset repo.
+**Requires:** CPU is often enough; token needed **only** if you pass `--output-repo` (upload).
+```python
+from pathlib import Path
+script = Path("hf-jobs/scripts/finepdfs-stats.py").read_text()
+hf_jobs("uv", {
+    "script": script,
+    "script_args": [
+        "--limit", "10000",
+        "--show-plan",
+        "--output-repo", "username/finepdfs-temporal-stats",
+    ],
+    "flavor": "cpu-upgrade",
+    "timeout": "2h",
+    "env": {"HF_XET_HIGH_PERFORMANCE": "1"},
+    "secrets": {"HF_TOKEN": "$HF_TOKEN"},
+})
+```
+## Common Failure Modes
+### Out of Memory (OOM)
+**Fix:**
+1. Reduce batch size or data chunk size
+2. Process data in smaller batches
+3. Upgrade hardware: cpu → t4 → a10g → a100
+### Job Timeout
+**Fix:**
+1. Check logs for actual runtime
+2. Increase timeout with buffer: `"timeout": "3h"`
+3. Optimize code for faster execution
+4. Process data in chunks
+### Hub Push Failures
+**Fix:**
+1. Add to job: `secrets={"HF_TOKEN": "$HF_TOKEN"}`
+2. Verify token in script: `assert "HF_TOKEN" in os.environ`
+3. Check token permissions
+4. Verify repo exists or can be created
+### Missing Dependencies
+**Fix:**
+Add to PEP 723 header:
+```python
+# /// script
+# dependencies = ["package1", "package2>=1.0.0"]
+# ///
+```
+### Authentication Errors
+**Fix:**
+1. Check `hf_whoami()` works locally
+2. Verify `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config
+3. Re-login: `hf auth login`
+4. Check token has required permissions
+## Troubleshooting
+**Common issues:**
+- Job times out → Increase timeout, optimize code
+- Results not saved → Check persistence method, verify HF_TOKEN
+- Out of Memory → Reduce batch size, upgrade hardware
+- Import errors → Add dependencies to PEP 723 header
+- Authentication errors → Check token, verify secrets parameter
+**See:** `references/troubleshooting.md` for complete troubleshooting guide
+## Resources
+### References (In This Skill)
+- `references/token_usage.md` - Complete token usage guide
+- `references/hardware_guide.md` - Hardware specs and selection
+- `references/hub_saving.md` - Hub persistence guide
+- `references/troubleshooting.md` - Common issues and solutions
+### Scripts (In This Skill)
+- `scripts/generate-responses.py` - vLLM batch generation: dataset → responses → push to Hub
+- `scripts/cot-self-instruct.py` - CoT Self-Instruct synthetic data generation + filtering → push to Hub
+- `scripts/finepdfs-stats.py` - Polars streaming stats over `finepdfs-edu` parquet on Hub (optional push)
+### External Links
+- [HF Jobs Documentation](https://huggingface.co/docs/huggingface_hub/guides/jobs)
+- [UV Scripts Guide](https://docs.astral.sh/uv/guides/scripts/)
+- [UV Scripts Organization](https://huggingface.co/uv-scripts)
+- [HF Hub Authentication](https://huggingface.co/docs/huggingface_hub/quick-start#authentication)
+## Key Takeaways
+1. **Submit scripts inline** - The `script` parameter accepts Python code directly; no file saving required unless user requests
+2. **Jobs are asynchronous** - Don't wait/poll; let user check when ready
+3. **Always set timeout** - Default 30 min may be insufficient; set appropriate timeout
+4. **Always persist results** - Environment is ephemeral; without persistence, all work is lost
+5. **Use tokens securely** - Always use `secrets={"HF_TOKEN": "$HF_TOKEN"}` for Hub operations
+6. **Choose appropriate hardware** - Start small, scale up based on needs
+7. **Use UV scripts** - Default to `hf_jobs("uv", {...})` with inline scripts for Python workloads
+8. **Handle authentication** - Verify tokens are available before Hub operations
+9. **Monitor jobs** - Provide job URLs and status check commands
+10. **Optimize costs** - Choose right hardware, set appropriate timeouts

index.html CHANGED Viewed

@@ -1,19 +1,214 @@
-<!doctype html>
-<html>
-	<head>
-		<meta charset="utf-8" />
-		<meta name="viewport" content="width=device-width" />
-		<title>My static Space</title>
-		<link rel="stylesheet" href="style.css" />
-	</head>
-	<body>
-		<div class="card">
-			<h1>Welcome to your static Space!</h1>
-			<p>You can modify this app directly by editing <i>index.html</i> in the Files and versions tab.</p>
-			<p>
-				Also don't forget to check the
-				<a href="https://huggingface.co/docs/hub/spaces" target="_blank">Spaces documentation</a>.
-			</p>
-		</div>
-	</body>
 </html>

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>hf-jobs - Run Workloads on Hugging Face Jobs</title>
+    <style>
+        * {
+            margin: 0;
+            padding: 0;
+            box-sizing: border-box;
+        }
+        body {
+            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif;
+            line-height: 1.6;
+            color: #333;
+            background: #f5f5f5;
+            padding: 20px;
+        }
+        .container {
+            max-width: 1200px;
+            margin: 0 auto;
+            background: white;
+            padding: 40px;
+            border-radius: 8px;
+            box-shadow: 0 2px 4px rgba(0,0,0,0.1);
+        }
+        h1 {
+            color: #ffd21e;
+            background: #000;
+            padding: 20px;
+            margin: -40px -40px 30px -40px;
+            border-radius: 8px 8px 0 0;
+        }
+        h2 {
+            color: #1e1e1e;
+            margin-top: 30px;
+            margin-bottom: 15px;
+            padding-bottom: 10px;
+            border-bottom: 2px solid #ffd21e;
+        }
+        h3 {
+            color: #555;
+            margin-top: 20px;
+            margin-bottom: 10px;
+        }
+        .description {
+            background: #f9f9f9;
+            padding: 20px;
+            border-left: 4px solid #ffd21e;
+            margin-bottom: 30px;
+            border-radius: 4px;
+        }
+        .file-list {
+            list-style: none;
+            padding: 0;
+        }
+        .file-list li {
+            padding: 12px;
+            margin: 8px 0;
+            background: #f9f9f9;
+            border-radius: 4px;
+            border-left: 3px solid #ffd21e;
+            transition: background 0.2s;
+        }
+        .file-list li:hover {
+            background: #f0f0f0;
+        }
+        .file-list a {
+            color: #0066cc;
+            text-decoration: none;
+            font-weight: 500;
+            display: block;
+        }
+        .file-list a:hover {
+            text-decoration: underline;
+        }
+        .file-path {
+            color: #666;
+            font-size: 0.9em;
+            font-family: 'Monaco', 'Courier New', monospace;
+            margin-top: 4px;
+        }
+        .file-description {
+            color: #777;
+            font-size: 0.9em;
+            margin-top: 4px;
+            font-style: italic;
+        }
+        .metadata {
+            background: #f0f0f0;
+            padding: 15px;
+            border-radius: 4px;
+            margin-bottom: 30px;
+        }
+        .metadata p {
+            margin: 5px 0;
+        }
+        .metadata strong {
+            color: #333;
+        }
+        .section {
+            margin-bottom: 40px;
+        }
+        code {
+            background: #f4f4f4;
+            padding: 2px 6px;
+            border-radius: 3px;
+            font-family: 'Monaco', 'Courier New', monospace;
+            font-size: 0.9em;
+        }
+    </style>
+</head>
+<body>
+    <div class="container">
+        <h1>Agent Skill : hf-jobs</h1>
+        <div class="description">
+            <p><strong>Run any workload on Hugging Face Jobs.</strong></p>
+            <p>Use this skill when you want to run GPU/CPU workloads (batch inference, synthetic data generation, dataset stats, experiments) on Hugging Face Jobs, with correct token handling and result persistence back to the Hub.</p>
+        </div>
+        <div class="metadata">
+            <p><strong>Skill Name:</strong> hf-jobs</p>
+            <p><strong>Main Documentation:</strong> <a href="hf-jobs/SKILL.md">hf-jobs/SKILL.md</a></p>
+            <p><strong>Scripts Directory:</strong> <code>hf-jobs/scripts/</code></p>
+            <p><strong>References Directory:</strong> <code>hf-jobs/references/</code></p>
+        </div>
+        <div class="section">
+            <h2>Overview</h2>
+            <p>This skill focuses on running real workloads via Hugging Face Jobs. It includes ready-to-run UV scripts and guides for authentication (HF tokens), secrets vs env vars, timeouts, hardware selection, and pushing results to the Hub.</p>
+        </div>
+        <div class="section">
+            <h2>Core Documentation</h2>
+            <ul class="file-list">
+                <li>
+                    <a href="hf-jobs/SKILL.md">SKILL.md</a>
+                    <div class="file-path">hf-jobs/SKILL.md</div>
+                    <div class="file-description">Complete skill documentation (how to submit jobs, tokens/secrets, timeouts, persistence, and how to use the bundled scripts)</div>
+                </li>
+            </ul>
+        </div>
+        <div class="section">
+            <h2>References</h2>
+            <ul class="file-list">
+                <li>
+                    <a href="hf-jobs/references/token_usage.md">token_usage.md</a>
+                    <div class="file-path">hf-jobs/references/token_usage.md</div>
+                    <div class="file-description">Token best practices: secrets vs env, permissions, common errors (401/403), and secure patterns</div>
+                </li>
+                <li>
+                    <a href="hf-jobs/references/hub_saving.md">hub_saving.md</a>
+                    <div class="file-path">hf-jobs/references/hub_saving.md</div>
+                    <div class="file-description">How to persist results: push datasets/models/files to the Hub (ephemeral job filesystem)</div>
+                </li>
+                <li>
+                    <a href="hf-jobs/references/hardware_guide.md">hardware_guide.md</a>
+                    <div class="file-path">hf-jobs/references/hardware_guide.md</div>
+                    <div class="file-description">Flavor selection guidance for CPU/GPU/TPU workloads</div>
+                </li>
+                <li>
+                    <a href="hf-jobs/references/troubleshooting.md">troubleshooting.md</a>
+                    <div class="file-path">hf-jobs/references/troubleshooting.md</div>
+                    <div class="file-description">Common failure modes (timeouts, missing deps, OOM, auth) and fixes</div>
+                </li>
+            </ul>
+        </div>
+        <div class="section">
+            <h2>Scripts</h2>
+            <ul class="file-list">
+                <li>
+                    <a href="hf-jobs/scripts/generate-responses.py">generate-responses.py</a>
+                    <div class="file-path">hf-jobs/scripts/generate-responses.py</div>
+                    <div class="file-description">vLLM batch generation: load prompts/messages from a dataset, generate responses, push dataset + card to Hub</div>
+                </li>
+                <li>
+                    <a href="hf-jobs/scripts/cot-self-instruct.py">cot-self-instruct.py</a>
+                    <div class="file-path">hf-jobs/scripts/cot-self-instruct.py</div>
+                    <div class="file-description">CoT Self-Instruct synthetic data generation (reasoning/instruction) + optional filtering, pushes dataset + card</div>
+                </li>
+                <li>
+                    <a href="hf-jobs/scripts/finepdfs-stats.py">finepdfs-stats.py</a>
+                    <div class="file-path">hf-jobs/scripts/finepdfs-stats.py</div>
+                    <div class="file-description">Polars streaming stats over Hub parquet (finepdfs-edu); optional upload of computed stats to a dataset repo</div>
+                </li>
+            </ul>
+        </div>
+    </div>
+</body>
 </html>

references/hardware_guide.md ADDED Viewed

	@@ -0,0 +1,266 @@

+# Hardware Selection Guide
+Choosing the right hardware (flavor) is critical for cost-effective workloads.
+## Available Hardware
+### CPU
+- `cpu-basic` - Basic CPU, testing only
+- `cpu-upgrade` - Enhanced CPU
+**Use cases:** Data processing, testing scripts, lightweight workloads
+**Not recommended for:** Model training, GPU-accelerated workloads
+### GPU Options
+| Flavor | GPU | Memory | Use Case | Cost/hour |
+|--------|-----|--------|----------|-----------|
+| `t4-small` | NVIDIA T4 | 16GB | <1B models, demos, batch inference | ~$0.50-1 |
+| `t4-medium` | NVIDIA T4 | 16GB | 1-3B models, development | ~$1-2 |
+| `l4x1` | NVIDIA L4 | 24GB | 3-7B models, efficient workloads | ~$2-3 |
+| `l4x4` | 4x NVIDIA L4 | 96GB | Multi-GPU workloads | ~$8-12 |
+| `a10g-small` | NVIDIA A10G | 24GB | 3-7B models, production | ~$3-4 |
+| `a10g-large` | NVIDIA A10G | 24GB | 7-13B models | ~$4-6 |
+| `a10g-largex2` | 2x NVIDIA A10G | 48GB | Multi-GPU, large models | ~$8-12 |
+| `a10g-largex4` | 4x NVIDIA A10G | 96GB | Multi-GPU, very large models | ~$16-24 |
+| `a100-large` | NVIDIA A100 | 40GB | 13B+ models, fast workloads | ~$8-12 |
+## Selection Guidelines
+### By Workload Type
+**Data Processing**
+- **Recommended:** `cpu-upgrade` or `l4x1`
+- **Use case:** Transform, filter, analyze datasets
+- **Batch size:** Depends on data size
+- **Time:** Varies by dataset size
+**Batch Inference**
+- **Recommended:** `a10g-large` or `a100-large`
+- **Use case:** Run inference on thousands of samples
+- **Batch size:** 8-32 depending on model
+- **Time:** Depends on number of samples
+**Experiments & Benchmarks**
+- **Recommended:** `a10g-small` or `a10g-large`
+- **Use case:** Reproducible ML experiments
+- **Batch size:** Varies
+- **Time:** Depends on experiment complexity
+**Model Training** (see `model-trainer` skill for details)
+- **Recommended:** See model-trainer skill
+- **Use case:** Fine-tuning models
+- **Batch size:** Depends on model size
+- **Time:** Hours to days
+**Synthetic Data Generation**
+- **Recommended:** `a10g-large` or `a100-large`
+- **Use case:** Generate datasets using LLMs
+- **Batch size:** Depends on generation method
+- **Time:** Hours for large datasets
+### By Budget
+**Minimal Budget (<$5 total)**
+- Use `cpu-basic` or `t4-small`
+- Process small datasets
+- Quick tests and demos
+**Small Budget ($5-20)**
+- Use `t4-medium` or `a10g-small`
+- Process medium datasets
+- Run experiments
+**Medium Budget ($20-50)**
+- Use `a10g-small` or `a10g-large`
+- Process large datasets
+- Production workloads
+**Large Budget ($50-200)**
+- Use `a10g-large` or `a100-large`
+- Large-scale processing
+- Multiple experiments
+### By Model Size (for inference/processing)
+**Tiny Models (<1B parameters)**
+- **Recommended:** `t4-small`
+- **Example:** Qwen2.5-0.5B, TinyLlama
+- **Batch size:** 8-16
+**Small Models (1-3B parameters)**
+- **Recommended:** `t4-medium` or `a10g-small`
+- **Example:** Qwen2.5-1.5B, Phi-2
+- **Batch size:** 4-8
+**Medium Models (3-7B parameters)**
+- **Recommended:** `a10g-small` or `a10g-large`
+- **Example:** Qwen2.5-7B, Mistral-7B
+- **Batch size:** 2-4
+**Large Models (7-13B parameters)**
+- **Recommended:** `a10g-large` or `a100-large`
+- **Example:** Llama-3-8B
+- **Batch size:** 1-2
+**Very Large Models (13B+ parameters)**
+- **Recommended:** `a100-large`
+- **Example:** Llama-3-13B, Llama-3-70B
+- **Batch size:** 1
+## Memory Considerations
+### Estimating Memory Requirements
+**For inference:**
+```
+Memory (GB) ≈ (Model params in billions) × 2-4
+```
+**For training:**
+```
+Memory (GB) ≈ (Model params in billions) × 20 (full) or × 4 (LoRA)
+```
+**Examples:**
+- Qwen2.5-0.5B inference: ~1-2GB ✅ fits t4-small
+- Qwen2.5-7B inference: ~14-28GB ✅ fits a10g-large
+- Qwen2.5-7B training: ~140GB ❌ not feasible without LoRA
+### Memory Optimization
+If hitting memory limits:
+1. **Reduce batch size**
+   ```python
+   batch_size = 1
+   ```
+2. **Process in chunks**
+   ```python
+   for chunk in chunks:
+       process(chunk)
+   ```
+3. **Use smaller models**
+   - Use quantized models
+   - Use LoRA adapters
+4. **Upgrade hardware**
+   - cpu → t4 → a10g → a100
+## Cost Estimation
+### Formula
+```
+Total Cost = (Hours of runtime) × (Cost per hour)
+```
+### Example Calculations
+**Data processing:**
+- Hardware: cpu-upgrade ($0.50/hour)
+- Time: 1 hour
+- Cost: $0.50
+**Batch inference:**
+- Hardware: a10g-large ($5/hour)
+- Time: 2 hours
+- Cost: $10.00
+**Experiments:**
+- Hardware: a10g-small ($3.50/hour)
+- Time: 4 hours
+- Cost: $14.00
+### Cost Optimization Tips
+1. **Start small:** Test on cpu-basic or t4-small
+2. **Monitor runtime:** Set appropriate timeouts
+3. **Optimize code:** Reduce unnecessary compute
+4. **Choose right hardware:** Don't over-provision
+5. **Use checkpoints:** Resume if job fails
+6. **Monitor costs:** Check running jobs regularly
+## Multi-GPU Workloads
+Multi-GPU flavors automatically distribute workloads:
+**Multi-GPU flavors:**
+- `l4x4` - 4x L4 GPUs
+- `a10g-largex2` - 2x A10G GPUs
+- `a10g-largex4` - 4x A10G GPUs
+**When to use:**
+- Large models (>13B parameters)
+- Need faster processing (linear speedup)
+- Large datasets (>100K samples)
+- Parallel workloads
+**Example:**
+```python
+hf_jobs("uv", {
+    "script": "process.py",
+    "flavor": "a10g-largex2",  # 2 GPUs
+    "timeout": "4h",
+    "secrets": {"HF_TOKEN": "$HF_TOKEN"}
+})
+```
+## Choosing Between Options
+### CPU vs GPU
+**Choose CPU when:**
+- No GPU acceleration needed
+- Data processing only
+- Budget constrained
+- Simple workloads
+**Choose GPU when:**
+- Model inference/training
+- GPU-accelerated libraries
+- Need faster processing
+- Large models
+### a10g vs a100
+**Choose a10g when:**
+- Model <13B parameters
+- Budget conscious
+- Processing time not critical
+**Choose a100 when:**
+- Model 13B+ parameters
+- Need fastest processing
+- Memory requirements high
+- Budget allows
+### Single vs Multi-GPU
+**Choose single GPU when:**
+- Model <7B parameters
+- Budget constrained
+- Simpler debugging
+**Choose multi-GPU when:**
+- Model >13B parameters
+- Need faster processing
+- Large batch sizes required
+- Cost-effective for large jobs
+## Quick Reference
+```python
+# Workload type → Hardware selection
+HARDWARE_MAP = {
+    "data_processing": "cpu-upgrade",
+    "batch_inference_small": "t4-small",
+    "batch_inference_medium": "a10g-large",
+    "batch_inference_large": "a100-large",
+    "experiments": "a10g-small",
+    "training": "see model-trainer skill"
+}
+```

references/hub_saving.md ADDED Viewed

	@@ -0,0 +1,339 @@

+# Saving Results to Hugging Face Hub
+**⚠️ CRITICAL:** Job environments are ephemeral. ALL results are lost when a job completes unless persisted to the Hub or external storage.
+## Why Persistence is Required
+When running on Hugging Face Jobs:
+- Environment is temporary
+- All files deleted on job completion
+- No local disk persistence
+- Cannot access results after job ends
+**Without persistence, all work is permanently lost.**
+## Persistence Options
+### Option 1: Push to Hugging Face Hub (Recommended)
+**For models:**
+```python
+from transformers import AutoModel
+model.push_to_hub("username/model-name", token=os.environ.get("HF_TOKEN"))
+```
+**For datasets:**
+```python
+from datasets import Dataset
+dataset.push_to_hub("username/dataset-name", token=os.environ.get("HF_TOKEN"))
+```
+**For files/artifacts:**
+```python
+from huggingface_hub import HfApi
+api = HfApi(token=os.environ.get("HF_TOKEN"))
+api.upload_file(
+    path_or_fileobj="results.json",
+    path_in_repo="results.json",
+    repo_id="username/results",
+    repo_type="dataset"
+)
+```
+### Option 2: External Storage
+**S3:**
+```python
+import boto3
+s3 = boto3.client('s3')
+s3.upload_file('results.json', 'my-bucket', 'results.json')
+```
+**Google Cloud Storage:**
+```python
+from google.cloud import storage
+client = storage.Client()
+bucket = client.bucket('my-bucket')
+blob = bucket.blob('results.json')
+blob.upload_from_filename('results.json')
+```
+### Option 3: API Endpoint
+```python
+import requests
+requests.post("https://your-api.com/results", json=results)
+```
+## Required Configuration for Hub Push
+### Job Configuration
+**Always include HF_TOKEN:**
+```python
+hf_jobs("uv", {
+    "script": "your_script.py",
+    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # ✅ Required for Hub operations
+})
+```
+### Script Configuration
+**Verify token exists:**
+```python
+import os
+assert "HF_TOKEN" in os.environ, "HF_TOKEN required for Hub operations!"
+```
+**Use token for Hub operations:**
+```python
+from huggingface_hub import HfApi
+# Auto-detects HF_TOKEN from environment
+api = HfApi()
+# Or explicitly pass token
+api = HfApi(token=os.environ.get("HF_TOKEN"))
+```
+## Complete Examples
+### Example 1: Push Dataset
+```python
+hf_jobs("uv", {
+    "script": """
+# /// script
+# dependencies = ["datasets", "huggingface-hub"]
+# ///
+import os
+from datasets import Dataset
+from huggingface_hub import HfApi
+# Verify token
+assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
+# Process data
+data = {"text": ["Sample 1", "Sample 2"]}
+dataset = Dataset.from_dict(data)
+# Push to Hub
+dataset.push_to_hub("username/my-dataset")
+print("✅ Dataset pushed!")
+""",
+    "flavor": "cpu-basic",
+    "timeout": "30m",
+    "secrets": {"HF_TOKEN": "$HF_TOKEN"}
+})
+```
+### Example 2: Push Model
+```python
+hf_jobs("uv", {
+    "script": """
+# /// script
+# dependencies = ["transformers"]
+# ///
+import os
+from transformers import AutoModel, AutoTokenizer
+# Verify token
+assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
+# Load and process model
+model = AutoModel.from_pretrained("base-model")
+tokenizer = AutoTokenizer.from_pretrained("base-model")
+# ... process model ...
+# Push to Hub
+model.push_to_hub("username/my-model")
+tokenizer.push_to_hub("username/my-model")
+print("✅ Model pushed!")
+""",
+    "flavor": "a10g-large",
+    "timeout": "2h",
+    "secrets": {"HF_TOKEN": "$HF_TOKEN"}
+})
+```
+### Example 3: Push Artifacts
+```python
+hf_jobs("uv", {
+    "script": """
+# /// script
+# dependencies = ["huggingface-hub", "pandas"]
+# ///
+import os
+import json
+import pandas as pd
+from huggingface_hub import HfApi
+# Verify token
+assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
+# Generate results
+results = {"accuracy": 0.95, "loss": 0.05}
+df = pd.DataFrame([results])
+# Save files
+with open("results.json", "w") as f:
+    json.dump(results, f)
+df.to_csv("results.csv", index=False)
+# Push to Hub
+api = HfApi()
+api.upload_file("results.json", "results.json", "username/results", repo_type="dataset")
+api.upload_file("results.csv", "results.csv", "username/results", repo_type="dataset")
+print("✅ Results pushed!")
+""",
+    "flavor": "cpu-basic",
+    "timeout": "30m",
+    "secrets": {"HF_TOKEN": "$HF_TOKEN"}
+})
+```
+## Authentication Methods
+### Method 1: Automatic Token (Recommended)
+```python
+"secrets": {"HF_TOKEN": "$HF_TOKEN"}
+```
+Uses your logged-in Hugging Face token automatically.
+### Method 2: Explicit Token
+```python
+"secrets": {"HF_TOKEN": "hf_abc123..."}
+```
+Provide token explicitly (not recommended for security).
+### Method 3: Environment Variable
+```python
+"env": {"HF_TOKEN": "hf_abc123..."}
+```
+Pass as regular environment variable (less secure than secrets).
+**Always prefer Method 1** for security and convenience.
+## Verification Checklist
+Before submitting any job that saves to Hub, verify:
+- [ ] `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config
+- [ ] Script checks for token: `assert "HF_TOKEN" in os.environ`
+- [ ] Hub push code included in script
+- [ ] Repository name doesn't conflict with existing repos
+- [ ] You have write access to the target namespace
+## Repository Setup
+### Automatic Creation
+If repository doesn't exist, it's created automatically when first pushing (if token has write permissions).
+### Manual Creation
+Create repository before pushing:
+```python
+from huggingface_hub import HfApi
+api = HfApi()
+api.create_repo(
+    repo_id="username/repo-name",
+    repo_type="model",  # or "dataset"
+    private=False,  # or True for private repo
+)
+```
+### Repository Naming
+**Valid names:**
+- `username/my-model`
+- `username/model-name`
+- `organization/model-name`
+**Invalid names:**
+- `model-name` (missing username)
+- `username/model name` (spaces not allowed)
+- `username/MODEL` (uppercase discouraged)
+## Troubleshooting
+### Error: 401 Unauthorized
+**Cause:** HF_TOKEN not provided or invalid
+**Solutions:**
+1. Verify `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config
+2. Check you're logged in: `hf_whoami()`
+3. Re-login: `hf auth login`
+### Error: 403 Forbidden
+**Cause:** No write access to repository
+**Solutions:**
+1. Check repository namespace matches your username
+2. Verify you're a member of organization (if using org namespace)
+3. Check token has write permissions
+### Error: Repository not found
+**Cause:** Repository doesn't exist and auto-creation failed
+**Solutions:**
+1. Manually create repository first
+2. Check repository name format
+3. Verify namespace exists
+### Error: Push failed
+**Cause:** Network issues or Hub unavailable
+**Solutions:**
+1. Check logs for specific error
+2. Verify token is valid
+3. Retry push operation
+## Best Practices
+1. **Always verify token exists** before Hub operations
+2. **Use descriptive repo names** (e.g., `my-experiment-results` not `results`)
+3. **Push incrementally** for large results (use checkpoints)
+4. **Verify push success** in logs before job completes
+5. **Use appropriate repo types** (model vs dataset)
+6. **Add README** with result descriptions
+7. **Tag repos** with relevant tags
+## Monitoring Push Progress
+Check logs for push progress:
+```python
+hf_jobs("logs", {"job_id": "your-job-id"})
+```
+**Look for:**
+```
+Pushing to username/repo-name...
+Upload file results.json: 100%
+✅ Push successful
+```
+## Key Takeaway
+**Without `secrets={"HF_TOKEN": "$HF_TOKEN"}` and persistence code, all results are permanently lost.**
+Always verify both are configured before submitting any job that produces results.

references/token_usage.md ADDED Viewed

	@@ -0,0 +1,546 @@

+# Token Usage Guide for Hugging Face Jobs
+**⚠️ CRITICAL:** Proper token usage is essential for any job that interacts with the Hugging Face Hub.
+## Overview
+Hugging Face tokens are authentication credentials that allow your jobs to interact with the Hub. They're required for:
+- Pushing models/datasets to Hub
+- Accessing private repositories
+- Creating new repositories
+- Using Hub APIs programmatically
+- Any authenticated Hub operations
+## Token Types
+### Read Token
+- **Permissions:** Download models/datasets, read private repos
+- **Use case:** Jobs that only need to download/read content
+- **Creation:** https://huggingface.co/settings/tokens
+### Write Token
+- **Permissions:** Push models/datasets, create repos, modify content
+- **Use case:** Jobs that need to upload results (most common)
+- **Creation:** https://huggingface.co/settings/tokens
+- **⚠️ Required for:** Pushing models, datasets, or any uploads
+### Organization Token
+- **Permissions:** Act on behalf of an organization
+- **Use case:** Jobs running under organization namespace
+- **Creation:** Organization settings → Tokens
+## Providing Tokens to Jobs
+### Method 1: Automatic Token (Recommended) ⭐
+```python
+hf_jobs("uv", {
+    "script": "your_script.py",
+    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # ✅ Automatic replacement
+})
+```
+**How it works:**
+1. `$HF_TOKEN` is a placeholder that gets replaced with your actual token
+2. Uses the token from your logged-in session (`hf auth login`)
+3. Token is encrypted server-side when passed as a secret
+4. Most secure and convenient method
+**Benefits:**
+- ✅ No token exposure in code
+- ✅ Uses your current login session
+- ✅ Automatically updated if you re-login
+- ✅ Works seamlessly with MCP tools
+- ✅ Token encrypted server-side
+**Requirements:**
+- Must be logged in: `hf auth login` or `hf_whoami()` works
+- Token must have required permissions
+### Method 2: Explicit Token (Not Recommended)
+```python
+hf_jobs("uv", {
+    "script": "your_script.py",
+    "secrets": {"HF_TOKEN": "hf_abc123..."}  # ⚠️ Hardcoded token
+})
+```
+**When to use:**
+- Only if automatic token doesn't work
+- Testing with a specific token
+- Organization tokens (use with caution)
+**Security concerns:**
+- ❌ Token visible in code/logs
+- ❌ Must manually update if token rotates
+- ❌ Risk of token exposure
+- ❌ Not recommended for production
+### Method 3: Environment Variable (Less Secure)
+```python
+hf_jobs("uv", {
+    "script": "your_script.py",
+    "env": {"HF_TOKEN": "hf_abc123..."}  # ⚠️ Less secure than secrets
+})
+```
+**Difference from secrets:**
+- `env` variables are visible in job logs
+- `secrets` are encrypted server-side
+- Always prefer `secrets` for tokens
+**When to use:**
+- Only for non-sensitive configuration
+- Never use for tokens (use `secrets` instead)
+## Using Tokens in Scripts
+### Accessing Tokens
+Tokens passed via `secrets` are available as environment variables in your script:
+```python
+import os
+# Get token from environment
+token = os.environ.get("HF_TOKEN")
+# Verify token exists
+if not token:
+    raise ValueError("HF_TOKEN not found in environment!")
+```
+### Using with Hugging Face Hub
+**Option 1: Explicit token parameter**
+```python
+from huggingface_hub import HfApi
+api = HfApi(token=os.environ.get("HF_TOKEN"))
+api.upload_file(...)
+```
+**Option 2: Auto-detection (Recommended)**
+```python
+from huggingface_hub import HfApi
+# Automatically uses HF_TOKEN env var
+api = HfApi()  # ✅ Simpler, uses token from environment
+api.upload_file(...)
+```
+**Option 3: With transformers/datasets**
+```python
+from transformers import AutoModel
+from datasets import load_dataset
+# Auto-detects HF_TOKEN from environment
+model = AutoModel.from_pretrained("username/model")
+dataset = load_dataset("username/dataset")
+# For push operations, token is auto-detected
+model.push_to_hub("username/new-model")
+dataset.push_to_hub("username/new-dataset")
+```
+### Complete Example
+```python
+# /// script
+# dependencies = ["huggingface-hub", "datasets"]
+# ///
+import os
+from huggingface_hub import HfApi
+from datasets import Dataset
+# Verify token is available
+assert "HF_TOKEN" in os.environ, "HF_TOKEN required for Hub operations!"
+# Use token for Hub operations
+api = HfApi()  # Auto-detects HF_TOKEN
+# Create and push dataset
+data = {"text": ["Hello", "World"]}
+dataset = Dataset.from_dict(data)
+# Push to Hub (token auto-detected)
+dataset.push_to_hub("username/my-dataset")
+print("✅ Dataset pushed successfully!")
+```
+## Token Verification
+### Check Authentication Locally
+```python
+from huggingface_hub import whoami
+try:
+    user_info = whoami()
+    print(f"✅ Logged in as: {user_info['name']}")
+except Exception as e:
+    print(f"❌ Not authenticated: {e}")
+```
+### Verify Token in Job
+```python
+import os
+# Check token exists
+if "HF_TOKEN" not in os.environ:
+    raise ValueError("HF_TOKEN not found in environment!")
+token = os.environ["HF_TOKEN"]
+# Verify token format (should start with "hf_")
+if not token.startswith("hf_"):
+    raise ValueError(f"Invalid token format: {token[:10]}...")
+# Test token works
+from huggingface_hub import whoami
+try:
+    user_info = whoami(token=token)
+    print(f"✅ Token valid for user: {user_info['name']}")
+except Exception as e:
+    raise ValueError(f"Token validation failed: {e}")
+```
+## Common Token Issues
+### Error: 401 Unauthorized
+**Symptoms:**
+```
+401 Client Error: Unauthorized for url: https://huggingface.co/api/...
+```
+**Causes:**
+1. Token missing from job
+2. Token invalid or expired
+3. Token not passed correctly
+**Solutions:**
+1. Add `secrets={"HF_TOKEN": "$HF_TOKEN"}` to job config
+2. Verify `hf_whoami()` works locally
+3. Re-login: `hf auth login`
+4. Check token hasn't expired
+**Verification:**
+```python
+# In your script
+import os
+assert "HF_TOKEN" in os.environ, "HF_TOKEN missing!"
+```
+### Error: 403 Forbidden
+**Symptoms:**
+```
+403 Client Error: Forbidden for url: https://huggingface.co/api/...
+```
+**Causes:**
+1. Token lacks required permissions (read-only token used for write)
+2. No access to private repository
+3. Organization permissions insufficient
+**Solutions:**
+1. Ensure token has write permissions
+2. Check token type at https://huggingface.co/settings/tokens
+3. Verify access to target repository
+4. Use organization token if needed
+**Check token permissions:**
+```python
+from huggingface_hub import whoami
+user_info = whoami()
+print(f"User: {user_info['name']}")
+print(f"Type: {user_info.get('type', 'user')}")
+```
+### Error: Token not found in environment
+**Symptoms:**
+```
+KeyError: 'HF_TOKEN'
+ValueError: HF_TOKEN not found
+```
+**Causes:**
+1. `secrets` not passed in job config
+2. Wrong key name (should be `HF_TOKEN`)
+3. Using `env` instead of `secrets`
+**Solutions:**
+1. Use `secrets={"HF_TOKEN": "$HF_TOKEN"}` (not `env`)
+2. Verify key name is exactly `HF_TOKEN`
+3. Check job config syntax
+**Correct configuration:**
+```python
+# ✅ Correct
+hf_jobs("uv", {
+    "script": "...",
+    "secrets": {"HF_TOKEN": "$HF_TOKEN"}
+})
+# ❌ Wrong - using env instead of secrets
+hf_jobs("uv", {
+    "script": "...",
+    "env": {"HF_TOKEN": "$HF_TOKEN"}  # Less secure
+})
+# ❌ Wrong - wrong key name
+hf_jobs("uv", {
+    "script": "...",
+    "secrets": {"TOKEN": "$HF_TOKEN"}  # Wrong key
+})
+```
+### Error: Repository access denied
+**Symptoms:**
+```
+403 Client Error: Forbidden
+Repository not found or access denied
+```
+**Causes:**
+1. Token doesn't have access to private repo
+2. Repository doesn't exist and can't be created
+3. Wrong namespace
+**Solutions:**
+1. Use token from account with access
+2. Verify repo visibility (public vs private)
+3. Check namespace matches token owner
+4. Create repo first if needed
+**Check repository access:**
+```python
+from huggingface_hub import HfApi
+api = HfApi()
+try:
+    repo_info = api.repo_info("username/repo-name")
+    print(f"✅ Access granted: {repo_info.id}")
+except Exception as e:
+    print(f"❌ Access denied: {e}")
+```
+## Token Security Best Practices
+### 1. Never Commit Tokens
+**❌ Bad:**
+```python
+# Never do this!
+token = "hf_abc123xyz..."
+api = HfApi(token=token)
+```
+**✅ Good:**
+```python
+# Use environment variable
+token = os.environ.get("HF_TOKEN")
+api = HfApi(token=token)
+```
+### 2. Use Secrets, Not Environment Variables
+**❌ Bad:**
+```python
+hf_jobs("uv", {
+    "script": "...",
+    "env": {"HF_TOKEN": "$HF_TOKEN"}  # Visible in logs
+})
+```
+**✅ Good:**
+```python
+hf_jobs("uv", {
+    "script": "...",
+    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # Encrypted server-side
+})
+```
+### 3. Use Automatic Token Replacement
+**❌ Bad:**
+```python
+hf_jobs("uv", {
+    "script": "...",
+    "secrets": {"HF_TOKEN": "hf_abc123..."}  # Hardcoded
+})
+```
+**✅ Good:**
+```python
+hf_jobs("uv", {
+    "script": "...",
+    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # Automatic
+})
+```
+### 4. Rotate Tokens Regularly
+- Generate new tokens periodically
+- Revoke old tokens
+- Update job configurations
+- Monitor token usage
+### 5. Use Minimal Permissions
+- Create tokens with only needed permissions
+- Use read tokens when write isn't needed
+- Don't use admin tokens for regular jobs
+### 6. Don't Share Tokens
+- Each user should use their own token
+- Don't commit tokens to repositories
+- Don't share tokens in logs or messages
+### 7. Monitor Token Usage
+- Check token activity in Hub settings
+- Review job logs for token issues
+- Set up alerts for unauthorized access
+## Token Workflow Examples
+### Example 1: Push Model to Hub
+```python
+hf_jobs("uv", {
+    "script": """
+# /// script
+# dependencies = ["transformers"]
+# ///
+import os
+from transformers import AutoModel, AutoTokenizer
+# Verify token
+assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
+# Load and process model
+model = AutoModel.from_pretrained("base-model")
+# ... process model ...
+# Push to Hub (token auto-detected)
+model.push_to_hub("username/my-model")
+print("✅ Model pushed!")
+""",
+    "flavor": "a10g-large",
+    "timeout": "2h",
+    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # ✅ Token provided
+})
+```
+### Example 2: Access Private Dataset
+```python
+hf_jobs("uv", {
+    "script": """
+# /// script
+# dependencies = ["datasets"]
+# ///
+import os
+from datasets import load_dataset
+# Verify token
+assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
+# Load private dataset (token auto-detected)
+dataset = load_dataset("private-org/private-dataset")
+print(f"✅ Loaded {len(dataset)} examples")
+""",
+    "flavor": "cpu-basic",
+    "timeout": "30m",
+    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # ✅ Token provided
+})
+```
+### Example 3: Create and Push Dataset
+```python
+hf_jobs("uv", {
+    "script": """
+# /// script
+# dependencies = ["datasets", "huggingface-hub"]
+# ///
+import os
+from datasets import Dataset
+from huggingface_hub import HfApi
+# Verify token
+assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
+# Create dataset
+data = {"text": ["Sample 1", "Sample 2"]}
+dataset = Dataset.from_dict(data)
+# Push to Hub
+api = HfApi()  # Auto-detects HF_TOKEN
+dataset.push_to_hub("username/my-dataset")
+print("✅ Dataset pushed!")
+""",
+    "flavor": "cpu-basic",
+    "timeout": "30m",
+    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # ✅ Token provided
+})
+```
+## Quick Reference
+### Token Checklist
+Before submitting a job that uses Hub:
+- [ ] Job includes `secrets={"HF_TOKEN": "$HF_TOKEN"}`
+- [ ] Script checks for token: `assert "HF_TOKEN" in os.environ`
+- [ ] Token has required permissions (read/write)
+- [ ] User is logged in: `hf_whoami()` works
+- [ ] Token not hardcoded in script
+- [ ] Using `secrets` not `env` for token
+### Common Patterns
+**Pattern 1: Auto-detect token**
+```python
+from huggingface_hub import HfApi
+api = HfApi()  # Uses HF_TOKEN from environment
+```
+**Pattern 2: Explicit token**
+```python
+import os
+from huggingface_hub import HfApi
+api = HfApi(token=os.environ.get("HF_TOKEN"))
+```
+**Pattern 3: Verify token**
+```python
+import os
+assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
+```
+## Key Takeaways
+1. **Always use `secrets={"HF_TOKEN": "$HF_TOKEN"}`** for Hub operations
+2. **Never hardcode tokens** in scripts or job configs
+3. **Verify token exists** in script before Hub operations
+4. **Use auto-detection** when possible (`HfApi()` without token parameter)
+5. **Check permissions** - ensure token has required access
+6. **Monitor token usage** - review activity regularly
+7. **Rotate tokens** - generate new tokens periodically

references/troubleshooting.md ADDED Viewed

	@@ -0,0 +1,431 @@

+# Troubleshooting Guide
+Common issues and solutions for Hugging Face Jobs.
+## Authentication Issues
+### Error: 401 Unauthorized
+**Symptoms:**
+```
+401 Client Error: Unauthorized for url: https://huggingface.co/api/...
+```
+**Causes:**
+- Token missing from job
+- Token invalid or expired
+- Token not passed correctly
+**Solutions:**
+1. Add `secrets={"HF_TOKEN": "$HF_TOKEN"}` to job config
+2. Verify `hf_whoami()` works locally
+3. Re-login: `hf auth login`
+4. Check token hasn't expired
+**Verification:**
+```python
+# In your script
+import os
+assert "HF_TOKEN" in os.environ, "HF_TOKEN missing!"
+```
+### Error: 403 Forbidden
+**Symptoms:**
+```
+403 Client Error: Forbidden for url: https://huggingface.co/api/...
+```
+**Causes:**
+- Token lacks required permissions
+- No access to private repository
+- Organization permissions insufficient
+**Solutions:**
+1. Ensure token has write permissions
+2. Check token type at https://huggingface.co/settings/tokens
+3. Verify access to target repository
+4. Use organization token if needed
+### Error: Token not found in environment
+**Symptoms:**
+```
+KeyError: 'HF_TOKEN'
+ValueError: HF_TOKEN not found
+```
+**Causes:**
+- `secrets` not passed in job config
+- Wrong key name (should be `HF_TOKEN`)
+- Using `env` instead of `secrets`
+**Solutions:**
+1. Use `secrets={"HF_TOKEN": "$HF_TOKEN"}` (not `env`)
+2. Verify key name is exactly `HF_TOKEN`
+3. Check job config syntax
+## Job Execution Issues
+### Error: Job Timeout
+**Symptoms:**
+- Job stops unexpectedly
+- Status shows "TIMEOUT"
+- Partial results only
+**Causes:**
+- Default 30min timeout exceeded
+- Job takes longer than expected
+- No timeout specified
+**Solutions:**
+1. Check logs for actual runtime
+2. Increase timeout with buffer: `"timeout": "3h"`
+3. Optimize code for faster execution
+4. Process data in chunks
+5. Add 20-30% buffer to estimated time
+**Example:**
+```python
+hf_jobs("uv", {
+    "script": "...",
+    "timeout": "2h"  # Set appropriate timeout
+})
+```
+### Error: Out of Memory (OOM)
+**Symptoms:**
+```
+RuntimeError: CUDA out of memory
+MemoryError: Unable to allocate array
+```
+**Causes:**
+- Batch size too large
+- Model too large for hardware
+- Insufficient GPU memory
+**Solutions:**
+1. Reduce batch size
+2. Process data in smaller chunks
+3. Upgrade hardware: cpu → t4 → a10g → a100
+4. Use smaller models or quantization
+5. Enable gradient checkpointing (for training)
+**Example:**
+```python
+# Reduce batch size
+batch_size = 1
+# Process in chunks
+for chunk in chunks:
+    process(chunk)
+```
+### Error: Missing Dependencies
+**Symptoms:**
+```
+ModuleNotFoundError: No module named 'package_name'
+ImportError: cannot import name 'X'
+```
+**Causes:**
+- Package not in dependencies
+- Wrong package name
+- Version mismatch
+**Solutions:**
+1. Add to PEP 723 header:
+   ```python
+   # /// script
+   # dependencies = ["package-name>=1.0.0"]
+   # ///
+   ```
+2. Check package name spelling
+3. Specify version if needed
+4. Check package availability
+### Error: Script Not Found
+**Symptoms:**
+```
+FileNotFoundError: script.py not found
+```
+**Causes:**
+- Local file path used (not supported)
+- URL incorrect
+- Script not accessible
+**Solutions:**
+1. Use inline script (recommended)
+2. Use publicly accessible URL
+3. Upload script to Hub first
+4. Check URL is correct
+**Correct approaches:**
+```python
+# ✅ Inline code
+hf_jobs("uv", {"script": "# /// script\n# dependencies = [...]\n# ///\n\n<code>"})
+# ✅ From URL
+hf_jobs("uv", {"script": "https://huggingface.co/user/repo/resolve/main/script.py"})
+```
+## Hub Push Issues
+### Error: Push Failed
+**Symptoms:**
+```
+Error pushing to Hub
+Upload failed
+```
+**Causes:**
+- Network issues
+- Token missing or invalid
+- Repository access denied
+- File too large
+**Solutions:**
+1. Check token: `assert "HF_TOKEN" in os.environ`
+2. Verify repository exists or can be created
+3. Check network connectivity in logs
+4. Retry push operation
+5. Split large files into chunks
+### Error: Repository Not Found
+**Symptoms:**
+```
+404 Client Error: Not Found
+Repository not found
+```
+**Causes:**
+- Repository doesn't exist
+- Wrong repository name
+- No access to private repo
+**Solutions:**
+1. Create repository first:
+   ```python
+   from huggingface_hub import HfApi
+   api = HfApi()
+   api.create_repo("username/repo-name", repo_type="dataset")
+   ```
+2. Check repository name format
+3. Verify namespace exists
+4. Check repository visibility
+### Error: Results Not Saved
+**Symptoms:**
+- Job completes successfully
+- No results visible on Hub
+- Files not persisted
+**Causes:**
+- No persistence code in script
+- Push code not executed
+- Push failed silently
+**Solutions:**
+1. Add persistence code to script
+2. Verify push executes successfully
+3. Check logs for push errors
+4. Add error handling around push
+**Example:**
+```python
+try:
+    dataset.push_to_hub("username/dataset")
+    print("✅ Push successful")
+except Exception as e:
+    print(f"❌ Push failed: {e}")
+    raise
+```
+## Hardware Issues
+### Error: GPU Not Available
+**Symptoms:**
+```
+CUDA not available
+No GPU found
+```
+**Causes:**
+- CPU flavor used instead of GPU
+- GPU not requested
+- CUDA not installed in image
+**Solutions:**
+1. Use GPU flavor: `"flavor": "a10g-large"`
+2. Check image has CUDA support
+3. Verify GPU availability in logs
+### Error: Slow Performance
+**Symptoms:**
+- Job takes longer than expected
+- Low GPU utilization
+- CPU bottleneck
+**Causes:**
+- Wrong hardware selected
+- Inefficient code
+- Data loading bottleneck
+**Solutions:**
+1. Upgrade hardware
+2. Optimize code
+3. Use batch processing
+4. Profile code to find bottlenecks
+## General Issues
+### Error: Job Status Unknown
+**Symptoms:**
+- Can't check job status
+- Status API returns error
+**Solutions:**
+1. Use job URL: `https://huggingface.co/jobs/username/job-id`
+2. Check logs: `hf_jobs("logs", {"job_id": "..."})`
+3. Inspect job: `hf_jobs("inspect", {"job_id": "..."})`
+### Error: Logs Not Available
+**Symptoms:**
+- No logs visible
+- Logs delayed
+**Causes:**
+- Job just started (logs delayed 30-60s)
+- Job failed before logging
+- Logs not yet generated
+**Solutions:**
+1. Wait 30-60 seconds after job start
+2. Check job status first
+3. Use job URL for web interface
+### Error: Cost Unexpectedly High
+**Symptoms:**
+- Job costs more than expected
+- Longer runtime than estimated
+**Causes:**
+- Job ran longer than timeout
+- Wrong hardware selected
+- Inefficient code
+**Solutions:**
+1. Monitor job runtime
+2. Set appropriate timeout
+3. Optimize code
+4. Choose right hardware
+5. Check cost estimates before running
+## Debugging Tips
+### 1. Add Logging
+```python
+import logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+logger.info("Starting processing...")
+logger.info(f"Processed {count} items")
+```
+### 2. Verify Environment
+```python
+import os
+print(f"Python version: {os.sys.version}")
+print(f"CUDA available: {torch.cuda.is_available()}")
+print(f"HF_TOKEN present: {'HF_TOKEN' in os.environ}")
+```
+### 3. Test Locally First
+Run script locally before submitting to catch errors early:
+```bash
+python script.py
+```
+### 4. Check Job Logs
+```python
+# View logs
+hf_jobs("logs", {"job_id": "your-job-id"})
+# Or use job URL
+# https://huggingface.co/jobs/username/job-id
+```
+### 5. Add Error Handling
+```python
+try:
+    # Your code
+    process_data()
+except Exception as e:
+    print(f"Error: {e}")
+    import traceback
+    traceback.print_exc()
+    raise
+```
+## Quick Reference
+### Common Error Codes
+| Code | Meaning | Solution |
+|------|---------|----------|
+| 401 | Unauthorized | Add `secrets={"HF_TOKEN": "$HF_TOKEN"}` |
+| 403 | Forbidden | Check token permissions |
+| 404 | Not Found | Verify repository exists |
+| 500 | Server Error | Retry or contact support |
+### Checklist Before Submitting
+- [ ] Token configured: `secrets={"HF_TOKEN": "$HF_TOKEN"}`
+- [ ] Script checks for token: `assert "HF_TOKEN" in os.environ`
+- [ ] Timeout set appropriately
+- [ ] Hardware selected correctly
+- [ ] Dependencies listed in PEP 723 header
+- [ ] Persistence code included
+- [ ] Error handling added
+- [ ] Logging added for debugging
+## Getting Help
+If issues persist:
+1. **Check logs** - Most errors include detailed messages
+2. **Review documentation** - See main SKILL.md
+3. **Check Hub status** - https://status.huggingface.co
+4. **Community forums** - https://discuss.huggingface.co
+5. **GitHub issues** - For bugs in huggingface_hub
+## Key Takeaways
+1. **Always include token** - `secrets={"HF_TOKEN": "$HF_TOKEN"}`
+2. **Set appropriate timeout** - Default 30min may be insufficient
+3. **Verify persistence** - Results won't persist without code
+4. **Check logs** - Most issues visible in job logs
+5. **Test locally** - Catch errors before submitting
+6. **Add error handling** - Better debugging information
+7. **Monitor costs** - Set timeouts to avoid unexpected charges

scripts/cot-self-instruct.py ADDED Viewed

	@@ -0,0 +1,718 @@

+# /// script
+# requires-python = ">=3.10"
+# dependencies = [
+#     "datasets",
+#     "transformers",
+#     "vllm>=0.6.5",
+#     "huggingface-hub[hf_transfer]",
+#     "torch",
+#     "numpy",
+#     "tqdm",
+#     "scikit-learn",
+# ]
+# ///
+"""
+Generate high-quality synthetic data using Chain-of-Thought Self-Instruct methodology.
+This script implements the CoT-Self-Instruct approach from the paper "CoT-Self-Instruct:
+Building high-quality synthetic prompts for reasoning and non-reasoning tasks" (2025).
+It supports two modes:
+1. Reasoning tasks: Generates both questions and answers with Chain-of-Thought
+2. Instruction tasks: Generates diverse prompts for general instruction following
+Example usage:
+    # Reasoning tasks with Answer-Consistency filtering
+    uv run cot-self-instruct.py \\
+        --seed-dataset davanstrien/s1k-reasoning \\
+        --output-dataset username/synthetic-math \\
+        --task-type reasoning \\
+        --num-samples 5000 \\
+        --filter-method answer-consistency
+    # Instruction tasks with RIP filtering
+    uv run cot-self-instruct.py \\
+        --seed-dataset wildchat-filtered \\
+        --output-dataset username/synthetic-prompts \\
+        --task-type instruction \\
+        --filter-method rip \\
+        --reward-model Nexusflow/Athene-RM-8B
+    # HF Jobs execution
+    hf jobs uv run --flavor l4x4 \\
+        --image vllm/vllm-openai \\
+        -e HF_TOKEN=$(python3 -c "from huggingface_hub import get_token; print(get_token())") \\
+        https://huggingface.co/datasets/uv-scripts/synthetic-data/raw/main/cot-self-instruct.py \\
+        [args...]
+"""
+import argparse
+import json
+import logging
+import os
+import random
+import re
+import sys
+from collections import Counter
+from datetime import datetime
+from typing import Dict, List, Optional, Tuple, Union
+import numpy as np
+import torch
+from datasets import Dataset, load_dataset
+from huggingface_hub import DatasetCard, login
+from sklearn.cluster import KMeans
+from tqdm.auto import tqdm
+from transformers import AutoTokenizer
+from vllm import LLM, SamplingParams
+# Enable HF Transfer for faster downloads
+os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
+logging.basicConfig(
+    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
+)
+logger = logging.getLogger(__name__)
+# Prompt templates from the paper
+REASONING_PROMPT_TEMPLATE = """You are a reasoning question generator assistant. Your goal is to create a novel, and challenging reasoning question. You are provided the following seed questions:
+Seed Question 1: {seed1}
+Seed Question 2: {seed2}
+Your task is to:
+1. Write a brand-new, self-contained reasoning question that meets the following requirements:
+(a) The question draws inspiration from the seed question without copying it verbatim, remaining novel and of comparable difficulty.
+(b) The question's final answer should be a single, unambiguous scalar value (e.g., an integer, reduced fraction, exact radical), or another answer type that can be verified in one step (e.g., 'yes/no,' a choice from A to D).
+2. Then reason step by step, solve the new question and format your output as follows:
+[New Question Begin]{{your_generated_question}}[New Question End]
+[Final Answer to New Question Begin]\\boxed{{your_final_answer}}[Final Answer to New Question End]"""
+INSTRUCTION_PROMPT_TEMPLATE = """You are a prompt generator assistant. Your goal is to create diverse and creative synthetic prompts.
+Please follow the steps below to create synthetic prompts.
+Step 1: Carefully read #Prompt 1# and #Prompt 2#. Identify and list all the common elements between these two prompts. If no common elements are found, list the main elements from each prompt.
+Step 2: Develop a comprehensive plan based on the #Common Elements List# or #Main Elements List# from Step 1. This plan will guide the generation of new synthetic prompts that are similar to the original prompts.
+Step 3: Execute the plan step by step and provide one #Synthetic Prompt#.
+Please reply strictly in the following format:
+- Step 1 #Common Elements List# or #Main Elements List#:
+- Step 2 #Plan#:
+- Step 3 #Synthetic Prompt#:
+#Prompt 1#:
+{prompt1}
+#Prompt 2#:
+{prompt2}"""
+def check_gpu_availability() -> int:
+    """Check if CUDA is available and return the number of GPUs."""
+    if not torch.cuda.is_available():
+        logger.error("CUDA is not available. This script requires a GPU.")
+        logger.error(
+            "Please run on a machine with NVIDIA GPU or use HF Jobs with GPU flavor."
+        )
+        sys.exit(1)
+    num_gpus = torch.cuda.device_count()
+    for i in range(num_gpus):
+        gpu_name = torch.cuda.get_device_name(i)
+        gpu_memory = torch.cuda.get_device_properties(i).total_memory / 1024**3
+        logger.info(f"GPU {i}: {gpu_name} with {gpu_memory:.1f} GB memory")
+    return num_gpus
+def parse_thinking_output(text: str) -> str:
+    """Remove thinking tokens from model output."""
+    # Remove <think>...</think> blocks
+    text = re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL)
+    return text.strip()
+def extract_reasoning_output(text: str) -> Tuple[Optional[str], Optional[str]]:
+    """Extract question and answer from reasoning task output."""
+    text = parse_thinking_output(text)
+    # Extract question
+    question_match = re.search(r'\[New Question Begin\](.*?)\[New Question End\]', text, re.DOTALL)
+    if not question_match:
+        return None, None
+    question = question_match.group(1).strip()
+    # Extract answer
+    answer_match = re.search(r'\[Final Answer to New Question Begin\]\\?boxed\{(.*?)\}\[Final Answer to New Question End\]', text, re.DOTALL)
+    if not answer_match:
+        # Try without \boxed
+        answer_match = re.search(r'\[Final Answer to New Question Begin\](.*?)\[Final Answer to New Question End\]', text, re.DOTALL)
+    if not answer_match:
+        return question, None
+    answer = answer_match.group(1).strip()
+    return question, answer
+def extract_instruction_output(text: str) -> Optional[str]:
+    """Extract synthetic prompt from instruction task output."""
+    text = parse_thinking_output(text)
+    # Look for the synthetic prompt after "Step 3 #Synthetic Prompt#:"
+    match = re.search(r'Step 3 #Synthetic Prompt#:\s*(.+)', text, re.DOTALL)
+    if match:
+        return match.group(1).strip()
+    return None
+def categorize_prompts(prompts: List[str], num_categories: int = 8) -> Dict[int, List[int]]:
+    """Categorize prompts using clustering for instruction tasks."""
+    from transformers import AutoModel
+    logger.info(f"Categorizing {len(prompts)} prompts into {num_categories} categories...")
+    # Use a small model for embeddings
+    tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
+    model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
+    # Get embeddings
+    embeddings = []
+    for prompt in tqdm(prompts, desc="Computing embeddings"):
+        inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
+        with torch.no_grad():
+            outputs = model(**inputs)
+            embedding = outputs.last_hidden_state.mean(dim=1).numpy()
+        embeddings.append(embedding[0])
+    # Cluster
+    kmeans = KMeans(n_clusters=num_categories, random_state=42)
+    labels = kmeans.fit_predict(embeddings)
+    # Group by category
+    categories = {}
+    for idx, label in enumerate(labels):
+        if label not in categories:
+            categories[label] = []
+        categories[label].append(idx)
+    return categories
+def generate_synthetic_data(
+    llm: LLM,
+    seed_data: List[Dict],
+    task_type: str,
+    num_samples: int,
+    categories: Optional[Dict[int, List[int]]] = None,
+) -> List[Dict]:
+    """Generate synthetic data using CoT-Self-Instruct."""
+    synthetic_data = []
+    # Set up progress bar
+    pbar = tqdm(total=num_samples, desc="Generating synthetic data")
+    while len(synthetic_data) < num_samples:
+        # Sample seed data
+        if task_type == "reasoning":
+            # Random sampling for reasoning tasks
+            seeds = random.sample(seed_data, min(2, len(seed_data)))
+            prompt = REASONING_PROMPT_TEMPLATE.format(
+                seed1=seeds[0].get("question", seeds[0].get("prompt", "")),
+                seed2=seeds[1].get("question", seeds[1].get("prompt", "")) if len(seeds) > 1 else seeds[0].get("question", seeds[0].get("prompt", ""))
+            )
+        else:
+            # Category-aware sampling for instruction tasks
+            if categories:
+                # Pick a random category
+                category = random.choice(list(categories.keys()))
+                category_indices = categories[category]
+                indices = random.sample(category_indices, min(2, len(category_indices)))
+                seeds = [seed_data[i] for i in indices]
+            else:
+                seeds = random.sample(seed_data, min(2, len(seed_data)))
+            prompt = INSTRUCTION_PROMPT_TEMPLATE.format(
+                prompt1=seeds[0].get("prompt", seeds[0].get("question", "")),
+                prompt2=seeds[1].get("prompt", seeds[1].get("question", "")) if len(seeds) > 1 else seeds[0].get("prompt", seeds[0].get("question", ""))
+            )
+        # Generate
+        sampling_params = SamplingParams(
+            temperature=0.7 if task_type == "reasoning" else 0.8,
+            top_p=0.95 if task_type == "reasoning" else 0.9,
+            max_tokens=2048,
+        )
+        outputs = llm.generate([prompt], sampling_params)
+        output_text = outputs[0].outputs[0].text
+        # Parse output
+        if task_type == "reasoning":
+            question, answer = extract_reasoning_output(output_text)
+            if question and answer:
+                synthetic_data.append({
+                    "question": question,
+                    "answer": answer,
+                    "seed_indices": [seed_data.index(s) for s in seeds],
+                })
+                pbar.update(1)
+        else:
+            synthetic_prompt = extract_instruction_output(output_text)
+            if synthetic_prompt:
+                synthetic_data.append({
+                    "prompt": synthetic_prompt,
+                    "seed_indices": [seed_data.index(s) for s in seeds],
+                })
+                pbar.update(1)
+    pbar.close()
+    return synthetic_data
+def answer_consistency_filter(
+    llm: LLM,
+    synthetic_data: List[Dict],
+    k_responses: int = 16,
+    threshold: float = 0.5,
+) -> List[Dict]:
+    """Filter reasoning tasks using Answer-Consistency."""
+    logger.info(f"Applying Answer-Consistency filter with K={k_responses}")
+    filtered_data = []
+    for item in tqdm(synthetic_data, desc="Answer-Consistency filtering"):
+        question = item["question"]
+        original_answer = item["answer"]
+        # Generate K responses
+        prompts = [question] * k_responses
+        sampling_params = SamplingParams(
+            temperature=0.6,
+            top_p=0.95,
+            max_tokens=1024,
+        )
+        outputs = llm.generate(prompts, sampling_params)
+        # Extract answers
+        answers = []
+        for output in outputs:
+            text = output.outputs[0].text
+            # Try to extract boxed answer
+            match = re.search(r'\\boxed\{(.*?)\}', text)
+            if match:
+                answers.append(match.group(1).strip())
+        if not answers:
+            continue
+        # Get majority answer
+        answer_counts = Counter(answers)
+        if answer_counts:
+            majority_answer, count = answer_counts.most_common(1)[0]
+            # Check if majority answer matches original and meets threshold
+            if (majority_answer == original_answer and
+                count / len(answers) >= threshold):
+                item["consistency_score"] = count / len(answers)
+                filtered_data.append(item)
+    logger.info(f"Answer-Consistency: kept {len(filtered_data)}/{len(synthetic_data)} examples")
+    return filtered_data
+def rip_filter(
+    llm: LLM,
+    synthetic_data: List[Dict],
+    reward_model_id: str,
+    k_responses: int = 32,
+    threshold: float = 0.5,
+) -> List[Dict]:
+    """Filter using Rejecting Instruction Preferences (RIP)."""
+    logger.info(f"Applying RIP filter with K={k_responses} and reward model {reward_model_id}")
+    # Note: In a full implementation, you would load and use the actual reward model
+    # For this example, we'll use a placeholder scoring mechanism
+    logger.warning("RIP filtering requires a reward model implementation - using placeholder")
+    filtered_data = []
+    for item in tqdm(synthetic_data, desc="RIP filtering"):
+        prompt = item.get("prompt", item.get("question", ""))
+        # Generate K responses
+        prompts = [prompt] * k_responses
+        sampling_params = SamplingParams(
+            temperature=1.0,
+            top_p=1.0,
+            max_tokens=1024,
+        )
+        outputs = llm.generate(prompts, sampling_params)
+        # In real implementation: score each response with reward model
+        # For now, use length as a proxy (longer responses often score higher)
+        scores = [len(output.outputs[0].text) for output in outputs]
+        # Use minimum score as quality indicator
+        min_score = min(scores) if scores else 0
+        normalized_score = min_score / 1000  # Normalize to 0-1 range
+        if normalized_score >= threshold:
+            item["rip_score"] = normalized_score
+            filtered_data.append(item)
+    logger.info(f"RIP filter: kept {len(filtered_data)}/{len(synthetic_data)} examples")
+    return filtered_data
+def create_dataset_card(
+    task_type: str,
+    source_dataset: str,
+    generation_model: str,
+    filter_method: str,
+    num_generated: int,
+    num_filtered: int,
+    generation_time: str,
+    additional_info: Dict = None,
+) -> str:
+    """Create a comprehensive dataset card."""
+    filter_info = ""
+    if filter_method == "answer-consistency":
+        filter_info = """
+### Answer-Consistency Filtering
+This dataset was filtered using Answer-Consistency:
+- Generated K responses for each synthetic question
+- Kept only examples where majority answer matched the generated answer
+- Ensures high-quality, correctly solved problems"""
+    elif filter_method == "rip":
+        filter_info = """
+### RIP (Rejecting Instruction Preferences) Filtering
+This dataset was filtered using RIP:
+- Generated K responses for each synthetic prompt
+- Scored responses using a reward model
+- Kept only prompts with high minimum scores"""
+    return f"""---
+tags:
+- synthetic-data
+- cot-self-instruct
+- {task_type}
+- uv-script
+---
+# CoT-Self-Instruct Synthetic Data
+This dataset contains synthetic {task_type} data generated using the Chain-of-Thought Self-Instruct methodology.
+## Generation Details
+- **Source Dataset**: [{source_dataset}](https://huggingface.co/datasets/{source_dataset})
+- **Generation Model**: [{generation_model}](https://huggingface.co/{generation_model})
+- **Task Type**: {task_type}
+- **Filter Method**: {filter_method}
+- **Generated Examples**: {num_generated:,}
+- **After Filtering**: {num_filtered:,} ({(num_filtered/num_generated)*100:.1f}% acceptance rate)
+- **Generation Date**: {generation_time}
+{filter_info}
+## Methodology
+Generated using CoT-Self-Instruct, which:
+1. Uses Chain-of-Thought reasoning to analyze seed examples
+2. Generates new synthetic examples of similar quality and complexity
+3. Applies quality filtering to ensure high-quality outputs
+Based on the paper: "CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks" (2025)
+## Generation Script
+Generated using the CoT-Self-Instruct script from [uv-scripts/synthetic-data](https://huggingface.co/datasets/uv-scripts/synthetic-data).
+To reproduce:
+```bash
+uv run https://huggingface.co/datasets/uv-scripts/synthetic-data/raw/main/cot-self-instruct.py \\
+    --seed-dataset {source_dataset} \\
+    --output-dataset <your-dataset> \\
+    --task-type {task_type} \\
+    --generation-model {generation_model} \\
+    --filter-method {filter_method}
+```
+"""
+def main():
+    parser = argparse.ArgumentParser(
+        description="Generate synthetic data using CoT-Self-Instruct",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog=__doc__,
+    )
+    # Dataset arguments
+    parser.add_argument(
+        "--seed-dataset",
+        type=str,
+        required=True,
+        help="HuggingFace dataset ID containing seed examples",
+    )
+    parser.add_argument(
+        "--output-dataset",
+        type=str,
+        required=True,
+        help="HuggingFace dataset ID for output",
+    )
+    # Task configuration
+    parser.add_argument(
+        "--task-type",
+        type=str,
+        choices=["reasoning", "instruction", "auto"],
+        default="auto",
+        help="Type of task (reasoning generates Q&A, instruction generates prompts)",
+    )
+    parser.add_argument(
+        "--task-column",
+        type=str,
+        default=None,
+        help="Column name containing tasks (auto-detected if not specified)",
+    )
+    # Model configuration
+    parser.add_argument(
+        "--generation-model",
+        type=str,
+        default="Qwen/Qwen3-30B-A3B-Thinking-2507",
+        help="Model for synthetic data generation",
+    )
+    parser.add_argument(
+        "--filter-model",
+        type=str,
+        default=None,
+        help="Model for filtering (defaults to generation model)",
+    )
+    parser.add_argument(
+        "--reward-model",
+        type=str,
+        default="Nexusflow/Athene-RM-8B",
+        help="Reward model for RIP filtering",
+    )
+    # Generation parameters
+    parser.add_argument(
+        "--num-samples",
+        type=int,
+        default=5000,
+        help="Number of synthetic examples to generate",
+    )
+    parser.add_argument(
+        "--batch-size",
+        type=int,
+        default=1,
+        help="Batch size for generation",
+    )
+    # Filtering parameters
+    parser.add_argument(
+        "--filter-method",
+        type=str,
+        choices=["answer-consistency", "rip", "both", "none"],
+        default="answer-consistency",
+        help="Quality filtering method",
+    )
+    parser.add_argument(
+        "--k-responses",
+        type=int,
+        default=16,
+        help="Number of responses for filtering",
+    )
+    parser.add_argument(
+        "--quality-threshold",
+        type=float,
+        default=0.5,
+        help="Minimum quality threshold for filtering",
+    )
+    # GPU configuration
+    parser.add_argument(
+        "--tensor-parallel-size",
+        type=int,
+        default=None,
+        help="Number of GPUs for tensor parallelism (auto-detected if not set)",
+    )
+    parser.add_argument(
+        "--gpu-memory-utilization",
+        type=float,
+        default=0.9,
+        help="GPU memory utilization",
+    )
+    # Other arguments
+    parser.add_argument(
+        "--hf-token",
+        type=str,
+        default=None,
+        help="HuggingFace API token",
+    )
+    parser.add_argument(
+        "--seed",
+        type=int,
+        default=42,
+        help="Random seed",
+    )
+    args = parser.parse_args()
+    # Set random seeds
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    # Check GPU
+    num_gpus = check_gpu_availability()
+    tensor_parallel_size = args.tensor_parallel_size or num_gpus
+    # Authentication
+    hf_token = args.hf_token or os.environ.get("HF_TOKEN")
+    if hf_token:
+        login(token=hf_token)
+    # Load seed dataset
+    logger.info(f"Loading seed dataset: {args.seed_dataset}")
+    seed_dataset = load_dataset(args.seed_dataset, split="train")
+    # Auto-detect task type and column if needed
+    if args.task_type == "auto":
+        columns = seed_dataset.column_names
+        if "question" in columns and "answer" in columns:
+            args.task_type = "reasoning"
+            logger.info("Auto-detected task type: reasoning")
+        else:
+            args.task_type = "instruction"
+            logger.info("Auto-detected task type: instruction")
+    if not args.task_column:
+        if args.task_type == "reasoning":
+            args.task_column = "question"
+        else:
+            # Try to find prompt column
+            for col in ["prompt", "instruction", "text", "input"]:
+                if col in seed_dataset.column_names:
+                    args.task_column = col
+                    break
+    logger.info(f"Using task column: {args.task_column}")
+    # Convert to list of dicts
+    seed_data = seed_dataset.to_list()
+    # Categorize prompts for instruction tasks
+    categories = None
+    if args.task_type == "instruction" and len(seed_data) > 100:
+        prompts = [item.get(args.task_column, "") for item in seed_data]
+        categories = categorize_prompts(prompts)
+    # Initialize generation model
+    logger.info(f"Loading generation model: {args.generation_model}")
+    generation_llm = LLM(
+        model=args.generation_model,
+        tensor_parallel_size=tensor_parallel_size,
+        gpu_memory_utilization=args.gpu_memory_utilization,
+    )
+    # Generate synthetic data
+    start_time = datetime.now()
+    synthetic_data = generate_synthetic_data(
+        generation_llm,
+        seed_data,
+        args.task_type,
+        args.num_samples,
+        categories,
+    )
+    # Apply filtering
+    filter_llm = generation_llm
+    if args.filter_model and args.filter_model != args.generation_model:
+        logger.info(f"Loading filter model: {args.filter_model}")
+        # Clean up generation model
+        del generation_llm
+        torch.cuda.empty_cache()
+        filter_llm = LLM(
+            model=args.filter_model,
+            tensor_parallel_size=tensor_parallel_size,
+            gpu_memory_utilization=args.gpu_memory_utilization,
+        )
+    filtered_data = synthetic_data
+    if args.filter_method != "none":
+        if args.filter_method == "answer-consistency" and args.task_type == "reasoning":
+            filtered_data = answer_consistency_filter(
+                filter_llm,
+                synthetic_data,
+                args.k_responses,
+                args.quality_threshold,
+            )
+        elif args.filter_method == "rip":
+            filtered_data = rip_filter(
+                filter_llm,
+                synthetic_data,
+                args.reward_model,
+                args.k_responses,
+                args.quality_threshold,
+            )
+        elif args.filter_method == "both":
+            if args.task_type == "reasoning":
+                filtered_data = answer_consistency_filter(
+                    filter_llm,
+                    synthetic_data,
+                    args.k_responses,
+                    args.quality_threshold,
+                )
+            filtered_data = rip_filter(
+                filter_llm,
+                filtered_data,
+                args.reward_model,
+                args.k_responses,
+                args.quality_threshold,
+            )
+    # Create HuggingFace dataset
+    logger.info(f"Creating dataset with {len(filtered_data)} examples")
+    dataset = Dataset.from_list(filtered_data)
+    # Create dataset card
+    generation_time = start_time.strftime("%Y-%m-%d %H:%M:%S UTC")
+    dataset_card = create_dataset_card(
+        args.task_type,
+        args.seed_dataset,
+        args.generation_model,
+        args.filter_method,
+        len(synthetic_data),
+        len(filtered_data),
+        generation_time,
+    )
+    # Push to hub
+    logger.info(f"Pushing dataset to: {args.output_dataset}")
+    # Create dataset card
+    card = DatasetCard(dataset_card)
+    dataset.push_to_hub(args.output_dataset)
+    # Push card separately
+    card.push_to_hub(args.output_dataset)
+    logger.info("Done! Dataset available at: https://huggingface.co/datasets/" + args.output_dataset)
+    # Print example HF Jobs command if running locally
+    if len(sys.argv) > 1:
+        print("\nTo run on HF Jobs:")
+        print(f"""hf jobs uv run --flavor l4x4 \\
+    --image vllm/vllm-openai \\
+    -e HF_TOKEN=$(python3 -c "from huggingface_hub import get_token; print(get_token())") \\
+    https://huggingface.co/datasets/uv-scripts/synthetic-data/raw/main/cot-self-instruct.py \\
+    --seed-dataset {args.seed_dataset} \\
+    --output-dataset {args.output_dataset} \\
+    --task-type {args.task_type} \\
+    --generation-model {args.generation_model} \\
+    --filter-method {args.filter_method} \\
+    --num-samples {args.num_samples}""")
+if __name__ == "__main__":
+    main()

scripts/finepdfs-stats.py ADDED Viewed

	@@ -0,0 +1,546 @@

+# /// script
+# requires-python = ">=3.12"
+# dependencies = [
+#     "polars>=1.31.0",
+#     "huggingface-hub",
+#     "datasets",
+#     "ascii-graph",
+# ]
+# ///
+"""
+Analyze educational quality trends across CommonCrawl dumps using Polars streaming.
+Answers: "Is the web getting more educational over time?"
+Demonstrates Polars HF Hub integration - process 50M+ docs without downloading 300GB+.
+Example usage:
+    # Analyze English PDFs (default)
+    uv run finepdfs-stats.py
+    # Analyze all 70+ languages
+    uv run finepdfs-stats.py --all-languages
+    # Quick test
+    uv run finepdfs-stats.py --limit 10000 --show-plan
+    # Save results to HF Hub
+    uv run finepdfs-stats.py --output-repo username/finepdfs-temporal-stats
+    # Run on HF Jobs
+    hf jobs uv run \\
+        -s HF_TOKEN \\
+        -e HF_XET_HIGH_PERFORMANCE=1 \\
+        https://huggingface.co/datasets/uv-scripts/dataset-stats/raw/main/finepdfs-stats.py \\
+        -- --output-repo username/stats
+"""
+import argparse
+import logging
+import os
+import sys
+import time
+from pathlib import Path
+import polars as pl
+from ascii_graph import Pyasciigraph
+from datasets import Dataset
+from huggingface_hub import HfApi, create_repo, list_repo_tree, login
+logging.basicConfig(
+    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
+)
+logger = logging.getLogger(__name__)
+# Common language+script codes for finepdfs-edu
+COMMON_LANGUAGES = {
+    "eng_Latn": "English (Latin script)",
+    "fra_Latn": "French (Latin script)",
+    "deu_Latn": "German (Latin script)",
+    "spa_Latn": "Spanish (Latin script)",
+    "por_Latn": "Portuguese (Latin script)",
+    "ita_Latn": "Italian (Latin script)",
+    "nld_Latn": "Dutch (Latin script)",
+    "pol_Latn": "Polish (Latin script)",
+    "rus_Cyrl": "Russian (Cyrillic script)",
+    "zho_Hans": "Chinese (Simplified)",
+    "zho_Hant": "Chinese (Traditional)",
+    "jpn_Jpan": "Japanese",
+    "kor_Hang": "Korean",
+    "ara_Arab": "Arabic",
+    "hin_Deva": "Hindi (Devanagari)",
+}
+def list_available_languages(dataset_id: str) -> list[str]:
+    """List available language subsets in the dataset."""
+    try:
+        tree = list_repo_tree(dataset_id, path_in_repo="data", repo_type="dataset")
+        languages = [
+            item.path.replace("data/", "")
+            for item in tree
+            if item.path.startswith("data/")
+            and "/" not in item.path.replace("data/", "")
+        ]
+        return sorted(languages)
+    except Exception as e:
+        logger.warning(f"Could not list languages: {e}")
+        return list(COMMON_LANGUAGES.keys())
+def compute_temporal_stats(df: pl.LazyFrame, output_path: Path) -> pl.DataFrame:
+    """Single scan: compute stats grouped by dump for temporal analysis."""
+    query = df.group_by("dump").agg(
+        pl.len().alias("doc_count"),
+        pl.col("token_count").sum().alias("total_tokens"),
+        pl.col("fw_edu_scores").list.mean().mean().alias("avg_edu_score"),
+        (pl.col("fw_edu_scores").list.mean() >= 3).sum().alias("high_edu_count"),
+    )
+    query.sink_parquet(output_path, engine="streaming")
+    return pl.read_parquet(output_path)
+def compute_global_stats(temporal: pl.DataFrame) -> pl.DataFrame:
+    """Compute global stats from temporal breakdown."""
+    total = temporal["doc_count"].sum()
+    return pl.DataFrame(
+        {
+            "total_docs": [total],
+            "total_tokens": [temporal["total_tokens"].sum()],
+            "avg_edu_score": [
+                (temporal["avg_edu_score"] * temporal["doc_count"]).sum() / total
+            ],
+            "high_edu_rate": [temporal["high_edu_count"].sum() / total],
+            "num_dumps": [len(temporal)],
+        }
+    )
+def format_temporal_stats(temporal: pl.DataFrame) -> pl.DataFrame:
+    """Format temporal stats with high_edu_rate, sorted chronologically."""
+    return (
+        temporal.with_columns(
+            (pl.col("high_edu_count") / pl.col("doc_count")).alias("high_edu_rate")
+        )
+        .select(["dump", "doc_count", "avg_edu_score", "high_edu_rate"])
+        .sort(
+            "dump"
+        )  # Chronological order (CC-MAIN-2017-xx comes before CC-MAIN-2024-xx)
+    )
+def create_ascii_charts(temporal_stats: pl.DataFrame) -> str:
+    """Create ASCII bar charts showing temporal trends."""
+    # Extract year from dump name (CC-MAIN-2024-42 -> 2024)
+    # Group by year and average the values for cleaner display
+    yearly = (
+        temporal_stats.with_columns(
+            pl.col("dump").str.extract(r"CC-MAIN-(\d{4})", 1).alias("year")
+        )
+        .group_by("year")
+        .agg(
+            pl.col("doc_count").sum(),
+            pl.col("avg_edu_score").mean(),
+            pl.col("high_edu_rate").mean(),
+        )
+        .sort("year")
+    )
+    lines = []
+    # High edu rate chart (more dramatic differences)
+    data_rate = [
+        (row["year"], row["high_edu_rate"] * 100)
+        for row in yearly.iter_rows(named=True)
+    ]
+    graph = Pyasciigraph(line_length=60, float_format="{0:.1f}%")
+    lines.extend(graph.graph("High Educational Content (edu >= 3)", data_rate))
+    lines.append("")
+    # Avg edu score chart
+    data_score = [
+        (row["year"], row["avg_edu_score"]) for row in yearly.iter_rows(named=True)
+    ]
+    graph2 = Pyasciigraph(line_length=60, float_format="{0:.2f}")
+    lines.extend(graph2.graph("Average Educational Score", data_score))
+    return "\n".join(lines)
+def create_readme(
+    args,
+    global_stats: pl.DataFrame,
+    temporal_stats: pl.DataFrame,
+    scan_time: float,
+    ascii_charts: str,
+) -> str:
+    """Create README content for the stats dataset."""
+    stats = global_stats.to_dicts()[0]
+    total_docs = stats.get("total_docs", 0)
+    docs_per_sec = total_docs / scan_time if scan_time > 0 else 0
+    # Get first and last year averages for trend (more representative than single dumps)
+    yearly = (
+        temporal_stats.with_columns(
+            pl.col("dump").str.extract(r"CC-MAIN-(\d{4})", 1).alias("year")
+        )
+        .group_by("year")
+        .agg(
+            pl.col("doc_count").sum(),
+            pl.col("avg_edu_score").mean(),
+            pl.col("high_edu_rate").mean(),
+        )
+        .sort("year")
+    )
+    first_year = yearly.head(1).to_dicts()[0]
+    last_year = yearly.tail(1).to_dicts()[0]
+    scope = (
+        "all languages"
+        if args.all_languages
+        else COMMON_LANGUAGES.get(args.lang, args.lang)
+    )
+    return f"""---
+tags:
+  - uv-script
+  - statistics
+  - polars
+  - finepdfs-edu
+  - temporal-analysis
+license: odc-by
+configs:
+  - config_name: global_stats
+    data_files: global_stats/train-*.parquet
+  - config_name: temporal_stats
+    data_files: temporal_stats/train-*.parquet
+default_viewer_config: temporal_stats
+---
+# Is the Web Getting More Educational?
+Temporal analysis of educational quality in **{scope}** across {stats.get("num_dumps", 0)} CommonCrawl dumps.
+## Trend
+```
+{ascii_charts}
+```
+## Key Finding
+| Year | Avg Edu Score | High Edu Rate |
+|------|---------------|---------------|
+| {first_year["year"]} | {first_year["avg_edu_score"]:.2f} | {first_year["high_edu_rate"] * 100:.1f}% |
+| {last_year["year"]} | {last_year["avg_edu_score"]:.2f} | {last_year["high_edu_rate"] * 100:.1f}% |
+## Performance
+- **{total_docs:,} documents** processed in **{scan_time:.0f} seconds**
+- **{docs_per_sec:,.0f} docs/sec** using Polars streaming
+- Single scan, no full dataset download required
+## Summary
+| Metric | Value |
+|--------|-------|
+| Scope | {scope} |
+| Total Documents | {total_docs:,} |
+| Total Tokens | {stats.get("total_tokens", 0):,} |
+| Avg Edu Score | {stats.get("avg_edu_score", 0):.3f} |
+| High Edu Rate | {stats.get("high_edu_rate", 0) * 100:.1f}% |
+| CommonCrawl Dumps | {stats.get("num_dumps", 0)} |
+## Files
+- `global_stats` - Overall summary
+- `temporal_stats` - Per-dump breakdown (sorted chronologically)
+## Reproduce
+```bash
+uv run https://huggingface.co/datasets/uv-scripts/dataset-stats/raw/main/finepdfs-stats.py \\
+    {"--all-languages" if args.all_languages else f"--lang {args.lang}"} --output-repo your-username/stats
+```
+## Source
+- **Dataset**: [{args.source_dataset}](https://huggingface.co/datasets/{args.source_dataset})
+- **Script**: [uv-scripts/dataset-stats](https://huggingface.co/datasets/uv-scripts/dataset-stats)
+"""
+def main():
+    parser = argparse.ArgumentParser(
+        description="Analyze educational quality trends across CommonCrawl dumps",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog=__doc__,
+    )
+    parser.add_argument(
+        "--source-dataset",
+        type=str,
+        default="HuggingFaceFW/finepdfs-edu",
+        help="Source dataset (default: HuggingFaceFW/finepdfs-edu)",
+    )
+    parser.add_argument(
+        "--lang",
+        type=str,
+        default="eng_Latn",
+        help="Language+script code (default: eng_Latn)",
+    )
+    parser.add_argument(
+        "--all-languages",
+        action="store_true",
+        help="Analyze all languages (70+) instead of single language",
+    )
+    parser.add_argument(
+        "--show-plan",
+        action="store_true",
+        help="Show Polars query plan (demonstrates optimization)",
+    )
+    parser.add_argument(
+        "--list-languages",
+        action="store_true",
+        help="List available languages and exit",
+    )
+    parser.add_argument(
+        "--limit",
+        type=int,
+        help="Limit to first N rows (for testing)",
+    )
+    parser.add_argument(
+        "--output-repo",
+        type=str,
+        help="HuggingFace dataset repository to upload results",
+    )
+    parser.add_argument(
+        "--output-dir",
+        type=str,
+        default="./stats_output",
+        help="Local directory for output files",
+    )
+    parser.add_argument(
+        "--hf-token",
+        type=str,
+        help="HuggingFace API token (or set HF_TOKEN env var)",
+    )
+    parser.add_argument(
+        "--private",
+        action="store_true",
+        help="Make the output dataset private",
+    )
+    args = parser.parse_args()
+    # Check for high-performance mode
+    if os.environ.get("HF_XET_HIGH_PERFORMANCE"):
+        logger.info("High-performance mode enabled (HF_XET_HIGH_PERFORMANCE=1)")
+    # List languages mode
+    if args.list_languages:
+        print(f"Available language+script codes for {args.source_dataset}:\n")
+        print("Common languages:")
+        for code, name in COMMON_LANGUAGES.items():
+            print(f"  {code:12} - {name}")
+        print("\nFetching full list from HF Hub...")
+        all_langs = list_available_languages(args.source_dataset)
+        print(f"\nAll available ({len(all_langs)} total):")
+        for lang in all_langs[:30]:  # Show first 30
+            name = COMMON_LANGUAGES.get(lang, "")
+            print(f"  {lang:12} {name}")
+        if len(all_langs) > 30:
+            print(f"  ... and {len(all_langs) - 30} more")
+        sys.exit(0)
+    # Build the parquet path
+    if args.all_languages:
+        source_path = f"hf://datasets/{args.source_dataset}/data/*/train/*.parquet"
+        scope_desc = "all languages"
+    else:
+        source_path = (
+            f"hf://datasets/{args.source_dataset}/data/{args.lang}/train/*.parquet"
+        )
+        scope_desc = f"{args.lang} ({COMMON_LANGUAGES.get(args.lang, 'unknown')})"
+    logger.info(f"Scanning: {source_path}")
+    logger.info(f"Scope: {scope_desc}")
+    # Create lazy frame - this doesn't load any data yet!
+    logger.info("Creating lazy query plan...")
+    df = pl.scan_parquet(source_path)
+    # Apply limit if specified
+    if args.limit:
+        logger.info(f"Limiting to first {args.limit:,} rows")
+        df = df.head(args.limit)
+    # Show query plan if requested
+    if args.show_plan:
+        # Build a sample query to show the plan
+        sample_query = df.select(
+            pl.len(),
+            pl.col("token_count").sum(),
+            pl.col("language").n_unique(),
+        )
+        print("\nQuery Plan (showing Polars optimization):")
+        print("=" * 60)
+        print(sample_query.explain())
+        print("=" * 60)
+        print("\nNote: Polars uses projection pushdown - only reads columns needed!")
+        print("The 'text' column is never loaded, making this very fast.\n")
+    # Create output directory
+    output_dir = Path(args.output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+    # Single scan: compute temporal stats
+    logger.info("Computing temporal stats (single scan)...")
+    start = time.perf_counter()
+    temporal_path = output_dir / "temporal_stats.parquet"
+    temporal_raw = compute_temporal_stats(df, temporal_path)
+    scan_time = time.perf_counter() - start
+    logger.info(f"Scan complete in {scan_time:.2f}s - {len(temporal_raw)} dumps")
+    # Compute stats
+    global_stats = compute_global_stats(temporal_raw)
+    temporal_stats = format_temporal_stats(temporal_raw)
+    # Save
+    global_stats.write_parquet(output_dir / "global_stats.parquet")
+    temporal_stats.write_parquet(output_dir / "temporal_stats.parquet")
+    # Print results
+    total_docs = global_stats["total_docs"][0]
+    docs_per_sec = total_docs / scan_time if scan_time > 0 else 0
+    print("\n" + "=" * 70)
+    print("IS THE WEB GETTING MORE EDUCATIONAL?")
+    print("=" * 70)
+    print(f"\nScope: {scope_desc}")
+    print(f"Dataset: {args.source_dataset}")
+    print("\n" + "-" * 70)
+    print("GLOBAL STATS")
+    print("-" * 70)
+    print(global_stats)
+    print("\n" + "-" * 70)
+    print(f"TEMPORAL TREND ({len(temporal_stats)} CommonCrawl dumps)")
+    print("-" * 70)
+    # Show first 5 and last 5
+    if len(temporal_stats) > 10:
+        print("Earliest dumps:")
+        print(temporal_stats.head(5))
+        print("\n...")
+        print("\nLatest dumps:")
+        print(temporal_stats.tail(5))
+    else:
+        print(temporal_stats)
+    # Create ASCII charts
+    ascii_charts = create_ascii_charts(temporal_stats)
+    print("\n" + "-" * 70)
+    print("TREND VISUALIZATION")
+    print("-" * 70)
+    print(ascii_charts)
+    print("\n" + "-" * 70)
+    print("PERFORMANCE")
+    print("-" * 70)
+    print(f"Scan time: {scan_time:.2f}s")
+    print(f"Documents: {total_docs:,}")
+    print(f"Throughput: {docs_per_sec:,.0f} docs/sec")
+    logger.info(f"Results saved to: {output_dir}")
+    # Upload to HF Hub if requested
+    if args.output_repo:
+        hf_token = args.hf_token or os.environ.get("HF_TOKEN")
+        if hf_token:
+            login(token=hf_token)
+        api = HfApi(token=hf_token)
+        logger.info(f"Creating/updating dataset repository: {args.output_repo}")
+        create_repo(
+            args.output_repo,
+            repo_type="dataset",
+            private=args.private,
+            token=hf_token,
+            exist_ok=True,
+        )
+        # Upload each as a dataset config
+        configs = [
+            ("global_stats", global_stats),
+            ("temporal_stats", temporal_stats),
+        ]
+        for config_name, stats_df in configs:
+            logger.info(f"Uploading {config_name}...")
+            ds = Dataset.from_polars(stats_df)
+            ds.push_to_hub(
+                args.output_repo,
+                config_name=config_name,
+                token=hf_token,
+                private=args.private,
+            )
+            time.sleep(1)  # Avoid 409 conflicts
+        # Upload README
+        readme_content = create_readme(
+            args, global_stats, temporal_stats, scan_time, ascii_charts
+        )
+        api.upload_file(
+            path_or_fileobj=readme_content.encode(),
+            path_in_repo="README.md",
+            repo_id=args.output_repo,
+            repo_type="dataset",
+            token=hf_token,
+        )
+        dataset_url = f"https://huggingface.co/datasets/{args.output_repo}"
+        logger.info(f"Dataset uploaded: {dataset_url}")
+        print(f"\nResults uploaded to: {dataset_url}")
+if __name__ == "__main__":
+    if len(sys.argv) == 1:
+        print("Is the Web Getting More Educational?")
+        print("=" * 40)
+        print("\nAnalyze educational quality trends across CommonCrawl dumps")
+        print("using Polars streaming - no download needed!\n")
+        print("Example commands:\n")
+        print("# Quick test:")
+        print("uv run finepdfs-stats.py --limit 10000\n")
+        print("# Analyze English PDFs:")
+        print("uv run finepdfs-stats.py\n")
+        print("# Analyze ALL 70+ languages:")
+        print("uv run finepdfs-stats.py --all-languages\n")
+        print("# Show query plan (see Polars optimization):")
+        print("uv run finepdfs-stats.py --show-plan --limit 1000\n")
+        print("# Save results to HF Hub:")
+        print("uv run finepdfs-stats.py --output-repo username/temporal-stats\n")
+        print("# Run on HF Jobs:")
+        print("hf jobs uv run \\")
+        print("    -s HF_TOKEN \\")
+        print("    -e HF_XET_HIGH_PERFORMANCE=1 \\")
+        print(
+            "    https://huggingface.co/datasets/uv-scripts/dataset-stats/raw/main/finepdfs-stats.py \\"
+        )
+        print("    -- --output-repo username/stats")
+        sys.exit(0)
+    main()

scripts/generate-responses.py ADDED Viewed

	@@ -0,0 +1,587 @@

+# /// script
+# requires-python = ">=3.10"
+# dependencies = [
+#     "datasets",
+#     "flashinfer-python",
+#     "huggingface-hub[hf_transfer]",
+#     "hf-xet>= 1.1.7",
+#     "torch",
+#     "transformers",
+#     "vllm>=0.8.5",
+# ]
+#
+# ///
+"""
+Generate responses for prompts in a dataset using vLLM for efficient GPU inference.
+This script loads a dataset from Hugging Face Hub containing chat-formatted messages,
+applies the model's chat template, generates responses using vLLM, and saves the
+results back to the Hub with a comprehensive dataset card.
+Example usage:
+    # Local execution with auto GPU detection
+    uv run generate-responses.py \\
+        username/input-dataset \\
+        username/output-dataset \\
+        --messages-column messages
+    # With custom model and sampling parameters
+    uv run generate-responses.py \\
+        username/input-dataset \\
+        username/output-dataset \\
+        --model-id meta-llama/Llama-3.1-8B-Instruct \\
+        --temperature 0.9 \\
+        --top-p 0.95 \\
+        --max-tokens 2048
+    # HF Jobs execution (see script output for full command)
+    hf jobs uv run --flavor a100x4 ...
+"""
+import argparse
+import logging
+import os
+import sys
+from datetime import datetime
+from typing import Optional
+from datasets import load_dataset
+from huggingface_hub import DatasetCard, get_token, login
+from torch import cuda
+from tqdm.auto import tqdm
+from transformers import AutoTokenizer
+from vllm import LLM, SamplingParams
+# Enable HF Transfer for faster downloads
+os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
+logging.basicConfig(
+    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
+)
+logger = logging.getLogger(__name__)
+def check_gpu_availability() -> int:
+    """Check if CUDA is available and return the number of GPUs."""
+    if not cuda.is_available():
+        logger.error("CUDA is not available. This script requires a GPU.")
+        logger.error(
+            "Please run on a machine with NVIDIA GPU or use HF Jobs with GPU flavor."
+        )
+        sys.exit(1)
+    num_gpus = cuda.device_count()
+    for i in range(num_gpus):
+        gpu_name = cuda.get_device_name(i)
+        gpu_memory = cuda.get_device_properties(i).total_memory / 1024**3
+        logger.info(f"GPU {i}: {gpu_name} with {gpu_memory:.1f} GB memory")
+    return num_gpus
+def create_dataset_card(
+    source_dataset: str,
+    model_id: str,
+    messages_column: str,
+    prompt_column: Optional[str],
+    sampling_params: SamplingParams,
+    tensor_parallel_size: int,
+    num_examples: int,
+    generation_time: str,
+    num_skipped: int = 0,
+    max_model_len_used: Optional[int] = None,
+) -> str:
+    """Create a comprehensive dataset card documenting the generation process."""
+    filtering_section = ""
+    if num_skipped > 0:
+        skip_percentage = (num_skipped / num_examples) * 100
+        processed = num_examples - num_skipped
+        filtering_section = f"""
+### Filtering Statistics
+- **Total Examples**: {num_examples:,}
+- **Processed**: {processed:,} ({100 - skip_percentage:.1f}%)
+- **Skipped (too long)**: {num_skipped:,} ({skip_percentage:.1f}%)
+- **Max Model Length Used**: {max_model_len_used:,} tokens
+Note: Prompts exceeding the maximum model length were skipped and have empty responses."""
+    return f"""---
+tags:
+- generated
+- vllm
+- uv-script
+---
+# Generated Responses Dataset
+This dataset contains generated responses for prompts from [{source_dataset}](https://huggingface.co/datasets/{source_dataset}).
+## Generation Details
+- **Source Dataset**: [{source_dataset}](https://huggingface.co/datasets/{source_dataset})
+- **Input Column**: `{prompt_column if prompt_column else messages_column}` ({"plain text prompts" if prompt_column else "chat messages"})
+- **Model**: [{model_id}](https://huggingface.co/{model_id})
+- **Number of Examples**: {num_examples:,}
+- **Generation Date**: {generation_time}{filtering_section}
+### Sampling Parameters
+- **Temperature**: {sampling_params.temperature}
+- **Top P**: {sampling_params.top_p}
+- **Top K**: {sampling_params.top_k}
+- **Min P**: {sampling_params.min_p}
+- **Max Tokens**: {sampling_params.max_tokens}
+- **Repetition Penalty**: {sampling_params.repetition_penalty}
+### Hardware Configuration
+- **Tensor Parallel Size**: {tensor_parallel_size}
+- **GPU Configuration**: {tensor_parallel_size} GPU(s)
+## Dataset Structure
+The dataset contains all columns from the source dataset plus:
+- `response`: The generated response from the model
+## Generation Script
+Generated using the vLLM inference script from [uv-scripts/vllm](https://huggingface.co/datasets/uv-scripts/vllm).
+To reproduce this generation:
+```bash
+uv run https://huggingface.co/datasets/uv-scripts/vllm/raw/main/generate-responses.py \\
+    {source_dataset} \\
+    <output-dataset> \\
+    --model-id {model_id} \\
+    {"--prompt-column " + prompt_column if prompt_column else "--messages-column " + messages_column} \\
+    --temperature {sampling_params.temperature} \\
+    --top-p {sampling_params.top_p} \\
+    --top-k {sampling_params.top_k} \\
+    --max-tokens {sampling_params.max_tokens}{f" \\\\\\n    --max-model-len {max_model_len_used}" if max_model_len_used else ""}
+```
+"""
+def main(
+    src_dataset_hub_id: str,
+    output_dataset_hub_id: str,
+    model_id: str = "Qwen/Qwen3-30B-A3B-Instruct-2507",
+    messages_column: str = "messages",
+    prompt_column: Optional[str] = None,
+    output_column: str = "response",
+    temperature: float = 0.7,
+    top_p: float = 0.8,
+    top_k: int = 20,
+    min_p: float = 0.0,
+    max_tokens: int = 16384,
+    repetition_penalty: float = 1.0,
+    gpu_memory_utilization: float = 0.90,
+    max_model_len: Optional[int] = None,
+    tensor_parallel_size: Optional[int] = None,
+    skip_long_prompts: bool = True,
+    max_samples: Optional[int] = None,
+    hf_token: Optional[str] = None,
+):
+    """
+    Main generation pipeline.
+    Args:
+        src_dataset_hub_id: Input dataset on Hugging Face Hub
+        output_dataset_hub_id: Where to save results on Hugging Face Hub
+        model_id: Hugging Face model ID for generation
+        messages_column: Column name containing chat messages
+        prompt_column: Column name containing plain text prompts (alternative to messages_column)
+        output_column: Column name for generated responses
+        temperature: Sampling temperature
+        top_p: Top-p sampling parameter
+        top_k: Top-k sampling parameter
+        min_p: Minimum probability threshold
+        max_tokens: Maximum tokens to generate
+        repetition_penalty: Repetition penalty parameter
+        gpu_memory_utilization: GPU memory utilization factor
+        max_model_len: Maximum model context length (None uses model default)
+        tensor_parallel_size: Number of GPUs to use (auto-detect if None)
+        skip_long_prompts: Skip prompts exceeding max_model_len instead of failing
+        max_samples: Maximum number of samples to process (None for all)
+        hf_token: Hugging Face authentication token
+    """
+    generation_start_time = datetime.now().isoformat()
+    # GPU check and configuration
+    num_gpus = check_gpu_availability()
+    if tensor_parallel_size is None:
+        tensor_parallel_size = num_gpus
+        logger.info(
+            f"Auto-detected {num_gpus} GPU(s), using tensor_parallel_size={tensor_parallel_size}"
+        )
+    else:
+        logger.info(f"Using specified tensor_parallel_size={tensor_parallel_size}")
+        if tensor_parallel_size > num_gpus:
+            logger.warning(
+                f"Requested {tensor_parallel_size} GPUs but only {num_gpus} available"
+            )
+    # Authentication - try multiple methods
+    HF_TOKEN = hf_token or os.environ.get("HF_TOKEN") or get_token()
+    if not HF_TOKEN:
+        logger.error("No HuggingFace token found. Please provide token via:")
+        logger.error("  1. --hf-token argument")
+        logger.error("  2. HF_TOKEN environment variable")
+        logger.error("  3. Run 'huggingface-cli login' or use login() in Python")
+        sys.exit(1)
+    logger.info("HuggingFace token found, authenticating...")
+    login(token=HF_TOKEN)
+    # Initialize vLLM
+    logger.info(f"Loading model: {model_id}")
+    vllm_kwargs = {
+        "model": model_id,
+        "tensor_parallel_size": tensor_parallel_size,
+        "gpu_memory_utilization": gpu_memory_utilization,
+    }
+    if max_model_len is not None:
+        vllm_kwargs["max_model_len"] = max_model_len
+        logger.info(f"Using max_model_len={max_model_len}")
+    llm = LLM(**vllm_kwargs)
+    # Load tokenizer for chat template
+    logger.info("Loading tokenizer...")
+    tokenizer = AutoTokenizer.from_pretrained(model_id)
+    # Create sampling parameters
+    sampling_params = SamplingParams(
+        temperature=temperature,
+        top_p=top_p,
+        top_k=top_k,
+        min_p=min_p,
+        max_tokens=max_tokens,
+        repetition_penalty=repetition_penalty,
+    )
+    # Load dataset
+    logger.info(f"Loading dataset: {src_dataset_hub_id}")
+    dataset = load_dataset(src_dataset_hub_id, split="train")
+    # Apply max_samples if specified
+    if max_samples is not None and max_samples < len(dataset):
+        logger.info(f"Limiting dataset to {max_samples} samples")
+        dataset = dataset.select(range(max_samples))
+    total_examples = len(dataset)
+    logger.info(f"Dataset loaded with {total_examples:,} examples")
+    # Determine which column to use and validate
+    if prompt_column:
+        # Use prompt column mode
+        if prompt_column not in dataset.column_names:
+            logger.error(
+                f"Column '{prompt_column}' not found. Available columns: {dataset.column_names}"
+            )
+            sys.exit(1)
+        logger.info(f"Using prompt column mode with column: '{prompt_column}'")
+        use_messages = False
+    else:
+        # Use messages column mode
+        if messages_column not in dataset.column_names:
+            logger.error(
+                f"Column '{messages_column}' not found. Available columns: {dataset.column_names}"
+            )
+            sys.exit(1)
+        logger.info(f"Using messages column mode with column: '{messages_column}'")
+        use_messages = True
+    # Get effective max length for filtering
+    if max_model_len is not None:
+        effective_max_len = max_model_len
+    else:
+        # Get model's default max length
+        effective_max_len = llm.llm_engine.model_config.max_model_len
+    logger.info(f"Using effective max model length: {effective_max_len}")
+    # Process messages and apply chat template
+    logger.info("Preparing prompts...")
+    all_prompts = []
+    valid_prompts = []
+    valid_indices = []
+    skipped_info = []
+    for i, example in enumerate(tqdm(dataset, desc="Processing prompts")):
+        if use_messages:
+            # Messages mode: use existing chat messages
+            messages = example[messages_column]
+            # Apply chat template
+            prompt = tokenizer.apply_chat_template(
+                messages, tokenize=False, add_generation_prompt=True
+            )
+        else:
+            # Prompt mode: convert plain text to messages format
+            user_prompt = example[prompt_column]
+            messages = [{"role": "user", "content": user_prompt}]
+            # Apply chat template
+            prompt = tokenizer.apply_chat_template(
+                messages, tokenize=False, add_generation_prompt=True
+            )
+        all_prompts.append(prompt)
+        # Count tokens if filtering is enabled
+        if skip_long_prompts:
+            tokens = tokenizer.encode(prompt)
+            if len(tokens) <= effective_max_len:
+                valid_prompts.append(prompt)
+                valid_indices.append(i)
+            else:
+                skipped_info.append((i, len(tokens)))
+        else:
+            valid_prompts.append(prompt)
+            valid_indices.append(i)
+    # Log filtering results
+    if skip_long_prompts and skipped_info:
+        logger.warning(
+            f"Skipped {len(skipped_info)} prompts that exceed max_model_len ({effective_max_len} tokens)"
+        )
+        logger.info("Skipped prompt details (first 10):")
+        for idx, (prompt_idx, token_count) in enumerate(skipped_info[:10]):
+            logger.info(
+                f"  - Example {prompt_idx}: {token_count} tokens (exceeds by {token_count - effective_max_len})"
+            )
+        if len(skipped_info) > 10:
+            logger.info(f"  ... and {len(skipped_info) - 10} more")
+        skip_percentage = (len(skipped_info) / total_examples) * 100
+        if skip_percentage > 10:
+            logger.warning(f"WARNING: {skip_percentage:.1f}% of prompts were skipped!")
+    if not valid_prompts:
+        logger.error("No valid prompts to process after filtering!")
+        sys.exit(1)
+    # Generate responses - vLLM handles batching internally
+    logger.info(f"Starting generation for {len(valid_prompts):,} valid prompts...")
+    logger.info("vLLM will handle batching and scheduling automatically")
+    outputs = llm.generate(valid_prompts, sampling_params)
+    # Extract generated text and create full response list
+    logger.info("Extracting generated responses...")
+    responses = [""] * total_examples  # Initialize with empty strings
+    for idx, output in enumerate(outputs):
+        original_idx = valid_indices[idx]
+        response = output.outputs[0].text.strip()
+        responses[original_idx] = response
+    # Add responses to dataset
+    logger.info("Adding responses to dataset...")
+    dataset = dataset.add_column(output_column, responses)
+    # Create dataset card
+    logger.info("Creating dataset card...")
+    card_content = create_dataset_card(
+        source_dataset=src_dataset_hub_id,
+        model_id=model_id,
+        messages_column=messages_column,
+        prompt_column=prompt_column,
+        sampling_params=sampling_params,
+        tensor_parallel_size=tensor_parallel_size,
+        num_examples=total_examples,
+        generation_time=generation_start_time,
+        num_skipped=len(skipped_info) if skip_long_prompts else 0,
+        max_model_len_used=effective_max_len if skip_long_prompts else None,
+    )
+    # Push dataset to hub
+    logger.info(f"Pushing dataset to: {output_dataset_hub_id}")
+    dataset.push_to_hub(output_dataset_hub_id, token=HF_TOKEN)
+    # Push dataset card
+    card = DatasetCard(card_content)
+    card.push_to_hub(output_dataset_hub_id, token=HF_TOKEN)
+    logger.info("✅ Generation complete!")
+    logger.info(
+        f"Dataset available at: https://huggingface.co/datasets/{output_dataset_hub_id}"
+    )
+if __name__ == "__main__":
+    if len(sys.argv) > 1:
+        parser = argparse.ArgumentParser(
+            description="Generate responses for dataset prompts using vLLM",
+            formatter_class=argparse.RawDescriptionHelpFormatter,
+            epilog="""
+Examples:
+  # Basic usage with default Qwen model
+  uv run generate-responses.py input-dataset output-dataset
+  # With custom model and parameters
+  uv run generate-responses.py input-dataset output-dataset \\
+    --model-id meta-llama/Llama-3.1-8B-Instruct \\
+    --temperature 0.9 \\
+    --max-tokens 2048
+  # Force specific GPU configuration
+  uv run generate-responses.py input-dataset output-dataset \\
+    --tensor-parallel-size 2 \\
+    --gpu-memory-utilization 0.95
+  # Using environment variable for token
+  HF_TOKEN=hf_xxx uv run generate-responses.py input-dataset output-dataset
+            """,
+        )
+        parser.add_argument(
+            "src_dataset_hub_id",
+            help="Input dataset on Hugging Face Hub (e.g., username/dataset-name)",
+        )
+        parser.add_argument(
+            "output_dataset_hub_id", help="Output dataset name on Hugging Face Hub"
+        )
+        parser.add_argument(
+            "--model-id",
+            type=str,
+            default="Qwen/Qwen3-30B-A3B-Instruct-2507",
+            help="Model to use for generation (default: Qwen3-30B-A3B-Instruct-2507)",
+        )
+        parser.add_argument(
+            "--messages-column",
+            type=str,
+            default="messages",
+            help="Column containing chat messages (default: messages)",
+        )
+        parser.add_argument(
+            "--prompt-column",
+            type=str,
+            help="Column containing plain text prompts (alternative to --messages-column)",
+        )
+        parser.add_argument(
+            "--output-column",
+            type=str,
+            default="response",
+            help="Column name for generated responses (default: response)",
+        )
+        parser.add_argument(
+            "--max-samples",
+            type=int,
+            help="Maximum number of samples to process (default: all)",
+        )
+        parser.add_argument(
+            "--temperature",
+            type=float,
+            default=0.7,
+            help="Sampling temperature (default: 0.7)",
+        )
+        parser.add_argument(
+            "--top-p",
+            type=float,
+            default=0.8,
+            help="Top-p sampling parameter (default: 0.8)",
+        )
+        parser.add_argument(
+            "--top-k",
+            type=int,
+            default=20,
+            help="Top-k sampling parameter (default: 20)",
+        )
+        parser.add_argument(
+            "--min-p",
+            type=float,
+            default=0.0,
+            help="Minimum probability threshold (default: 0.0)",
+        )
+        parser.add_argument(
+            "--max-tokens",
+            type=int,
+            default=16384,
+            help="Maximum tokens to generate (default: 16384)",
+        )
+        parser.add_argument(
+            "--repetition-penalty",
+            type=float,
+            default=1.0,
+            help="Repetition penalty (default: 1.0)",
+        )
+        parser.add_argument(
+            "--gpu-memory-utilization",
+            type=float,
+            default=0.90,
+            help="GPU memory utilization factor (default: 0.90)",
+        )
+        parser.add_argument(
+            "--max-model-len",
+            type=int,
+            help="Maximum model context length (default: model's default)",
+        )
+        parser.add_argument(
+            "--tensor-parallel-size",
+            type=int,
+            help="Number of GPUs to use (default: auto-detect)",
+        )
+        parser.add_argument(
+            "--hf-token",
+            type=str,
+            help="Hugging Face token (can also use HF_TOKEN env var)",
+        )
+        parser.add_argument(
+            "--skip-long-prompts",
+            action="store_true",
+            default=True,
+            help="Skip prompts that exceed max_model_len instead of failing (default: True)",
+        )
+        parser.add_argument(
+            "--no-skip-long-prompts",
+            dest="skip_long_prompts",
+            action="store_false",
+            help="Fail on prompts that exceed max_model_len",
+        )
+        args = parser.parse_args()
+        main(
+            src_dataset_hub_id=args.src_dataset_hub_id,
+            output_dataset_hub_id=args.output_dataset_hub_id,
+            model_id=args.model_id,
+            messages_column=args.messages_column,
+            prompt_column=args.prompt_column,
+            output_column=args.output_column,
+            temperature=args.temperature,
+            top_p=args.top_p,
+            top_k=args.top_k,
+            min_p=args.min_p,
+            max_tokens=args.max_tokens,
+            repetition_penalty=args.repetition_penalty,
+            gpu_memory_utilization=args.gpu_memory_utilization,
+            max_model_len=args.max_model_len,
+            tensor_parallel_size=args.tensor_parallel_size,
+            skip_long_prompts=args.skip_long_prompts,
+            max_samples=args.max_samples,
+            hf_token=args.hf_token,
+        )
+    else:
+        # Show HF Jobs example when run without arguments
+        print("""
+vLLM Response Generation Script
+==============================
+This script requires arguments. For usage information:
+    uv run generate-responses.py --help
+Example HF Jobs command with multi-GPU:
+    # If you're logged in with huggingface-cli, token will be auto-detected
+    hf jobs uv run \\
+        --flavor l4x4 \\
+        https://huggingface.co/datasets/uv-scripts/vllm/raw/main/generate-responses.py \\
+        username/input-dataset \\
+        username/output-dataset \\
+        --messages-column messages \\
+        --model-id Qwen/Qwen3-30B-A3B-Instruct-2507 \\
+        --temperature 0.7 \\
+        --max-tokens 16384
+        """)