Upload folder using huggingface_hub
Browse files- SKILL.md +752 -0
- index.html +213 -18
- references/hardware_guide.md +266 -0
- references/hub_saving.md +339 -0
- references/token_usage.md +546 -0
- references/troubleshooting.md +431 -0
- scripts/cot-self-instruct.py +718 -0
- scripts/finepdfs-stats.py +546 -0
- scripts/generate-responses.py +587 -0
SKILL.md
ADDED
|
@@ -0,0 +1,752 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
name: hf-jobs
|
| 3 |
+
description: This skill should be used when users want to run any workload on Hugging Face Jobs infrastructure. Covers UV scripts, Docker-based jobs, hardware selection, cost estimation, authentication with tokens, secrets management, timeout configuration, and result persistence. Designed for general-purpose compute workloads including data processing, inference, experiments, batch jobs, and any Python-based tasks. Should be invoked for tasks involving cloud compute, GPU workloads, or when users mention running jobs on Hugging Face infrastructure without local setup.
|
| 4 |
+
license: Complete terms in LICENSE.txt
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Running Workloads on Hugging Face Jobs
|
| 8 |
+
|
| 9 |
+
## Overview
|
| 10 |
+
|
| 11 |
+
Run any workload on fully managed Hugging Face infrastructure. No local setup required—jobs run on cloud CPUs, GPUs, or TPUs and can persist results to the Hugging Face Hub.
|
| 12 |
+
|
| 13 |
+
**Common use cases:**
|
| 14 |
+
- **Data Processing** - Transform, filter, or analyze large datasets
|
| 15 |
+
- **Batch Inference** - Run inference on thousands of samples
|
| 16 |
+
- **Experiments & Benchmarks** - Reproducible ML experiments
|
| 17 |
+
- **Model Training** - Fine-tune models (see `model-trainer` skill for TRL-specific training)
|
| 18 |
+
- **Synthetic Data Generation** - Generate datasets using LLMs
|
| 19 |
+
- **Development & Testing** - Test code without local GPU setup
|
| 20 |
+
- **Scheduled Jobs** - Automate recurring tasks
|
| 21 |
+
|
| 22 |
+
**For model training specifically:** See the `model-trainer` skill for TRL-based training workflows.
|
| 23 |
+
|
| 24 |
+
## When to Use This Skill
|
| 25 |
+
|
| 26 |
+
Use this skill when users want to:
|
| 27 |
+
- Run Python workloads on cloud infrastructure
|
| 28 |
+
- Execute jobs without local GPU/TPU setup
|
| 29 |
+
- Process data at scale
|
| 30 |
+
- Run batch inference or experiments
|
| 31 |
+
- Schedule recurring tasks
|
| 32 |
+
- Use GPUs/TPUs for any workload
|
| 33 |
+
- Persist results to the Hugging Face Hub
|
| 34 |
+
|
| 35 |
+
## Key Directives
|
| 36 |
+
|
| 37 |
+
When assisting with jobs:
|
| 38 |
+
|
| 39 |
+
1. **ALWAYS use `hf_jobs()` MCP tool** - Submit jobs using `hf_jobs("uv", {...})` or `hf_jobs("run", {...})`. The `script` parameter accepts Python code directly. Do NOT save to local files unless the user explicitly requests it. Pass the script content as a string to `hf_jobs()`.
|
| 40 |
+
|
| 41 |
+
2. **Always handle authentication** - Jobs that interact with the Hub require `HF_TOKEN` via secrets. See Token Usage section below.
|
| 42 |
+
|
| 43 |
+
3. **Provide job details after submission** - After submitting, provide job ID, monitoring URL, estimated time, and note that the user can request status checks later.
|
| 44 |
+
|
| 45 |
+
4. **Set appropriate timeouts** - Default 30min may be insufficient for long-running tasks.
|
| 46 |
+
|
| 47 |
+
## Prerequisites Checklist
|
| 48 |
+
|
| 49 |
+
Before starting any job, verify:
|
| 50 |
+
|
| 51 |
+
### ✅ **Account & Authentication**
|
| 52 |
+
- Hugging Face Account with [Pro](https://hf.co/pro), [Team](https://hf.co/enterprise), or [Enterprise](https://hf.co/enterprise) plan (Jobs require paid plan)
|
| 53 |
+
- Authenticated login: Check with `hf_whoami()`
|
| 54 |
+
- **HF_TOKEN for Hub Access** ⚠️ CRITICAL - Required for any Hub operations (push models/datasets, download private repos, etc.)
|
| 55 |
+
- Token must have appropriate permissions (read for downloads, write for uploads)
|
| 56 |
+
|
| 57 |
+
### ✅ **Token Usage** (See Token Usage section for details)
|
| 58 |
+
|
| 59 |
+
**When tokens are required:**
|
| 60 |
+
- Pushing models/datasets to Hub
|
| 61 |
+
- Accessing private repositories
|
| 62 |
+
- Using Hub APIs in scripts
|
| 63 |
+
- Any authenticated Hub operations
|
| 64 |
+
|
| 65 |
+
**How to provide tokens:**
|
| 66 |
+
```python
|
| 67 |
+
{
|
| 68 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # Recommended: automatic token
|
| 69 |
+
}
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
**⚠️ CRITICAL:** The `$HF_TOKEN` placeholder is automatically replaced with your logged-in token. Never hardcode tokens in scripts.
|
| 73 |
+
|
| 74 |
+
## Token Usage Guide
|
| 75 |
+
|
| 76 |
+
### Understanding Tokens
|
| 77 |
+
|
| 78 |
+
**What are HF Tokens?**
|
| 79 |
+
- Authentication credentials for Hugging Face Hub
|
| 80 |
+
- Required for authenticated operations (push, private repos, API access)
|
| 81 |
+
- Stored securely on your machine after `hf auth login`
|
| 82 |
+
|
| 83 |
+
**Token Types:**
|
| 84 |
+
- **Read Token** - Can download models/datasets, read private repos
|
| 85 |
+
- **Write Token** - Can push models/datasets, create repos, modify content
|
| 86 |
+
- **Organization Token** - Can act on behalf of an organization
|
| 87 |
+
|
| 88 |
+
### When Tokens Are Required
|
| 89 |
+
|
| 90 |
+
**Always Required:**
|
| 91 |
+
- Pushing models/datasets to Hub
|
| 92 |
+
- Accessing private repositories
|
| 93 |
+
- Creating new repositories
|
| 94 |
+
- Modifying existing repositories
|
| 95 |
+
- Using Hub APIs programmatically
|
| 96 |
+
|
| 97 |
+
**Not Required:**
|
| 98 |
+
- Downloading public models/datasets
|
| 99 |
+
- Running jobs that don't interact with Hub
|
| 100 |
+
- Reading public repository information
|
| 101 |
+
|
| 102 |
+
### How to Provide Tokens to Jobs
|
| 103 |
+
|
| 104 |
+
#### Method 1: Automatic Token (Recommended)
|
| 105 |
+
|
| 106 |
+
```python
|
| 107 |
+
hf_jobs("uv", {
|
| 108 |
+
"script": "your_script.py",
|
| 109 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Automatic replacement
|
| 110 |
+
})
|
| 111 |
+
```
|
| 112 |
+
|
| 113 |
+
**How it works:**
|
| 114 |
+
- `$HF_TOKEN` is a placeholder that gets replaced with your actual token
|
| 115 |
+
- Uses the token from your logged-in session (`hf auth login`)
|
| 116 |
+
- Most secure and convenient method
|
| 117 |
+
- Token is encrypted server-side when passed as a secret
|
| 118 |
+
|
| 119 |
+
**Benefits:**
|
| 120 |
+
- No token exposure in code
|
| 121 |
+
- Uses your current login session
|
| 122 |
+
- Automatically updated if you re-login
|
| 123 |
+
- Works seamlessly with MCP tools
|
| 124 |
+
|
| 125 |
+
#### Method 2: Explicit Token (Not Recommended)
|
| 126 |
+
|
| 127 |
+
```python
|
| 128 |
+
hf_jobs("uv", {
|
| 129 |
+
"script": "your_script.py",
|
| 130 |
+
"secrets": {"HF_TOKEN": "hf_abc123..."} # ⚠️ Hardcoded token
|
| 131 |
+
})
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
**When to use:**
|
| 135 |
+
- Only if automatic token doesn't work
|
| 136 |
+
- Testing with a specific token
|
| 137 |
+
- Organization tokens (use with caution)
|
| 138 |
+
|
| 139 |
+
**Security concerns:**
|
| 140 |
+
- Token visible in code/logs
|
| 141 |
+
- Must manually update if token rotates
|
| 142 |
+
- Risk of token exposure
|
| 143 |
+
|
| 144 |
+
#### Method 3: Environment Variable (Less Secure)
|
| 145 |
+
|
| 146 |
+
```python
|
| 147 |
+
hf_jobs("uv", {
|
| 148 |
+
"script": "your_script.py",
|
| 149 |
+
"env": {"HF_TOKEN": "hf_abc123..."} # ⚠️ Less secure than secrets
|
| 150 |
+
})
|
| 151 |
+
```
|
| 152 |
+
|
| 153 |
+
**Difference from secrets:**
|
| 154 |
+
- `env` variables are visible in job logs
|
| 155 |
+
- `secrets` are encrypted server-side
|
| 156 |
+
- Always prefer `secrets` for tokens
|
| 157 |
+
|
| 158 |
+
### Using Tokens in Scripts
|
| 159 |
+
|
| 160 |
+
**In your Python script, tokens are available as environment variables:**
|
| 161 |
+
|
| 162 |
+
```python
|
| 163 |
+
# /// script
|
| 164 |
+
# dependencies = ["huggingface-hub"]
|
| 165 |
+
# ///
|
| 166 |
+
|
| 167 |
+
import os
|
| 168 |
+
from huggingface_hub import HfApi
|
| 169 |
+
|
| 170 |
+
# Token is automatically available if passed via secrets
|
| 171 |
+
token = os.environ.get("HF_TOKEN")
|
| 172 |
+
|
| 173 |
+
# Use with Hub API
|
| 174 |
+
api = HfApi(token=token)
|
| 175 |
+
|
| 176 |
+
# Or let huggingface_hub auto-detect
|
| 177 |
+
api = HfApi() # Automatically uses HF_TOKEN env var
|
| 178 |
+
```
|
| 179 |
+
|
| 180 |
+
**Best practices:**
|
| 181 |
+
- Don't hardcode tokens in scripts
|
| 182 |
+
- Use `os.environ.get("HF_TOKEN")` to access
|
| 183 |
+
- Let `huggingface_hub` auto-detect when possible
|
| 184 |
+
- Verify token exists before Hub operations
|
| 185 |
+
|
| 186 |
+
### Token Verification
|
| 187 |
+
|
| 188 |
+
**Check if you're logged in:**
|
| 189 |
+
```python
|
| 190 |
+
from huggingface_hub import whoami
|
| 191 |
+
user_info = whoami() # Returns your username if authenticated
|
| 192 |
+
```
|
| 193 |
+
|
| 194 |
+
**Verify token in job:**
|
| 195 |
+
```python
|
| 196 |
+
import os
|
| 197 |
+
assert "HF_TOKEN" in os.environ, "HF_TOKEN not found!"
|
| 198 |
+
token = os.environ["HF_TOKEN"]
|
| 199 |
+
print(f"Token starts with: {token[:7]}...") # Should start with "hf_"
|
| 200 |
+
```
|
| 201 |
+
|
| 202 |
+
### Common Token Issues
|
| 203 |
+
|
| 204 |
+
**Error: 401 Unauthorized**
|
| 205 |
+
- **Cause:** Token missing or invalid
|
| 206 |
+
- **Fix:** Add `secrets={"HF_TOKEN": "$HF_TOKEN"}` to job config
|
| 207 |
+
- **Verify:** Check `hf_whoami()` works locally
|
| 208 |
+
|
| 209 |
+
**Error: 403 Forbidden**
|
| 210 |
+
- **Cause:** Token lacks required permissions
|
| 211 |
+
- **Fix:** Ensure token has write permissions for push operations
|
| 212 |
+
- **Check:** Token type at https://huggingface.co/settings/tokens
|
| 213 |
+
|
| 214 |
+
**Error: Token not found in environment**
|
| 215 |
+
- **Cause:** `secrets` not passed or wrong key name
|
| 216 |
+
- **Fix:** Use `secrets={"HF_TOKEN": "$HF_TOKEN"}` (not `env`)
|
| 217 |
+
- **Verify:** Script checks `os.environ.get("HF_TOKEN")`
|
| 218 |
+
|
| 219 |
+
**Error: Repository access denied**
|
| 220 |
+
- **Cause:** Token doesn't have access to private repo
|
| 221 |
+
- **Fix:** Use token from account with access
|
| 222 |
+
- **Check:** Verify repo visibility and your permissions
|
| 223 |
+
|
| 224 |
+
### Token Security Best Practices
|
| 225 |
+
|
| 226 |
+
1. **Never commit tokens** - Use `$HF_TOKEN` placeholder or environment variables
|
| 227 |
+
2. **Use secrets, not env** - Secrets are encrypted server-side
|
| 228 |
+
3. **Rotate tokens regularly** - Generate new tokens periodically
|
| 229 |
+
4. **Use minimal permissions** - Create tokens with only needed permissions
|
| 230 |
+
5. **Don't share tokens** - Each user should use their own token
|
| 231 |
+
6. **Monitor token usage** - Check token activity in Hub settings
|
| 232 |
+
|
| 233 |
+
### Complete Token Example
|
| 234 |
+
|
| 235 |
+
```python
|
| 236 |
+
# Example: Push results to Hub
|
| 237 |
+
hf_jobs("uv", {
|
| 238 |
+
"script": """
|
| 239 |
+
# /// script
|
| 240 |
+
# dependencies = ["huggingface-hub", "datasets"]
|
| 241 |
+
# ///
|
| 242 |
+
|
| 243 |
+
import os
|
| 244 |
+
from huggingface_hub import HfApi
|
| 245 |
+
from datasets import Dataset
|
| 246 |
+
|
| 247 |
+
# Verify token is available
|
| 248 |
+
assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
|
| 249 |
+
|
| 250 |
+
# Use token for Hub operations
|
| 251 |
+
api = HfApi(token=os.environ["HF_TOKEN"])
|
| 252 |
+
|
| 253 |
+
# Create and push dataset
|
| 254 |
+
data = {"text": ["Hello", "World"]}
|
| 255 |
+
dataset = Dataset.from_dict(data)
|
| 256 |
+
dataset.push_to_hub("username/my-dataset", token=os.environ["HF_TOKEN"])
|
| 257 |
+
|
| 258 |
+
print("✅ Dataset pushed successfully!")
|
| 259 |
+
""",
|
| 260 |
+
"flavor": "cpu-basic",
|
| 261 |
+
"timeout": "30m",
|
| 262 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Token provided securely
|
| 263 |
+
})
|
| 264 |
+
```
|
| 265 |
+
|
| 266 |
+
## Quick Start: Two Approaches
|
| 267 |
+
|
| 268 |
+
### Approach 1: UV Scripts (Recommended)
|
| 269 |
+
|
| 270 |
+
UV scripts use PEP 723 inline dependencies for clean, self-contained workloads.
|
| 271 |
+
|
| 272 |
+
```python
|
| 273 |
+
hf_jobs("uv", {
|
| 274 |
+
"script": """
|
| 275 |
+
# /// script
|
| 276 |
+
# dependencies = ["transformers", "torch"]
|
| 277 |
+
# ///
|
| 278 |
+
|
| 279 |
+
from transformers import pipeline
|
| 280 |
+
import torch
|
| 281 |
+
|
| 282 |
+
# Your workload here
|
| 283 |
+
classifier = pipeline("sentiment-analysis")
|
| 284 |
+
result = classifier("I love Hugging Face!")
|
| 285 |
+
print(result)
|
| 286 |
+
""",
|
| 287 |
+
"flavor": "cpu-basic",
|
| 288 |
+
"timeout": "30m"
|
| 289 |
+
})
|
| 290 |
+
```
|
| 291 |
+
|
| 292 |
+
**Benefits:** Direct MCP tool usage, clean code, dependencies declared inline, no file saving required
|
| 293 |
+
|
| 294 |
+
**When to use:** Default choice for all workloads, custom logic, any scenario requiring `hf_jobs()`
|
| 295 |
+
|
| 296 |
+
#### Working with Scripts
|
| 297 |
+
|
| 298 |
+
⚠️ **Important:** There are *two* “script path” stories depending on how you run Jobs:
|
| 299 |
+
|
| 300 |
+
- **Using the `hf_jobs()` MCP tool (recommended in this repo)**: the `script` value must be **inline code** (a string) or a **URL**. A local filesystem path (like `"./scripts/foo.py"`) won’t exist inside the remote container.
|
| 301 |
+
- **Using the `hf jobs uv run` CLI**: local file paths **do work** (the CLI uploads your script).
|
| 302 |
+
|
| 303 |
+
**Common mistake with `hf_jobs()` MCP tool:**
|
| 304 |
+
|
| 305 |
+
```python
|
| 306 |
+
# ❌ Will fail (remote container can't see your local path)
|
| 307 |
+
hf_jobs("uv", {"script": "./scripts/foo.py"})
|
| 308 |
+
```
|
| 309 |
+
|
| 310 |
+
**Correct patterns with `hf_jobs()` MCP tool:**
|
| 311 |
+
|
| 312 |
+
```python
|
| 313 |
+
# ✅ Inline: read the local script file and pass its *contents*
|
| 314 |
+
from pathlib import Path
|
| 315 |
+
script = Path("hf-jobs/scripts/foo.py").read_text()
|
| 316 |
+
hf_jobs("uv", {"script": script})
|
| 317 |
+
|
| 318 |
+
# ✅ URL: host the script somewhere reachable
|
| 319 |
+
hf_jobs("uv", {"script": "https://huggingface.co/datasets/uv-scripts/.../raw/main/foo.py"})
|
| 320 |
+
```
|
| 321 |
+
|
| 322 |
+
**CLI equivalent (local paths supported):**
|
| 323 |
+
|
| 324 |
+
```bash
|
| 325 |
+
hf jobs uv run ./scripts/foo.py -- --your --args
|
| 326 |
+
```
|
| 327 |
+
|
| 328 |
+
### Approach 2: Docker-Based Jobs
|
| 329 |
+
|
| 330 |
+
Run jobs with custom Docker images and commands.
|
| 331 |
+
|
| 332 |
+
```python
|
| 333 |
+
hf_jobs("run", {
|
| 334 |
+
"image": "python:3.12",
|
| 335 |
+
"command": ["python", "-c", "print('Hello from HF Jobs!')"],
|
| 336 |
+
"flavor": "cpu-basic",
|
| 337 |
+
"timeout": "30m"
|
| 338 |
+
})
|
| 339 |
+
```
|
| 340 |
+
|
| 341 |
+
**Benefits:** Full Docker control, use pre-built images, run any command
|
| 342 |
+
**When to use:** Need specific Docker images, non-Python workloads, complex environments
|
| 343 |
+
|
| 344 |
+
**Example with GPU:**
|
| 345 |
+
```python
|
| 346 |
+
hf_jobs("run", {
|
| 347 |
+
"image": "pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel",
|
| 348 |
+
"command": ["python", "-c", "import torch; print(torch.cuda.get_device_name())"],
|
| 349 |
+
"flavor": "a10g-small",
|
| 350 |
+
"timeout": "1h"
|
| 351 |
+
})
|
| 352 |
+
```
|
| 353 |
+
|
| 354 |
+
### Finding More UV Scripts on Hub
|
| 355 |
+
|
| 356 |
+
The `uv-scripts` organization provides ready-to-use UV scripts stored as datasets on Hugging Face Hub:
|
| 357 |
+
|
| 358 |
+
```python
|
| 359 |
+
# Discover available UV script collections
|
| 360 |
+
dataset_search({"author": "uv-scripts", "sort": "downloads", "limit": 20})
|
| 361 |
+
|
| 362 |
+
# Explore a specific collection
|
| 363 |
+
hub_repo_details(["uv-scripts/classification"], repo_type="dataset", include_readme=True)
|
| 364 |
+
```
|
| 365 |
+
|
| 366 |
+
**Popular collections:** OCR, classification, synthetic-data, vLLM, dataset-creation
|
| 367 |
+
|
| 368 |
+
## Hardware Selection
|
| 369 |
+
|
| 370 |
+
| Workload Type | Recommended Hardware | Cost (approx./hr) | Use Case |
|
| 371 |
+
|---------------|---------------------|------------------|----------|
|
| 372 |
+
| Data processing, testing | `cpu-basic`, `cpu-upgrade` | ~$0.10-0.50 | Lightweight tasks |
|
| 373 |
+
| Small models, demos | `t4-small` | ~$0.75 | <1B models, quick tests |
|
| 374 |
+
| Medium models | `t4-medium`, `l4x1` | ~$1.50-2.50 | 1-7B models |
|
| 375 |
+
| Large models, production | `a10g-small`, `a10g-large` | ~$3.50-5.00 | 7-13B models |
|
| 376 |
+
| Very large models | `a100-large` | ~$8-12 | 13B+ models |
|
| 377 |
+
| Batch inference | `a10g-large`, `a100-large` | ~$5-10 | High-throughput |
|
| 378 |
+
| Data processing | `cpu-upgrade`, `l4x1` | ~$0.50-2.50 | Parallel workloads |
|
| 379 |
+
|
| 380 |
+
**GPU Flavors:** cpu-basic/upgrade, t4-small/medium, l4x1/x4, a10g-small/large/largex2/largex4, a100-large, h100/h100x8
|
| 381 |
+
|
| 382 |
+
**TPU Flavors:** v5e-1x1, v5e-2x2, v5e-2x4
|
| 383 |
+
|
| 384 |
+
**Guidelines:**
|
| 385 |
+
- Start with smaller hardware for testing
|
| 386 |
+
- Scale up based on actual needs
|
| 387 |
+
- Use multi-GPU for parallel workloads
|
| 388 |
+
- See `references/hardware_guide.md` for detailed specifications
|
| 389 |
+
|
| 390 |
+
## Critical: Saving Results
|
| 391 |
+
|
| 392 |
+
**⚠️ EPHEMERAL ENVIRONMENT—MUST PERSIST RESULTS**
|
| 393 |
+
|
| 394 |
+
The Jobs environment is temporary. All files are deleted when the job ends. If results aren't persisted, **ALL WORK IS LOST**.
|
| 395 |
+
|
| 396 |
+
### Persistence Options
|
| 397 |
+
|
| 398 |
+
**1. Push to Hugging Face Hub (Recommended)**
|
| 399 |
+
|
| 400 |
+
```python
|
| 401 |
+
# Push models
|
| 402 |
+
model.push_to_hub("username/model-name", token=os.environ["HF_TOKEN"])
|
| 403 |
+
|
| 404 |
+
# Push datasets
|
| 405 |
+
dataset.push_to_hub("username/dataset-name", token=os.environ["HF_TOKEN"])
|
| 406 |
+
|
| 407 |
+
# Push artifacts
|
| 408 |
+
api.upload_file(
|
| 409 |
+
path_or_fileobj="results.json",
|
| 410 |
+
path_in_repo="results.json",
|
| 411 |
+
repo_id="username/results",
|
| 412 |
+
token=os.environ["HF_TOKEN"]
|
| 413 |
+
)
|
| 414 |
+
```
|
| 415 |
+
|
| 416 |
+
**2. Use External Storage**
|
| 417 |
+
|
| 418 |
+
```python
|
| 419 |
+
# Upload to S3, GCS, etc.
|
| 420 |
+
import boto3
|
| 421 |
+
s3 = boto3.client('s3')
|
| 422 |
+
s3.upload_file('results.json', 'my-bucket', 'results.json')
|
| 423 |
+
```
|
| 424 |
+
|
| 425 |
+
**3. Send Results via API**
|
| 426 |
+
|
| 427 |
+
```python
|
| 428 |
+
# POST results to your API
|
| 429 |
+
import requests
|
| 430 |
+
requests.post("https://your-api.com/results", json=results)
|
| 431 |
+
```
|
| 432 |
+
|
| 433 |
+
### Required Configuration for Hub Push
|
| 434 |
+
|
| 435 |
+
**In job submission:**
|
| 436 |
+
```python
|
| 437 |
+
{
|
| 438 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # Enables authentication
|
| 439 |
+
}
|
| 440 |
+
```
|
| 441 |
+
|
| 442 |
+
**In script:**
|
| 443 |
+
```python
|
| 444 |
+
import os
|
| 445 |
+
from huggingface_hub import HfApi
|
| 446 |
+
|
| 447 |
+
# Token automatically available from secrets
|
| 448 |
+
api = HfApi(token=os.environ.get("HF_TOKEN"))
|
| 449 |
+
|
| 450 |
+
# Push your results
|
| 451 |
+
api.upload_file(...)
|
| 452 |
+
```
|
| 453 |
+
|
| 454 |
+
### Verification Checklist
|
| 455 |
+
|
| 456 |
+
Before submitting:
|
| 457 |
+
- [ ] Results persistence method chosen
|
| 458 |
+
- [ ] `secrets={"HF_TOKEN": "$HF_TOKEN"}` if using Hub
|
| 459 |
+
- [ ] Script handles missing token gracefully
|
| 460 |
+
- [ ] Test persistence path works
|
| 461 |
+
|
| 462 |
+
**See:** `references/hub_saving.md` for detailed Hub persistence guide
|
| 463 |
+
|
| 464 |
+
## Timeout Management
|
| 465 |
+
|
| 466 |
+
**⚠️ DEFAULT: 30 MINUTES**
|
| 467 |
+
|
| 468 |
+
### Setting Timeouts
|
| 469 |
+
|
| 470 |
+
```python
|
| 471 |
+
{
|
| 472 |
+
"timeout": "2h" # 2 hours (formats: "90m", "2h", "1.5h", or seconds as integer)
|
| 473 |
+
}
|
| 474 |
+
```
|
| 475 |
+
|
| 476 |
+
### Timeout Guidelines
|
| 477 |
+
|
| 478 |
+
| Scenario | Recommended | Notes |
|
| 479 |
+
|----------|-------------|-------|
|
| 480 |
+
| Quick test | 10-30 min | Verify setup |
|
| 481 |
+
| Data processing | 1-2 hours | Depends on data size |
|
| 482 |
+
| Batch inference | 2-4 hours | Large batches |
|
| 483 |
+
| Experiments | 4-8 hours | Multiple runs |
|
| 484 |
+
| Long-running | 8-24 hours | Production workloads |
|
| 485 |
+
|
| 486 |
+
**Always add 20-30% buffer** for setup, network delays, and cleanup.
|
| 487 |
+
|
| 488 |
+
**On timeout:** Job killed immediately, all unsaved progress lost
|
| 489 |
+
|
| 490 |
+
## Cost Estimation
|
| 491 |
+
|
| 492 |
+
**General guidelines:**
|
| 493 |
+
|
| 494 |
+
```
|
| 495 |
+
Total Cost = (Hours of runtime) × (Cost per hour)
|
| 496 |
+
```
|
| 497 |
+
|
| 498 |
+
**Example calculations:**
|
| 499 |
+
|
| 500 |
+
**Quick test:**
|
| 501 |
+
- Hardware: cpu-basic ($0.10/hour)
|
| 502 |
+
- Time: 15 minutes (0.25 hours)
|
| 503 |
+
- Cost: $0.03
|
| 504 |
+
|
| 505 |
+
**Data processing:**
|
| 506 |
+
- Hardware: l4x1 ($2.50/hour)
|
| 507 |
+
- Time: 2 hours
|
| 508 |
+
- Cost: $5.00
|
| 509 |
+
|
| 510 |
+
**Batch inference:**
|
| 511 |
+
- Hardware: a10g-large ($5/hour)
|
| 512 |
+
- Time: 4 hours
|
| 513 |
+
- Cost: $20.00
|
| 514 |
+
|
| 515 |
+
**Cost optimization tips:**
|
| 516 |
+
1. Start small - Test on cpu-basic or t4-small
|
| 517 |
+
2. Monitor runtime - Set appropriate timeouts
|
| 518 |
+
3. Use checkpoints - Resume if job fails
|
| 519 |
+
4. Optimize code - Reduce unnecessary compute
|
| 520 |
+
5. Choose right hardware - Don't over-provision
|
| 521 |
+
|
| 522 |
+
## Monitoring and Tracking
|
| 523 |
+
|
| 524 |
+
### Check Job Status
|
| 525 |
+
|
| 526 |
+
```python
|
| 527 |
+
# List all jobs
|
| 528 |
+
hf_jobs("ps")
|
| 529 |
+
|
| 530 |
+
# Inspect specific job
|
| 531 |
+
hf_jobs("inspect", {"job_id": "your-job-id"})
|
| 532 |
+
|
| 533 |
+
# View logs
|
| 534 |
+
hf_jobs("logs", {"job_id": "your-job-id"})
|
| 535 |
+
|
| 536 |
+
# Cancel a job
|
| 537 |
+
hf_jobs("cancel", {"job_id": "your-job-id"})
|
| 538 |
+
```
|
| 539 |
+
|
| 540 |
+
**Remember:** Wait for user to request status checks. Avoid polling repeatedly.
|
| 541 |
+
|
| 542 |
+
### Job URLs
|
| 543 |
+
|
| 544 |
+
After submission, jobs have monitoring URLs:
|
| 545 |
+
```
|
| 546 |
+
https://huggingface.co/jobs/username/job-id
|
| 547 |
+
```
|
| 548 |
+
|
| 549 |
+
View logs, status, and details in the browser.
|
| 550 |
+
|
| 551 |
+
## Scheduled Jobs
|
| 552 |
+
|
| 553 |
+
Run jobs on a schedule using CRON expressions or predefined schedules.
|
| 554 |
+
|
| 555 |
+
```python
|
| 556 |
+
# Schedule a job that runs every hour
|
| 557 |
+
hf_jobs("scheduled uv", {
|
| 558 |
+
"script": "your_script.py",
|
| 559 |
+
"schedule": "@hourly",
|
| 560 |
+
"flavor": "cpu-basic"
|
| 561 |
+
})
|
| 562 |
+
|
| 563 |
+
# Use CRON syntax
|
| 564 |
+
hf_jobs("scheduled uv", {
|
| 565 |
+
"script": "your_script.py",
|
| 566 |
+
"schedule": "0 9 * * 1", # 9 AM every Monday
|
| 567 |
+
"flavor": "cpu-basic"
|
| 568 |
+
})
|
| 569 |
+
```
|
| 570 |
+
|
| 571 |
+
**Available schedules:**
|
| 572 |
+
- `@annually`, `@yearly` - Once per year
|
| 573 |
+
- `@monthly` - Once per month
|
| 574 |
+
- `@weekly` - Once per week
|
| 575 |
+
- `@daily` - Once per day
|
| 576 |
+
- `@hourly` - Once per hour
|
| 577 |
+
- CRON expression - Custom schedule (e.g., `"0 9 * * 1"`)
|
| 578 |
+
|
| 579 |
+
**Manage scheduled jobs:**
|
| 580 |
+
```python
|
| 581 |
+
hf_jobs("scheduled ps") # List scheduled jobs
|
| 582 |
+
hf_jobs("scheduled suspend", {"job_id": "..."}) # Pause
|
| 583 |
+
hf_jobs("scheduled resume", {"job_id": "..."}) # Resume
|
| 584 |
+
hf_jobs("scheduled delete", {"job_id": "..."}) # Delete
|
| 585 |
+
```
|
| 586 |
+
|
| 587 |
+
## Common Workload Patterns
|
| 588 |
+
|
| 589 |
+
This repository ships ready-to-run UV scripts in `hf-jobs/scripts/`. Prefer using them instead of inventing new templates.
|
| 590 |
+
|
| 591 |
+
### Pattern 1: Dataset → Model Responses (vLLM) — `scripts/generate-responses.py`
|
| 592 |
+
|
| 593 |
+
**What it does:** loads a Hub dataset (chat `messages` or a `prompt` column), applies a model chat template, generates responses with vLLM, and **pushes** the output dataset + dataset card back to the Hub.
|
| 594 |
+
|
| 595 |
+
**Requires:** GPU + **write** token (it pushes a dataset).
|
| 596 |
+
|
| 597 |
+
```python
|
| 598 |
+
from pathlib import Path
|
| 599 |
+
|
| 600 |
+
script = Path("hf-jobs/scripts/generate-responses.py").read_text()
|
| 601 |
+
hf_jobs("uv", {
|
| 602 |
+
"script": script,
|
| 603 |
+
"script_args": [
|
| 604 |
+
"username/input-dataset",
|
| 605 |
+
"username/output-dataset",
|
| 606 |
+
"--messages-column", "messages",
|
| 607 |
+
"--model-id", "Qwen/Qwen3-30B-A3B-Instruct-2507",
|
| 608 |
+
"--temperature", "0.7",
|
| 609 |
+
"--top-p", "0.8",
|
| 610 |
+
"--max-tokens", "2048",
|
| 611 |
+
],
|
| 612 |
+
"flavor": "a10g-large",
|
| 613 |
+
"timeout": "4h",
|
| 614 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"},
|
| 615 |
+
})
|
| 616 |
+
```
|
| 617 |
+
|
| 618 |
+
### Pattern 2: CoT Self-Instruct Synthetic Data — `scripts/cot-self-instruct.py`
|
| 619 |
+
|
| 620 |
+
**What it does:** generates synthetic prompts/answers via CoT Self-Instruct, optionally filters outputs (answer-consistency / RIP), then **pushes** the generated dataset + dataset card to the Hub.
|
| 621 |
+
|
| 622 |
+
**Requires:** GPU + **write** token (it pushes a dataset).
|
| 623 |
+
|
| 624 |
+
```python
|
| 625 |
+
from pathlib import Path
|
| 626 |
+
|
| 627 |
+
script = Path("hf-jobs/scripts/cot-self-instruct.py").read_text()
|
| 628 |
+
hf_jobs("uv", {
|
| 629 |
+
"script": script,
|
| 630 |
+
"script_args": [
|
| 631 |
+
"--seed-dataset", "davanstrien/s1k-reasoning",
|
| 632 |
+
"--output-dataset", "username/synthetic-math",
|
| 633 |
+
"--task-type", "reasoning",
|
| 634 |
+
"--num-samples", "5000",
|
| 635 |
+
"--filter-method", "answer-consistency",
|
| 636 |
+
],
|
| 637 |
+
"flavor": "l4x4",
|
| 638 |
+
"timeout": "8h",
|
| 639 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"},
|
| 640 |
+
})
|
| 641 |
+
```
|
| 642 |
+
|
| 643 |
+
### Pattern 3: Streaming Dataset Stats (Polars + HF Hub) — `scripts/finepdfs-stats.py`
|
| 644 |
+
|
| 645 |
+
**What it does:** scans parquet directly from Hub (no 300GB download), computes temporal stats, and (optionally) uploads results to a Hub dataset repo.
|
| 646 |
+
|
| 647 |
+
**Requires:** CPU is often enough; token needed **only** if you pass `--output-repo` (upload).
|
| 648 |
+
|
| 649 |
+
```python
|
| 650 |
+
from pathlib import Path
|
| 651 |
+
|
| 652 |
+
script = Path("hf-jobs/scripts/finepdfs-stats.py").read_text()
|
| 653 |
+
hf_jobs("uv", {
|
| 654 |
+
"script": script,
|
| 655 |
+
"script_args": [
|
| 656 |
+
"--limit", "10000",
|
| 657 |
+
"--show-plan",
|
| 658 |
+
"--output-repo", "username/finepdfs-temporal-stats",
|
| 659 |
+
],
|
| 660 |
+
"flavor": "cpu-upgrade",
|
| 661 |
+
"timeout": "2h",
|
| 662 |
+
"env": {"HF_XET_HIGH_PERFORMANCE": "1"},
|
| 663 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"},
|
| 664 |
+
})
|
| 665 |
+
```
|
| 666 |
+
|
| 667 |
+
## Common Failure Modes
|
| 668 |
+
|
| 669 |
+
### Out of Memory (OOM)
|
| 670 |
+
|
| 671 |
+
**Fix:**
|
| 672 |
+
1. Reduce batch size or data chunk size
|
| 673 |
+
2. Process data in smaller batches
|
| 674 |
+
3. Upgrade hardware: cpu → t4 → a10g → a100
|
| 675 |
+
|
| 676 |
+
### Job Timeout
|
| 677 |
+
|
| 678 |
+
**Fix:**
|
| 679 |
+
1. Check logs for actual runtime
|
| 680 |
+
2. Increase timeout with buffer: `"timeout": "3h"`
|
| 681 |
+
3. Optimize code for faster execution
|
| 682 |
+
4. Process data in chunks
|
| 683 |
+
|
| 684 |
+
### Hub Push Failures
|
| 685 |
+
|
| 686 |
+
**Fix:**
|
| 687 |
+
1. Add to job: `secrets={"HF_TOKEN": "$HF_TOKEN"}`
|
| 688 |
+
2. Verify token in script: `assert "HF_TOKEN" in os.environ`
|
| 689 |
+
3. Check token permissions
|
| 690 |
+
4. Verify repo exists or can be created
|
| 691 |
+
|
| 692 |
+
### Missing Dependencies
|
| 693 |
+
|
| 694 |
+
**Fix:**
|
| 695 |
+
Add to PEP 723 header:
|
| 696 |
+
```python
|
| 697 |
+
# /// script
|
| 698 |
+
# dependencies = ["package1", "package2>=1.0.0"]
|
| 699 |
+
# ///
|
| 700 |
+
```
|
| 701 |
+
|
| 702 |
+
### Authentication Errors
|
| 703 |
+
|
| 704 |
+
**Fix:**
|
| 705 |
+
1. Check `hf_whoami()` works locally
|
| 706 |
+
2. Verify `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config
|
| 707 |
+
3. Re-login: `hf auth login`
|
| 708 |
+
4. Check token has required permissions
|
| 709 |
+
|
| 710 |
+
## Troubleshooting
|
| 711 |
+
|
| 712 |
+
**Common issues:**
|
| 713 |
+
- Job times out → Increase timeout, optimize code
|
| 714 |
+
- Results not saved → Check persistence method, verify HF_TOKEN
|
| 715 |
+
- Out of Memory → Reduce batch size, upgrade hardware
|
| 716 |
+
- Import errors → Add dependencies to PEP 723 header
|
| 717 |
+
- Authentication errors → Check token, verify secrets parameter
|
| 718 |
+
|
| 719 |
+
**See:** `references/troubleshooting.md` for complete troubleshooting guide
|
| 720 |
+
|
| 721 |
+
## Resources
|
| 722 |
+
|
| 723 |
+
### References (In This Skill)
|
| 724 |
+
- `references/token_usage.md` - Complete token usage guide
|
| 725 |
+
- `references/hardware_guide.md` - Hardware specs and selection
|
| 726 |
+
- `references/hub_saving.md` - Hub persistence guide
|
| 727 |
+
- `references/troubleshooting.md` - Common issues and solutions
|
| 728 |
+
|
| 729 |
+
### Scripts (In This Skill)
|
| 730 |
+
- `scripts/generate-responses.py` - vLLM batch generation: dataset → responses → push to Hub
|
| 731 |
+
- `scripts/cot-self-instruct.py` - CoT Self-Instruct synthetic data generation + filtering → push to Hub
|
| 732 |
+
- `scripts/finepdfs-stats.py` - Polars streaming stats over `finepdfs-edu` parquet on Hub (optional push)
|
| 733 |
+
|
| 734 |
+
### External Links
|
| 735 |
+
- [HF Jobs Documentation](https://huggingface.co/docs/huggingface_hub/guides/jobs)
|
| 736 |
+
- [UV Scripts Guide](https://docs.astral.sh/uv/guides/scripts/)
|
| 737 |
+
- [UV Scripts Organization](https://huggingface.co/uv-scripts)
|
| 738 |
+
- [HF Hub Authentication](https://huggingface.co/docs/huggingface_hub/quick-start#authentication)
|
| 739 |
+
|
| 740 |
+
## Key Takeaways
|
| 741 |
+
|
| 742 |
+
1. **Submit scripts inline** - The `script` parameter accepts Python code directly; no file saving required unless user requests
|
| 743 |
+
2. **Jobs are asynchronous** - Don't wait/poll; let user check when ready
|
| 744 |
+
3. **Always set timeout** - Default 30 min may be insufficient; set appropriate timeout
|
| 745 |
+
4. **Always persist results** - Environment is ephemeral; without persistence, all work is lost
|
| 746 |
+
5. **Use tokens securely** - Always use `secrets={"HF_TOKEN": "$HF_TOKEN"}` for Hub operations
|
| 747 |
+
6. **Choose appropriate hardware** - Start small, scale up based on needs
|
| 748 |
+
7. **Use UV scripts** - Default to `hf_jobs("uv", {...})` with inline scripts for Python workloads
|
| 749 |
+
8. **Handle authentication** - Verify tokens are available before Hub operations
|
| 750 |
+
9. **Monitor jobs** - Provide job URLs and status check commands
|
| 751 |
+
10. **Optimize costs** - Choose right hardware, set appropriate timeouts
|
| 752 |
+
|
index.html
CHANGED
|
@@ -1,19 +1,214 @@
|
|
| 1 |
-
<!
|
| 2 |
-
<html>
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
</html>
|
|
|
|
|
|
|
|
|
| 1 |
+
<!DOCTYPE html>
|
| 2 |
+
<html lang="en">
|
| 3 |
+
<head>
|
| 4 |
+
<meta charset="UTF-8">
|
| 5 |
+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
| 6 |
+
<title>hf-jobs - Run Workloads on Hugging Face Jobs</title>
|
| 7 |
+
<style>
|
| 8 |
+
* {
|
| 9 |
+
margin: 0;
|
| 10 |
+
padding: 0;
|
| 11 |
+
box-sizing: border-box;
|
| 12 |
+
}
|
| 13 |
+
|
| 14 |
+
body {
|
| 15 |
+
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif;
|
| 16 |
+
line-height: 1.6;
|
| 17 |
+
color: #333;
|
| 18 |
+
background: #f5f5f5;
|
| 19 |
+
padding: 20px;
|
| 20 |
+
}
|
| 21 |
+
|
| 22 |
+
.container {
|
| 23 |
+
max-width: 1200px;
|
| 24 |
+
margin: 0 auto;
|
| 25 |
+
background: white;
|
| 26 |
+
padding: 40px;
|
| 27 |
+
border-radius: 8px;
|
| 28 |
+
box-shadow: 0 2px 4px rgba(0,0,0,0.1);
|
| 29 |
+
}
|
| 30 |
+
|
| 31 |
+
h1 {
|
| 32 |
+
color: #ffd21e;
|
| 33 |
+
background: #000;
|
| 34 |
+
padding: 20px;
|
| 35 |
+
margin: -40px -40px 30px -40px;
|
| 36 |
+
border-radius: 8px 8px 0 0;
|
| 37 |
+
}
|
| 38 |
+
|
| 39 |
+
h2 {
|
| 40 |
+
color: #1e1e1e;
|
| 41 |
+
margin-top: 30px;
|
| 42 |
+
margin-bottom: 15px;
|
| 43 |
+
padding-bottom: 10px;
|
| 44 |
+
border-bottom: 2px solid #ffd21e;
|
| 45 |
+
}
|
| 46 |
+
|
| 47 |
+
h3 {
|
| 48 |
+
color: #555;
|
| 49 |
+
margin-top: 20px;
|
| 50 |
+
margin-bottom: 10px;
|
| 51 |
+
}
|
| 52 |
+
|
| 53 |
+
.description {
|
| 54 |
+
background: #f9f9f9;
|
| 55 |
+
padding: 20px;
|
| 56 |
+
border-left: 4px solid #ffd21e;
|
| 57 |
+
margin-bottom: 30px;
|
| 58 |
+
border-radius: 4px;
|
| 59 |
+
}
|
| 60 |
+
|
| 61 |
+
.file-list {
|
| 62 |
+
list-style: none;
|
| 63 |
+
padding: 0;
|
| 64 |
+
}
|
| 65 |
+
|
| 66 |
+
.file-list li {
|
| 67 |
+
padding: 12px;
|
| 68 |
+
margin: 8px 0;
|
| 69 |
+
background: #f9f9f9;
|
| 70 |
+
border-radius: 4px;
|
| 71 |
+
border-left: 3px solid #ffd21e;
|
| 72 |
+
transition: background 0.2s;
|
| 73 |
+
}
|
| 74 |
+
|
| 75 |
+
.file-list li:hover {
|
| 76 |
+
background: #f0f0f0;
|
| 77 |
+
}
|
| 78 |
+
|
| 79 |
+
.file-list a {
|
| 80 |
+
color: #0066cc;
|
| 81 |
+
text-decoration: none;
|
| 82 |
+
font-weight: 500;
|
| 83 |
+
display: block;
|
| 84 |
+
}
|
| 85 |
+
|
| 86 |
+
.file-list a:hover {
|
| 87 |
+
text-decoration: underline;
|
| 88 |
+
}
|
| 89 |
+
|
| 90 |
+
.file-path {
|
| 91 |
+
color: #666;
|
| 92 |
+
font-size: 0.9em;
|
| 93 |
+
font-family: 'Monaco', 'Courier New', monospace;
|
| 94 |
+
margin-top: 4px;
|
| 95 |
+
}
|
| 96 |
+
|
| 97 |
+
.file-description {
|
| 98 |
+
color: #777;
|
| 99 |
+
font-size: 0.9em;
|
| 100 |
+
margin-top: 4px;
|
| 101 |
+
font-style: italic;
|
| 102 |
+
}
|
| 103 |
+
|
| 104 |
+
.metadata {
|
| 105 |
+
background: #f0f0f0;
|
| 106 |
+
padding: 15px;
|
| 107 |
+
border-radius: 4px;
|
| 108 |
+
margin-bottom: 30px;
|
| 109 |
+
}
|
| 110 |
+
|
| 111 |
+
.metadata p {
|
| 112 |
+
margin: 5px 0;
|
| 113 |
+
}
|
| 114 |
+
|
| 115 |
+
.metadata strong {
|
| 116 |
+
color: #333;
|
| 117 |
+
}
|
| 118 |
+
|
| 119 |
+
.section {
|
| 120 |
+
margin-bottom: 40px;
|
| 121 |
+
}
|
| 122 |
+
|
| 123 |
+
code {
|
| 124 |
+
background: #f4f4f4;
|
| 125 |
+
padding: 2px 6px;
|
| 126 |
+
border-radius: 3px;
|
| 127 |
+
font-family: 'Monaco', 'Courier New', monospace;
|
| 128 |
+
font-size: 0.9em;
|
| 129 |
+
}
|
| 130 |
+
</style>
|
| 131 |
+
</head>
|
| 132 |
+
<body>
|
| 133 |
+
<div class="container">
|
| 134 |
+
<h1>Agent Skill : hf-jobs</h1>
|
| 135 |
+
|
| 136 |
+
<div class="description">
|
| 137 |
+
<p><strong>Run any workload on Hugging Face Jobs.</strong></p>
|
| 138 |
+
<p>Use this skill when you want to run GPU/CPU workloads (batch inference, synthetic data generation, dataset stats, experiments) on Hugging Face Jobs, with correct token handling and result persistence back to the Hub.</p>
|
| 139 |
+
</div>
|
| 140 |
+
|
| 141 |
+
<div class="metadata">
|
| 142 |
+
<p><strong>Skill Name:</strong> hf-jobs</p>
|
| 143 |
+
<p><strong>Main Documentation:</strong> <a href="hf-jobs/SKILL.md">hf-jobs/SKILL.md</a></p>
|
| 144 |
+
<p><strong>Scripts Directory:</strong> <code>hf-jobs/scripts/</code></p>
|
| 145 |
+
<p><strong>References Directory:</strong> <code>hf-jobs/references/</code></p>
|
| 146 |
+
</div>
|
| 147 |
+
|
| 148 |
+
<div class="section">
|
| 149 |
+
<h2>Overview</h2>
|
| 150 |
+
<p>This skill focuses on running real workloads via Hugging Face Jobs. It includes ready-to-run UV scripts and guides for authentication (HF tokens), secrets vs env vars, timeouts, hardware selection, and pushing results to the Hub.</p>
|
| 151 |
+
</div>
|
| 152 |
+
|
| 153 |
+
<div class="section">
|
| 154 |
+
<h2>Core Documentation</h2>
|
| 155 |
+
<ul class="file-list">
|
| 156 |
+
<li>
|
| 157 |
+
<a href="hf-jobs/SKILL.md">SKILL.md</a>
|
| 158 |
+
<div class="file-path">hf-jobs/SKILL.md</div>
|
| 159 |
+
<div class="file-description">Complete skill documentation (how to submit jobs, tokens/secrets, timeouts, persistence, and how to use the bundled scripts)</div>
|
| 160 |
+
</li>
|
| 161 |
+
</ul>
|
| 162 |
+
</div>
|
| 163 |
+
|
| 164 |
+
<div class="section">
|
| 165 |
+
<h2>References</h2>
|
| 166 |
+
<ul class="file-list">
|
| 167 |
+
<li>
|
| 168 |
+
<a href="hf-jobs/references/token_usage.md">token_usage.md</a>
|
| 169 |
+
<div class="file-path">hf-jobs/references/token_usage.md</div>
|
| 170 |
+
<div class="file-description">Token best practices: secrets vs env, permissions, common errors (401/403), and secure patterns</div>
|
| 171 |
+
</li>
|
| 172 |
+
<li>
|
| 173 |
+
<a href="hf-jobs/references/hub_saving.md">hub_saving.md</a>
|
| 174 |
+
<div class="file-path">hf-jobs/references/hub_saving.md</div>
|
| 175 |
+
<div class="file-description">How to persist results: push datasets/models/files to the Hub (ephemeral job filesystem)</div>
|
| 176 |
+
</li>
|
| 177 |
+
<li>
|
| 178 |
+
<a href="hf-jobs/references/hardware_guide.md">hardware_guide.md</a>
|
| 179 |
+
<div class="file-path">hf-jobs/references/hardware_guide.md</div>
|
| 180 |
+
<div class="file-description">Flavor selection guidance for CPU/GPU/TPU workloads</div>
|
| 181 |
+
</li>
|
| 182 |
+
<li>
|
| 183 |
+
<a href="hf-jobs/references/troubleshooting.md">troubleshooting.md</a>
|
| 184 |
+
<div class="file-path">hf-jobs/references/troubleshooting.md</div>
|
| 185 |
+
<div class="file-description">Common failure modes (timeouts, missing deps, OOM, auth) and fixes</div>
|
| 186 |
+
</li>
|
| 187 |
+
</ul>
|
| 188 |
+
</div>
|
| 189 |
+
|
| 190 |
+
<div class="section">
|
| 191 |
+
<h2>Scripts</h2>
|
| 192 |
+
<ul class="file-list">
|
| 193 |
+
<li>
|
| 194 |
+
<a href="hf-jobs/scripts/generate-responses.py">generate-responses.py</a>
|
| 195 |
+
<div class="file-path">hf-jobs/scripts/generate-responses.py</div>
|
| 196 |
+
<div class="file-description">vLLM batch generation: load prompts/messages from a dataset, generate responses, push dataset + card to Hub</div>
|
| 197 |
+
</li>
|
| 198 |
+
<li>
|
| 199 |
+
<a href="hf-jobs/scripts/cot-self-instruct.py">cot-self-instruct.py</a>
|
| 200 |
+
<div class="file-path">hf-jobs/scripts/cot-self-instruct.py</div>
|
| 201 |
+
<div class="file-description">CoT Self-Instruct synthetic data generation (reasoning/instruction) + optional filtering, pushes dataset + card</div>
|
| 202 |
+
</li>
|
| 203 |
+
<li>
|
| 204 |
+
<a href="hf-jobs/scripts/finepdfs-stats.py">finepdfs-stats.py</a>
|
| 205 |
+
<div class="file-path">hf-jobs/scripts/finepdfs-stats.py</div>
|
| 206 |
+
<div class="file-description">Polars streaming stats over Hub parquet (finepdfs-edu); optional upload of computed stats to a dataset repo</div>
|
| 207 |
+
</li>
|
| 208 |
+
</ul>
|
| 209 |
+
</div>
|
| 210 |
+
</div>
|
| 211 |
+
</body>
|
| 212 |
</html>
|
| 213 |
+
|
| 214 |
+
|
references/hardware_guide.md
ADDED
|
@@ -0,0 +1,266 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Hardware Selection Guide
|
| 2 |
+
|
| 3 |
+
Choosing the right hardware (flavor) is critical for cost-effective workloads.
|
| 4 |
+
|
| 5 |
+
## Available Hardware
|
| 6 |
+
|
| 7 |
+
### CPU
|
| 8 |
+
- `cpu-basic` - Basic CPU, testing only
|
| 9 |
+
- `cpu-upgrade` - Enhanced CPU
|
| 10 |
+
|
| 11 |
+
**Use cases:** Data processing, testing scripts, lightweight workloads
|
| 12 |
+
**Not recommended for:** Model training, GPU-accelerated workloads
|
| 13 |
+
|
| 14 |
+
### GPU Options
|
| 15 |
+
|
| 16 |
+
| Flavor | GPU | Memory | Use Case | Cost/hour |
|
| 17 |
+
|--------|-----|--------|----------|-----------|
|
| 18 |
+
| `t4-small` | NVIDIA T4 | 16GB | <1B models, demos, batch inference | ~$0.50-1 |
|
| 19 |
+
| `t4-medium` | NVIDIA T4 | 16GB | 1-3B models, development | ~$1-2 |
|
| 20 |
+
| `l4x1` | NVIDIA L4 | 24GB | 3-7B models, efficient workloads | ~$2-3 |
|
| 21 |
+
| `l4x4` | 4x NVIDIA L4 | 96GB | Multi-GPU workloads | ~$8-12 |
|
| 22 |
+
| `a10g-small` | NVIDIA A10G | 24GB | 3-7B models, production | ~$3-4 |
|
| 23 |
+
| `a10g-large` | NVIDIA A10G | 24GB | 7-13B models | ~$4-6 |
|
| 24 |
+
| `a10g-largex2` | 2x NVIDIA A10G | 48GB | Multi-GPU, large models | ~$8-12 |
|
| 25 |
+
| `a10g-largex4` | 4x NVIDIA A10G | 96GB | Multi-GPU, very large models | ~$16-24 |
|
| 26 |
+
| `a100-large` | NVIDIA A100 | 40GB | 13B+ models, fast workloads | ~$8-12 |
|
| 27 |
+
|
| 28 |
+
## Selection Guidelines
|
| 29 |
+
|
| 30 |
+
### By Workload Type
|
| 31 |
+
|
| 32 |
+
**Data Processing**
|
| 33 |
+
- **Recommended:** `cpu-upgrade` or `l4x1`
|
| 34 |
+
- **Use case:** Transform, filter, analyze datasets
|
| 35 |
+
- **Batch size:** Depends on data size
|
| 36 |
+
- **Time:** Varies by dataset size
|
| 37 |
+
|
| 38 |
+
**Batch Inference**
|
| 39 |
+
- **Recommended:** `a10g-large` or `a100-large`
|
| 40 |
+
- **Use case:** Run inference on thousands of samples
|
| 41 |
+
- **Batch size:** 8-32 depending on model
|
| 42 |
+
- **Time:** Depends on number of samples
|
| 43 |
+
|
| 44 |
+
**Experiments & Benchmarks**
|
| 45 |
+
- **Recommended:** `a10g-small` or `a10g-large`
|
| 46 |
+
- **Use case:** Reproducible ML experiments
|
| 47 |
+
- **Batch size:** Varies
|
| 48 |
+
- **Time:** Depends on experiment complexity
|
| 49 |
+
|
| 50 |
+
**Model Training** (see `model-trainer` skill for details)
|
| 51 |
+
- **Recommended:** See model-trainer skill
|
| 52 |
+
- **Use case:** Fine-tuning models
|
| 53 |
+
- **Batch size:** Depends on model size
|
| 54 |
+
- **Time:** Hours to days
|
| 55 |
+
|
| 56 |
+
**Synthetic Data Generation**
|
| 57 |
+
- **Recommended:** `a10g-large` or `a100-large`
|
| 58 |
+
- **Use case:** Generate datasets using LLMs
|
| 59 |
+
- **Batch size:** Depends on generation method
|
| 60 |
+
- **Time:** Hours for large datasets
|
| 61 |
+
|
| 62 |
+
### By Budget
|
| 63 |
+
|
| 64 |
+
**Minimal Budget (<$5 total)**
|
| 65 |
+
- Use `cpu-basic` or `t4-small`
|
| 66 |
+
- Process small datasets
|
| 67 |
+
- Quick tests and demos
|
| 68 |
+
|
| 69 |
+
**Small Budget ($5-20)**
|
| 70 |
+
- Use `t4-medium` or `a10g-small`
|
| 71 |
+
- Process medium datasets
|
| 72 |
+
- Run experiments
|
| 73 |
+
|
| 74 |
+
**Medium Budget ($20-50)**
|
| 75 |
+
- Use `a10g-small` or `a10g-large`
|
| 76 |
+
- Process large datasets
|
| 77 |
+
- Production workloads
|
| 78 |
+
|
| 79 |
+
**Large Budget ($50-200)**
|
| 80 |
+
- Use `a10g-large` or `a100-large`
|
| 81 |
+
- Large-scale processing
|
| 82 |
+
- Multiple experiments
|
| 83 |
+
|
| 84 |
+
### By Model Size (for inference/processing)
|
| 85 |
+
|
| 86 |
+
**Tiny Models (<1B parameters)**
|
| 87 |
+
- **Recommended:** `t4-small`
|
| 88 |
+
- **Example:** Qwen2.5-0.5B, TinyLlama
|
| 89 |
+
- **Batch size:** 8-16
|
| 90 |
+
|
| 91 |
+
**Small Models (1-3B parameters)**
|
| 92 |
+
- **Recommended:** `t4-medium` or `a10g-small`
|
| 93 |
+
- **Example:** Qwen2.5-1.5B, Phi-2
|
| 94 |
+
- **Batch size:** 4-8
|
| 95 |
+
|
| 96 |
+
**Medium Models (3-7B parameters)**
|
| 97 |
+
- **Recommended:** `a10g-small` or `a10g-large`
|
| 98 |
+
- **Example:** Qwen2.5-7B, Mistral-7B
|
| 99 |
+
- **Batch size:** 2-4
|
| 100 |
+
|
| 101 |
+
**Large Models (7-13B parameters)**
|
| 102 |
+
- **Recommended:** `a10g-large` or `a100-large`
|
| 103 |
+
- **Example:** Llama-3-8B
|
| 104 |
+
- **Batch size:** 1-2
|
| 105 |
+
|
| 106 |
+
**Very Large Models (13B+ parameters)**
|
| 107 |
+
- **Recommended:** `a100-large`
|
| 108 |
+
- **Example:** Llama-3-13B, Llama-3-70B
|
| 109 |
+
- **Batch size:** 1
|
| 110 |
+
|
| 111 |
+
## Memory Considerations
|
| 112 |
+
|
| 113 |
+
### Estimating Memory Requirements
|
| 114 |
+
|
| 115 |
+
**For inference:**
|
| 116 |
+
```
|
| 117 |
+
Memory (GB) ≈ (Model params in billions) × 2-4
|
| 118 |
+
```
|
| 119 |
+
|
| 120 |
+
**For training:**
|
| 121 |
+
```
|
| 122 |
+
Memory (GB) ≈ (Model params in billions) × 20 (full) or × 4 (LoRA)
|
| 123 |
+
```
|
| 124 |
+
|
| 125 |
+
**Examples:**
|
| 126 |
+
- Qwen2.5-0.5B inference: ~1-2GB ✅ fits t4-small
|
| 127 |
+
- Qwen2.5-7B inference: ~14-28GB ✅ fits a10g-large
|
| 128 |
+
- Qwen2.5-7B training: ~140GB ❌ not feasible without LoRA
|
| 129 |
+
|
| 130 |
+
### Memory Optimization
|
| 131 |
+
|
| 132 |
+
If hitting memory limits:
|
| 133 |
+
|
| 134 |
+
1. **Reduce batch size**
|
| 135 |
+
```python
|
| 136 |
+
batch_size = 1
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
2. **Process in chunks**
|
| 140 |
+
```python
|
| 141 |
+
for chunk in chunks:
|
| 142 |
+
process(chunk)
|
| 143 |
+
```
|
| 144 |
+
|
| 145 |
+
3. **Use smaller models**
|
| 146 |
+
- Use quantized models
|
| 147 |
+
- Use LoRA adapters
|
| 148 |
+
|
| 149 |
+
4. **Upgrade hardware**
|
| 150 |
+
- cpu → t4 → a10g → a100
|
| 151 |
+
|
| 152 |
+
## Cost Estimation
|
| 153 |
+
|
| 154 |
+
### Formula
|
| 155 |
+
|
| 156 |
+
```
|
| 157 |
+
Total Cost = (Hours of runtime) × (Cost per hour)
|
| 158 |
+
```
|
| 159 |
+
|
| 160 |
+
### Example Calculations
|
| 161 |
+
|
| 162 |
+
**Data processing:**
|
| 163 |
+
- Hardware: cpu-upgrade ($0.50/hour)
|
| 164 |
+
- Time: 1 hour
|
| 165 |
+
- Cost: $0.50
|
| 166 |
+
|
| 167 |
+
**Batch inference:**
|
| 168 |
+
- Hardware: a10g-large ($5/hour)
|
| 169 |
+
- Time: 2 hours
|
| 170 |
+
- Cost: $10.00
|
| 171 |
+
|
| 172 |
+
**Experiments:**
|
| 173 |
+
- Hardware: a10g-small ($3.50/hour)
|
| 174 |
+
- Time: 4 hours
|
| 175 |
+
- Cost: $14.00
|
| 176 |
+
|
| 177 |
+
### Cost Optimization Tips
|
| 178 |
+
|
| 179 |
+
1. **Start small:** Test on cpu-basic or t4-small
|
| 180 |
+
2. **Monitor runtime:** Set appropriate timeouts
|
| 181 |
+
3. **Optimize code:** Reduce unnecessary compute
|
| 182 |
+
4. **Choose right hardware:** Don't over-provision
|
| 183 |
+
5. **Use checkpoints:** Resume if job fails
|
| 184 |
+
6. **Monitor costs:** Check running jobs regularly
|
| 185 |
+
|
| 186 |
+
## Multi-GPU Workloads
|
| 187 |
+
|
| 188 |
+
Multi-GPU flavors automatically distribute workloads:
|
| 189 |
+
|
| 190 |
+
**Multi-GPU flavors:**
|
| 191 |
+
- `l4x4` - 4x L4 GPUs
|
| 192 |
+
- `a10g-largex2` - 2x A10G GPUs
|
| 193 |
+
- `a10g-largex4` - 4x A10G GPUs
|
| 194 |
+
|
| 195 |
+
**When to use:**
|
| 196 |
+
- Large models (>13B parameters)
|
| 197 |
+
- Need faster processing (linear speedup)
|
| 198 |
+
- Large datasets (>100K samples)
|
| 199 |
+
- Parallel workloads
|
| 200 |
+
|
| 201 |
+
**Example:**
|
| 202 |
+
```python
|
| 203 |
+
hf_jobs("uv", {
|
| 204 |
+
"script": "process.py",
|
| 205 |
+
"flavor": "a10g-largex2", # 2 GPUs
|
| 206 |
+
"timeout": "4h",
|
| 207 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
|
| 208 |
+
})
|
| 209 |
+
```
|
| 210 |
+
|
| 211 |
+
## Choosing Between Options
|
| 212 |
+
|
| 213 |
+
### CPU vs GPU
|
| 214 |
+
|
| 215 |
+
**Choose CPU when:**
|
| 216 |
+
- No GPU acceleration needed
|
| 217 |
+
- Data processing only
|
| 218 |
+
- Budget constrained
|
| 219 |
+
- Simple workloads
|
| 220 |
+
|
| 221 |
+
**Choose GPU when:**
|
| 222 |
+
- Model inference/training
|
| 223 |
+
- GPU-accelerated libraries
|
| 224 |
+
- Need faster processing
|
| 225 |
+
- Large models
|
| 226 |
+
|
| 227 |
+
### a10g vs a100
|
| 228 |
+
|
| 229 |
+
**Choose a10g when:**
|
| 230 |
+
- Model <13B parameters
|
| 231 |
+
- Budget conscious
|
| 232 |
+
- Processing time not critical
|
| 233 |
+
|
| 234 |
+
**Choose a100 when:**
|
| 235 |
+
- Model 13B+ parameters
|
| 236 |
+
- Need fastest processing
|
| 237 |
+
- Memory requirements high
|
| 238 |
+
- Budget allows
|
| 239 |
+
|
| 240 |
+
### Single vs Multi-GPU
|
| 241 |
+
|
| 242 |
+
**Choose single GPU when:**
|
| 243 |
+
- Model <7B parameters
|
| 244 |
+
- Budget constrained
|
| 245 |
+
- Simpler debugging
|
| 246 |
+
|
| 247 |
+
**Choose multi-GPU when:**
|
| 248 |
+
- Model >13B parameters
|
| 249 |
+
- Need faster processing
|
| 250 |
+
- Large batch sizes required
|
| 251 |
+
- Cost-effective for large jobs
|
| 252 |
+
|
| 253 |
+
## Quick Reference
|
| 254 |
+
|
| 255 |
+
```python
|
| 256 |
+
# Workload type → Hardware selection
|
| 257 |
+
HARDWARE_MAP = {
|
| 258 |
+
"data_processing": "cpu-upgrade",
|
| 259 |
+
"batch_inference_small": "t4-small",
|
| 260 |
+
"batch_inference_medium": "a10g-large",
|
| 261 |
+
"batch_inference_large": "a100-large",
|
| 262 |
+
"experiments": "a10g-small",
|
| 263 |
+
"training": "see model-trainer skill"
|
| 264 |
+
}
|
| 265 |
+
```
|
| 266 |
+
|
references/hub_saving.md
ADDED
|
@@ -0,0 +1,339 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Saving Results to Hugging Face Hub
|
| 2 |
+
|
| 3 |
+
**⚠️ CRITICAL:** Job environments are ephemeral. ALL results are lost when a job completes unless persisted to the Hub or external storage.
|
| 4 |
+
|
| 5 |
+
## Why Persistence is Required
|
| 6 |
+
|
| 7 |
+
When running on Hugging Face Jobs:
|
| 8 |
+
- Environment is temporary
|
| 9 |
+
- All files deleted on job completion
|
| 10 |
+
- No local disk persistence
|
| 11 |
+
- Cannot access results after job ends
|
| 12 |
+
|
| 13 |
+
**Without persistence, all work is permanently lost.**
|
| 14 |
+
|
| 15 |
+
## Persistence Options
|
| 16 |
+
|
| 17 |
+
### Option 1: Push to Hugging Face Hub (Recommended)
|
| 18 |
+
|
| 19 |
+
**For models:**
|
| 20 |
+
```python
|
| 21 |
+
from transformers import AutoModel
|
| 22 |
+
model.push_to_hub("username/model-name", token=os.environ.get("HF_TOKEN"))
|
| 23 |
+
```
|
| 24 |
+
|
| 25 |
+
**For datasets:**
|
| 26 |
+
```python
|
| 27 |
+
from datasets import Dataset
|
| 28 |
+
dataset.push_to_hub("username/dataset-name", token=os.environ.get("HF_TOKEN"))
|
| 29 |
+
```
|
| 30 |
+
|
| 31 |
+
**For files/artifacts:**
|
| 32 |
+
```python
|
| 33 |
+
from huggingface_hub import HfApi
|
| 34 |
+
api = HfApi(token=os.environ.get("HF_TOKEN"))
|
| 35 |
+
api.upload_file(
|
| 36 |
+
path_or_fileobj="results.json",
|
| 37 |
+
path_in_repo="results.json",
|
| 38 |
+
repo_id="username/results",
|
| 39 |
+
repo_type="dataset"
|
| 40 |
+
)
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
### Option 2: External Storage
|
| 44 |
+
|
| 45 |
+
**S3:**
|
| 46 |
+
```python
|
| 47 |
+
import boto3
|
| 48 |
+
s3 = boto3.client('s3')
|
| 49 |
+
s3.upload_file('results.json', 'my-bucket', 'results.json')
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
**Google Cloud Storage:**
|
| 53 |
+
```python
|
| 54 |
+
from google.cloud import storage
|
| 55 |
+
client = storage.Client()
|
| 56 |
+
bucket = client.bucket('my-bucket')
|
| 57 |
+
blob = bucket.blob('results.json')
|
| 58 |
+
blob.upload_from_filename('results.json')
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
### Option 3: API Endpoint
|
| 62 |
+
|
| 63 |
+
```python
|
| 64 |
+
import requests
|
| 65 |
+
requests.post("https://your-api.com/results", json=results)
|
| 66 |
+
```
|
| 67 |
+
|
| 68 |
+
## Required Configuration for Hub Push
|
| 69 |
+
|
| 70 |
+
### Job Configuration
|
| 71 |
+
|
| 72 |
+
**Always include HF_TOKEN:**
|
| 73 |
+
```python
|
| 74 |
+
hf_jobs("uv", {
|
| 75 |
+
"script": "your_script.py",
|
| 76 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Required for Hub operations
|
| 77 |
+
})
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
+
### Script Configuration
|
| 81 |
+
|
| 82 |
+
**Verify token exists:**
|
| 83 |
+
```python
|
| 84 |
+
import os
|
| 85 |
+
assert "HF_TOKEN" in os.environ, "HF_TOKEN required for Hub operations!"
|
| 86 |
+
```
|
| 87 |
+
|
| 88 |
+
**Use token for Hub operations:**
|
| 89 |
+
```python
|
| 90 |
+
from huggingface_hub import HfApi
|
| 91 |
+
|
| 92 |
+
# Auto-detects HF_TOKEN from environment
|
| 93 |
+
api = HfApi()
|
| 94 |
+
|
| 95 |
+
# Or explicitly pass token
|
| 96 |
+
api = HfApi(token=os.environ.get("HF_TOKEN"))
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
## Complete Examples
|
| 100 |
+
|
| 101 |
+
### Example 1: Push Dataset
|
| 102 |
+
|
| 103 |
+
```python
|
| 104 |
+
hf_jobs("uv", {
|
| 105 |
+
"script": """
|
| 106 |
+
# /// script
|
| 107 |
+
# dependencies = ["datasets", "huggingface-hub"]
|
| 108 |
+
# ///
|
| 109 |
+
|
| 110 |
+
import os
|
| 111 |
+
from datasets import Dataset
|
| 112 |
+
from huggingface_hub import HfApi
|
| 113 |
+
|
| 114 |
+
# Verify token
|
| 115 |
+
assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
|
| 116 |
+
|
| 117 |
+
# Process data
|
| 118 |
+
data = {"text": ["Sample 1", "Sample 2"]}
|
| 119 |
+
dataset = Dataset.from_dict(data)
|
| 120 |
+
|
| 121 |
+
# Push to Hub
|
| 122 |
+
dataset.push_to_hub("username/my-dataset")
|
| 123 |
+
print("✅ Dataset pushed!")
|
| 124 |
+
""",
|
| 125 |
+
"flavor": "cpu-basic",
|
| 126 |
+
"timeout": "30m",
|
| 127 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
|
| 128 |
+
})
|
| 129 |
+
```
|
| 130 |
+
|
| 131 |
+
### Example 2: Push Model
|
| 132 |
+
|
| 133 |
+
```python
|
| 134 |
+
hf_jobs("uv", {
|
| 135 |
+
"script": """
|
| 136 |
+
# /// script
|
| 137 |
+
# dependencies = ["transformers"]
|
| 138 |
+
# ///
|
| 139 |
+
|
| 140 |
+
import os
|
| 141 |
+
from transformers import AutoModel, AutoTokenizer
|
| 142 |
+
|
| 143 |
+
# Verify token
|
| 144 |
+
assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
|
| 145 |
+
|
| 146 |
+
# Load and process model
|
| 147 |
+
model = AutoModel.from_pretrained("base-model")
|
| 148 |
+
tokenizer = AutoTokenizer.from_pretrained("base-model")
|
| 149 |
+
# ... process model ...
|
| 150 |
+
|
| 151 |
+
# Push to Hub
|
| 152 |
+
model.push_to_hub("username/my-model")
|
| 153 |
+
tokenizer.push_to_hub("username/my-model")
|
| 154 |
+
print("✅ Model pushed!")
|
| 155 |
+
""",
|
| 156 |
+
"flavor": "a10g-large",
|
| 157 |
+
"timeout": "2h",
|
| 158 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
|
| 159 |
+
})
|
| 160 |
+
```
|
| 161 |
+
|
| 162 |
+
### Example 3: Push Artifacts
|
| 163 |
+
|
| 164 |
+
```python
|
| 165 |
+
hf_jobs("uv", {
|
| 166 |
+
"script": """
|
| 167 |
+
# /// script
|
| 168 |
+
# dependencies = ["huggingface-hub", "pandas"]
|
| 169 |
+
# ///
|
| 170 |
+
|
| 171 |
+
import os
|
| 172 |
+
import json
|
| 173 |
+
import pandas as pd
|
| 174 |
+
from huggingface_hub import HfApi
|
| 175 |
+
|
| 176 |
+
# Verify token
|
| 177 |
+
assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
|
| 178 |
+
|
| 179 |
+
# Generate results
|
| 180 |
+
results = {"accuracy": 0.95, "loss": 0.05}
|
| 181 |
+
df = pd.DataFrame([results])
|
| 182 |
+
|
| 183 |
+
# Save files
|
| 184 |
+
with open("results.json", "w") as f:
|
| 185 |
+
json.dump(results, f)
|
| 186 |
+
df.to_csv("results.csv", index=False)
|
| 187 |
+
|
| 188 |
+
# Push to Hub
|
| 189 |
+
api = HfApi()
|
| 190 |
+
api.upload_file("results.json", "results.json", "username/results", repo_type="dataset")
|
| 191 |
+
api.upload_file("results.csv", "results.csv", "username/results", repo_type="dataset")
|
| 192 |
+
print("✅ Results pushed!")
|
| 193 |
+
""",
|
| 194 |
+
"flavor": "cpu-basic",
|
| 195 |
+
"timeout": "30m",
|
| 196 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
|
| 197 |
+
})
|
| 198 |
+
```
|
| 199 |
+
|
| 200 |
+
## Authentication Methods
|
| 201 |
+
|
| 202 |
+
### Method 1: Automatic Token (Recommended)
|
| 203 |
+
|
| 204 |
+
```python
|
| 205 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
|
| 206 |
+
```
|
| 207 |
+
|
| 208 |
+
Uses your logged-in Hugging Face token automatically.
|
| 209 |
+
|
| 210 |
+
### Method 2: Explicit Token
|
| 211 |
+
|
| 212 |
+
```python
|
| 213 |
+
"secrets": {"HF_TOKEN": "hf_abc123..."}
|
| 214 |
+
```
|
| 215 |
+
|
| 216 |
+
Provide token explicitly (not recommended for security).
|
| 217 |
+
|
| 218 |
+
### Method 3: Environment Variable
|
| 219 |
+
|
| 220 |
+
```python
|
| 221 |
+
"env": {"HF_TOKEN": "hf_abc123..."}
|
| 222 |
+
```
|
| 223 |
+
|
| 224 |
+
Pass as regular environment variable (less secure than secrets).
|
| 225 |
+
|
| 226 |
+
**Always prefer Method 1** for security and convenience.
|
| 227 |
+
|
| 228 |
+
## Verification Checklist
|
| 229 |
+
|
| 230 |
+
Before submitting any job that saves to Hub, verify:
|
| 231 |
+
|
| 232 |
+
- [ ] `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config
|
| 233 |
+
- [ ] Script checks for token: `assert "HF_TOKEN" in os.environ`
|
| 234 |
+
- [ ] Hub push code included in script
|
| 235 |
+
- [ ] Repository name doesn't conflict with existing repos
|
| 236 |
+
- [ ] You have write access to the target namespace
|
| 237 |
+
|
| 238 |
+
## Repository Setup
|
| 239 |
+
|
| 240 |
+
### Automatic Creation
|
| 241 |
+
|
| 242 |
+
If repository doesn't exist, it's created automatically when first pushing (if token has write permissions).
|
| 243 |
+
|
| 244 |
+
### Manual Creation
|
| 245 |
+
|
| 246 |
+
Create repository before pushing:
|
| 247 |
+
|
| 248 |
+
```python
|
| 249 |
+
from huggingface_hub import HfApi
|
| 250 |
+
|
| 251 |
+
api = HfApi()
|
| 252 |
+
api.create_repo(
|
| 253 |
+
repo_id="username/repo-name",
|
| 254 |
+
repo_type="model", # or "dataset"
|
| 255 |
+
private=False, # or True for private repo
|
| 256 |
+
)
|
| 257 |
+
```
|
| 258 |
+
|
| 259 |
+
### Repository Naming
|
| 260 |
+
|
| 261 |
+
**Valid names:**
|
| 262 |
+
- `username/my-model`
|
| 263 |
+
- `username/model-name`
|
| 264 |
+
- `organization/model-name`
|
| 265 |
+
|
| 266 |
+
**Invalid names:**
|
| 267 |
+
- `model-name` (missing username)
|
| 268 |
+
- `username/model name` (spaces not allowed)
|
| 269 |
+
- `username/MODEL` (uppercase discouraged)
|
| 270 |
+
|
| 271 |
+
## Troubleshooting
|
| 272 |
+
|
| 273 |
+
### Error: 401 Unauthorized
|
| 274 |
+
|
| 275 |
+
**Cause:** HF_TOKEN not provided or invalid
|
| 276 |
+
|
| 277 |
+
**Solutions:**
|
| 278 |
+
1. Verify `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config
|
| 279 |
+
2. Check you're logged in: `hf_whoami()`
|
| 280 |
+
3. Re-login: `hf auth login`
|
| 281 |
+
|
| 282 |
+
### Error: 403 Forbidden
|
| 283 |
+
|
| 284 |
+
**Cause:** No write access to repository
|
| 285 |
+
|
| 286 |
+
**Solutions:**
|
| 287 |
+
1. Check repository namespace matches your username
|
| 288 |
+
2. Verify you're a member of organization (if using org namespace)
|
| 289 |
+
3. Check token has write permissions
|
| 290 |
+
|
| 291 |
+
### Error: Repository not found
|
| 292 |
+
|
| 293 |
+
**Cause:** Repository doesn't exist and auto-creation failed
|
| 294 |
+
|
| 295 |
+
**Solutions:**
|
| 296 |
+
1. Manually create repository first
|
| 297 |
+
2. Check repository name format
|
| 298 |
+
3. Verify namespace exists
|
| 299 |
+
|
| 300 |
+
### Error: Push failed
|
| 301 |
+
|
| 302 |
+
**Cause:** Network issues or Hub unavailable
|
| 303 |
+
|
| 304 |
+
**Solutions:**
|
| 305 |
+
1. Check logs for specific error
|
| 306 |
+
2. Verify token is valid
|
| 307 |
+
3. Retry push operation
|
| 308 |
+
|
| 309 |
+
## Best Practices
|
| 310 |
+
|
| 311 |
+
1. **Always verify token exists** before Hub operations
|
| 312 |
+
2. **Use descriptive repo names** (e.g., `my-experiment-results` not `results`)
|
| 313 |
+
3. **Push incrementally** for large results (use checkpoints)
|
| 314 |
+
4. **Verify push success** in logs before job completes
|
| 315 |
+
5. **Use appropriate repo types** (model vs dataset)
|
| 316 |
+
6. **Add README** with result descriptions
|
| 317 |
+
7. **Tag repos** with relevant tags
|
| 318 |
+
|
| 319 |
+
## Monitoring Push Progress
|
| 320 |
+
|
| 321 |
+
Check logs for push progress:
|
| 322 |
+
|
| 323 |
+
```python
|
| 324 |
+
hf_jobs("logs", {"job_id": "your-job-id"})
|
| 325 |
+
```
|
| 326 |
+
|
| 327 |
+
**Look for:**
|
| 328 |
+
```
|
| 329 |
+
Pushing to username/repo-name...
|
| 330 |
+
Upload file results.json: 100%
|
| 331 |
+
✅ Push successful
|
| 332 |
+
```
|
| 333 |
+
|
| 334 |
+
## Key Takeaway
|
| 335 |
+
|
| 336 |
+
**Without `secrets={"HF_TOKEN": "$HF_TOKEN"}` and persistence code, all results are permanently lost.**
|
| 337 |
+
|
| 338 |
+
Always verify both are configured before submitting any job that produces results.
|
| 339 |
+
|
references/token_usage.md
ADDED
|
@@ -0,0 +1,546 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Token Usage Guide for Hugging Face Jobs
|
| 2 |
+
|
| 3 |
+
**⚠️ CRITICAL:** Proper token usage is essential for any job that interacts with the Hugging Face Hub.
|
| 4 |
+
|
| 5 |
+
## Overview
|
| 6 |
+
|
| 7 |
+
Hugging Face tokens are authentication credentials that allow your jobs to interact with the Hub. They're required for:
|
| 8 |
+
- Pushing models/datasets to Hub
|
| 9 |
+
- Accessing private repositories
|
| 10 |
+
- Creating new repositories
|
| 11 |
+
- Using Hub APIs programmatically
|
| 12 |
+
- Any authenticated Hub operations
|
| 13 |
+
|
| 14 |
+
## Token Types
|
| 15 |
+
|
| 16 |
+
### Read Token
|
| 17 |
+
- **Permissions:** Download models/datasets, read private repos
|
| 18 |
+
- **Use case:** Jobs that only need to download/read content
|
| 19 |
+
- **Creation:** https://huggingface.co/settings/tokens
|
| 20 |
+
|
| 21 |
+
### Write Token
|
| 22 |
+
- **Permissions:** Push models/datasets, create repos, modify content
|
| 23 |
+
- **Use case:** Jobs that need to upload results (most common)
|
| 24 |
+
- **Creation:** https://huggingface.co/settings/tokens
|
| 25 |
+
- **⚠️ Required for:** Pushing models, datasets, or any uploads
|
| 26 |
+
|
| 27 |
+
### Organization Token
|
| 28 |
+
- **Permissions:** Act on behalf of an organization
|
| 29 |
+
- **Use case:** Jobs running under organization namespace
|
| 30 |
+
- **Creation:** Organization settings → Tokens
|
| 31 |
+
|
| 32 |
+
## Providing Tokens to Jobs
|
| 33 |
+
|
| 34 |
+
### Method 1: Automatic Token (Recommended) ⭐
|
| 35 |
+
|
| 36 |
+
```python
|
| 37 |
+
hf_jobs("uv", {
|
| 38 |
+
"script": "your_script.py",
|
| 39 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Automatic replacement
|
| 40 |
+
})
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
**How it works:**
|
| 44 |
+
1. `$HF_TOKEN` is a placeholder that gets replaced with your actual token
|
| 45 |
+
2. Uses the token from your logged-in session (`hf auth login`)
|
| 46 |
+
3. Token is encrypted server-side when passed as a secret
|
| 47 |
+
4. Most secure and convenient method
|
| 48 |
+
|
| 49 |
+
**Benefits:**
|
| 50 |
+
- ✅ No token exposure in code
|
| 51 |
+
- ✅ Uses your current login session
|
| 52 |
+
- ✅ Automatically updated if you re-login
|
| 53 |
+
- ✅ Works seamlessly with MCP tools
|
| 54 |
+
- ✅ Token encrypted server-side
|
| 55 |
+
|
| 56 |
+
**Requirements:**
|
| 57 |
+
- Must be logged in: `hf auth login` or `hf_whoami()` works
|
| 58 |
+
- Token must have required permissions
|
| 59 |
+
|
| 60 |
+
### Method 2: Explicit Token (Not Recommended)
|
| 61 |
+
|
| 62 |
+
```python
|
| 63 |
+
hf_jobs("uv", {
|
| 64 |
+
"script": "your_script.py",
|
| 65 |
+
"secrets": {"HF_TOKEN": "hf_abc123..."} # ⚠️ Hardcoded token
|
| 66 |
+
})
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
**When to use:**
|
| 70 |
+
- Only if automatic token doesn't work
|
| 71 |
+
- Testing with a specific token
|
| 72 |
+
- Organization tokens (use with caution)
|
| 73 |
+
|
| 74 |
+
**Security concerns:**
|
| 75 |
+
- ❌ Token visible in code/logs
|
| 76 |
+
- ❌ Must manually update if token rotates
|
| 77 |
+
- ❌ Risk of token exposure
|
| 78 |
+
- ❌ Not recommended for production
|
| 79 |
+
|
| 80 |
+
### Method 3: Environment Variable (Less Secure)
|
| 81 |
+
|
| 82 |
+
```python
|
| 83 |
+
hf_jobs("uv", {
|
| 84 |
+
"script": "your_script.py",
|
| 85 |
+
"env": {"HF_TOKEN": "hf_abc123..."} # ⚠️ Less secure than secrets
|
| 86 |
+
})
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
**Difference from secrets:**
|
| 90 |
+
- `env` variables are visible in job logs
|
| 91 |
+
- `secrets` are encrypted server-side
|
| 92 |
+
- Always prefer `secrets` for tokens
|
| 93 |
+
|
| 94 |
+
**When to use:**
|
| 95 |
+
- Only for non-sensitive configuration
|
| 96 |
+
- Never use for tokens (use `secrets` instead)
|
| 97 |
+
|
| 98 |
+
## Using Tokens in Scripts
|
| 99 |
+
|
| 100 |
+
### Accessing Tokens
|
| 101 |
+
|
| 102 |
+
Tokens passed via `secrets` are available as environment variables in your script:
|
| 103 |
+
|
| 104 |
+
```python
|
| 105 |
+
import os
|
| 106 |
+
|
| 107 |
+
# Get token from environment
|
| 108 |
+
token = os.environ.get("HF_TOKEN")
|
| 109 |
+
|
| 110 |
+
# Verify token exists
|
| 111 |
+
if not token:
|
| 112 |
+
raise ValueError("HF_TOKEN not found in environment!")
|
| 113 |
+
```
|
| 114 |
+
|
| 115 |
+
### Using with Hugging Face Hub
|
| 116 |
+
|
| 117 |
+
**Option 1: Explicit token parameter**
|
| 118 |
+
```python
|
| 119 |
+
from huggingface_hub import HfApi
|
| 120 |
+
|
| 121 |
+
api = HfApi(token=os.environ.get("HF_TOKEN"))
|
| 122 |
+
api.upload_file(...)
|
| 123 |
+
```
|
| 124 |
+
|
| 125 |
+
**Option 2: Auto-detection (Recommended)**
|
| 126 |
+
```python
|
| 127 |
+
from huggingface_hub import HfApi
|
| 128 |
+
|
| 129 |
+
# Automatically uses HF_TOKEN env var
|
| 130 |
+
api = HfApi() # ✅ Simpler, uses token from environment
|
| 131 |
+
api.upload_file(...)
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
**Option 3: With transformers/datasets**
|
| 135 |
+
```python
|
| 136 |
+
from transformers import AutoModel
|
| 137 |
+
from datasets import load_dataset
|
| 138 |
+
|
| 139 |
+
# Auto-detects HF_TOKEN from environment
|
| 140 |
+
model = AutoModel.from_pretrained("username/model")
|
| 141 |
+
dataset = load_dataset("username/dataset")
|
| 142 |
+
|
| 143 |
+
# For push operations, token is auto-detected
|
| 144 |
+
model.push_to_hub("username/new-model")
|
| 145 |
+
dataset.push_to_hub("username/new-dataset")
|
| 146 |
+
```
|
| 147 |
+
|
| 148 |
+
### Complete Example
|
| 149 |
+
|
| 150 |
+
```python
|
| 151 |
+
# /// script
|
| 152 |
+
# dependencies = ["huggingface-hub", "datasets"]
|
| 153 |
+
# ///
|
| 154 |
+
|
| 155 |
+
import os
|
| 156 |
+
from huggingface_hub import HfApi
|
| 157 |
+
from datasets import Dataset
|
| 158 |
+
|
| 159 |
+
# Verify token is available
|
| 160 |
+
assert "HF_TOKEN" in os.environ, "HF_TOKEN required for Hub operations!"
|
| 161 |
+
|
| 162 |
+
# Use token for Hub operations
|
| 163 |
+
api = HfApi() # Auto-detects HF_TOKEN
|
| 164 |
+
|
| 165 |
+
# Create and push dataset
|
| 166 |
+
data = {"text": ["Hello", "World"]}
|
| 167 |
+
dataset = Dataset.from_dict(data)
|
| 168 |
+
|
| 169 |
+
# Push to Hub (token auto-detected)
|
| 170 |
+
dataset.push_to_hub("username/my-dataset")
|
| 171 |
+
|
| 172 |
+
print("✅ Dataset pushed successfully!")
|
| 173 |
+
```
|
| 174 |
+
|
| 175 |
+
## Token Verification
|
| 176 |
+
|
| 177 |
+
### Check Authentication Locally
|
| 178 |
+
|
| 179 |
+
```python
|
| 180 |
+
from huggingface_hub import whoami
|
| 181 |
+
|
| 182 |
+
try:
|
| 183 |
+
user_info = whoami()
|
| 184 |
+
print(f"✅ Logged in as: {user_info['name']}")
|
| 185 |
+
except Exception as e:
|
| 186 |
+
print(f"❌ Not authenticated: {e}")
|
| 187 |
+
```
|
| 188 |
+
|
| 189 |
+
### Verify Token in Job
|
| 190 |
+
|
| 191 |
+
```python
|
| 192 |
+
import os
|
| 193 |
+
|
| 194 |
+
# Check token exists
|
| 195 |
+
if "HF_TOKEN" not in os.environ:
|
| 196 |
+
raise ValueError("HF_TOKEN not found in environment!")
|
| 197 |
+
|
| 198 |
+
token = os.environ["HF_TOKEN"]
|
| 199 |
+
|
| 200 |
+
# Verify token format (should start with "hf_")
|
| 201 |
+
if not token.startswith("hf_"):
|
| 202 |
+
raise ValueError(f"Invalid token format: {token[:10]}...")
|
| 203 |
+
|
| 204 |
+
# Test token works
|
| 205 |
+
from huggingface_hub import whoami
|
| 206 |
+
try:
|
| 207 |
+
user_info = whoami(token=token)
|
| 208 |
+
print(f"✅ Token valid for user: {user_info['name']}")
|
| 209 |
+
except Exception as e:
|
| 210 |
+
raise ValueError(f"Token validation failed: {e}")
|
| 211 |
+
```
|
| 212 |
+
|
| 213 |
+
## Common Token Issues
|
| 214 |
+
|
| 215 |
+
### Error: 401 Unauthorized
|
| 216 |
+
|
| 217 |
+
**Symptoms:**
|
| 218 |
+
```
|
| 219 |
+
401 Client Error: Unauthorized for url: https://huggingface.co/api/...
|
| 220 |
+
```
|
| 221 |
+
|
| 222 |
+
**Causes:**
|
| 223 |
+
1. Token missing from job
|
| 224 |
+
2. Token invalid or expired
|
| 225 |
+
3. Token not passed correctly
|
| 226 |
+
|
| 227 |
+
**Solutions:**
|
| 228 |
+
1. Add `secrets={"HF_TOKEN": "$HF_TOKEN"}` to job config
|
| 229 |
+
2. Verify `hf_whoami()` works locally
|
| 230 |
+
3. Re-login: `hf auth login`
|
| 231 |
+
4. Check token hasn't expired
|
| 232 |
+
|
| 233 |
+
**Verification:**
|
| 234 |
+
```python
|
| 235 |
+
# In your script
|
| 236 |
+
import os
|
| 237 |
+
assert "HF_TOKEN" in os.environ, "HF_TOKEN missing!"
|
| 238 |
+
```
|
| 239 |
+
|
| 240 |
+
### Error: 403 Forbidden
|
| 241 |
+
|
| 242 |
+
**Symptoms:**
|
| 243 |
+
```
|
| 244 |
+
403 Client Error: Forbidden for url: https://huggingface.co/api/...
|
| 245 |
+
```
|
| 246 |
+
|
| 247 |
+
**Causes:**
|
| 248 |
+
1. Token lacks required permissions (read-only token used for write)
|
| 249 |
+
2. No access to private repository
|
| 250 |
+
3. Organization permissions insufficient
|
| 251 |
+
|
| 252 |
+
**Solutions:**
|
| 253 |
+
1. Ensure token has write permissions
|
| 254 |
+
2. Check token type at https://huggingface.co/settings/tokens
|
| 255 |
+
3. Verify access to target repository
|
| 256 |
+
4. Use organization token if needed
|
| 257 |
+
|
| 258 |
+
**Check token permissions:**
|
| 259 |
+
```python
|
| 260 |
+
from huggingface_hub import whoami
|
| 261 |
+
|
| 262 |
+
user_info = whoami()
|
| 263 |
+
print(f"User: {user_info['name']}")
|
| 264 |
+
print(f"Type: {user_info.get('type', 'user')}")
|
| 265 |
+
```
|
| 266 |
+
|
| 267 |
+
### Error: Token not found in environment
|
| 268 |
+
|
| 269 |
+
**Symptoms:**
|
| 270 |
+
```
|
| 271 |
+
KeyError: 'HF_TOKEN'
|
| 272 |
+
ValueError: HF_TOKEN not found
|
| 273 |
+
```
|
| 274 |
+
|
| 275 |
+
**Causes:**
|
| 276 |
+
1. `secrets` not passed in job config
|
| 277 |
+
2. Wrong key name (should be `HF_TOKEN`)
|
| 278 |
+
3. Using `env` instead of `secrets`
|
| 279 |
+
|
| 280 |
+
**Solutions:**
|
| 281 |
+
1. Use `secrets={"HF_TOKEN": "$HF_TOKEN"}` (not `env`)
|
| 282 |
+
2. Verify key name is exactly `HF_TOKEN`
|
| 283 |
+
3. Check job config syntax
|
| 284 |
+
|
| 285 |
+
**Correct configuration:**
|
| 286 |
+
```python
|
| 287 |
+
# ✅ Correct
|
| 288 |
+
hf_jobs("uv", {
|
| 289 |
+
"script": "...",
|
| 290 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
|
| 291 |
+
})
|
| 292 |
+
|
| 293 |
+
# ❌ Wrong - using env instead of secrets
|
| 294 |
+
hf_jobs("uv", {
|
| 295 |
+
"script": "...",
|
| 296 |
+
"env": {"HF_TOKEN": "$HF_TOKEN"} # Less secure
|
| 297 |
+
})
|
| 298 |
+
|
| 299 |
+
# ❌ Wrong - wrong key name
|
| 300 |
+
hf_jobs("uv", {
|
| 301 |
+
"script": "...",
|
| 302 |
+
"secrets": {"TOKEN": "$HF_TOKEN"} # Wrong key
|
| 303 |
+
})
|
| 304 |
+
```
|
| 305 |
+
|
| 306 |
+
### Error: Repository access denied
|
| 307 |
+
|
| 308 |
+
**Symptoms:**
|
| 309 |
+
```
|
| 310 |
+
403 Client Error: Forbidden
|
| 311 |
+
Repository not found or access denied
|
| 312 |
+
```
|
| 313 |
+
|
| 314 |
+
**Causes:**
|
| 315 |
+
1. Token doesn't have access to private repo
|
| 316 |
+
2. Repository doesn't exist and can't be created
|
| 317 |
+
3. Wrong namespace
|
| 318 |
+
|
| 319 |
+
**Solutions:**
|
| 320 |
+
1. Use token from account with access
|
| 321 |
+
2. Verify repo visibility (public vs private)
|
| 322 |
+
3. Check namespace matches token owner
|
| 323 |
+
4. Create repo first if needed
|
| 324 |
+
|
| 325 |
+
**Check repository access:**
|
| 326 |
+
```python
|
| 327 |
+
from huggingface_hub import HfApi
|
| 328 |
+
|
| 329 |
+
api = HfApi()
|
| 330 |
+
try:
|
| 331 |
+
repo_info = api.repo_info("username/repo-name")
|
| 332 |
+
print(f"✅ Access granted: {repo_info.id}")
|
| 333 |
+
except Exception as e:
|
| 334 |
+
print(f"❌ Access denied: {e}")
|
| 335 |
+
```
|
| 336 |
+
|
| 337 |
+
## Token Security Best Practices
|
| 338 |
+
|
| 339 |
+
### 1. Never Commit Tokens
|
| 340 |
+
|
| 341 |
+
**❌ Bad:**
|
| 342 |
+
```python
|
| 343 |
+
# Never do this!
|
| 344 |
+
token = "hf_abc123xyz..."
|
| 345 |
+
api = HfApi(token=token)
|
| 346 |
+
```
|
| 347 |
+
|
| 348 |
+
**✅ Good:**
|
| 349 |
+
```python
|
| 350 |
+
# Use environment variable
|
| 351 |
+
token = os.environ.get("HF_TOKEN")
|
| 352 |
+
api = HfApi(token=token)
|
| 353 |
+
```
|
| 354 |
+
|
| 355 |
+
### 2. Use Secrets, Not Environment Variables
|
| 356 |
+
|
| 357 |
+
**❌ Bad:**
|
| 358 |
+
```python
|
| 359 |
+
hf_jobs("uv", {
|
| 360 |
+
"script": "...",
|
| 361 |
+
"env": {"HF_TOKEN": "$HF_TOKEN"} # Visible in logs
|
| 362 |
+
})
|
| 363 |
+
```
|
| 364 |
+
|
| 365 |
+
**✅ Good:**
|
| 366 |
+
```python
|
| 367 |
+
hf_jobs("uv", {
|
| 368 |
+
"script": "...",
|
| 369 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # Encrypted server-side
|
| 370 |
+
})
|
| 371 |
+
```
|
| 372 |
+
|
| 373 |
+
### 3. Use Automatic Token Replacement
|
| 374 |
+
|
| 375 |
+
**❌ Bad:**
|
| 376 |
+
```python
|
| 377 |
+
hf_jobs("uv", {
|
| 378 |
+
"script": "...",
|
| 379 |
+
"secrets": {"HF_TOKEN": "hf_abc123..."} # Hardcoded
|
| 380 |
+
})
|
| 381 |
+
```
|
| 382 |
+
|
| 383 |
+
**✅ Good:**
|
| 384 |
+
```python
|
| 385 |
+
hf_jobs("uv", {
|
| 386 |
+
"script": "...",
|
| 387 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # Automatic
|
| 388 |
+
})
|
| 389 |
+
```
|
| 390 |
+
|
| 391 |
+
### 4. Rotate Tokens Regularly
|
| 392 |
+
|
| 393 |
+
- Generate new tokens periodically
|
| 394 |
+
- Revoke old tokens
|
| 395 |
+
- Update job configurations
|
| 396 |
+
- Monitor token usage
|
| 397 |
+
|
| 398 |
+
### 5. Use Minimal Permissions
|
| 399 |
+
|
| 400 |
+
- Create tokens with only needed permissions
|
| 401 |
+
- Use read tokens when write isn't needed
|
| 402 |
+
- Don't use admin tokens for regular jobs
|
| 403 |
+
|
| 404 |
+
### 6. Don't Share Tokens
|
| 405 |
+
|
| 406 |
+
- Each user should use their own token
|
| 407 |
+
- Don't commit tokens to repositories
|
| 408 |
+
- Don't share tokens in logs or messages
|
| 409 |
+
|
| 410 |
+
### 7. Monitor Token Usage
|
| 411 |
+
|
| 412 |
+
- Check token activity in Hub settings
|
| 413 |
+
- Review job logs for token issues
|
| 414 |
+
- Set up alerts for unauthorized access
|
| 415 |
+
|
| 416 |
+
## Token Workflow Examples
|
| 417 |
+
|
| 418 |
+
### Example 1: Push Model to Hub
|
| 419 |
+
|
| 420 |
+
```python
|
| 421 |
+
hf_jobs("uv", {
|
| 422 |
+
"script": """
|
| 423 |
+
# /// script
|
| 424 |
+
# dependencies = ["transformers"]
|
| 425 |
+
# ///
|
| 426 |
+
|
| 427 |
+
import os
|
| 428 |
+
from transformers import AutoModel, AutoTokenizer
|
| 429 |
+
|
| 430 |
+
# Verify token
|
| 431 |
+
assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
|
| 432 |
+
|
| 433 |
+
# Load and process model
|
| 434 |
+
model = AutoModel.from_pretrained("base-model")
|
| 435 |
+
# ... process model ...
|
| 436 |
+
|
| 437 |
+
# Push to Hub (token auto-detected)
|
| 438 |
+
model.push_to_hub("username/my-model")
|
| 439 |
+
print("✅ Model pushed!")
|
| 440 |
+
""",
|
| 441 |
+
"flavor": "a10g-large",
|
| 442 |
+
"timeout": "2h",
|
| 443 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Token provided
|
| 444 |
+
})
|
| 445 |
+
```
|
| 446 |
+
|
| 447 |
+
### Example 2: Access Private Dataset
|
| 448 |
+
|
| 449 |
+
```python
|
| 450 |
+
hf_jobs("uv", {
|
| 451 |
+
"script": """
|
| 452 |
+
# /// script
|
| 453 |
+
# dependencies = ["datasets"]
|
| 454 |
+
# ///
|
| 455 |
+
|
| 456 |
+
import os
|
| 457 |
+
from datasets import load_dataset
|
| 458 |
+
|
| 459 |
+
# Verify token
|
| 460 |
+
assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
|
| 461 |
+
|
| 462 |
+
# Load private dataset (token auto-detected)
|
| 463 |
+
dataset = load_dataset("private-org/private-dataset")
|
| 464 |
+
print(f"✅ Loaded {len(dataset)} examples")
|
| 465 |
+
""",
|
| 466 |
+
"flavor": "cpu-basic",
|
| 467 |
+
"timeout": "30m",
|
| 468 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Token provided
|
| 469 |
+
})
|
| 470 |
+
```
|
| 471 |
+
|
| 472 |
+
### Example 3: Create and Push Dataset
|
| 473 |
+
|
| 474 |
+
```python
|
| 475 |
+
hf_jobs("uv", {
|
| 476 |
+
"script": """
|
| 477 |
+
# /// script
|
| 478 |
+
# dependencies = ["datasets", "huggingface-hub"]
|
| 479 |
+
# ///
|
| 480 |
+
|
| 481 |
+
import os
|
| 482 |
+
from datasets import Dataset
|
| 483 |
+
from huggingface_hub import HfApi
|
| 484 |
+
|
| 485 |
+
# Verify token
|
| 486 |
+
assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
|
| 487 |
+
|
| 488 |
+
# Create dataset
|
| 489 |
+
data = {"text": ["Sample 1", "Sample 2"]}
|
| 490 |
+
dataset = Dataset.from_dict(data)
|
| 491 |
+
|
| 492 |
+
# Push to Hub
|
| 493 |
+
api = HfApi() # Auto-detects HF_TOKEN
|
| 494 |
+
dataset.push_to_hub("username/my-dataset")
|
| 495 |
+
print("✅ Dataset pushed!")
|
| 496 |
+
""",
|
| 497 |
+
"flavor": "cpu-basic",
|
| 498 |
+
"timeout": "30m",
|
| 499 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Token provided
|
| 500 |
+
})
|
| 501 |
+
```
|
| 502 |
+
|
| 503 |
+
## Quick Reference
|
| 504 |
+
|
| 505 |
+
### Token Checklist
|
| 506 |
+
|
| 507 |
+
Before submitting a job that uses Hub:
|
| 508 |
+
|
| 509 |
+
- [ ] Job includes `secrets={"HF_TOKEN": "$HF_TOKEN"}`
|
| 510 |
+
- [ ] Script checks for token: `assert "HF_TOKEN" in os.environ`
|
| 511 |
+
- [ ] Token has required permissions (read/write)
|
| 512 |
+
- [ ] User is logged in: `hf_whoami()` works
|
| 513 |
+
- [ ] Token not hardcoded in script
|
| 514 |
+
- [ ] Using `secrets` not `env` for token
|
| 515 |
+
|
| 516 |
+
### Common Patterns
|
| 517 |
+
|
| 518 |
+
**Pattern 1: Auto-detect token**
|
| 519 |
+
```python
|
| 520 |
+
from huggingface_hub import HfApi
|
| 521 |
+
api = HfApi() # Uses HF_TOKEN from environment
|
| 522 |
+
```
|
| 523 |
+
|
| 524 |
+
**Pattern 2: Explicit token**
|
| 525 |
+
```python
|
| 526 |
+
import os
|
| 527 |
+
from huggingface_hub import HfApi
|
| 528 |
+
api = HfApi(token=os.environ.get("HF_TOKEN"))
|
| 529 |
+
```
|
| 530 |
+
|
| 531 |
+
**Pattern 3: Verify token**
|
| 532 |
+
```python
|
| 533 |
+
import os
|
| 534 |
+
assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
|
| 535 |
+
```
|
| 536 |
+
|
| 537 |
+
## Key Takeaways
|
| 538 |
+
|
| 539 |
+
1. **Always use `secrets={"HF_TOKEN": "$HF_TOKEN"}`** for Hub operations
|
| 540 |
+
2. **Never hardcode tokens** in scripts or job configs
|
| 541 |
+
3. **Verify token exists** in script before Hub operations
|
| 542 |
+
4. **Use auto-detection** when possible (`HfApi()` without token parameter)
|
| 543 |
+
5. **Check permissions** - ensure token has required access
|
| 544 |
+
6. **Monitor token usage** - review activity regularly
|
| 545 |
+
7. **Rotate tokens** - generate new tokens periodically
|
| 546 |
+
|
references/troubleshooting.md
ADDED
|
@@ -0,0 +1,431 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Troubleshooting Guide
|
| 2 |
+
|
| 3 |
+
Common issues and solutions for Hugging Face Jobs.
|
| 4 |
+
|
| 5 |
+
## Authentication Issues
|
| 6 |
+
|
| 7 |
+
### Error: 401 Unauthorized
|
| 8 |
+
|
| 9 |
+
**Symptoms:**
|
| 10 |
+
```
|
| 11 |
+
401 Client Error: Unauthorized for url: https://huggingface.co/api/...
|
| 12 |
+
```
|
| 13 |
+
|
| 14 |
+
**Causes:**
|
| 15 |
+
- Token missing from job
|
| 16 |
+
- Token invalid or expired
|
| 17 |
+
- Token not passed correctly
|
| 18 |
+
|
| 19 |
+
**Solutions:**
|
| 20 |
+
1. Add `secrets={"HF_TOKEN": "$HF_TOKEN"}` to job config
|
| 21 |
+
2. Verify `hf_whoami()` works locally
|
| 22 |
+
3. Re-login: `hf auth login`
|
| 23 |
+
4. Check token hasn't expired
|
| 24 |
+
|
| 25 |
+
**Verification:**
|
| 26 |
+
```python
|
| 27 |
+
# In your script
|
| 28 |
+
import os
|
| 29 |
+
assert "HF_TOKEN" in os.environ, "HF_TOKEN missing!"
|
| 30 |
+
```
|
| 31 |
+
|
| 32 |
+
### Error: 403 Forbidden
|
| 33 |
+
|
| 34 |
+
**Symptoms:**
|
| 35 |
+
```
|
| 36 |
+
403 Client Error: Forbidden for url: https://huggingface.co/api/...
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
**Causes:**
|
| 40 |
+
- Token lacks required permissions
|
| 41 |
+
- No access to private repository
|
| 42 |
+
- Organization permissions insufficient
|
| 43 |
+
|
| 44 |
+
**Solutions:**
|
| 45 |
+
1. Ensure token has write permissions
|
| 46 |
+
2. Check token type at https://huggingface.co/settings/tokens
|
| 47 |
+
3. Verify access to target repository
|
| 48 |
+
4. Use organization token if needed
|
| 49 |
+
|
| 50 |
+
### Error: Token not found in environment
|
| 51 |
+
|
| 52 |
+
**Symptoms:**
|
| 53 |
+
```
|
| 54 |
+
KeyError: 'HF_TOKEN'
|
| 55 |
+
ValueError: HF_TOKEN not found
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
**Causes:**
|
| 59 |
+
- `secrets` not passed in job config
|
| 60 |
+
- Wrong key name (should be `HF_TOKEN`)
|
| 61 |
+
- Using `env` instead of `secrets`
|
| 62 |
+
|
| 63 |
+
**Solutions:**
|
| 64 |
+
1. Use `secrets={"HF_TOKEN": "$HF_TOKEN"}` (not `env`)
|
| 65 |
+
2. Verify key name is exactly `HF_TOKEN`
|
| 66 |
+
3. Check job config syntax
|
| 67 |
+
|
| 68 |
+
## Job Execution Issues
|
| 69 |
+
|
| 70 |
+
### Error: Job Timeout
|
| 71 |
+
|
| 72 |
+
**Symptoms:**
|
| 73 |
+
- Job stops unexpectedly
|
| 74 |
+
- Status shows "TIMEOUT"
|
| 75 |
+
- Partial results only
|
| 76 |
+
|
| 77 |
+
**Causes:**
|
| 78 |
+
- Default 30min timeout exceeded
|
| 79 |
+
- Job takes longer than expected
|
| 80 |
+
- No timeout specified
|
| 81 |
+
|
| 82 |
+
**Solutions:**
|
| 83 |
+
1. Check logs for actual runtime
|
| 84 |
+
2. Increase timeout with buffer: `"timeout": "3h"`
|
| 85 |
+
3. Optimize code for faster execution
|
| 86 |
+
4. Process data in chunks
|
| 87 |
+
5. Add 20-30% buffer to estimated time
|
| 88 |
+
|
| 89 |
+
**Example:**
|
| 90 |
+
```python
|
| 91 |
+
hf_jobs("uv", {
|
| 92 |
+
"script": "...",
|
| 93 |
+
"timeout": "2h" # Set appropriate timeout
|
| 94 |
+
})
|
| 95 |
+
```
|
| 96 |
+
|
| 97 |
+
### Error: Out of Memory (OOM)
|
| 98 |
+
|
| 99 |
+
**Symptoms:**
|
| 100 |
+
```
|
| 101 |
+
RuntimeError: CUDA out of memory
|
| 102 |
+
MemoryError: Unable to allocate array
|
| 103 |
+
```
|
| 104 |
+
|
| 105 |
+
**Causes:**
|
| 106 |
+
- Batch size too large
|
| 107 |
+
- Model too large for hardware
|
| 108 |
+
- Insufficient GPU memory
|
| 109 |
+
|
| 110 |
+
**Solutions:**
|
| 111 |
+
1. Reduce batch size
|
| 112 |
+
2. Process data in smaller chunks
|
| 113 |
+
3. Upgrade hardware: cpu → t4 → a10g → a100
|
| 114 |
+
4. Use smaller models or quantization
|
| 115 |
+
5. Enable gradient checkpointing (for training)
|
| 116 |
+
|
| 117 |
+
**Example:**
|
| 118 |
+
```python
|
| 119 |
+
# Reduce batch size
|
| 120 |
+
batch_size = 1
|
| 121 |
+
|
| 122 |
+
# Process in chunks
|
| 123 |
+
for chunk in chunks:
|
| 124 |
+
process(chunk)
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
### Error: Missing Dependencies
|
| 128 |
+
|
| 129 |
+
**Symptoms:**
|
| 130 |
+
```
|
| 131 |
+
ModuleNotFoundError: No module named 'package_name'
|
| 132 |
+
ImportError: cannot import name 'X'
|
| 133 |
+
```
|
| 134 |
+
|
| 135 |
+
**Causes:**
|
| 136 |
+
- Package not in dependencies
|
| 137 |
+
- Wrong package name
|
| 138 |
+
- Version mismatch
|
| 139 |
+
|
| 140 |
+
**Solutions:**
|
| 141 |
+
1. Add to PEP 723 header:
|
| 142 |
+
```python
|
| 143 |
+
# /// script
|
| 144 |
+
# dependencies = ["package-name>=1.0.0"]
|
| 145 |
+
# ///
|
| 146 |
+
```
|
| 147 |
+
2. Check package name spelling
|
| 148 |
+
3. Specify version if needed
|
| 149 |
+
4. Check package availability
|
| 150 |
+
|
| 151 |
+
### Error: Script Not Found
|
| 152 |
+
|
| 153 |
+
**Symptoms:**
|
| 154 |
+
```
|
| 155 |
+
FileNotFoundError: script.py not found
|
| 156 |
+
```
|
| 157 |
+
|
| 158 |
+
**Causes:**
|
| 159 |
+
- Local file path used (not supported)
|
| 160 |
+
- URL incorrect
|
| 161 |
+
- Script not accessible
|
| 162 |
+
|
| 163 |
+
**Solutions:**
|
| 164 |
+
1. Use inline script (recommended)
|
| 165 |
+
2. Use publicly accessible URL
|
| 166 |
+
3. Upload script to Hub first
|
| 167 |
+
4. Check URL is correct
|
| 168 |
+
|
| 169 |
+
**Correct approaches:**
|
| 170 |
+
```python
|
| 171 |
+
# ✅ Inline code
|
| 172 |
+
hf_jobs("uv", {"script": "# /// script\n# dependencies = [...]\n# ///\n\n<code>"})
|
| 173 |
+
|
| 174 |
+
# ✅ From URL
|
| 175 |
+
hf_jobs("uv", {"script": "https://huggingface.co/user/repo/resolve/main/script.py"})
|
| 176 |
+
```
|
| 177 |
+
|
| 178 |
+
## Hub Push Issues
|
| 179 |
+
|
| 180 |
+
### Error: Push Failed
|
| 181 |
+
|
| 182 |
+
**Symptoms:**
|
| 183 |
+
```
|
| 184 |
+
Error pushing to Hub
|
| 185 |
+
Upload failed
|
| 186 |
+
```
|
| 187 |
+
|
| 188 |
+
**Causes:**
|
| 189 |
+
- Network issues
|
| 190 |
+
- Token missing or invalid
|
| 191 |
+
- Repository access denied
|
| 192 |
+
- File too large
|
| 193 |
+
|
| 194 |
+
**Solutions:**
|
| 195 |
+
1. Check token: `assert "HF_TOKEN" in os.environ`
|
| 196 |
+
2. Verify repository exists or can be created
|
| 197 |
+
3. Check network connectivity in logs
|
| 198 |
+
4. Retry push operation
|
| 199 |
+
5. Split large files into chunks
|
| 200 |
+
|
| 201 |
+
### Error: Repository Not Found
|
| 202 |
+
|
| 203 |
+
**Symptoms:**
|
| 204 |
+
```
|
| 205 |
+
404 Client Error: Not Found
|
| 206 |
+
Repository not found
|
| 207 |
+
```
|
| 208 |
+
|
| 209 |
+
**Causes:**
|
| 210 |
+
- Repository doesn't exist
|
| 211 |
+
- Wrong repository name
|
| 212 |
+
- No access to private repo
|
| 213 |
+
|
| 214 |
+
**Solutions:**
|
| 215 |
+
1. Create repository first:
|
| 216 |
+
```python
|
| 217 |
+
from huggingface_hub import HfApi
|
| 218 |
+
api = HfApi()
|
| 219 |
+
api.create_repo("username/repo-name", repo_type="dataset")
|
| 220 |
+
```
|
| 221 |
+
2. Check repository name format
|
| 222 |
+
3. Verify namespace exists
|
| 223 |
+
4. Check repository visibility
|
| 224 |
+
|
| 225 |
+
### Error: Results Not Saved
|
| 226 |
+
|
| 227 |
+
**Symptoms:**
|
| 228 |
+
- Job completes successfully
|
| 229 |
+
- No results visible on Hub
|
| 230 |
+
- Files not persisted
|
| 231 |
+
|
| 232 |
+
**Causes:**
|
| 233 |
+
- No persistence code in script
|
| 234 |
+
- Push code not executed
|
| 235 |
+
- Push failed silently
|
| 236 |
+
|
| 237 |
+
**Solutions:**
|
| 238 |
+
1. Add persistence code to script
|
| 239 |
+
2. Verify push executes successfully
|
| 240 |
+
3. Check logs for push errors
|
| 241 |
+
4. Add error handling around push
|
| 242 |
+
|
| 243 |
+
**Example:**
|
| 244 |
+
```python
|
| 245 |
+
try:
|
| 246 |
+
dataset.push_to_hub("username/dataset")
|
| 247 |
+
print("✅ Push successful")
|
| 248 |
+
except Exception as e:
|
| 249 |
+
print(f"❌ Push failed: {e}")
|
| 250 |
+
raise
|
| 251 |
+
```
|
| 252 |
+
|
| 253 |
+
## Hardware Issues
|
| 254 |
+
|
| 255 |
+
### Error: GPU Not Available
|
| 256 |
+
|
| 257 |
+
**Symptoms:**
|
| 258 |
+
```
|
| 259 |
+
CUDA not available
|
| 260 |
+
No GPU found
|
| 261 |
+
```
|
| 262 |
+
|
| 263 |
+
**Causes:**
|
| 264 |
+
- CPU flavor used instead of GPU
|
| 265 |
+
- GPU not requested
|
| 266 |
+
- CUDA not installed in image
|
| 267 |
+
|
| 268 |
+
**Solutions:**
|
| 269 |
+
1. Use GPU flavor: `"flavor": "a10g-large"`
|
| 270 |
+
2. Check image has CUDA support
|
| 271 |
+
3. Verify GPU availability in logs
|
| 272 |
+
|
| 273 |
+
### Error: Slow Performance
|
| 274 |
+
|
| 275 |
+
**Symptoms:**
|
| 276 |
+
- Job takes longer than expected
|
| 277 |
+
- Low GPU utilization
|
| 278 |
+
- CPU bottleneck
|
| 279 |
+
|
| 280 |
+
**Causes:**
|
| 281 |
+
- Wrong hardware selected
|
| 282 |
+
- Inefficient code
|
| 283 |
+
- Data loading bottleneck
|
| 284 |
+
|
| 285 |
+
**Solutions:**
|
| 286 |
+
1. Upgrade hardware
|
| 287 |
+
2. Optimize code
|
| 288 |
+
3. Use batch processing
|
| 289 |
+
4. Profile code to find bottlenecks
|
| 290 |
+
|
| 291 |
+
## General Issues
|
| 292 |
+
|
| 293 |
+
### Error: Job Status Unknown
|
| 294 |
+
|
| 295 |
+
**Symptoms:**
|
| 296 |
+
- Can't check job status
|
| 297 |
+
- Status API returns error
|
| 298 |
+
|
| 299 |
+
**Solutions:**
|
| 300 |
+
1. Use job URL: `https://huggingface.co/jobs/username/job-id`
|
| 301 |
+
2. Check logs: `hf_jobs("logs", {"job_id": "..."})`
|
| 302 |
+
3. Inspect job: `hf_jobs("inspect", {"job_id": "..."})`
|
| 303 |
+
|
| 304 |
+
### Error: Logs Not Available
|
| 305 |
+
|
| 306 |
+
**Symptoms:**
|
| 307 |
+
- No logs visible
|
| 308 |
+
- Logs delayed
|
| 309 |
+
|
| 310 |
+
**Causes:**
|
| 311 |
+
- Job just started (logs delayed 30-60s)
|
| 312 |
+
- Job failed before logging
|
| 313 |
+
- Logs not yet generated
|
| 314 |
+
|
| 315 |
+
**Solutions:**
|
| 316 |
+
1. Wait 30-60 seconds after job start
|
| 317 |
+
2. Check job status first
|
| 318 |
+
3. Use job URL for web interface
|
| 319 |
+
|
| 320 |
+
### Error: Cost Unexpectedly High
|
| 321 |
+
|
| 322 |
+
**Symptoms:**
|
| 323 |
+
- Job costs more than expected
|
| 324 |
+
- Longer runtime than estimated
|
| 325 |
+
|
| 326 |
+
**Causes:**
|
| 327 |
+
- Job ran longer than timeout
|
| 328 |
+
- Wrong hardware selected
|
| 329 |
+
- Inefficient code
|
| 330 |
+
|
| 331 |
+
**Solutions:**
|
| 332 |
+
1. Monitor job runtime
|
| 333 |
+
2. Set appropriate timeout
|
| 334 |
+
3. Optimize code
|
| 335 |
+
4. Choose right hardware
|
| 336 |
+
5. Check cost estimates before running
|
| 337 |
+
|
| 338 |
+
## Debugging Tips
|
| 339 |
+
|
| 340 |
+
### 1. Add Logging
|
| 341 |
+
|
| 342 |
+
```python
|
| 343 |
+
import logging
|
| 344 |
+
logging.basicConfig(level=logging.INFO)
|
| 345 |
+
logger = logging.getLogger(__name__)
|
| 346 |
+
|
| 347 |
+
logger.info("Starting processing...")
|
| 348 |
+
logger.info(f"Processed {count} items")
|
| 349 |
+
```
|
| 350 |
+
|
| 351 |
+
### 2. Verify Environment
|
| 352 |
+
|
| 353 |
+
```python
|
| 354 |
+
import os
|
| 355 |
+
print(f"Python version: {os.sys.version}")
|
| 356 |
+
print(f"CUDA available: {torch.cuda.is_available()}")
|
| 357 |
+
print(f"HF_TOKEN present: {'HF_TOKEN' in os.environ}")
|
| 358 |
+
```
|
| 359 |
+
|
| 360 |
+
### 3. Test Locally First
|
| 361 |
+
|
| 362 |
+
Run script locally before submitting to catch errors early:
|
| 363 |
+
```bash
|
| 364 |
+
python script.py
|
| 365 |
+
```
|
| 366 |
+
|
| 367 |
+
### 4. Check Job Logs
|
| 368 |
+
|
| 369 |
+
```python
|
| 370 |
+
# View logs
|
| 371 |
+
hf_jobs("logs", {"job_id": "your-job-id"})
|
| 372 |
+
|
| 373 |
+
# Or use job URL
|
| 374 |
+
# https://huggingface.co/jobs/username/job-id
|
| 375 |
+
```
|
| 376 |
+
|
| 377 |
+
### 5. Add Error Handling
|
| 378 |
+
|
| 379 |
+
```python
|
| 380 |
+
try:
|
| 381 |
+
# Your code
|
| 382 |
+
process_data()
|
| 383 |
+
except Exception as e:
|
| 384 |
+
print(f"Error: {e}")
|
| 385 |
+
import traceback
|
| 386 |
+
traceback.print_exc()
|
| 387 |
+
raise
|
| 388 |
+
```
|
| 389 |
+
|
| 390 |
+
## Quick Reference
|
| 391 |
+
|
| 392 |
+
### Common Error Codes
|
| 393 |
+
|
| 394 |
+
| Code | Meaning | Solution |
|
| 395 |
+
|------|---------|----------|
|
| 396 |
+
| 401 | Unauthorized | Add `secrets={"HF_TOKEN": "$HF_TOKEN"}` |
|
| 397 |
+
| 403 | Forbidden | Check token permissions |
|
| 398 |
+
| 404 | Not Found | Verify repository exists |
|
| 399 |
+
| 500 | Server Error | Retry or contact support |
|
| 400 |
+
|
| 401 |
+
### Checklist Before Submitting
|
| 402 |
+
|
| 403 |
+
- [ ] Token configured: `secrets={"HF_TOKEN": "$HF_TOKEN"}`
|
| 404 |
+
- [ ] Script checks for token: `assert "HF_TOKEN" in os.environ`
|
| 405 |
+
- [ ] Timeout set appropriately
|
| 406 |
+
- [ ] Hardware selected correctly
|
| 407 |
+
- [ ] Dependencies listed in PEP 723 header
|
| 408 |
+
- [ ] Persistence code included
|
| 409 |
+
- [ ] Error handling added
|
| 410 |
+
- [ ] Logging added for debugging
|
| 411 |
+
|
| 412 |
+
## Getting Help
|
| 413 |
+
|
| 414 |
+
If issues persist:
|
| 415 |
+
|
| 416 |
+
1. **Check logs** - Most errors include detailed messages
|
| 417 |
+
2. **Review documentation** - See main SKILL.md
|
| 418 |
+
3. **Check Hub status** - https://status.huggingface.co
|
| 419 |
+
4. **Community forums** - https://discuss.huggingface.co
|
| 420 |
+
5. **GitHub issues** - For bugs in huggingface_hub
|
| 421 |
+
|
| 422 |
+
## Key Takeaways
|
| 423 |
+
|
| 424 |
+
1. **Always include token** - `secrets={"HF_TOKEN": "$HF_TOKEN"}`
|
| 425 |
+
2. **Set appropriate timeout** - Default 30min may be insufficient
|
| 426 |
+
3. **Verify persistence** - Results won't persist without code
|
| 427 |
+
4. **Check logs** - Most issues visible in job logs
|
| 428 |
+
5. **Test locally** - Catch errors before submitting
|
| 429 |
+
6. **Add error handling** - Better debugging information
|
| 430 |
+
7. **Monitor costs** - Set timeouts to avoid unexpected charges
|
| 431 |
+
|
scripts/cot-self-instruct.py
ADDED
|
@@ -0,0 +1,718 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# /// script
|
| 2 |
+
# requires-python = ">=3.10"
|
| 3 |
+
# dependencies = [
|
| 4 |
+
# "datasets",
|
| 5 |
+
# "transformers",
|
| 6 |
+
# "vllm>=0.6.5",
|
| 7 |
+
# "huggingface-hub[hf_transfer]",
|
| 8 |
+
# "torch",
|
| 9 |
+
# "numpy",
|
| 10 |
+
# "tqdm",
|
| 11 |
+
# "scikit-learn",
|
| 12 |
+
# ]
|
| 13 |
+
# ///
|
| 14 |
+
"""
|
| 15 |
+
Generate high-quality synthetic data using Chain-of-Thought Self-Instruct methodology.
|
| 16 |
+
|
| 17 |
+
This script implements the CoT-Self-Instruct approach from the paper "CoT-Self-Instruct:
|
| 18 |
+
Building high-quality synthetic prompts for reasoning and non-reasoning tasks" (2025).
|
| 19 |
+
|
| 20 |
+
It supports two modes:
|
| 21 |
+
1. Reasoning tasks: Generates both questions and answers with Chain-of-Thought
|
| 22 |
+
2. Instruction tasks: Generates diverse prompts for general instruction following
|
| 23 |
+
|
| 24 |
+
Example usage:
|
| 25 |
+
# Reasoning tasks with Answer-Consistency filtering
|
| 26 |
+
uv run cot-self-instruct.py \\
|
| 27 |
+
--seed-dataset davanstrien/s1k-reasoning \\
|
| 28 |
+
--output-dataset username/synthetic-math \\
|
| 29 |
+
--task-type reasoning \\
|
| 30 |
+
--num-samples 5000 \\
|
| 31 |
+
--filter-method answer-consistency
|
| 32 |
+
|
| 33 |
+
# Instruction tasks with RIP filtering
|
| 34 |
+
uv run cot-self-instruct.py \\
|
| 35 |
+
--seed-dataset wildchat-filtered \\
|
| 36 |
+
--output-dataset username/synthetic-prompts \\
|
| 37 |
+
--task-type instruction \\
|
| 38 |
+
--filter-method rip \\
|
| 39 |
+
--reward-model Nexusflow/Athene-RM-8B
|
| 40 |
+
|
| 41 |
+
# HF Jobs execution
|
| 42 |
+
hf jobs uv run --flavor l4x4 \\
|
| 43 |
+
--image vllm/vllm-openai \\
|
| 44 |
+
-e HF_TOKEN=$(python3 -c "from huggingface_hub import get_token; print(get_token())") \\
|
| 45 |
+
https://huggingface.co/datasets/uv-scripts/synthetic-data/raw/main/cot-self-instruct.py \\
|
| 46 |
+
[args...]
|
| 47 |
+
"""
|
| 48 |
+
|
| 49 |
+
import argparse
|
| 50 |
+
import json
|
| 51 |
+
import logging
|
| 52 |
+
import os
|
| 53 |
+
import random
|
| 54 |
+
import re
|
| 55 |
+
import sys
|
| 56 |
+
from collections import Counter
|
| 57 |
+
from datetime import datetime
|
| 58 |
+
from typing import Dict, List, Optional, Tuple, Union
|
| 59 |
+
|
| 60 |
+
import numpy as np
|
| 61 |
+
import torch
|
| 62 |
+
from datasets import Dataset, load_dataset
|
| 63 |
+
from huggingface_hub import DatasetCard, login
|
| 64 |
+
from sklearn.cluster import KMeans
|
| 65 |
+
from tqdm.auto import tqdm
|
| 66 |
+
from transformers import AutoTokenizer
|
| 67 |
+
from vllm import LLM, SamplingParams
|
| 68 |
+
|
| 69 |
+
# Enable HF Transfer for faster downloads
|
| 70 |
+
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
|
| 71 |
+
|
| 72 |
+
logging.basicConfig(
|
| 73 |
+
level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
|
| 74 |
+
)
|
| 75 |
+
logger = logging.getLogger(__name__)
|
| 76 |
+
|
| 77 |
+
# Prompt templates from the paper
|
| 78 |
+
REASONING_PROMPT_TEMPLATE = """You are a reasoning question generator assistant. Your goal is to create a novel, and challenging reasoning question. You are provided the following seed questions:
|
| 79 |
+
Seed Question 1: {seed1}
|
| 80 |
+
Seed Question 2: {seed2}
|
| 81 |
+
Your task is to:
|
| 82 |
+
1. Write a brand-new, self-contained reasoning question that meets the following requirements:
|
| 83 |
+
(a) The question draws inspiration from the seed question without copying it verbatim, remaining novel and of comparable difficulty.
|
| 84 |
+
(b) The question's final answer should be a single, unambiguous scalar value (e.g., an integer, reduced fraction, exact radical), or another answer type that can be verified in one step (e.g., 'yes/no,' a choice from A to D).
|
| 85 |
+
2. Then reason step by step, solve the new question and format your output as follows:
|
| 86 |
+
[New Question Begin]{{your_generated_question}}[New Question End]
|
| 87 |
+
[Final Answer to New Question Begin]\\boxed{{your_final_answer}}[Final Answer to New Question End]"""
|
| 88 |
+
|
| 89 |
+
INSTRUCTION_PROMPT_TEMPLATE = """You are a prompt generator assistant. Your goal is to create diverse and creative synthetic prompts.
|
| 90 |
+
Please follow the steps below to create synthetic prompts.
|
| 91 |
+
Step 1: Carefully read #Prompt 1# and #Prompt 2#. Identify and list all the common elements between these two prompts. If no common elements are found, list the main elements from each prompt.
|
| 92 |
+
Step 2: Develop a comprehensive plan based on the #Common Elements List# or #Main Elements List# from Step 1. This plan will guide the generation of new synthetic prompts that are similar to the original prompts.
|
| 93 |
+
Step 3: Execute the plan step by step and provide one #Synthetic Prompt#.
|
| 94 |
+
Please reply strictly in the following format:
|
| 95 |
+
- Step 1 #Common Elements List# or #Main Elements List#:
|
| 96 |
+
- Step 2 #Plan#:
|
| 97 |
+
- Step 3 #Synthetic Prompt#:
|
| 98 |
+
#Prompt 1#:
|
| 99 |
+
{prompt1}
|
| 100 |
+
#Prompt 2#:
|
| 101 |
+
{prompt2}"""
|
| 102 |
+
|
| 103 |
+
|
| 104 |
+
def check_gpu_availability() -> int:
|
| 105 |
+
"""Check if CUDA is available and return the number of GPUs."""
|
| 106 |
+
if not torch.cuda.is_available():
|
| 107 |
+
logger.error("CUDA is not available. This script requires a GPU.")
|
| 108 |
+
logger.error(
|
| 109 |
+
"Please run on a machine with NVIDIA GPU or use HF Jobs with GPU flavor."
|
| 110 |
+
)
|
| 111 |
+
sys.exit(1)
|
| 112 |
+
|
| 113 |
+
num_gpus = torch.cuda.device_count()
|
| 114 |
+
for i in range(num_gpus):
|
| 115 |
+
gpu_name = torch.cuda.get_device_name(i)
|
| 116 |
+
gpu_memory = torch.cuda.get_device_properties(i).total_memory / 1024**3
|
| 117 |
+
logger.info(f"GPU {i}: {gpu_name} with {gpu_memory:.1f} GB memory")
|
| 118 |
+
|
| 119 |
+
return num_gpus
|
| 120 |
+
|
| 121 |
+
|
| 122 |
+
def parse_thinking_output(text: str) -> str:
|
| 123 |
+
"""Remove thinking tokens from model output."""
|
| 124 |
+
# Remove <think>...</think> blocks
|
| 125 |
+
text = re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL)
|
| 126 |
+
return text.strip()
|
| 127 |
+
|
| 128 |
+
|
| 129 |
+
def extract_reasoning_output(text: str) -> Tuple[Optional[str], Optional[str]]:
|
| 130 |
+
"""Extract question and answer from reasoning task output."""
|
| 131 |
+
text = parse_thinking_output(text)
|
| 132 |
+
|
| 133 |
+
# Extract question
|
| 134 |
+
question_match = re.search(r'\[New Question Begin\](.*?)\[New Question End\]', text, re.DOTALL)
|
| 135 |
+
if not question_match:
|
| 136 |
+
return None, None
|
| 137 |
+
question = question_match.group(1).strip()
|
| 138 |
+
|
| 139 |
+
# Extract answer
|
| 140 |
+
answer_match = re.search(r'\[Final Answer to New Question Begin\]\\?boxed\{(.*?)\}\[Final Answer to New Question End\]', text, re.DOTALL)
|
| 141 |
+
if not answer_match:
|
| 142 |
+
# Try without \boxed
|
| 143 |
+
answer_match = re.search(r'\[Final Answer to New Question Begin\](.*?)\[Final Answer to New Question End\]', text, re.DOTALL)
|
| 144 |
+
|
| 145 |
+
if not answer_match:
|
| 146 |
+
return question, None
|
| 147 |
+
|
| 148 |
+
answer = answer_match.group(1).strip()
|
| 149 |
+
return question, answer
|
| 150 |
+
|
| 151 |
+
|
| 152 |
+
def extract_instruction_output(text: str) -> Optional[str]:
|
| 153 |
+
"""Extract synthetic prompt from instruction task output."""
|
| 154 |
+
text = parse_thinking_output(text)
|
| 155 |
+
|
| 156 |
+
# Look for the synthetic prompt after "Step 3 #Synthetic Prompt#:"
|
| 157 |
+
match = re.search(r'Step 3 #Synthetic Prompt#:\s*(.+)', text, re.DOTALL)
|
| 158 |
+
if match:
|
| 159 |
+
return match.group(1).strip()
|
| 160 |
+
return None
|
| 161 |
+
|
| 162 |
+
|
| 163 |
+
def categorize_prompts(prompts: List[str], num_categories: int = 8) -> Dict[int, List[int]]:
|
| 164 |
+
"""Categorize prompts using clustering for instruction tasks."""
|
| 165 |
+
from transformers import AutoModel
|
| 166 |
+
|
| 167 |
+
logger.info(f"Categorizing {len(prompts)} prompts into {num_categories} categories...")
|
| 168 |
+
|
| 169 |
+
# Use a small model for embeddings
|
| 170 |
+
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
|
| 171 |
+
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
|
| 172 |
+
|
| 173 |
+
# Get embeddings
|
| 174 |
+
embeddings = []
|
| 175 |
+
for prompt in tqdm(prompts, desc="Computing embeddings"):
|
| 176 |
+
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
|
| 177 |
+
with torch.no_grad():
|
| 178 |
+
outputs = model(**inputs)
|
| 179 |
+
embedding = outputs.last_hidden_state.mean(dim=1).numpy()
|
| 180 |
+
embeddings.append(embedding[0])
|
| 181 |
+
|
| 182 |
+
# Cluster
|
| 183 |
+
kmeans = KMeans(n_clusters=num_categories, random_state=42)
|
| 184 |
+
labels = kmeans.fit_predict(embeddings)
|
| 185 |
+
|
| 186 |
+
# Group by category
|
| 187 |
+
categories = {}
|
| 188 |
+
for idx, label in enumerate(labels):
|
| 189 |
+
if label not in categories:
|
| 190 |
+
categories[label] = []
|
| 191 |
+
categories[label].append(idx)
|
| 192 |
+
|
| 193 |
+
return categories
|
| 194 |
+
|
| 195 |
+
|
| 196 |
+
def generate_synthetic_data(
|
| 197 |
+
llm: LLM,
|
| 198 |
+
seed_data: List[Dict],
|
| 199 |
+
task_type: str,
|
| 200 |
+
num_samples: int,
|
| 201 |
+
categories: Optional[Dict[int, List[int]]] = None,
|
| 202 |
+
) -> List[Dict]:
|
| 203 |
+
"""Generate synthetic data using CoT-Self-Instruct."""
|
| 204 |
+
synthetic_data = []
|
| 205 |
+
|
| 206 |
+
# Set up progress bar
|
| 207 |
+
pbar = tqdm(total=num_samples, desc="Generating synthetic data")
|
| 208 |
+
|
| 209 |
+
while len(synthetic_data) < num_samples:
|
| 210 |
+
# Sample seed data
|
| 211 |
+
if task_type == "reasoning":
|
| 212 |
+
# Random sampling for reasoning tasks
|
| 213 |
+
seeds = random.sample(seed_data, min(2, len(seed_data)))
|
| 214 |
+
prompt = REASONING_PROMPT_TEMPLATE.format(
|
| 215 |
+
seed1=seeds[0].get("question", seeds[0].get("prompt", "")),
|
| 216 |
+
seed2=seeds[1].get("question", seeds[1].get("prompt", "")) if len(seeds) > 1 else seeds[0].get("question", seeds[0].get("prompt", ""))
|
| 217 |
+
)
|
| 218 |
+
else:
|
| 219 |
+
# Category-aware sampling for instruction tasks
|
| 220 |
+
if categories:
|
| 221 |
+
# Pick a random category
|
| 222 |
+
category = random.choice(list(categories.keys()))
|
| 223 |
+
category_indices = categories[category]
|
| 224 |
+
indices = random.sample(category_indices, min(2, len(category_indices)))
|
| 225 |
+
seeds = [seed_data[i] for i in indices]
|
| 226 |
+
else:
|
| 227 |
+
seeds = random.sample(seed_data, min(2, len(seed_data)))
|
| 228 |
+
|
| 229 |
+
prompt = INSTRUCTION_PROMPT_TEMPLATE.format(
|
| 230 |
+
prompt1=seeds[0].get("prompt", seeds[0].get("question", "")),
|
| 231 |
+
prompt2=seeds[1].get("prompt", seeds[1].get("question", "")) if len(seeds) > 1 else seeds[0].get("prompt", seeds[0].get("question", ""))
|
| 232 |
+
)
|
| 233 |
+
|
| 234 |
+
# Generate
|
| 235 |
+
sampling_params = SamplingParams(
|
| 236 |
+
temperature=0.7 if task_type == "reasoning" else 0.8,
|
| 237 |
+
top_p=0.95 if task_type == "reasoning" else 0.9,
|
| 238 |
+
max_tokens=2048,
|
| 239 |
+
)
|
| 240 |
+
|
| 241 |
+
outputs = llm.generate([prompt], sampling_params)
|
| 242 |
+
output_text = outputs[0].outputs[0].text
|
| 243 |
+
|
| 244 |
+
# Parse output
|
| 245 |
+
if task_type == "reasoning":
|
| 246 |
+
question, answer = extract_reasoning_output(output_text)
|
| 247 |
+
if question and answer:
|
| 248 |
+
synthetic_data.append({
|
| 249 |
+
"question": question,
|
| 250 |
+
"answer": answer,
|
| 251 |
+
"seed_indices": [seed_data.index(s) for s in seeds],
|
| 252 |
+
})
|
| 253 |
+
pbar.update(1)
|
| 254 |
+
else:
|
| 255 |
+
synthetic_prompt = extract_instruction_output(output_text)
|
| 256 |
+
if synthetic_prompt:
|
| 257 |
+
synthetic_data.append({
|
| 258 |
+
"prompt": synthetic_prompt,
|
| 259 |
+
"seed_indices": [seed_data.index(s) for s in seeds],
|
| 260 |
+
})
|
| 261 |
+
pbar.update(1)
|
| 262 |
+
|
| 263 |
+
pbar.close()
|
| 264 |
+
return synthetic_data
|
| 265 |
+
|
| 266 |
+
|
| 267 |
+
def answer_consistency_filter(
|
| 268 |
+
llm: LLM,
|
| 269 |
+
synthetic_data: List[Dict],
|
| 270 |
+
k_responses: int = 16,
|
| 271 |
+
threshold: float = 0.5,
|
| 272 |
+
) -> List[Dict]:
|
| 273 |
+
"""Filter reasoning tasks using Answer-Consistency."""
|
| 274 |
+
logger.info(f"Applying Answer-Consistency filter with K={k_responses}")
|
| 275 |
+
|
| 276 |
+
filtered_data = []
|
| 277 |
+
|
| 278 |
+
for item in tqdm(synthetic_data, desc="Answer-Consistency filtering"):
|
| 279 |
+
question = item["question"]
|
| 280 |
+
original_answer = item["answer"]
|
| 281 |
+
|
| 282 |
+
# Generate K responses
|
| 283 |
+
prompts = [question] * k_responses
|
| 284 |
+
sampling_params = SamplingParams(
|
| 285 |
+
temperature=0.6,
|
| 286 |
+
top_p=0.95,
|
| 287 |
+
max_tokens=1024,
|
| 288 |
+
)
|
| 289 |
+
|
| 290 |
+
outputs = llm.generate(prompts, sampling_params)
|
| 291 |
+
|
| 292 |
+
# Extract answers
|
| 293 |
+
answers = []
|
| 294 |
+
for output in outputs:
|
| 295 |
+
text = output.outputs[0].text
|
| 296 |
+
# Try to extract boxed answer
|
| 297 |
+
match = re.search(r'\\boxed\{(.*?)\}', text)
|
| 298 |
+
if match:
|
| 299 |
+
answers.append(match.group(1).strip())
|
| 300 |
+
|
| 301 |
+
if not answers:
|
| 302 |
+
continue
|
| 303 |
+
|
| 304 |
+
# Get majority answer
|
| 305 |
+
answer_counts = Counter(answers)
|
| 306 |
+
if answer_counts:
|
| 307 |
+
majority_answer, count = answer_counts.most_common(1)[0]
|
| 308 |
+
|
| 309 |
+
# Check if majority answer matches original and meets threshold
|
| 310 |
+
if (majority_answer == original_answer and
|
| 311 |
+
count / len(answers) >= threshold):
|
| 312 |
+
item["consistency_score"] = count / len(answers)
|
| 313 |
+
filtered_data.append(item)
|
| 314 |
+
|
| 315 |
+
logger.info(f"Answer-Consistency: kept {len(filtered_data)}/{len(synthetic_data)} examples")
|
| 316 |
+
return filtered_data
|
| 317 |
+
|
| 318 |
+
|
| 319 |
+
def rip_filter(
|
| 320 |
+
llm: LLM,
|
| 321 |
+
synthetic_data: List[Dict],
|
| 322 |
+
reward_model_id: str,
|
| 323 |
+
k_responses: int = 32,
|
| 324 |
+
threshold: float = 0.5,
|
| 325 |
+
) -> List[Dict]:
|
| 326 |
+
"""Filter using Rejecting Instruction Preferences (RIP)."""
|
| 327 |
+
logger.info(f"Applying RIP filter with K={k_responses} and reward model {reward_model_id}")
|
| 328 |
+
|
| 329 |
+
# Note: In a full implementation, you would load and use the actual reward model
|
| 330 |
+
# For this example, we'll use a placeholder scoring mechanism
|
| 331 |
+
logger.warning("RIP filtering requires a reward model implementation - using placeholder")
|
| 332 |
+
|
| 333 |
+
filtered_data = []
|
| 334 |
+
|
| 335 |
+
for item in tqdm(synthetic_data, desc="RIP filtering"):
|
| 336 |
+
prompt = item.get("prompt", item.get("question", ""))
|
| 337 |
+
|
| 338 |
+
# Generate K responses
|
| 339 |
+
prompts = [prompt] * k_responses
|
| 340 |
+
sampling_params = SamplingParams(
|
| 341 |
+
temperature=1.0,
|
| 342 |
+
top_p=1.0,
|
| 343 |
+
max_tokens=1024,
|
| 344 |
+
)
|
| 345 |
+
|
| 346 |
+
outputs = llm.generate(prompts, sampling_params)
|
| 347 |
+
|
| 348 |
+
# In real implementation: score each response with reward model
|
| 349 |
+
# For now, use length as a proxy (longer responses often score higher)
|
| 350 |
+
scores = [len(output.outputs[0].text) for output in outputs]
|
| 351 |
+
|
| 352 |
+
# Use minimum score as quality indicator
|
| 353 |
+
min_score = min(scores) if scores else 0
|
| 354 |
+
normalized_score = min_score / 1000 # Normalize to 0-1 range
|
| 355 |
+
|
| 356 |
+
if normalized_score >= threshold:
|
| 357 |
+
item["rip_score"] = normalized_score
|
| 358 |
+
filtered_data.append(item)
|
| 359 |
+
|
| 360 |
+
logger.info(f"RIP filter: kept {len(filtered_data)}/{len(synthetic_data)} examples")
|
| 361 |
+
return filtered_data
|
| 362 |
+
|
| 363 |
+
|
| 364 |
+
def create_dataset_card(
|
| 365 |
+
task_type: str,
|
| 366 |
+
source_dataset: str,
|
| 367 |
+
generation_model: str,
|
| 368 |
+
filter_method: str,
|
| 369 |
+
num_generated: int,
|
| 370 |
+
num_filtered: int,
|
| 371 |
+
generation_time: str,
|
| 372 |
+
additional_info: Dict = None,
|
| 373 |
+
) -> str:
|
| 374 |
+
"""Create a comprehensive dataset card."""
|
| 375 |
+
filter_info = ""
|
| 376 |
+
if filter_method == "answer-consistency":
|
| 377 |
+
filter_info = """
|
| 378 |
+
### Answer-Consistency Filtering
|
| 379 |
+
|
| 380 |
+
This dataset was filtered using Answer-Consistency:
|
| 381 |
+
- Generated K responses for each synthetic question
|
| 382 |
+
- Kept only examples where majority answer matched the generated answer
|
| 383 |
+
- Ensures high-quality, correctly solved problems"""
|
| 384 |
+
elif filter_method == "rip":
|
| 385 |
+
filter_info = """
|
| 386 |
+
### RIP (Rejecting Instruction Preferences) Filtering
|
| 387 |
+
|
| 388 |
+
This dataset was filtered using RIP:
|
| 389 |
+
- Generated K responses for each synthetic prompt
|
| 390 |
+
- Scored responses using a reward model
|
| 391 |
+
- Kept only prompts with high minimum scores"""
|
| 392 |
+
|
| 393 |
+
return f"""---
|
| 394 |
+
tags:
|
| 395 |
+
- synthetic-data
|
| 396 |
+
- cot-self-instruct
|
| 397 |
+
- {task_type}
|
| 398 |
+
- uv-script
|
| 399 |
+
---
|
| 400 |
+
|
| 401 |
+
# CoT-Self-Instruct Synthetic Data
|
| 402 |
+
|
| 403 |
+
This dataset contains synthetic {task_type} data generated using the Chain-of-Thought Self-Instruct methodology.
|
| 404 |
+
|
| 405 |
+
## Generation Details
|
| 406 |
+
|
| 407 |
+
- **Source Dataset**: [{source_dataset}](https://huggingface.co/datasets/{source_dataset})
|
| 408 |
+
- **Generation Model**: [{generation_model}](https://huggingface.co/{generation_model})
|
| 409 |
+
- **Task Type**: {task_type}
|
| 410 |
+
- **Filter Method**: {filter_method}
|
| 411 |
+
- **Generated Examples**: {num_generated:,}
|
| 412 |
+
- **After Filtering**: {num_filtered:,} ({(num_filtered/num_generated)*100:.1f}% acceptance rate)
|
| 413 |
+
- **Generation Date**: {generation_time}
|
| 414 |
+
{filter_info}
|
| 415 |
+
|
| 416 |
+
## Methodology
|
| 417 |
+
|
| 418 |
+
Generated using CoT-Self-Instruct, which:
|
| 419 |
+
1. Uses Chain-of-Thought reasoning to analyze seed examples
|
| 420 |
+
2. Generates new synthetic examples of similar quality and complexity
|
| 421 |
+
3. Applies quality filtering to ensure high-quality outputs
|
| 422 |
+
|
| 423 |
+
Based on the paper: "CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks" (2025)
|
| 424 |
+
|
| 425 |
+
## Generation Script
|
| 426 |
+
|
| 427 |
+
Generated using the CoT-Self-Instruct script from [uv-scripts/synthetic-data](https://huggingface.co/datasets/uv-scripts/synthetic-data).
|
| 428 |
+
|
| 429 |
+
To reproduce:
|
| 430 |
+
```bash
|
| 431 |
+
uv run https://huggingface.co/datasets/uv-scripts/synthetic-data/raw/main/cot-self-instruct.py \\
|
| 432 |
+
--seed-dataset {source_dataset} \\
|
| 433 |
+
--output-dataset <your-dataset> \\
|
| 434 |
+
--task-type {task_type} \\
|
| 435 |
+
--generation-model {generation_model} \\
|
| 436 |
+
--filter-method {filter_method}
|
| 437 |
+
```
|
| 438 |
+
"""
|
| 439 |
+
|
| 440 |
+
|
| 441 |
+
def main():
|
| 442 |
+
parser = argparse.ArgumentParser(
|
| 443 |
+
description="Generate synthetic data using CoT-Self-Instruct",
|
| 444 |
+
formatter_class=argparse.RawDescriptionHelpFormatter,
|
| 445 |
+
epilog=__doc__,
|
| 446 |
+
)
|
| 447 |
+
|
| 448 |
+
# Dataset arguments
|
| 449 |
+
parser.add_argument(
|
| 450 |
+
"--seed-dataset",
|
| 451 |
+
type=str,
|
| 452 |
+
required=True,
|
| 453 |
+
help="HuggingFace dataset ID containing seed examples",
|
| 454 |
+
)
|
| 455 |
+
parser.add_argument(
|
| 456 |
+
"--output-dataset",
|
| 457 |
+
type=str,
|
| 458 |
+
required=True,
|
| 459 |
+
help="HuggingFace dataset ID for output",
|
| 460 |
+
)
|
| 461 |
+
|
| 462 |
+
# Task configuration
|
| 463 |
+
parser.add_argument(
|
| 464 |
+
"--task-type",
|
| 465 |
+
type=str,
|
| 466 |
+
choices=["reasoning", "instruction", "auto"],
|
| 467 |
+
default="auto",
|
| 468 |
+
help="Type of task (reasoning generates Q&A, instruction generates prompts)",
|
| 469 |
+
)
|
| 470 |
+
parser.add_argument(
|
| 471 |
+
"--task-column",
|
| 472 |
+
type=str,
|
| 473 |
+
default=None,
|
| 474 |
+
help="Column name containing tasks (auto-detected if not specified)",
|
| 475 |
+
)
|
| 476 |
+
|
| 477 |
+
# Model configuration
|
| 478 |
+
parser.add_argument(
|
| 479 |
+
"--generation-model",
|
| 480 |
+
type=str,
|
| 481 |
+
default="Qwen/Qwen3-30B-A3B-Thinking-2507",
|
| 482 |
+
help="Model for synthetic data generation",
|
| 483 |
+
)
|
| 484 |
+
parser.add_argument(
|
| 485 |
+
"--filter-model",
|
| 486 |
+
type=str,
|
| 487 |
+
default=None,
|
| 488 |
+
help="Model for filtering (defaults to generation model)",
|
| 489 |
+
)
|
| 490 |
+
parser.add_argument(
|
| 491 |
+
"--reward-model",
|
| 492 |
+
type=str,
|
| 493 |
+
default="Nexusflow/Athene-RM-8B",
|
| 494 |
+
help="Reward model for RIP filtering",
|
| 495 |
+
)
|
| 496 |
+
|
| 497 |
+
# Generation parameters
|
| 498 |
+
parser.add_argument(
|
| 499 |
+
"--num-samples",
|
| 500 |
+
type=int,
|
| 501 |
+
default=5000,
|
| 502 |
+
help="Number of synthetic examples to generate",
|
| 503 |
+
)
|
| 504 |
+
parser.add_argument(
|
| 505 |
+
"--batch-size",
|
| 506 |
+
type=int,
|
| 507 |
+
default=1,
|
| 508 |
+
help="Batch size for generation",
|
| 509 |
+
)
|
| 510 |
+
|
| 511 |
+
# Filtering parameters
|
| 512 |
+
parser.add_argument(
|
| 513 |
+
"--filter-method",
|
| 514 |
+
type=str,
|
| 515 |
+
choices=["answer-consistency", "rip", "both", "none"],
|
| 516 |
+
default="answer-consistency",
|
| 517 |
+
help="Quality filtering method",
|
| 518 |
+
)
|
| 519 |
+
parser.add_argument(
|
| 520 |
+
"--k-responses",
|
| 521 |
+
type=int,
|
| 522 |
+
default=16,
|
| 523 |
+
help="Number of responses for filtering",
|
| 524 |
+
)
|
| 525 |
+
parser.add_argument(
|
| 526 |
+
"--quality-threshold",
|
| 527 |
+
type=float,
|
| 528 |
+
default=0.5,
|
| 529 |
+
help="Minimum quality threshold for filtering",
|
| 530 |
+
)
|
| 531 |
+
|
| 532 |
+
# GPU configuration
|
| 533 |
+
parser.add_argument(
|
| 534 |
+
"--tensor-parallel-size",
|
| 535 |
+
type=int,
|
| 536 |
+
default=None,
|
| 537 |
+
help="Number of GPUs for tensor parallelism (auto-detected if not set)",
|
| 538 |
+
)
|
| 539 |
+
parser.add_argument(
|
| 540 |
+
"--gpu-memory-utilization",
|
| 541 |
+
type=float,
|
| 542 |
+
default=0.9,
|
| 543 |
+
help="GPU memory utilization",
|
| 544 |
+
)
|
| 545 |
+
|
| 546 |
+
# Other arguments
|
| 547 |
+
parser.add_argument(
|
| 548 |
+
"--hf-token",
|
| 549 |
+
type=str,
|
| 550 |
+
default=None,
|
| 551 |
+
help="HuggingFace API token",
|
| 552 |
+
)
|
| 553 |
+
parser.add_argument(
|
| 554 |
+
"--seed",
|
| 555 |
+
type=int,
|
| 556 |
+
default=42,
|
| 557 |
+
help="Random seed",
|
| 558 |
+
)
|
| 559 |
+
|
| 560 |
+
args = parser.parse_args()
|
| 561 |
+
|
| 562 |
+
# Set random seeds
|
| 563 |
+
random.seed(args.seed)
|
| 564 |
+
np.random.seed(args.seed)
|
| 565 |
+
torch.manual_seed(args.seed)
|
| 566 |
+
|
| 567 |
+
# Check GPU
|
| 568 |
+
num_gpus = check_gpu_availability()
|
| 569 |
+
tensor_parallel_size = args.tensor_parallel_size or num_gpus
|
| 570 |
+
|
| 571 |
+
# Authentication
|
| 572 |
+
hf_token = args.hf_token or os.environ.get("HF_TOKEN")
|
| 573 |
+
if hf_token:
|
| 574 |
+
login(token=hf_token)
|
| 575 |
+
|
| 576 |
+
# Load seed dataset
|
| 577 |
+
logger.info(f"Loading seed dataset: {args.seed_dataset}")
|
| 578 |
+
seed_dataset = load_dataset(args.seed_dataset, split="train")
|
| 579 |
+
|
| 580 |
+
# Auto-detect task type and column if needed
|
| 581 |
+
if args.task_type == "auto":
|
| 582 |
+
columns = seed_dataset.column_names
|
| 583 |
+
if "question" in columns and "answer" in columns:
|
| 584 |
+
args.task_type = "reasoning"
|
| 585 |
+
logger.info("Auto-detected task type: reasoning")
|
| 586 |
+
else:
|
| 587 |
+
args.task_type = "instruction"
|
| 588 |
+
logger.info("Auto-detected task type: instruction")
|
| 589 |
+
|
| 590 |
+
if not args.task_column:
|
| 591 |
+
if args.task_type == "reasoning":
|
| 592 |
+
args.task_column = "question"
|
| 593 |
+
else:
|
| 594 |
+
# Try to find prompt column
|
| 595 |
+
for col in ["prompt", "instruction", "text", "input"]:
|
| 596 |
+
if col in seed_dataset.column_names:
|
| 597 |
+
args.task_column = col
|
| 598 |
+
break
|
| 599 |
+
|
| 600 |
+
logger.info(f"Using task column: {args.task_column}")
|
| 601 |
+
|
| 602 |
+
# Convert to list of dicts
|
| 603 |
+
seed_data = seed_dataset.to_list()
|
| 604 |
+
|
| 605 |
+
# Categorize prompts for instruction tasks
|
| 606 |
+
categories = None
|
| 607 |
+
if args.task_type == "instruction" and len(seed_data) > 100:
|
| 608 |
+
prompts = [item.get(args.task_column, "") for item in seed_data]
|
| 609 |
+
categories = categorize_prompts(prompts)
|
| 610 |
+
|
| 611 |
+
# Initialize generation model
|
| 612 |
+
logger.info(f"Loading generation model: {args.generation_model}")
|
| 613 |
+
generation_llm = LLM(
|
| 614 |
+
model=args.generation_model,
|
| 615 |
+
tensor_parallel_size=tensor_parallel_size,
|
| 616 |
+
gpu_memory_utilization=args.gpu_memory_utilization,
|
| 617 |
+
)
|
| 618 |
+
|
| 619 |
+
# Generate synthetic data
|
| 620 |
+
start_time = datetime.now()
|
| 621 |
+
synthetic_data = generate_synthetic_data(
|
| 622 |
+
generation_llm,
|
| 623 |
+
seed_data,
|
| 624 |
+
args.task_type,
|
| 625 |
+
args.num_samples,
|
| 626 |
+
categories,
|
| 627 |
+
)
|
| 628 |
+
|
| 629 |
+
# Apply filtering
|
| 630 |
+
filter_llm = generation_llm
|
| 631 |
+
if args.filter_model and args.filter_model != args.generation_model:
|
| 632 |
+
logger.info(f"Loading filter model: {args.filter_model}")
|
| 633 |
+
# Clean up generation model
|
| 634 |
+
del generation_llm
|
| 635 |
+
torch.cuda.empty_cache()
|
| 636 |
+
|
| 637 |
+
filter_llm = LLM(
|
| 638 |
+
model=args.filter_model,
|
| 639 |
+
tensor_parallel_size=tensor_parallel_size,
|
| 640 |
+
gpu_memory_utilization=args.gpu_memory_utilization,
|
| 641 |
+
)
|
| 642 |
+
|
| 643 |
+
filtered_data = synthetic_data
|
| 644 |
+
if args.filter_method != "none":
|
| 645 |
+
if args.filter_method == "answer-consistency" and args.task_type == "reasoning":
|
| 646 |
+
filtered_data = answer_consistency_filter(
|
| 647 |
+
filter_llm,
|
| 648 |
+
synthetic_data,
|
| 649 |
+
args.k_responses,
|
| 650 |
+
args.quality_threshold,
|
| 651 |
+
)
|
| 652 |
+
elif args.filter_method == "rip":
|
| 653 |
+
filtered_data = rip_filter(
|
| 654 |
+
filter_llm,
|
| 655 |
+
synthetic_data,
|
| 656 |
+
args.reward_model,
|
| 657 |
+
args.k_responses,
|
| 658 |
+
args.quality_threshold,
|
| 659 |
+
)
|
| 660 |
+
elif args.filter_method == "both":
|
| 661 |
+
if args.task_type == "reasoning":
|
| 662 |
+
filtered_data = answer_consistency_filter(
|
| 663 |
+
filter_llm,
|
| 664 |
+
synthetic_data,
|
| 665 |
+
args.k_responses,
|
| 666 |
+
args.quality_threshold,
|
| 667 |
+
)
|
| 668 |
+
filtered_data = rip_filter(
|
| 669 |
+
filter_llm,
|
| 670 |
+
filtered_data,
|
| 671 |
+
args.reward_model,
|
| 672 |
+
args.k_responses,
|
| 673 |
+
args.quality_threshold,
|
| 674 |
+
)
|
| 675 |
+
|
| 676 |
+
# Create HuggingFace dataset
|
| 677 |
+
logger.info(f"Creating dataset with {len(filtered_data)} examples")
|
| 678 |
+
dataset = Dataset.from_list(filtered_data)
|
| 679 |
+
|
| 680 |
+
# Create dataset card
|
| 681 |
+
generation_time = start_time.strftime("%Y-%m-%d %H:%M:%S UTC")
|
| 682 |
+
dataset_card = create_dataset_card(
|
| 683 |
+
args.task_type,
|
| 684 |
+
args.seed_dataset,
|
| 685 |
+
args.generation_model,
|
| 686 |
+
args.filter_method,
|
| 687 |
+
len(synthetic_data),
|
| 688 |
+
len(filtered_data),
|
| 689 |
+
generation_time,
|
| 690 |
+
)
|
| 691 |
+
|
| 692 |
+
# Push to hub
|
| 693 |
+
logger.info(f"Pushing dataset to: {args.output_dataset}")
|
| 694 |
+
# Create dataset card
|
| 695 |
+
card = DatasetCard(dataset_card)
|
| 696 |
+
dataset.push_to_hub(args.output_dataset)
|
| 697 |
+
# Push card separately
|
| 698 |
+
card.push_to_hub(args.output_dataset)
|
| 699 |
+
|
| 700 |
+
logger.info("Done! Dataset available at: https://huggingface.co/datasets/" + args.output_dataset)
|
| 701 |
+
|
| 702 |
+
# Print example HF Jobs command if running locally
|
| 703 |
+
if len(sys.argv) > 1:
|
| 704 |
+
print("\nTo run on HF Jobs:")
|
| 705 |
+
print(f"""hf jobs uv run --flavor l4x4 \\
|
| 706 |
+
--image vllm/vllm-openai \\
|
| 707 |
+
-e HF_TOKEN=$(python3 -c "from huggingface_hub import get_token; print(get_token())") \\
|
| 708 |
+
https://huggingface.co/datasets/uv-scripts/synthetic-data/raw/main/cot-self-instruct.py \\
|
| 709 |
+
--seed-dataset {args.seed_dataset} \\
|
| 710 |
+
--output-dataset {args.output_dataset} \\
|
| 711 |
+
--task-type {args.task_type} \\
|
| 712 |
+
--generation-model {args.generation_model} \\
|
| 713 |
+
--filter-method {args.filter_method} \\
|
| 714 |
+
--num-samples {args.num_samples}""")
|
| 715 |
+
|
| 716 |
+
|
| 717 |
+
if __name__ == "__main__":
|
| 718 |
+
main()
|
scripts/finepdfs-stats.py
ADDED
|
@@ -0,0 +1,546 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# /// script
|
| 2 |
+
# requires-python = ">=3.12"
|
| 3 |
+
# dependencies = [
|
| 4 |
+
# "polars>=1.31.0",
|
| 5 |
+
# "huggingface-hub",
|
| 6 |
+
# "datasets",
|
| 7 |
+
# "ascii-graph",
|
| 8 |
+
# ]
|
| 9 |
+
# ///
|
| 10 |
+
"""
|
| 11 |
+
Analyze educational quality trends across CommonCrawl dumps using Polars streaming.
|
| 12 |
+
|
| 13 |
+
Answers: "Is the web getting more educational over time?"
|
| 14 |
+
|
| 15 |
+
Demonstrates Polars HF Hub integration - process 50M+ docs without downloading 300GB+.
|
| 16 |
+
|
| 17 |
+
Example usage:
|
| 18 |
+
# Analyze English PDFs (default)
|
| 19 |
+
uv run finepdfs-stats.py
|
| 20 |
+
|
| 21 |
+
# Analyze all 70+ languages
|
| 22 |
+
uv run finepdfs-stats.py --all-languages
|
| 23 |
+
|
| 24 |
+
# Quick test
|
| 25 |
+
uv run finepdfs-stats.py --limit 10000 --show-plan
|
| 26 |
+
|
| 27 |
+
# Save results to HF Hub
|
| 28 |
+
uv run finepdfs-stats.py --output-repo username/finepdfs-temporal-stats
|
| 29 |
+
|
| 30 |
+
# Run on HF Jobs
|
| 31 |
+
hf jobs uv run \\
|
| 32 |
+
-s HF_TOKEN \\
|
| 33 |
+
-e HF_XET_HIGH_PERFORMANCE=1 \\
|
| 34 |
+
https://huggingface.co/datasets/uv-scripts/dataset-stats/raw/main/finepdfs-stats.py \\
|
| 35 |
+
-- --output-repo username/stats
|
| 36 |
+
"""
|
| 37 |
+
|
| 38 |
+
import argparse
|
| 39 |
+
import logging
|
| 40 |
+
import os
|
| 41 |
+
import sys
|
| 42 |
+
import time
|
| 43 |
+
from pathlib import Path
|
| 44 |
+
|
| 45 |
+
import polars as pl
|
| 46 |
+
from ascii_graph import Pyasciigraph
|
| 47 |
+
from datasets import Dataset
|
| 48 |
+
from huggingface_hub import HfApi, create_repo, list_repo_tree, login
|
| 49 |
+
|
| 50 |
+
logging.basicConfig(
|
| 51 |
+
level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
|
| 52 |
+
)
|
| 53 |
+
logger = logging.getLogger(__name__)
|
| 54 |
+
|
| 55 |
+
# Common language+script codes for finepdfs-edu
|
| 56 |
+
COMMON_LANGUAGES = {
|
| 57 |
+
"eng_Latn": "English (Latin script)",
|
| 58 |
+
"fra_Latn": "French (Latin script)",
|
| 59 |
+
"deu_Latn": "German (Latin script)",
|
| 60 |
+
"spa_Latn": "Spanish (Latin script)",
|
| 61 |
+
"por_Latn": "Portuguese (Latin script)",
|
| 62 |
+
"ita_Latn": "Italian (Latin script)",
|
| 63 |
+
"nld_Latn": "Dutch (Latin script)",
|
| 64 |
+
"pol_Latn": "Polish (Latin script)",
|
| 65 |
+
"rus_Cyrl": "Russian (Cyrillic script)",
|
| 66 |
+
"zho_Hans": "Chinese (Simplified)",
|
| 67 |
+
"zho_Hant": "Chinese (Traditional)",
|
| 68 |
+
"jpn_Jpan": "Japanese",
|
| 69 |
+
"kor_Hang": "Korean",
|
| 70 |
+
"ara_Arab": "Arabic",
|
| 71 |
+
"hin_Deva": "Hindi (Devanagari)",
|
| 72 |
+
}
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
def list_available_languages(dataset_id: str) -> list[str]:
|
| 76 |
+
"""List available language subsets in the dataset."""
|
| 77 |
+
try:
|
| 78 |
+
tree = list_repo_tree(dataset_id, path_in_repo="data", repo_type="dataset")
|
| 79 |
+
languages = [
|
| 80 |
+
item.path.replace("data/", "")
|
| 81 |
+
for item in tree
|
| 82 |
+
if item.path.startswith("data/")
|
| 83 |
+
and "/" not in item.path.replace("data/", "")
|
| 84 |
+
]
|
| 85 |
+
return sorted(languages)
|
| 86 |
+
except Exception as e:
|
| 87 |
+
logger.warning(f"Could not list languages: {e}")
|
| 88 |
+
return list(COMMON_LANGUAGES.keys())
|
| 89 |
+
|
| 90 |
+
|
| 91 |
+
def compute_temporal_stats(df: pl.LazyFrame, output_path: Path) -> pl.DataFrame:
|
| 92 |
+
"""Single scan: compute stats grouped by dump for temporal analysis."""
|
| 93 |
+
query = df.group_by("dump").agg(
|
| 94 |
+
pl.len().alias("doc_count"),
|
| 95 |
+
pl.col("token_count").sum().alias("total_tokens"),
|
| 96 |
+
pl.col("fw_edu_scores").list.mean().mean().alias("avg_edu_score"),
|
| 97 |
+
(pl.col("fw_edu_scores").list.mean() >= 3).sum().alias("high_edu_count"),
|
| 98 |
+
)
|
| 99 |
+
query.sink_parquet(output_path, engine="streaming")
|
| 100 |
+
return pl.read_parquet(output_path)
|
| 101 |
+
|
| 102 |
+
|
| 103 |
+
def compute_global_stats(temporal: pl.DataFrame) -> pl.DataFrame:
|
| 104 |
+
"""Compute global stats from temporal breakdown."""
|
| 105 |
+
total = temporal["doc_count"].sum()
|
| 106 |
+
return pl.DataFrame(
|
| 107 |
+
{
|
| 108 |
+
"total_docs": [total],
|
| 109 |
+
"total_tokens": [temporal["total_tokens"].sum()],
|
| 110 |
+
"avg_edu_score": [
|
| 111 |
+
(temporal["avg_edu_score"] * temporal["doc_count"]).sum() / total
|
| 112 |
+
],
|
| 113 |
+
"high_edu_rate": [temporal["high_edu_count"].sum() / total],
|
| 114 |
+
"num_dumps": [len(temporal)],
|
| 115 |
+
}
|
| 116 |
+
)
|
| 117 |
+
|
| 118 |
+
|
| 119 |
+
def format_temporal_stats(temporal: pl.DataFrame) -> pl.DataFrame:
|
| 120 |
+
"""Format temporal stats with high_edu_rate, sorted chronologically."""
|
| 121 |
+
return (
|
| 122 |
+
temporal.with_columns(
|
| 123 |
+
(pl.col("high_edu_count") / pl.col("doc_count")).alias("high_edu_rate")
|
| 124 |
+
)
|
| 125 |
+
.select(["dump", "doc_count", "avg_edu_score", "high_edu_rate"])
|
| 126 |
+
.sort(
|
| 127 |
+
"dump"
|
| 128 |
+
) # Chronological order (CC-MAIN-2017-xx comes before CC-MAIN-2024-xx)
|
| 129 |
+
)
|
| 130 |
+
|
| 131 |
+
|
| 132 |
+
def create_ascii_charts(temporal_stats: pl.DataFrame) -> str:
|
| 133 |
+
"""Create ASCII bar charts showing temporal trends."""
|
| 134 |
+
# Extract year from dump name (CC-MAIN-2024-42 -> 2024)
|
| 135 |
+
# Group by year and average the values for cleaner display
|
| 136 |
+
yearly = (
|
| 137 |
+
temporal_stats.with_columns(
|
| 138 |
+
pl.col("dump").str.extract(r"CC-MAIN-(\d{4})", 1).alias("year")
|
| 139 |
+
)
|
| 140 |
+
.group_by("year")
|
| 141 |
+
.agg(
|
| 142 |
+
pl.col("doc_count").sum(),
|
| 143 |
+
pl.col("avg_edu_score").mean(),
|
| 144 |
+
pl.col("high_edu_rate").mean(),
|
| 145 |
+
)
|
| 146 |
+
.sort("year")
|
| 147 |
+
)
|
| 148 |
+
|
| 149 |
+
lines = []
|
| 150 |
+
|
| 151 |
+
# High edu rate chart (more dramatic differences)
|
| 152 |
+
data_rate = [
|
| 153 |
+
(row["year"], row["high_edu_rate"] * 100)
|
| 154 |
+
for row in yearly.iter_rows(named=True)
|
| 155 |
+
]
|
| 156 |
+
graph = Pyasciigraph(line_length=60, float_format="{0:.1f}%")
|
| 157 |
+
lines.extend(graph.graph("High Educational Content (edu >= 3)", data_rate))
|
| 158 |
+
|
| 159 |
+
lines.append("")
|
| 160 |
+
|
| 161 |
+
# Avg edu score chart
|
| 162 |
+
data_score = [
|
| 163 |
+
(row["year"], row["avg_edu_score"]) for row in yearly.iter_rows(named=True)
|
| 164 |
+
]
|
| 165 |
+
graph2 = Pyasciigraph(line_length=60, float_format="{0:.2f}")
|
| 166 |
+
lines.extend(graph2.graph("Average Educational Score", data_score))
|
| 167 |
+
|
| 168 |
+
return "\n".join(lines)
|
| 169 |
+
|
| 170 |
+
|
| 171 |
+
def create_readme(
|
| 172 |
+
args,
|
| 173 |
+
global_stats: pl.DataFrame,
|
| 174 |
+
temporal_stats: pl.DataFrame,
|
| 175 |
+
scan_time: float,
|
| 176 |
+
ascii_charts: str,
|
| 177 |
+
) -> str:
|
| 178 |
+
"""Create README content for the stats dataset."""
|
| 179 |
+
stats = global_stats.to_dicts()[0]
|
| 180 |
+
total_docs = stats.get("total_docs", 0)
|
| 181 |
+
docs_per_sec = total_docs / scan_time if scan_time > 0 else 0
|
| 182 |
+
|
| 183 |
+
# Get first and last year averages for trend (more representative than single dumps)
|
| 184 |
+
yearly = (
|
| 185 |
+
temporal_stats.with_columns(
|
| 186 |
+
pl.col("dump").str.extract(r"CC-MAIN-(\d{4})", 1).alias("year")
|
| 187 |
+
)
|
| 188 |
+
.group_by("year")
|
| 189 |
+
.agg(
|
| 190 |
+
pl.col("doc_count").sum(),
|
| 191 |
+
pl.col("avg_edu_score").mean(),
|
| 192 |
+
pl.col("high_edu_rate").mean(),
|
| 193 |
+
)
|
| 194 |
+
.sort("year")
|
| 195 |
+
)
|
| 196 |
+
first_year = yearly.head(1).to_dicts()[0]
|
| 197 |
+
last_year = yearly.tail(1).to_dicts()[0]
|
| 198 |
+
|
| 199 |
+
scope = (
|
| 200 |
+
"all languages"
|
| 201 |
+
if args.all_languages
|
| 202 |
+
else COMMON_LANGUAGES.get(args.lang, args.lang)
|
| 203 |
+
)
|
| 204 |
+
|
| 205 |
+
return f"""---
|
| 206 |
+
tags:
|
| 207 |
+
- uv-script
|
| 208 |
+
- statistics
|
| 209 |
+
- polars
|
| 210 |
+
- finepdfs-edu
|
| 211 |
+
- temporal-analysis
|
| 212 |
+
license: odc-by
|
| 213 |
+
configs:
|
| 214 |
+
- config_name: global_stats
|
| 215 |
+
data_files: global_stats/train-*.parquet
|
| 216 |
+
- config_name: temporal_stats
|
| 217 |
+
data_files: temporal_stats/train-*.parquet
|
| 218 |
+
default_viewer_config: temporal_stats
|
| 219 |
+
---
|
| 220 |
+
|
| 221 |
+
# Is the Web Getting More Educational?
|
| 222 |
+
|
| 223 |
+
Temporal analysis of educational quality in **{scope}** across {stats.get("num_dumps", 0)} CommonCrawl dumps.
|
| 224 |
+
|
| 225 |
+
## Trend
|
| 226 |
+
|
| 227 |
+
```
|
| 228 |
+
{ascii_charts}
|
| 229 |
+
```
|
| 230 |
+
|
| 231 |
+
## Key Finding
|
| 232 |
+
|
| 233 |
+
| Year | Avg Edu Score | High Edu Rate |
|
| 234 |
+
|------|---------------|---------------|
|
| 235 |
+
| {first_year["year"]} | {first_year["avg_edu_score"]:.2f} | {first_year["high_edu_rate"] * 100:.1f}% |
|
| 236 |
+
| {last_year["year"]} | {last_year["avg_edu_score"]:.2f} | {last_year["high_edu_rate"] * 100:.1f}% |
|
| 237 |
+
|
| 238 |
+
## Performance
|
| 239 |
+
|
| 240 |
+
- **{total_docs:,} documents** processed in **{scan_time:.0f} seconds**
|
| 241 |
+
- **{docs_per_sec:,.0f} docs/sec** using Polars streaming
|
| 242 |
+
- Single scan, no full dataset download required
|
| 243 |
+
|
| 244 |
+
## Summary
|
| 245 |
+
|
| 246 |
+
| Metric | Value |
|
| 247 |
+
|--------|-------|
|
| 248 |
+
| Scope | {scope} |
|
| 249 |
+
| Total Documents | {total_docs:,} |
|
| 250 |
+
| Total Tokens | {stats.get("total_tokens", 0):,} |
|
| 251 |
+
| Avg Edu Score | {stats.get("avg_edu_score", 0):.3f} |
|
| 252 |
+
| High Edu Rate | {stats.get("high_edu_rate", 0) * 100:.1f}% |
|
| 253 |
+
| CommonCrawl Dumps | {stats.get("num_dumps", 0)} |
|
| 254 |
+
|
| 255 |
+
## Files
|
| 256 |
+
|
| 257 |
+
- `global_stats` - Overall summary
|
| 258 |
+
- `temporal_stats` - Per-dump breakdown (sorted chronologically)
|
| 259 |
+
|
| 260 |
+
## Reproduce
|
| 261 |
+
|
| 262 |
+
```bash
|
| 263 |
+
uv run https://huggingface.co/datasets/uv-scripts/dataset-stats/raw/main/finepdfs-stats.py \\
|
| 264 |
+
{"--all-languages" if args.all_languages else f"--lang {args.lang}"} --output-repo your-username/stats
|
| 265 |
+
```
|
| 266 |
+
|
| 267 |
+
## Source
|
| 268 |
+
|
| 269 |
+
- **Dataset**: [{args.source_dataset}](https://huggingface.co/datasets/{args.source_dataset})
|
| 270 |
+
- **Script**: [uv-scripts/dataset-stats](https://huggingface.co/datasets/uv-scripts/dataset-stats)
|
| 271 |
+
"""
|
| 272 |
+
|
| 273 |
+
|
| 274 |
+
def main():
|
| 275 |
+
parser = argparse.ArgumentParser(
|
| 276 |
+
description="Analyze educational quality trends across CommonCrawl dumps",
|
| 277 |
+
formatter_class=argparse.RawDescriptionHelpFormatter,
|
| 278 |
+
epilog=__doc__,
|
| 279 |
+
)
|
| 280 |
+
|
| 281 |
+
parser.add_argument(
|
| 282 |
+
"--source-dataset",
|
| 283 |
+
type=str,
|
| 284 |
+
default="HuggingFaceFW/finepdfs-edu",
|
| 285 |
+
help="Source dataset (default: HuggingFaceFW/finepdfs-edu)",
|
| 286 |
+
)
|
| 287 |
+
|
| 288 |
+
parser.add_argument(
|
| 289 |
+
"--lang",
|
| 290 |
+
type=str,
|
| 291 |
+
default="eng_Latn",
|
| 292 |
+
help="Language+script code (default: eng_Latn)",
|
| 293 |
+
)
|
| 294 |
+
|
| 295 |
+
parser.add_argument(
|
| 296 |
+
"--all-languages",
|
| 297 |
+
action="store_true",
|
| 298 |
+
help="Analyze all languages (70+) instead of single language",
|
| 299 |
+
)
|
| 300 |
+
|
| 301 |
+
parser.add_argument(
|
| 302 |
+
"--show-plan",
|
| 303 |
+
action="store_true",
|
| 304 |
+
help="Show Polars query plan (demonstrates optimization)",
|
| 305 |
+
)
|
| 306 |
+
|
| 307 |
+
parser.add_argument(
|
| 308 |
+
"--list-languages",
|
| 309 |
+
action="store_true",
|
| 310 |
+
help="List available languages and exit",
|
| 311 |
+
)
|
| 312 |
+
|
| 313 |
+
parser.add_argument(
|
| 314 |
+
"--limit",
|
| 315 |
+
type=int,
|
| 316 |
+
help="Limit to first N rows (for testing)",
|
| 317 |
+
)
|
| 318 |
+
|
| 319 |
+
parser.add_argument(
|
| 320 |
+
"--output-repo",
|
| 321 |
+
type=str,
|
| 322 |
+
help="HuggingFace dataset repository to upload results",
|
| 323 |
+
)
|
| 324 |
+
|
| 325 |
+
parser.add_argument(
|
| 326 |
+
"--output-dir",
|
| 327 |
+
type=str,
|
| 328 |
+
default="./stats_output",
|
| 329 |
+
help="Local directory for output files",
|
| 330 |
+
)
|
| 331 |
+
|
| 332 |
+
parser.add_argument(
|
| 333 |
+
"--hf-token",
|
| 334 |
+
type=str,
|
| 335 |
+
help="HuggingFace API token (or set HF_TOKEN env var)",
|
| 336 |
+
)
|
| 337 |
+
|
| 338 |
+
parser.add_argument(
|
| 339 |
+
"--private",
|
| 340 |
+
action="store_true",
|
| 341 |
+
help="Make the output dataset private",
|
| 342 |
+
)
|
| 343 |
+
|
| 344 |
+
args = parser.parse_args()
|
| 345 |
+
|
| 346 |
+
# Check for high-performance mode
|
| 347 |
+
if os.environ.get("HF_XET_HIGH_PERFORMANCE"):
|
| 348 |
+
logger.info("High-performance mode enabled (HF_XET_HIGH_PERFORMANCE=1)")
|
| 349 |
+
|
| 350 |
+
# List languages mode
|
| 351 |
+
if args.list_languages:
|
| 352 |
+
print(f"Available language+script codes for {args.source_dataset}:\n")
|
| 353 |
+
print("Common languages:")
|
| 354 |
+
for code, name in COMMON_LANGUAGES.items():
|
| 355 |
+
print(f" {code:12} - {name}")
|
| 356 |
+
print("\nFetching full list from HF Hub...")
|
| 357 |
+
all_langs = list_available_languages(args.source_dataset)
|
| 358 |
+
print(f"\nAll available ({len(all_langs)} total):")
|
| 359 |
+
for lang in all_langs[:30]: # Show first 30
|
| 360 |
+
name = COMMON_LANGUAGES.get(lang, "")
|
| 361 |
+
print(f" {lang:12} {name}")
|
| 362 |
+
if len(all_langs) > 30:
|
| 363 |
+
print(f" ... and {len(all_langs) - 30} more")
|
| 364 |
+
sys.exit(0)
|
| 365 |
+
|
| 366 |
+
# Build the parquet path
|
| 367 |
+
if args.all_languages:
|
| 368 |
+
source_path = f"hf://datasets/{args.source_dataset}/data/*/train/*.parquet"
|
| 369 |
+
scope_desc = "all languages"
|
| 370 |
+
else:
|
| 371 |
+
source_path = (
|
| 372 |
+
f"hf://datasets/{args.source_dataset}/data/{args.lang}/train/*.parquet"
|
| 373 |
+
)
|
| 374 |
+
scope_desc = f"{args.lang} ({COMMON_LANGUAGES.get(args.lang, 'unknown')})"
|
| 375 |
+
|
| 376 |
+
logger.info(f"Scanning: {source_path}")
|
| 377 |
+
logger.info(f"Scope: {scope_desc}")
|
| 378 |
+
|
| 379 |
+
# Create lazy frame - this doesn't load any data yet!
|
| 380 |
+
logger.info("Creating lazy query plan...")
|
| 381 |
+
df = pl.scan_parquet(source_path)
|
| 382 |
+
|
| 383 |
+
# Apply limit if specified
|
| 384 |
+
if args.limit:
|
| 385 |
+
logger.info(f"Limiting to first {args.limit:,} rows")
|
| 386 |
+
df = df.head(args.limit)
|
| 387 |
+
|
| 388 |
+
# Show query plan if requested
|
| 389 |
+
if args.show_plan:
|
| 390 |
+
# Build a sample query to show the plan
|
| 391 |
+
sample_query = df.select(
|
| 392 |
+
pl.len(),
|
| 393 |
+
pl.col("token_count").sum(),
|
| 394 |
+
pl.col("language").n_unique(),
|
| 395 |
+
)
|
| 396 |
+
print("\nQuery Plan (showing Polars optimization):")
|
| 397 |
+
print("=" * 60)
|
| 398 |
+
print(sample_query.explain())
|
| 399 |
+
print("=" * 60)
|
| 400 |
+
print("\nNote: Polars uses projection pushdown - only reads columns needed!")
|
| 401 |
+
print("The 'text' column is never loaded, making this very fast.\n")
|
| 402 |
+
|
| 403 |
+
# Create output directory
|
| 404 |
+
output_dir = Path(args.output_dir)
|
| 405 |
+
output_dir.mkdir(parents=True, exist_ok=True)
|
| 406 |
+
|
| 407 |
+
# Single scan: compute temporal stats
|
| 408 |
+
logger.info("Computing temporal stats (single scan)...")
|
| 409 |
+
start = time.perf_counter()
|
| 410 |
+
temporal_path = output_dir / "temporal_stats.parquet"
|
| 411 |
+
temporal_raw = compute_temporal_stats(df, temporal_path)
|
| 412 |
+
scan_time = time.perf_counter() - start
|
| 413 |
+
logger.info(f"Scan complete in {scan_time:.2f}s - {len(temporal_raw)} dumps")
|
| 414 |
+
|
| 415 |
+
# Compute stats
|
| 416 |
+
global_stats = compute_global_stats(temporal_raw)
|
| 417 |
+
temporal_stats = format_temporal_stats(temporal_raw)
|
| 418 |
+
|
| 419 |
+
# Save
|
| 420 |
+
global_stats.write_parquet(output_dir / "global_stats.parquet")
|
| 421 |
+
temporal_stats.write_parquet(output_dir / "temporal_stats.parquet")
|
| 422 |
+
|
| 423 |
+
# Print results
|
| 424 |
+
total_docs = global_stats["total_docs"][0]
|
| 425 |
+
docs_per_sec = total_docs / scan_time if scan_time > 0 else 0
|
| 426 |
+
|
| 427 |
+
print("\n" + "=" * 70)
|
| 428 |
+
print("IS THE WEB GETTING MORE EDUCATIONAL?")
|
| 429 |
+
print("=" * 70)
|
| 430 |
+
|
| 431 |
+
print(f"\nScope: {scope_desc}")
|
| 432 |
+
print(f"Dataset: {args.source_dataset}")
|
| 433 |
+
|
| 434 |
+
print("\n" + "-" * 70)
|
| 435 |
+
print("GLOBAL STATS")
|
| 436 |
+
print("-" * 70)
|
| 437 |
+
print(global_stats)
|
| 438 |
+
|
| 439 |
+
print("\n" + "-" * 70)
|
| 440 |
+
print(f"TEMPORAL TREND ({len(temporal_stats)} CommonCrawl dumps)")
|
| 441 |
+
print("-" * 70)
|
| 442 |
+
# Show first 5 and last 5
|
| 443 |
+
if len(temporal_stats) > 10:
|
| 444 |
+
print("Earliest dumps:")
|
| 445 |
+
print(temporal_stats.head(5))
|
| 446 |
+
print("\n...")
|
| 447 |
+
print("\nLatest dumps:")
|
| 448 |
+
print(temporal_stats.tail(5))
|
| 449 |
+
else:
|
| 450 |
+
print(temporal_stats)
|
| 451 |
+
|
| 452 |
+
# Create ASCII charts
|
| 453 |
+
ascii_charts = create_ascii_charts(temporal_stats)
|
| 454 |
+
print("\n" + "-" * 70)
|
| 455 |
+
print("TREND VISUALIZATION")
|
| 456 |
+
print("-" * 70)
|
| 457 |
+
print(ascii_charts)
|
| 458 |
+
|
| 459 |
+
print("\n" + "-" * 70)
|
| 460 |
+
print("PERFORMANCE")
|
| 461 |
+
print("-" * 70)
|
| 462 |
+
print(f"Scan time: {scan_time:.2f}s")
|
| 463 |
+
print(f"Documents: {total_docs:,}")
|
| 464 |
+
print(f"Throughput: {docs_per_sec:,.0f} docs/sec")
|
| 465 |
+
|
| 466 |
+
logger.info(f"Results saved to: {output_dir}")
|
| 467 |
+
|
| 468 |
+
# Upload to HF Hub if requested
|
| 469 |
+
if args.output_repo:
|
| 470 |
+
hf_token = args.hf_token or os.environ.get("HF_TOKEN")
|
| 471 |
+
if hf_token:
|
| 472 |
+
login(token=hf_token)
|
| 473 |
+
|
| 474 |
+
api = HfApi(token=hf_token)
|
| 475 |
+
|
| 476 |
+
logger.info(f"Creating/updating dataset repository: {args.output_repo}")
|
| 477 |
+
create_repo(
|
| 478 |
+
args.output_repo,
|
| 479 |
+
repo_type="dataset",
|
| 480 |
+
private=args.private,
|
| 481 |
+
token=hf_token,
|
| 482 |
+
exist_ok=True,
|
| 483 |
+
)
|
| 484 |
+
|
| 485 |
+
# Upload each as a dataset config
|
| 486 |
+
configs = [
|
| 487 |
+
("global_stats", global_stats),
|
| 488 |
+
("temporal_stats", temporal_stats),
|
| 489 |
+
]
|
| 490 |
+
|
| 491 |
+
for config_name, stats_df in configs:
|
| 492 |
+
logger.info(f"Uploading {config_name}...")
|
| 493 |
+
ds = Dataset.from_polars(stats_df)
|
| 494 |
+
ds.push_to_hub(
|
| 495 |
+
args.output_repo,
|
| 496 |
+
config_name=config_name,
|
| 497 |
+
token=hf_token,
|
| 498 |
+
private=args.private,
|
| 499 |
+
)
|
| 500 |
+
time.sleep(1) # Avoid 409 conflicts
|
| 501 |
+
|
| 502 |
+
# Upload README
|
| 503 |
+
readme_content = create_readme(
|
| 504 |
+
args, global_stats, temporal_stats, scan_time, ascii_charts
|
| 505 |
+
)
|
| 506 |
+
api.upload_file(
|
| 507 |
+
path_or_fileobj=readme_content.encode(),
|
| 508 |
+
path_in_repo="README.md",
|
| 509 |
+
repo_id=args.output_repo,
|
| 510 |
+
repo_type="dataset",
|
| 511 |
+
token=hf_token,
|
| 512 |
+
)
|
| 513 |
+
|
| 514 |
+
dataset_url = f"https://huggingface.co/datasets/{args.output_repo}"
|
| 515 |
+
logger.info(f"Dataset uploaded: {dataset_url}")
|
| 516 |
+
print(f"\nResults uploaded to: {dataset_url}")
|
| 517 |
+
|
| 518 |
+
|
| 519 |
+
if __name__ == "__main__":
|
| 520 |
+
if len(sys.argv) == 1:
|
| 521 |
+
print("Is the Web Getting More Educational?")
|
| 522 |
+
print("=" * 40)
|
| 523 |
+
print("\nAnalyze educational quality trends across CommonCrawl dumps")
|
| 524 |
+
print("using Polars streaming - no download needed!\n")
|
| 525 |
+
print("Example commands:\n")
|
| 526 |
+
print("# Quick test:")
|
| 527 |
+
print("uv run finepdfs-stats.py --limit 10000\n")
|
| 528 |
+
print("# Analyze English PDFs:")
|
| 529 |
+
print("uv run finepdfs-stats.py\n")
|
| 530 |
+
print("# Analyze ALL 70+ languages:")
|
| 531 |
+
print("uv run finepdfs-stats.py --all-languages\n")
|
| 532 |
+
print("# Show query plan (see Polars optimization):")
|
| 533 |
+
print("uv run finepdfs-stats.py --show-plan --limit 1000\n")
|
| 534 |
+
print("# Save results to HF Hub:")
|
| 535 |
+
print("uv run finepdfs-stats.py --output-repo username/temporal-stats\n")
|
| 536 |
+
print("# Run on HF Jobs:")
|
| 537 |
+
print("hf jobs uv run \\")
|
| 538 |
+
print(" -s HF_TOKEN \\")
|
| 539 |
+
print(" -e HF_XET_HIGH_PERFORMANCE=1 \\")
|
| 540 |
+
print(
|
| 541 |
+
" https://huggingface.co/datasets/uv-scripts/dataset-stats/raw/main/finepdfs-stats.py \\"
|
| 542 |
+
)
|
| 543 |
+
print(" -- --output-repo username/stats")
|
| 544 |
+
sys.exit(0)
|
| 545 |
+
|
| 546 |
+
main()
|
scripts/generate-responses.py
ADDED
|
@@ -0,0 +1,587 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# /// script
|
| 2 |
+
# requires-python = ">=3.10"
|
| 3 |
+
# dependencies = [
|
| 4 |
+
# "datasets",
|
| 5 |
+
# "flashinfer-python",
|
| 6 |
+
# "huggingface-hub[hf_transfer]",
|
| 7 |
+
# "hf-xet>= 1.1.7",
|
| 8 |
+
# "torch",
|
| 9 |
+
# "transformers",
|
| 10 |
+
# "vllm>=0.8.5",
|
| 11 |
+
# ]
|
| 12 |
+
#
|
| 13 |
+
# ///
|
| 14 |
+
"""
|
| 15 |
+
Generate responses for prompts in a dataset using vLLM for efficient GPU inference.
|
| 16 |
+
|
| 17 |
+
This script loads a dataset from Hugging Face Hub containing chat-formatted messages,
|
| 18 |
+
applies the model's chat template, generates responses using vLLM, and saves the
|
| 19 |
+
results back to the Hub with a comprehensive dataset card.
|
| 20 |
+
|
| 21 |
+
Example usage:
|
| 22 |
+
# Local execution with auto GPU detection
|
| 23 |
+
uv run generate-responses.py \\
|
| 24 |
+
username/input-dataset \\
|
| 25 |
+
username/output-dataset \\
|
| 26 |
+
--messages-column messages
|
| 27 |
+
|
| 28 |
+
# With custom model and sampling parameters
|
| 29 |
+
uv run generate-responses.py \\
|
| 30 |
+
username/input-dataset \\
|
| 31 |
+
username/output-dataset \\
|
| 32 |
+
--model-id meta-llama/Llama-3.1-8B-Instruct \\
|
| 33 |
+
--temperature 0.9 \\
|
| 34 |
+
--top-p 0.95 \\
|
| 35 |
+
--max-tokens 2048
|
| 36 |
+
|
| 37 |
+
# HF Jobs execution (see script output for full command)
|
| 38 |
+
hf jobs uv run --flavor a100x4 ...
|
| 39 |
+
"""
|
| 40 |
+
|
| 41 |
+
import argparse
|
| 42 |
+
import logging
|
| 43 |
+
import os
|
| 44 |
+
import sys
|
| 45 |
+
from datetime import datetime
|
| 46 |
+
from typing import Optional
|
| 47 |
+
|
| 48 |
+
from datasets import load_dataset
|
| 49 |
+
from huggingface_hub import DatasetCard, get_token, login
|
| 50 |
+
from torch import cuda
|
| 51 |
+
from tqdm.auto import tqdm
|
| 52 |
+
from transformers import AutoTokenizer
|
| 53 |
+
from vllm import LLM, SamplingParams
|
| 54 |
+
|
| 55 |
+
# Enable HF Transfer for faster downloads
|
| 56 |
+
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
|
| 57 |
+
|
| 58 |
+
logging.basicConfig(
|
| 59 |
+
level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
|
| 60 |
+
)
|
| 61 |
+
logger = logging.getLogger(__name__)
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
def check_gpu_availability() -> int:
|
| 65 |
+
"""Check if CUDA is available and return the number of GPUs."""
|
| 66 |
+
if not cuda.is_available():
|
| 67 |
+
logger.error("CUDA is not available. This script requires a GPU.")
|
| 68 |
+
logger.error(
|
| 69 |
+
"Please run on a machine with NVIDIA GPU or use HF Jobs with GPU flavor."
|
| 70 |
+
)
|
| 71 |
+
sys.exit(1)
|
| 72 |
+
|
| 73 |
+
num_gpus = cuda.device_count()
|
| 74 |
+
for i in range(num_gpus):
|
| 75 |
+
gpu_name = cuda.get_device_name(i)
|
| 76 |
+
gpu_memory = cuda.get_device_properties(i).total_memory / 1024**3
|
| 77 |
+
logger.info(f"GPU {i}: {gpu_name} with {gpu_memory:.1f} GB memory")
|
| 78 |
+
|
| 79 |
+
return num_gpus
|
| 80 |
+
|
| 81 |
+
|
| 82 |
+
def create_dataset_card(
|
| 83 |
+
source_dataset: str,
|
| 84 |
+
model_id: str,
|
| 85 |
+
messages_column: str,
|
| 86 |
+
prompt_column: Optional[str],
|
| 87 |
+
sampling_params: SamplingParams,
|
| 88 |
+
tensor_parallel_size: int,
|
| 89 |
+
num_examples: int,
|
| 90 |
+
generation_time: str,
|
| 91 |
+
num_skipped: int = 0,
|
| 92 |
+
max_model_len_used: Optional[int] = None,
|
| 93 |
+
) -> str:
|
| 94 |
+
"""Create a comprehensive dataset card documenting the generation process."""
|
| 95 |
+
filtering_section = ""
|
| 96 |
+
if num_skipped > 0:
|
| 97 |
+
skip_percentage = (num_skipped / num_examples) * 100
|
| 98 |
+
processed = num_examples - num_skipped
|
| 99 |
+
filtering_section = f"""
|
| 100 |
+
|
| 101 |
+
### Filtering Statistics
|
| 102 |
+
|
| 103 |
+
- **Total Examples**: {num_examples:,}
|
| 104 |
+
- **Processed**: {processed:,} ({100 - skip_percentage:.1f}%)
|
| 105 |
+
- **Skipped (too long)**: {num_skipped:,} ({skip_percentage:.1f}%)
|
| 106 |
+
- **Max Model Length Used**: {max_model_len_used:,} tokens
|
| 107 |
+
|
| 108 |
+
Note: Prompts exceeding the maximum model length were skipped and have empty responses."""
|
| 109 |
+
|
| 110 |
+
return f"""---
|
| 111 |
+
tags:
|
| 112 |
+
- generated
|
| 113 |
+
- vllm
|
| 114 |
+
- uv-script
|
| 115 |
+
---
|
| 116 |
+
|
| 117 |
+
# Generated Responses Dataset
|
| 118 |
+
|
| 119 |
+
This dataset contains generated responses for prompts from [{source_dataset}](https://huggingface.co/datasets/{source_dataset}).
|
| 120 |
+
|
| 121 |
+
## Generation Details
|
| 122 |
+
|
| 123 |
+
- **Source Dataset**: [{source_dataset}](https://huggingface.co/datasets/{source_dataset})
|
| 124 |
+
- **Input Column**: `{prompt_column if prompt_column else messages_column}` ({"plain text prompts" if prompt_column else "chat messages"})
|
| 125 |
+
- **Model**: [{model_id}](https://huggingface.co/{model_id})
|
| 126 |
+
- **Number of Examples**: {num_examples:,}
|
| 127 |
+
- **Generation Date**: {generation_time}{filtering_section}
|
| 128 |
+
|
| 129 |
+
### Sampling Parameters
|
| 130 |
+
|
| 131 |
+
- **Temperature**: {sampling_params.temperature}
|
| 132 |
+
- **Top P**: {sampling_params.top_p}
|
| 133 |
+
- **Top K**: {sampling_params.top_k}
|
| 134 |
+
- **Min P**: {sampling_params.min_p}
|
| 135 |
+
- **Max Tokens**: {sampling_params.max_tokens}
|
| 136 |
+
- **Repetition Penalty**: {sampling_params.repetition_penalty}
|
| 137 |
+
|
| 138 |
+
### Hardware Configuration
|
| 139 |
+
|
| 140 |
+
- **Tensor Parallel Size**: {tensor_parallel_size}
|
| 141 |
+
- **GPU Configuration**: {tensor_parallel_size} GPU(s)
|
| 142 |
+
|
| 143 |
+
## Dataset Structure
|
| 144 |
+
|
| 145 |
+
The dataset contains all columns from the source dataset plus:
|
| 146 |
+
- `response`: The generated response from the model
|
| 147 |
+
|
| 148 |
+
## Generation Script
|
| 149 |
+
|
| 150 |
+
Generated using the vLLM inference script from [uv-scripts/vllm](https://huggingface.co/datasets/uv-scripts/vllm).
|
| 151 |
+
|
| 152 |
+
To reproduce this generation:
|
| 153 |
+
|
| 154 |
+
```bash
|
| 155 |
+
uv run https://huggingface.co/datasets/uv-scripts/vllm/raw/main/generate-responses.py \\
|
| 156 |
+
{source_dataset} \\
|
| 157 |
+
<output-dataset> \\
|
| 158 |
+
--model-id {model_id} \\
|
| 159 |
+
{"--prompt-column " + prompt_column if prompt_column else "--messages-column " + messages_column} \\
|
| 160 |
+
--temperature {sampling_params.temperature} \\
|
| 161 |
+
--top-p {sampling_params.top_p} \\
|
| 162 |
+
--top-k {sampling_params.top_k} \\
|
| 163 |
+
--max-tokens {sampling_params.max_tokens}{f" \\\\\\n --max-model-len {max_model_len_used}" if max_model_len_used else ""}
|
| 164 |
+
```
|
| 165 |
+
"""
|
| 166 |
+
|
| 167 |
+
|
| 168 |
+
def main(
|
| 169 |
+
src_dataset_hub_id: str,
|
| 170 |
+
output_dataset_hub_id: str,
|
| 171 |
+
model_id: str = "Qwen/Qwen3-30B-A3B-Instruct-2507",
|
| 172 |
+
messages_column: str = "messages",
|
| 173 |
+
prompt_column: Optional[str] = None,
|
| 174 |
+
output_column: str = "response",
|
| 175 |
+
temperature: float = 0.7,
|
| 176 |
+
top_p: float = 0.8,
|
| 177 |
+
top_k: int = 20,
|
| 178 |
+
min_p: float = 0.0,
|
| 179 |
+
max_tokens: int = 16384,
|
| 180 |
+
repetition_penalty: float = 1.0,
|
| 181 |
+
gpu_memory_utilization: float = 0.90,
|
| 182 |
+
max_model_len: Optional[int] = None,
|
| 183 |
+
tensor_parallel_size: Optional[int] = None,
|
| 184 |
+
skip_long_prompts: bool = True,
|
| 185 |
+
max_samples: Optional[int] = None,
|
| 186 |
+
hf_token: Optional[str] = None,
|
| 187 |
+
):
|
| 188 |
+
"""
|
| 189 |
+
Main generation pipeline.
|
| 190 |
+
|
| 191 |
+
Args:
|
| 192 |
+
src_dataset_hub_id: Input dataset on Hugging Face Hub
|
| 193 |
+
output_dataset_hub_id: Where to save results on Hugging Face Hub
|
| 194 |
+
model_id: Hugging Face model ID for generation
|
| 195 |
+
messages_column: Column name containing chat messages
|
| 196 |
+
prompt_column: Column name containing plain text prompts (alternative to messages_column)
|
| 197 |
+
output_column: Column name for generated responses
|
| 198 |
+
temperature: Sampling temperature
|
| 199 |
+
top_p: Top-p sampling parameter
|
| 200 |
+
top_k: Top-k sampling parameter
|
| 201 |
+
min_p: Minimum probability threshold
|
| 202 |
+
max_tokens: Maximum tokens to generate
|
| 203 |
+
repetition_penalty: Repetition penalty parameter
|
| 204 |
+
gpu_memory_utilization: GPU memory utilization factor
|
| 205 |
+
max_model_len: Maximum model context length (None uses model default)
|
| 206 |
+
tensor_parallel_size: Number of GPUs to use (auto-detect if None)
|
| 207 |
+
skip_long_prompts: Skip prompts exceeding max_model_len instead of failing
|
| 208 |
+
max_samples: Maximum number of samples to process (None for all)
|
| 209 |
+
hf_token: Hugging Face authentication token
|
| 210 |
+
"""
|
| 211 |
+
generation_start_time = datetime.now().isoformat()
|
| 212 |
+
|
| 213 |
+
# GPU check and configuration
|
| 214 |
+
num_gpus = check_gpu_availability()
|
| 215 |
+
if tensor_parallel_size is None:
|
| 216 |
+
tensor_parallel_size = num_gpus
|
| 217 |
+
logger.info(
|
| 218 |
+
f"Auto-detected {num_gpus} GPU(s), using tensor_parallel_size={tensor_parallel_size}"
|
| 219 |
+
)
|
| 220 |
+
else:
|
| 221 |
+
logger.info(f"Using specified tensor_parallel_size={tensor_parallel_size}")
|
| 222 |
+
if tensor_parallel_size > num_gpus:
|
| 223 |
+
logger.warning(
|
| 224 |
+
f"Requested {tensor_parallel_size} GPUs but only {num_gpus} available"
|
| 225 |
+
)
|
| 226 |
+
|
| 227 |
+
# Authentication - try multiple methods
|
| 228 |
+
HF_TOKEN = hf_token or os.environ.get("HF_TOKEN") or get_token()
|
| 229 |
+
|
| 230 |
+
if not HF_TOKEN:
|
| 231 |
+
logger.error("No HuggingFace token found. Please provide token via:")
|
| 232 |
+
logger.error(" 1. --hf-token argument")
|
| 233 |
+
logger.error(" 2. HF_TOKEN environment variable")
|
| 234 |
+
logger.error(" 3. Run 'huggingface-cli login' or use login() in Python")
|
| 235 |
+
sys.exit(1)
|
| 236 |
+
|
| 237 |
+
logger.info("HuggingFace token found, authenticating...")
|
| 238 |
+
login(token=HF_TOKEN)
|
| 239 |
+
|
| 240 |
+
# Initialize vLLM
|
| 241 |
+
logger.info(f"Loading model: {model_id}")
|
| 242 |
+
vllm_kwargs = {
|
| 243 |
+
"model": model_id,
|
| 244 |
+
"tensor_parallel_size": tensor_parallel_size,
|
| 245 |
+
"gpu_memory_utilization": gpu_memory_utilization,
|
| 246 |
+
}
|
| 247 |
+
if max_model_len is not None:
|
| 248 |
+
vllm_kwargs["max_model_len"] = max_model_len
|
| 249 |
+
logger.info(f"Using max_model_len={max_model_len}")
|
| 250 |
+
|
| 251 |
+
llm = LLM(**vllm_kwargs)
|
| 252 |
+
|
| 253 |
+
# Load tokenizer for chat template
|
| 254 |
+
logger.info("Loading tokenizer...")
|
| 255 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 256 |
+
|
| 257 |
+
# Create sampling parameters
|
| 258 |
+
sampling_params = SamplingParams(
|
| 259 |
+
temperature=temperature,
|
| 260 |
+
top_p=top_p,
|
| 261 |
+
top_k=top_k,
|
| 262 |
+
min_p=min_p,
|
| 263 |
+
max_tokens=max_tokens,
|
| 264 |
+
repetition_penalty=repetition_penalty,
|
| 265 |
+
)
|
| 266 |
+
|
| 267 |
+
# Load dataset
|
| 268 |
+
logger.info(f"Loading dataset: {src_dataset_hub_id}")
|
| 269 |
+
dataset = load_dataset(src_dataset_hub_id, split="train")
|
| 270 |
+
|
| 271 |
+
# Apply max_samples if specified
|
| 272 |
+
if max_samples is not None and max_samples < len(dataset):
|
| 273 |
+
logger.info(f"Limiting dataset to {max_samples} samples")
|
| 274 |
+
dataset = dataset.select(range(max_samples))
|
| 275 |
+
|
| 276 |
+
total_examples = len(dataset)
|
| 277 |
+
logger.info(f"Dataset loaded with {total_examples:,} examples")
|
| 278 |
+
|
| 279 |
+
# Determine which column to use and validate
|
| 280 |
+
if prompt_column:
|
| 281 |
+
# Use prompt column mode
|
| 282 |
+
if prompt_column not in dataset.column_names:
|
| 283 |
+
logger.error(
|
| 284 |
+
f"Column '{prompt_column}' not found. Available columns: {dataset.column_names}"
|
| 285 |
+
)
|
| 286 |
+
sys.exit(1)
|
| 287 |
+
logger.info(f"Using prompt column mode with column: '{prompt_column}'")
|
| 288 |
+
use_messages = False
|
| 289 |
+
else:
|
| 290 |
+
# Use messages column mode
|
| 291 |
+
if messages_column not in dataset.column_names:
|
| 292 |
+
logger.error(
|
| 293 |
+
f"Column '{messages_column}' not found. Available columns: {dataset.column_names}"
|
| 294 |
+
)
|
| 295 |
+
sys.exit(1)
|
| 296 |
+
logger.info(f"Using messages column mode with column: '{messages_column}'")
|
| 297 |
+
use_messages = True
|
| 298 |
+
|
| 299 |
+
# Get effective max length for filtering
|
| 300 |
+
if max_model_len is not None:
|
| 301 |
+
effective_max_len = max_model_len
|
| 302 |
+
else:
|
| 303 |
+
# Get model's default max length
|
| 304 |
+
effective_max_len = llm.llm_engine.model_config.max_model_len
|
| 305 |
+
logger.info(f"Using effective max model length: {effective_max_len}")
|
| 306 |
+
|
| 307 |
+
# Process messages and apply chat template
|
| 308 |
+
logger.info("Preparing prompts...")
|
| 309 |
+
all_prompts = []
|
| 310 |
+
valid_prompts = []
|
| 311 |
+
valid_indices = []
|
| 312 |
+
skipped_info = []
|
| 313 |
+
|
| 314 |
+
for i, example in enumerate(tqdm(dataset, desc="Processing prompts")):
|
| 315 |
+
if use_messages:
|
| 316 |
+
# Messages mode: use existing chat messages
|
| 317 |
+
messages = example[messages_column]
|
| 318 |
+
# Apply chat template
|
| 319 |
+
prompt = tokenizer.apply_chat_template(
|
| 320 |
+
messages, tokenize=False, add_generation_prompt=True
|
| 321 |
+
)
|
| 322 |
+
else:
|
| 323 |
+
# Prompt mode: convert plain text to messages format
|
| 324 |
+
user_prompt = example[prompt_column]
|
| 325 |
+
messages = [{"role": "user", "content": user_prompt}]
|
| 326 |
+
# Apply chat template
|
| 327 |
+
prompt = tokenizer.apply_chat_template(
|
| 328 |
+
messages, tokenize=False, add_generation_prompt=True
|
| 329 |
+
)
|
| 330 |
+
|
| 331 |
+
all_prompts.append(prompt)
|
| 332 |
+
|
| 333 |
+
# Count tokens if filtering is enabled
|
| 334 |
+
if skip_long_prompts:
|
| 335 |
+
tokens = tokenizer.encode(prompt)
|
| 336 |
+
if len(tokens) <= effective_max_len:
|
| 337 |
+
valid_prompts.append(prompt)
|
| 338 |
+
valid_indices.append(i)
|
| 339 |
+
else:
|
| 340 |
+
skipped_info.append((i, len(tokens)))
|
| 341 |
+
else:
|
| 342 |
+
valid_prompts.append(prompt)
|
| 343 |
+
valid_indices.append(i)
|
| 344 |
+
|
| 345 |
+
# Log filtering results
|
| 346 |
+
if skip_long_prompts and skipped_info:
|
| 347 |
+
logger.warning(
|
| 348 |
+
f"Skipped {len(skipped_info)} prompts that exceed max_model_len ({effective_max_len} tokens)"
|
| 349 |
+
)
|
| 350 |
+
logger.info("Skipped prompt details (first 10):")
|
| 351 |
+
for idx, (prompt_idx, token_count) in enumerate(skipped_info[:10]):
|
| 352 |
+
logger.info(
|
| 353 |
+
f" - Example {prompt_idx}: {token_count} tokens (exceeds by {token_count - effective_max_len})"
|
| 354 |
+
)
|
| 355 |
+
if len(skipped_info) > 10:
|
| 356 |
+
logger.info(f" ... and {len(skipped_info) - 10} more")
|
| 357 |
+
|
| 358 |
+
skip_percentage = (len(skipped_info) / total_examples) * 100
|
| 359 |
+
if skip_percentage > 10:
|
| 360 |
+
logger.warning(f"WARNING: {skip_percentage:.1f}% of prompts were skipped!")
|
| 361 |
+
|
| 362 |
+
if not valid_prompts:
|
| 363 |
+
logger.error("No valid prompts to process after filtering!")
|
| 364 |
+
sys.exit(1)
|
| 365 |
+
|
| 366 |
+
# Generate responses - vLLM handles batching internally
|
| 367 |
+
logger.info(f"Starting generation for {len(valid_prompts):,} valid prompts...")
|
| 368 |
+
logger.info("vLLM will handle batching and scheduling automatically")
|
| 369 |
+
|
| 370 |
+
outputs = llm.generate(valid_prompts, sampling_params)
|
| 371 |
+
|
| 372 |
+
# Extract generated text and create full response list
|
| 373 |
+
logger.info("Extracting generated responses...")
|
| 374 |
+
responses = [""] * total_examples # Initialize with empty strings
|
| 375 |
+
|
| 376 |
+
for idx, output in enumerate(outputs):
|
| 377 |
+
original_idx = valid_indices[idx]
|
| 378 |
+
response = output.outputs[0].text.strip()
|
| 379 |
+
responses[original_idx] = response
|
| 380 |
+
|
| 381 |
+
# Add responses to dataset
|
| 382 |
+
logger.info("Adding responses to dataset...")
|
| 383 |
+
dataset = dataset.add_column(output_column, responses)
|
| 384 |
+
|
| 385 |
+
# Create dataset card
|
| 386 |
+
logger.info("Creating dataset card...")
|
| 387 |
+
card_content = create_dataset_card(
|
| 388 |
+
source_dataset=src_dataset_hub_id,
|
| 389 |
+
model_id=model_id,
|
| 390 |
+
messages_column=messages_column,
|
| 391 |
+
prompt_column=prompt_column,
|
| 392 |
+
sampling_params=sampling_params,
|
| 393 |
+
tensor_parallel_size=tensor_parallel_size,
|
| 394 |
+
num_examples=total_examples,
|
| 395 |
+
generation_time=generation_start_time,
|
| 396 |
+
num_skipped=len(skipped_info) if skip_long_prompts else 0,
|
| 397 |
+
max_model_len_used=effective_max_len if skip_long_prompts else None,
|
| 398 |
+
)
|
| 399 |
+
|
| 400 |
+
# Push dataset to hub
|
| 401 |
+
logger.info(f"Pushing dataset to: {output_dataset_hub_id}")
|
| 402 |
+
dataset.push_to_hub(output_dataset_hub_id, token=HF_TOKEN)
|
| 403 |
+
|
| 404 |
+
# Push dataset card
|
| 405 |
+
card = DatasetCard(card_content)
|
| 406 |
+
card.push_to_hub(output_dataset_hub_id, token=HF_TOKEN)
|
| 407 |
+
|
| 408 |
+
logger.info("✅ Generation complete!")
|
| 409 |
+
logger.info(
|
| 410 |
+
f"Dataset available at: https://huggingface.co/datasets/{output_dataset_hub_id}"
|
| 411 |
+
)
|
| 412 |
+
|
| 413 |
+
|
| 414 |
+
if __name__ == "__main__":
|
| 415 |
+
if len(sys.argv) > 1:
|
| 416 |
+
parser = argparse.ArgumentParser(
|
| 417 |
+
description="Generate responses for dataset prompts using vLLM",
|
| 418 |
+
formatter_class=argparse.RawDescriptionHelpFormatter,
|
| 419 |
+
epilog="""
|
| 420 |
+
Examples:
|
| 421 |
+
# Basic usage with default Qwen model
|
| 422 |
+
uv run generate-responses.py input-dataset output-dataset
|
| 423 |
+
|
| 424 |
+
# With custom model and parameters
|
| 425 |
+
uv run generate-responses.py input-dataset output-dataset \\
|
| 426 |
+
--model-id meta-llama/Llama-3.1-8B-Instruct \\
|
| 427 |
+
--temperature 0.9 \\
|
| 428 |
+
--max-tokens 2048
|
| 429 |
+
|
| 430 |
+
# Force specific GPU configuration
|
| 431 |
+
uv run generate-responses.py input-dataset output-dataset \\
|
| 432 |
+
--tensor-parallel-size 2 \\
|
| 433 |
+
--gpu-memory-utilization 0.95
|
| 434 |
+
|
| 435 |
+
# Using environment variable for token
|
| 436 |
+
HF_TOKEN=hf_xxx uv run generate-responses.py input-dataset output-dataset
|
| 437 |
+
""",
|
| 438 |
+
)
|
| 439 |
+
|
| 440 |
+
parser.add_argument(
|
| 441 |
+
"src_dataset_hub_id",
|
| 442 |
+
help="Input dataset on Hugging Face Hub (e.g., username/dataset-name)",
|
| 443 |
+
)
|
| 444 |
+
parser.add_argument(
|
| 445 |
+
"output_dataset_hub_id", help="Output dataset name on Hugging Face Hub"
|
| 446 |
+
)
|
| 447 |
+
parser.add_argument(
|
| 448 |
+
"--model-id",
|
| 449 |
+
type=str,
|
| 450 |
+
default="Qwen/Qwen3-30B-A3B-Instruct-2507",
|
| 451 |
+
help="Model to use for generation (default: Qwen3-30B-A3B-Instruct-2507)",
|
| 452 |
+
)
|
| 453 |
+
parser.add_argument(
|
| 454 |
+
"--messages-column",
|
| 455 |
+
type=str,
|
| 456 |
+
default="messages",
|
| 457 |
+
help="Column containing chat messages (default: messages)",
|
| 458 |
+
)
|
| 459 |
+
parser.add_argument(
|
| 460 |
+
"--prompt-column",
|
| 461 |
+
type=str,
|
| 462 |
+
help="Column containing plain text prompts (alternative to --messages-column)",
|
| 463 |
+
)
|
| 464 |
+
parser.add_argument(
|
| 465 |
+
"--output-column",
|
| 466 |
+
type=str,
|
| 467 |
+
default="response",
|
| 468 |
+
help="Column name for generated responses (default: response)",
|
| 469 |
+
)
|
| 470 |
+
parser.add_argument(
|
| 471 |
+
"--max-samples",
|
| 472 |
+
type=int,
|
| 473 |
+
help="Maximum number of samples to process (default: all)",
|
| 474 |
+
)
|
| 475 |
+
parser.add_argument(
|
| 476 |
+
"--temperature",
|
| 477 |
+
type=float,
|
| 478 |
+
default=0.7,
|
| 479 |
+
help="Sampling temperature (default: 0.7)",
|
| 480 |
+
)
|
| 481 |
+
parser.add_argument(
|
| 482 |
+
"--top-p",
|
| 483 |
+
type=float,
|
| 484 |
+
default=0.8,
|
| 485 |
+
help="Top-p sampling parameter (default: 0.8)",
|
| 486 |
+
)
|
| 487 |
+
parser.add_argument(
|
| 488 |
+
"--top-k",
|
| 489 |
+
type=int,
|
| 490 |
+
default=20,
|
| 491 |
+
help="Top-k sampling parameter (default: 20)",
|
| 492 |
+
)
|
| 493 |
+
parser.add_argument(
|
| 494 |
+
"--min-p",
|
| 495 |
+
type=float,
|
| 496 |
+
default=0.0,
|
| 497 |
+
help="Minimum probability threshold (default: 0.0)",
|
| 498 |
+
)
|
| 499 |
+
parser.add_argument(
|
| 500 |
+
"--max-tokens",
|
| 501 |
+
type=int,
|
| 502 |
+
default=16384,
|
| 503 |
+
help="Maximum tokens to generate (default: 16384)",
|
| 504 |
+
)
|
| 505 |
+
parser.add_argument(
|
| 506 |
+
"--repetition-penalty",
|
| 507 |
+
type=float,
|
| 508 |
+
default=1.0,
|
| 509 |
+
help="Repetition penalty (default: 1.0)",
|
| 510 |
+
)
|
| 511 |
+
parser.add_argument(
|
| 512 |
+
"--gpu-memory-utilization",
|
| 513 |
+
type=float,
|
| 514 |
+
default=0.90,
|
| 515 |
+
help="GPU memory utilization factor (default: 0.90)",
|
| 516 |
+
)
|
| 517 |
+
parser.add_argument(
|
| 518 |
+
"--max-model-len",
|
| 519 |
+
type=int,
|
| 520 |
+
help="Maximum model context length (default: model's default)",
|
| 521 |
+
)
|
| 522 |
+
parser.add_argument(
|
| 523 |
+
"--tensor-parallel-size",
|
| 524 |
+
type=int,
|
| 525 |
+
help="Number of GPUs to use (default: auto-detect)",
|
| 526 |
+
)
|
| 527 |
+
parser.add_argument(
|
| 528 |
+
"--hf-token",
|
| 529 |
+
type=str,
|
| 530 |
+
help="Hugging Face token (can also use HF_TOKEN env var)",
|
| 531 |
+
)
|
| 532 |
+
parser.add_argument(
|
| 533 |
+
"--skip-long-prompts",
|
| 534 |
+
action="store_true",
|
| 535 |
+
default=True,
|
| 536 |
+
help="Skip prompts that exceed max_model_len instead of failing (default: True)",
|
| 537 |
+
)
|
| 538 |
+
parser.add_argument(
|
| 539 |
+
"--no-skip-long-prompts",
|
| 540 |
+
dest="skip_long_prompts",
|
| 541 |
+
action="store_false",
|
| 542 |
+
help="Fail on prompts that exceed max_model_len",
|
| 543 |
+
)
|
| 544 |
+
|
| 545 |
+
args = parser.parse_args()
|
| 546 |
+
|
| 547 |
+
main(
|
| 548 |
+
src_dataset_hub_id=args.src_dataset_hub_id,
|
| 549 |
+
output_dataset_hub_id=args.output_dataset_hub_id,
|
| 550 |
+
model_id=args.model_id,
|
| 551 |
+
messages_column=args.messages_column,
|
| 552 |
+
prompt_column=args.prompt_column,
|
| 553 |
+
output_column=args.output_column,
|
| 554 |
+
temperature=args.temperature,
|
| 555 |
+
top_p=args.top_p,
|
| 556 |
+
top_k=args.top_k,
|
| 557 |
+
min_p=args.min_p,
|
| 558 |
+
max_tokens=args.max_tokens,
|
| 559 |
+
repetition_penalty=args.repetition_penalty,
|
| 560 |
+
gpu_memory_utilization=args.gpu_memory_utilization,
|
| 561 |
+
max_model_len=args.max_model_len,
|
| 562 |
+
tensor_parallel_size=args.tensor_parallel_size,
|
| 563 |
+
skip_long_prompts=args.skip_long_prompts,
|
| 564 |
+
max_samples=args.max_samples,
|
| 565 |
+
hf_token=args.hf_token,
|
| 566 |
+
)
|
| 567 |
+
else:
|
| 568 |
+
# Show HF Jobs example when run without arguments
|
| 569 |
+
print("""
|
| 570 |
+
vLLM Response Generation Script
|
| 571 |
+
==============================
|
| 572 |
+
|
| 573 |
+
This script requires arguments. For usage information:
|
| 574 |
+
uv run generate-responses.py --help
|
| 575 |
+
|
| 576 |
+
Example HF Jobs command with multi-GPU:
|
| 577 |
+
# If you're logged in with huggingface-cli, token will be auto-detected
|
| 578 |
+
hf jobs uv run \\
|
| 579 |
+
--flavor l4x4 \\
|
| 580 |
+
https://huggingface.co/datasets/uv-scripts/vllm/raw/main/generate-responses.py \\
|
| 581 |
+
username/input-dataset \\
|
| 582 |
+
username/output-dataset \\
|
| 583 |
+
--messages-column messages \\
|
| 584 |
+
--model-id Qwen/Qwen3-30B-A3B-Instruct-2507 \\
|
| 585 |
+
--temperature 0.7 \\
|
| 586 |
+
--max-tokens 16384
|
| 587 |
+
""")
|