burtenshaw HF Staff commited on
Commit
7200e76
·
verified ·
1 Parent(s): ab1dc3b

Upload folder using huggingface_hub

Browse files
SKILL.md ADDED
@@ -0,0 +1,752 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: hf-jobs
3
+ description: This skill should be used when users want to run any workload on Hugging Face Jobs infrastructure. Covers UV scripts, Docker-based jobs, hardware selection, cost estimation, authentication with tokens, secrets management, timeout configuration, and result persistence. Designed for general-purpose compute workloads including data processing, inference, experiments, batch jobs, and any Python-based tasks. Should be invoked for tasks involving cloud compute, GPU workloads, or when users mention running jobs on Hugging Face infrastructure without local setup.
4
+ license: Complete terms in LICENSE.txt
5
+ ---
6
+
7
+ # Running Workloads on Hugging Face Jobs
8
+
9
+ ## Overview
10
+
11
+ Run any workload on fully managed Hugging Face infrastructure. No local setup required—jobs run on cloud CPUs, GPUs, or TPUs and can persist results to the Hugging Face Hub.
12
+
13
+ **Common use cases:**
14
+ - **Data Processing** - Transform, filter, or analyze large datasets
15
+ - **Batch Inference** - Run inference on thousands of samples
16
+ - **Experiments & Benchmarks** - Reproducible ML experiments
17
+ - **Model Training** - Fine-tune models (see `model-trainer` skill for TRL-specific training)
18
+ - **Synthetic Data Generation** - Generate datasets using LLMs
19
+ - **Development & Testing** - Test code without local GPU setup
20
+ - **Scheduled Jobs** - Automate recurring tasks
21
+
22
+ **For model training specifically:** See the `model-trainer` skill for TRL-based training workflows.
23
+
24
+ ## When to Use This Skill
25
+
26
+ Use this skill when users want to:
27
+ - Run Python workloads on cloud infrastructure
28
+ - Execute jobs without local GPU/TPU setup
29
+ - Process data at scale
30
+ - Run batch inference or experiments
31
+ - Schedule recurring tasks
32
+ - Use GPUs/TPUs for any workload
33
+ - Persist results to the Hugging Face Hub
34
+
35
+ ## Key Directives
36
+
37
+ When assisting with jobs:
38
+
39
+ 1. **ALWAYS use `hf_jobs()` MCP tool** - Submit jobs using `hf_jobs("uv", {...})` or `hf_jobs("run", {...})`. The `script` parameter accepts Python code directly. Do NOT save to local files unless the user explicitly requests it. Pass the script content as a string to `hf_jobs()`.
40
+
41
+ 2. **Always handle authentication** - Jobs that interact with the Hub require `HF_TOKEN` via secrets. See Token Usage section below.
42
+
43
+ 3. **Provide job details after submission** - After submitting, provide job ID, monitoring URL, estimated time, and note that the user can request status checks later.
44
+
45
+ 4. **Set appropriate timeouts** - Default 30min may be insufficient for long-running tasks.
46
+
47
+ ## Prerequisites Checklist
48
+
49
+ Before starting any job, verify:
50
+
51
+ ### ✅ **Account & Authentication**
52
+ - Hugging Face Account with [Pro](https://hf.co/pro), [Team](https://hf.co/enterprise), or [Enterprise](https://hf.co/enterprise) plan (Jobs require paid plan)
53
+ - Authenticated login: Check with `hf_whoami()`
54
+ - **HF_TOKEN for Hub Access** ⚠️ CRITICAL - Required for any Hub operations (push models/datasets, download private repos, etc.)
55
+ - Token must have appropriate permissions (read for downloads, write for uploads)
56
+
57
+ ### ✅ **Token Usage** (See Token Usage section for details)
58
+
59
+ **When tokens are required:**
60
+ - Pushing models/datasets to Hub
61
+ - Accessing private repositories
62
+ - Using Hub APIs in scripts
63
+ - Any authenticated Hub operations
64
+
65
+ **How to provide tokens:**
66
+ ```python
67
+ {
68
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"} # Recommended: automatic token
69
+ }
70
+ ```
71
+
72
+ **⚠️ CRITICAL:** The `$HF_TOKEN` placeholder is automatically replaced with your logged-in token. Never hardcode tokens in scripts.
73
+
74
+ ## Token Usage Guide
75
+
76
+ ### Understanding Tokens
77
+
78
+ **What are HF Tokens?**
79
+ - Authentication credentials for Hugging Face Hub
80
+ - Required for authenticated operations (push, private repos, API access)
81
+ - Stored securely on your machine after `hf auth login`
82
+
83
+ **Token Types:**
84
+ - **Read Token** - Can download models/datasets, read private repos
85
+ - **Write Token** - Can push models/datasets, create repos, modify content
86
+ - **Organization Token** - Can act on behalf of an organization
87
+
88
+ ### When Tokens Are Required
89
+
90
+ **Always Required:**
91
+ - Pushing models/datasets to Hub
92
+ - Accessing private repositories
93
+ - Creating new repositories
94
+ - Modifying existing repositories
95
+ - Using Hub APIs programmatically
96
+
97
+ **Not Required:**
98
+ - Downloading public models/datasets
99
+ - Running jobs that don't interact with Hub
100
+ - Reading public repository information
101
+
102
+ ### How to Provide Tokens to Jobs
103
+
104
+ #### Method 1: Automatic Token (Recommended)
105
+
106
+ ```python
107
+ hf_jobs("uv", {
108
+ "script": "your_script.py",
109
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Automatic replacement
110
+ })
111
+ ```
112
+
113
+ **How it works:**
114
+ - `$HF_TOKEN` is a placeholder that gets replaced with your actual token
115
+ - Uses the token from your logged-in session (`hf auth login`)
116
+ - Most secure and convenient method
117
+ - Token is encrypted server-side when passed as a secret
118
+
119
+ **Benefits:**
120
+ - No token exposure in code
121
+ - Uses your current login session
122
+ - Automatically updated if you re-login
123
+ - Works seamlessly with MCP tools
124
+
125
+ #### Method 2: Explicit Token (Not Recommended)
126
+
127
+ ```python
128
+ hf_jobs("uv", {
129
+ "script": "your_script.py",
130
+ "secrets": {"HF_TOKEN": "hf_abc123..."} # ⚠️ Hardcoded token
131
+ })
132
+ ```
133
+
134
+ **When to use:**
135
+ - Only if automatic token doesn't work
136
+ - Testing with a specific token
137
+ - Organization tokens (use with caution)
138
+
139
+ **Security concerns:**
140
+ - Token visible in code/logs
141
+ - Must manually update if token rotates
142
+ - Risk of token exposure
143
+
144
+ #### Method 3: Environment Variable (Less Secure)
145
+
146
+ ```python
147
+ hf_jobs("uv", {
148
+ "script": "your_script.py",
149
+ "env": {"HF_TOKEN": "hf_abc123..."} # ⚠️ Less secure than secrets
150
+ })
151
+ ```
152
+
153
+ **Difference from secrets:**
154
+ - `env` variables are visible in job logs
155
+ - `secrets` are encrypted server-side
156
+ - Always prefer `secrets` for tokens
157
+
158
+ ### Using Tokens in Scripts
159
+
160
+ **In your Python script, tokens are available as environment variables:**
161
+
162
+ ```python
163
+ # /// script
164
+ # dependencies = ["huggingface-hub"]
165
+ # ///
166
+
167
+ import os
168
+ from huggingface_hub import HfApi
169
+
170
+ # Token is automatically available if passed via secrets
171
+ token = os.environ.get("HF_TOKEN")
172
+
173
+ # Use with Hub API
174
+ api = HfApi(token=token)
175
+
176
+ # Or let huggingface_hub auto-detect
177
+ api = HfApi() # Automatically uses HF_TOKEN env var
178
+ ```
179
+
180
+ **Best practices:**
181
+ - Don't hardcode tokens in scripts
182
+ - Use `os.environ.get("HF_TOKEN")` to access
183
+ - Let `huggingface_hub` auto-detect when possible
184
+ - Verify token exists before Hub operations
185
+
186
+ ### Token Verification
187
+
188
+ **Check if you're logged in:**
189
+ ```python
190
+ from huggingface_hub import whoami
191
+ user_info = whoami() # Returns your username if authenticated
192
+ ```
193
+
194
+ **Verify token in job:**
195
+ ```python
196
+ import os
197
+ assert "HF_TOKEN" in os.environ, "HF_TOKEN not found!"
198
+ token = os.environ["HF_TOKEN"]
199
+ print(f"Token starts with: {token[:7]}...") # Should start with "hf_"
200
+ ```
201
+
202
+ ### Common Token Issues
203
+
204
+ **Error: 401 Unauthorized**
205
+ - **Cause:** Token missing or invalid
206
+ - **Fix:** Add `secrets={"HF_TOKEN": "$HF_TOKEN"}` to job config
207
+ - **Verify:** Check `hf_whoami()` works locally
208
+
209
+ **Error: 403 Forbidden**
210
+ - **Cause:** Token lacks required permissions
211
+ - **Fix:** Ensure token has write permissions for push operations
212
+ - **Check:** Token type at https://huggingface.co/settings/tokens
213
+
214
+ **Error: Token not found in environment**
215
+ - **Cause:** `secrets` not passed or wrong key name
216
+ - **Fix:** Use `secrets={"HF_TOKEN": "$HF_TOKEN"}` (not `env`)
217
+ - **Verify:** Script checks `os.environ.get("HF_TOKEN")`
218
+
219
+ **Error: Repository access denied**
220
+ - **Cause:** Token doesn't have access to private repo
221
+ - **Fix:** Use token from account with access
222
+ - **Check:** Verify repo visibility and your permissions
223
+
224
+ ### Token Security Best Practices
225
+
226
+ 1. **Never commit tokens** - Use `$HF_TOKEN` placeholder or environment variables
227
+ 2. **Use secrets, not env** - Secrets are encrypted server-side
228
+ 3. **Rotate tokens regularly** - Generate new tokens periodically
229
+ 4. **Use minimal permissions** - Create tokens with only needed permissions
230
+ 5. **Don't share tokens** - Each user should use their own token
231
+ 6. **Monitor token usage** - Check token activity in Hub settings
232
+
233
+ ### Complete Token Example
234
+
235
+ ```python
236
+ # Example: Push results to Hub
237
+ hf_jobs("uv", {
238
+ "script": """
239
+ # /// script
240
+ # dependencies = ["huggingface-hub", "datasets"]
241
+ # ///
242
+
243
+ import os
244
+ from huggingface_hub import HfApi
245
+ from datasets import Dataset
246
+
247
+ # Verify token is available
248
+ assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
249
+
250
+ # Use token for Hub operations
251
+ api = HfApi(token=os.environ["HF_TOKEN"])
252
+
253
+ # Create and push dataset
254
+ data = {"text": ["Hello", "World"]}
255
+ dataset = Dataset.from_dict(data)
256
+ dataset.push_to_hub("username/my-dataset", token=os.environ["HF_TOKEN"])
257
+
258
+ print("✅ Dataset pushed successfully!")
259
+ """,
260
+ "flavor": "cpu-basic",
261
+ "timeout": "30m",
262
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Token provided securely
263
+ })
264
+ ```
265
+
266
+ ## Quick Start: Two Approaches
267
+
268
+ ### Approach 1: UV Scripts (Recommended)
269
+
270
+ UV scripts use PEP 723 inline dependencies for clean, self-contained workloads.
271
+
272
+ ```python
273
+ hf_jobs("uv", {
274
+ "script": """
275
+ # /// script
276
+ # dependencies = ["transformers", "torch"]
277
+ # ///
278
+
279
+ from transformers import pipeline
280
+ import torch
281
+
282
+ # Your workload here
283
+ classifier = pipeline("sentiment-analysis")
284
+ result = classifier("I love Hugging Face!")
285
+ print(result)
286
+ """,
287
+ "flavor": "cpu-basic",
288
+ "timeout": "30m"
289
+ })
290
+ ```
291
+
292
+ **Benefits:** Direct MCP tool usage, clean code, dependencies declared inline, no file saving required
293
+
294
+ **When to use:** Default choice for all workloads, custom logic, any scenario requiring `hf_jobs()`
295
+
296
+ #### Working with Scripts
297
+
298
+ ⚠️ **Important:** There are *two* “script path” stories depending on how you run Jobs:
299
+
300
+ - **Using the `hf_jobs()` MCP tool (recommended in this repo)**: the `script` value must be **inline code** (a string) or a **URL**. A local filesystem path (like `"./scripts/foo.py"`) won’t exist inside the remote container.
301
+ - **Using the `hf jobs uv run` CLI**: local file paths **do work** (the CLI uploads your script).
302
+
303
+ **Common mistake with `hf_jobs()` MCP tool:**
304
+
305
+ ```python
306
+ # ❌ Will fail (remote container can't see your local path)
307
+ hf_jobs("uv", {"script": "./scripts/foo.py"})
308
+ ```
309
+
310
+ **Correct patterns with `hf_jobs()` MCP tool:**
311
+
312
+ ```python
313
+ # ✅ Inline: read the local script file and pass its *contents*
314
+ from pathlib import Path
315
+ script = Path("hf-jobs/scripts/foo.py").read_text()
316
+ hf_jobs("uv", {"script": script})
317
+
318
+ # ✅ URL: host the script somewhere reachable
319
+ hf_jobs("uv", {"script": "https://huggingface.co/datasets/uv-scripts/.../raw/main/foo.py"})
320
+ ```
321
+
322
+ **CLI equivalent (local paths supported):**
323
+
324
+ ```bash
325
+ hf jobs uv run ./scripts/foo.py -- --your --args
326
+ ```
327
+
328
+ ### Approach 2: Docker-Based Jobs
329
+
330
+ Run jobs with custom Docker images and commands.
331
+
332
+ ```python
333
+ hf_jobs("run", {
334
+ "image": "python:3.12",
335
+ "command": ["python", "-c", "print('Hello from HF Jobs!')"],
336
+ "flavor": "cpu-basic",
337
+ "timeout": "30m"
338
+ })
339
+ ```
340
+
341
+ **Benefits:** Full Docker control, use pre-built images, run any command
342
+ **When to use:** Need specific Docker images, non-Python workloads, complex environments
343
+
344
+ **Example with GPU:**
345
+ ```python
346
+ hf_jobs("run", {
347
+ "image": "pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel",
348
+ "command": ["python", "-c", "import torch; print(torch.cuda.get_device_name())"],
349
+ "flavor": "a10g-small",
350
+ "timeout": "1h"
351
+ })
352
+ ```
353
+
354
+ ### Finding More UV Scripts on Hub
355
+
356
+ The `uv-scripts` organization provides ready-to-use UV scripts stored as datasets on Hugging Face Hub:
357
+
358
+ ```python
359
+ # Discover available UV script collections
360
+ dataset_search({"author": "uv-scripts", "sort": "downloads", "limit": 20})
361
+
362
+ # Explore a specific collection
363
+ hub_repo_details(["uv-scripts/classification"], repo_type="dataset", include_readme=True)
364
+ ```
365
+
366
+ **Popular collections:** OCR, classification, synthetic-data, vLLM, dataset-creation
367
+
368
+ ## Hardware Selection
369
+
370
+ | Workload Type | Recommended Hardware | Cost (approx./hr) | Use Case |
371
+ |---------------|---------------------|------------------|----------|
372
+ | Data processing, testing | `cpu-basic`, `cpu-upgrade` | ~$0.10-0.50 | Lightweight tasks |
373
+ | Small models, demos | `t4-small` | ~$0.75 | <1B models, quick tests |
374
+ | Medium models | `t4-medium`, `l4x1` | ~$1.50-2.50 | 1-7B models |
375
+ | Large models, production | `a10g-small`, `a10g-large` | ~$3.50-5.00 | 7-13B models |
376
+ | Very large models | `a100-large` | ~$8-12 | 13B+ models |
377
+ | Batch inference | `a10g-large`, `a100-large` | ~$5-10 | High-throughput |
378
+ | Data processing | `cpu-upgrade`, `l4x1` | ~$0.50-2.50 | Parallel workloads |
379
+
380
+ **GPU Flavors:** cpu-basic/upgrade, t4-small/medium, l4x1/x4, a10g-small/large/largex2/largex4, a100-large, h100/h100x8
381
+
382
+ **TPU Flavors:** v5e-1x1, v5e-2x2, v5e-2x4
383
+
384
+ **Guidelines:**
385
+ - Start with smaller hardware for testing
386
+ - Scale up based on actual needs
387
+ - Use multi-GPU for parallel workloads
388
+ - See `references/hardware_guide.md` for detailed specifications
389
+
390
+ ## Critical: Saving Results
391
+
392
+ **⚠️ EPHEMERAL ENVIRONMENT—MUST PERSIST RESULTS**
393
+
394
+ The Jobs environment is temporary. All files are deleted when the job ends. If results aren't persisted, **ALL WORK IS LOST**.
395
+
396
+ ### Persistence Options
397
+
398
+ **1. Push to Hugging Face Hub (Recommended)**
399
+
400
+ ```python
401
+ # Push models
402
+ model.push_to_hub("username/model-name", token=os.environ["HF_TOKEN"])
403
+
404
+ # Push datasets
405
+ dataset.push_to_hub("username/dataset-name", token=os.environ["HF_TOKEN"])
406
+
407
+ # Push artifacts
408
+ api.upload_file(
409
+ path_or_fileobj="results.json",
410
+ path_in_repo="results.json",
411
+ repo_id="username/results",
412
+ token=os.environ["HF_TOKEN"]
413
+ )
414
+ ```
415
+
416
+ **2. Use External Storage**
417
+
418
+ ```python
419
+ # Upload to S3, GCS, etc.
420
+ import boto3
421
+ s3 = boto3.client('s3')
422
+ s3.upload_file('results.json', 'my-bucket', 'results.json')
423
+ ```
424
+
425
+ **3. Send Results via API**
426
+
427
+ ```python
428
+ # POST results to your API
429
+ import requests
430
+ requests.post("https://your-api.com/results", json=results)
431
+ ```
432
+
433
+ ### Required Configuration for Hub Push
434
+
435
+ **In job submission:**
436
+ ```python
437
+ {
438
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"} # Enables authentication
439
+ }
440
+ ```
441
+
442
+ **In script:**
443
+ ```python
444
+ import os
445
+ from huggingface_hub import HfApi
446
+
447
+ # Token automatically available from secrets
448
+ api = HfApi(token=os.environ.get("HF_TOKEN"))
449
+
450
+ # Push your results
451
+ api.upload_file(...)
452
+ ```
453
+
454
+ ### Verification Checklist
455
+
456
+ Before submitting:
457
+ - [ ] Results persistence method chosen
458
+ - [ ] `secrets={"HF_TOKEN": "$HF_TOKEN"}` if using Hub
459
+ - [ ] Script handles missing token gracefully
460
+ - [ ] Test persistence path works
461
+
462
+ **See:** `references/hub_saving.md` for detailed Hub persistence guide
463
+
464
+ ## Timeout Management
465
+
466
+ **⚠️ DEFAULT: 30 MINUTES**
467
+
468
+ ### Setting Timeouts
469
+
470
+ ```python
471
+ {
472
+ "timeout": "2h" # 2 hours (formats: "90m", "2h", "1.5h", or seconds as integer)
473
+ }
474
+ ```
475
+
476
+ ### Timeout Guidelines
477
+
478
+ | Scenario | Recommended | Notes |
479
+ |----------|-------------|-------|
480
+ | Quick test | 10-30 min | Verify setup |
481
+ | Data processing | 1-2 hours | Depends on data size |
482
+ | Batch inference | 2-4 hours | Large batches |
483
+ | Experiments | 4-8 hours | Multiple runs |
484
+ | Long-running | 8-24 hours | Production workloads |
485
+
486
+ **Always add 20-30% buffer** for setup, network delays, and cleanup.
487
+
488
+ **On timeout:** Job killed immediately, all unsaved progress lost
489
+
490
+ ## Cost Estimation
491
+
492
+ **General guidelines:**
493
+
494
+ ```
495
+ Total Cost = (Hours of runtime) × (Cost per hour)
496
+ ```
497
+
498
+ **Example calculations:**
499
+
500
+ **Quick test:**
501
+ - Hardware: cpu-basic ($0.10/hour)
502
+ - Time: 15 minutes (0.25 hours)
503
+ - Cost: $0.03
504
+
505
+ **Data processing:**
506
+ - Hardware: l4x1 ($2.50/hour)
507
+ - Time: 2 hours
508
+ - Cost: $5.00
509
+
510
+ **Batch inference:**
511
+ - Hardware: a10g-large ($5/hour)
512
+ - Time: 4 hours
513
+ - Cost: $20.00
514
+
515
+ **Cost optimization tips:**
516
+ 1. Start small - Test on cpu-basic or t4-small
517
+ 2. Monitor runtime - Set appropriate timeouts
518
+ 3. Use checkpoints - Resume if job fails
519
+ 4. Optimize code - Reduce unnecessary compute
520
+ 5. Choose right hardware - Don't over-provision
521
+
522
+ ## Monitoring and Tracking
523
+
524
+ ### Check Job Status
525
+
526
+ ```python
527
+ # List all jobs
528
+ hf_jobs("ps")
529
+
530
+ # Inspect specific job
531
+ hf_jobs("inspect", {"job_id": "your-job-id"})
532
+
533
+ # View logs
534
+ hf_jobs("logs", {"job_id": "your-job-id"})
535
+
536
+ # Cancel a job
537
+ hf_jobs("cancel", {"job_id": "your-job-id"})
538
+ ```
539
+
540
+ **Remember:** Wait for user to request status checks. Avoid polling repeatedly.
541
+
542
+ ### Job URLs
543
+
544
+ After submission, jobs have monitoring URLs:
545
+ ```
546
+ https://huggingface.co/jobs/username/job-id
547
+ ```
548
+
549
+ View logs, status, and details in the browser.
550
+
551
+ ## Scheduled Jobs
552
+
553
+ Run jobs on a schedule using CRON expressions or predefined schedules.
554
+
555
+ ```python
556
+ # Schedule a job that runs every hour
557
+ hf_jobs("scheduled uv", {
558
+ "script": "your_script.py",
559
+ "schedule": "@hourly",
560
+ "flavor": "cpu-basic"
561
+ })
562
+
563
+ # Use CRON syntax
564
+ hf_jobs("scheduled uv", {
565
+ "script": "your_script.py",
566
+ "schedule": "0 9 * * 1", # 9 AM every Monday
567
+ "flavor": "cpu-basic"
568
+ })
569
+ ```
570
+
571
+ **Available schedules:**
572
+ - `@annually`, `@yearly` - Once per year
573
+ - `@monthly` - Once per month
574
+ - `@weekly` - Once per week
575
+ - `@daily` - Once per day
576
+ - `@hourly` - Once per hour
577
+ - CRON expression - Custom schedule (e.g., `"0 9 * * 1"`)
578
+
579
+ **Manage scheduled jobs:**
580
+ ```python
581
+ hf_jobs("scheduled ps") # List scheduled jobs
582
+ hf_jobs("scheduled suspend", {"job_id": "..."}) # Pause
583
+ hf_jobs("scheduled resume", {"job_id": "..."}) # Resume
584
+ hf_jobs("scheduled delete", {"job_id": "..."}) # Delete
585
+ ```
586
+
587
+ ## Common Workload Patterns
588
+
589
+ This repository ships ready-to-run UV scripts in `hf-jobs/scripts/`. Prefer using them instead of inventing new templates.
590
+
591
+ ### Pattern 1: Dataset → Model Responses (vLLM) — `scripts/generate-responses.py`
592
+
593
+ **What it does:** loads a Hub dataset (chat `messages` or a `prompt` column), applies a model chat template, generates responses with vLLM, and **pushes** the output dataset + dataset card back to the Hub.
594
+
595
+ **Requires:** GPU + **write** token (it pushes a dataset).
596
+
597
+ ```python
598
+ from pathlib import Path
599
+
600
+ script = Path("hf-jobs/scripts/generate-responses.py").read_text()
601
+ hf_jobs("uv", {
602
+ "script": script,
603
+ "script_args": [
604
+ "username/input-dataset",
605
+ "username/output-dataset",
606
+ "--messages-column", "messages",
607
+ "--model-id", "Qwen/Qwen3-30B-A3B-Instruct-2507",
608
+ "--temperature", "0.7",
609
+ "--top-p", "0.8",
610
+ "--max-tokens", "2048",
611
+ ],
612
+ "flavor": "a10g-large",
613
+ "timeout": "4h",
614
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"},
615
+ })
616
+ ```
617
+
618
+ ### Pattern 2: CoT Self-Instruct Synthetic Data — `scripts/cot-self-instruct.py`
619
+
620
+ **What it does:** generates synthetic prompts/answers via CoT Self-Instruct, optionally filters outputs (answer-consistency / RIP), then **pushes** the generated dataset + dataset card to the Hub.
621
+
622
+ **Requires:** GPU + **write** token (it pushes a dataset).
623
+
624
+ ```python
625
+ from pathlib import Path
626
+
627
+ script = Path("hf-jobs/scripts/cot-self-instruct.py").read_text()
628
+ hf_jobs("uv", {
629
+ "script": script,
630
+ "script_args": [
631
+ "--seed-dataset", "davanstrien/s1k-reasoning",
632
+ "--output-dataset", "username/synthetic-math",
633
+ "--task-type", "reasoning",
634
+ "--num-samples", "5000",
635
+ "--filter-method", "answer-consistency",
636
+ ],
637
+ "flavor": "l4x4",
638
+ "timeout": "8h",
639
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"},
640
+ })
641
+ ```
642
+
643
+ ### Pattern 3: Streaming Dataset Stats (Polars + HF Hub) — `scripts/finepdfs-stats.py`
644
+
645
+ **What it does:** scans parquet directly from Hub (no 300GB download), computes temporal stats, and (optionally) uploads results to a Hub dataset repo.
646
+
647
+ **Requires:** CPU is often enough; token needed **only** if you pass `--output-repo` (upload).
648
+
649
+ ```python
650
+ from pathlib import Path
651
+
652
+ script = Path("hf-jobs/scripts/finepdfs-stats.py").read_text()
653
+ hf_jobs("uv", {
654
+ "script": script,
655
+ "script_args": [
656
+ "--limit", "10000",
657
+ "--show-plan",
658
+ "--output-repo", "username/finepdfs-temporal-stats",
659
+ ],
660
+ "flavor": "cpu-upgrade",
661
+ "timeout": "2h",
662
+ "env": {"HF_XET_HIGH_PERFORMANCE": "1"},
663
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"},
664
+ })
665
+ ```
666
+
667
+ ## Common Failure Modes
668
+
669
+ ### Out of Memory (OOM)
670
+
671
+ **Fix:**
672
+ 1. Reduce batch size or data chunk size
673
+ 2. Process data in smaller batches
674
+ 3. Upgrade hardware: cpu → t4 → a10g → a100
675
+
676
+ ### Job Timeout
677
+
678
+ **Fix:**
679
+ 1. Check logs for actual runtime
680
+ 2. Increase timeout with buffer: `"timeout": "3h"`
681
+ 3. Optimize code for faster execution
682
+ 4. Process data in chunks
683
+
684
+ ### Hub Push Failures
685
+
686
+ **Fix:**
687
+ 1. Add to job: `secrets={"HF_TOKEN": "$HF_TOKEN"}`
688
+ 2. Verify token in script: `assert "HF_TOKEN" in os.environ`
689
+ 3. Check token permissions
690
+ 4. Verify repo exists or can be created
691
+
692
+ ### Missing Dependencies
693
+
694
+ **Fix:**
695
+ Add to PEP 723 header:
696
+ ```python
697
+ # /// script
698
+ # dependencies = ["package1", "package2>=1.0.0"]
699
+ # ///
700
+ ```
701
+
702
+ ### Authentication Errors
703
+
704
+ **Fix:**
705
+ 1. Check `hf_whoami()` works locally
706
+ 2. Verify `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config
707
+ 3. Re-login: `hf auth login`
708
+ 4. Check token has required permissions
709
+
710
+ ## Troubleshooting
711
+
712
+ **Common issues:**
713
+ - Job times out → Increase timeout, optimize code
714
+ - Results not saved → Check persistence method, verify HF_TOKEN
715
+ - Out of Memory → Reduce batch size, upgrade hardware
716
+ - Import errors → Add dependencies to PEP 723 header
717
+ - Authentication errors → Check token, verify secrets parameter
718
+
719
+ **See:** `references/troubleshooting.md` for complete troubleshooting guide
720
+
721
+ ## Resources
722
+
723
+ ### References (In This Skill)
724
+ - `references/token_usage.md` - Complete token usage guide
725
+ - `references/hardware_guide.md` - Hardware specs and selection
726
+ - `references/hub_saving.md` - Hub persistence guide
727
+ - `references/troubleshooting.md` - Common issues and solutions
728
+
729
+ ### Scripts (In This Skill)
730
+ - `scripts/generate-responses.py` - vLLM batch generation: dataset → responses → push to Hub
731
+ - `scripts/cot-self-instruct.py` - CoT Self-Instruct synthetic data generation + filtering → push to Hub
732
+ - `scripts/finepdfs-stats.py` - Polars streaming stats over `finepdfs-edu` parquet on Hub (optional push)
733
+
734
+ ### External Links
735
+ - [HF Jobs Documentation](https://huggingface.co/docs/huggingface_hub/guides/jobs)
736
+ - [UV Scripts Guide](https://docs.astral.sh/uv/guides/scripts/)
737
+ - [UV Scripts Organization](https://huggingface.co/uv-scripts)
738
+ - [HF Hub Authentication](https://huggingface.co/docs/huggingface_hub/quick-start#authentication)
739
+
740
+ ## Key Takeaways
741
+
742
+ 1. **Submit scripts inline** - The `script` parameter accepts Python code directly; no file saving required unless user requests
743
+ 2. **Jobs are asynchronous** - Don't wait/poll; let user check when ready
744
+ 3. **Always set timeout** - Default 30 min may be insufficient; set appropriate timeout
745
+ 4. **Always persist results** - Environment is ephemeral; without persistence, all work is lost
746
+ 5. **Use tokens securely** - Always use `secrets={"HF_TOKEN": "$HF_TOKEN"}` for Hub operations
747
+ 6. **Choose appropriate hardware** - Start small, scale up based on needs
748
+ 7. **Use UV scripts** - Default to `hf_jobs("uv", {...})` with inline scripts for Python workloads
749
+ 8. **Handle authentication** - Verify tokens are available before Hub operations
750
+ 9. **Monitor jobs** - Provide job URLs and status check commands
751
+ 10. **Optimize costs** - Choose right hardware, set appropriate timeouts
752
+
index.html CHANGED
@@ -1,19 +1,214 @@
1
- <!doctype html>
2
- <html>
3
- <head>
4
- <meta charset="utf-8" />
5
- <meta name="viewport" content="width=device-width" />
6
- <title>My static Space</title>
7
- <link rel="stylesheet" href="style.css" />
8
- </head>
9
- <body>
10
- <div class="card">
11
- <h1>Welcome to your static Space!</h1>
12
- <p>You can modify this app directly by editing <i>index.html</i> in the Files and versions tab.</p>
13
- <p>
14
- Also don't forget to check the
15
- <a href="https://huggingface.co/docs/hub/spaces" target="_blank">Spaces documentation</a>.
16
- </p>
17
- </div>
18
- </body>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  </html>
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="UTF-8">
5
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
6
+ <title>hf-jobs - Run Workloads on Hugging Face Jobs</title>
7
+ <style>
8
+ * {
9
+ margin: 0;
10
+ padding: 0;
11
+ box-sizing: border-box;
12
+ }
13
+
14
+ body {
15
+ font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif;
16
+ line-height: 1.6;
17
+ color: #333;
18
+ background: #f5f5f5;
19
+ padding: 20px;
20
+ }
21
+
22
+ .container {
23
+ max-width: 1200px;
24
+ margin: 0 auto;
25
+ background: white;
26
+ padding: 40px;
27
+ border-radius: 8px;
28
+ box-shadow: 0 2px 4px rgba(0,0,0,0.1);
29
+ }
30
+
31
+ h1 {
32
+ color: #ffd21e;
33
+ background: #000;
34
+ padding: 20px;
35
+ margin: -40px -40px 30px -40px;
36
+ border-radius: 8px 8px 0 0;
37
+ }
38
+
39
+ h2 {
40
+ color: #1e1e1e;
41
+ margin-top: 30px;
42
+ margin-bottom: 15px;
43
+ padding-bottom: 10px;
44
+ border-bottom: 2px solid #ffd21e;
45
+ }
46
+
47
+ h3 {
48
+ color: #555;
49
+ margin-top: 20px;
50
+ margin-bottom: 10px;
51
+ }
52
+
53
+ .description {
54
+ background: #f9f9f9;
55
+ padding: 20px;
56
+ border-left: 4px solid #ffd21e;
57
+ margin-bottom: 30px;
58
+ border-radius: 4px;
59
+ }
60
+
61
+ .file-list {
62
+ list-style: none;
63
+ padding: 0;
64
+ }
65
+
66
+ .file-list li {
67
+ padding: 12px;
68
+ margin: 8px 0;
69
+ background: #f9f9f9;
70
+ border-radius: 4px;
71
+ border-left: 3px solid #ffd21e;
72
+ transition: background 0.2s;
73
+ }
74
+
75
+ .file-list li:hover {
76
+ background: #f0f0f0;
77
+ }
78
+
79
+ .file-list a {
80
+ color: #0066cc;
81
+ text-decoration: none;
82
+ font-weight: 500;
83
+ display: block;
84
+ }
85
+
86
+ .file-list a:hover {
87
+ text-decoration: underline;
88
+ }
89
+
90
+ .file-path {
91
+ color: #666;
92
+ font-size: 0.9em;
93
+ font-family: 'Monaco', 'Courier New', monospace;
94
+ margin-top: 4px;
95
+ }
96
+
97
+ .file-description {
98
+ color: #777;
99
+ font-size: 0.9em;
100
+ margin-top: 4px;
101
+ font-style: italic;
102
+ }
103
+
104
+ .metadata {
105
+ background: #f0f0f0;
106
+ padding: 15px;
107
+ border-radius: 4px;
108
+ margin-bottom: 30px;
109
+ }
110
+
111
+ .metadata p {
112
+ margin: 5px 0;
113
+ }
114
+
115
+ .metadata strong {
116
+ color: #333;
117
+ }
118
+
119
+ .section {
120
+ margin-bottom: 40px;
121
+ }
122
+
123
+ code {
124
+ background: #f4f4f4;
125
+ padding: 2px 6px;
126
+ border-radius: 3px;
127
+ font-family: 'Monaco', 'Courier New', monospace;
128
+ font-size: 0.9em;
129
+ }
130
+ </style>
131
+ </head>
132
+ <body>
133
+ <div class="container">
134
+ <h1>Agent Skill : hf-jobs</h1>
135
+
136
+ <div class="description">
137
+ <p><strong>Run any workload on Hugging Face Jobs.</strong></p>
138
+ <p>Use this skill when you want to run GPU/CPU workloads (batch inference, synthetic data generation, dataset stats, experiments) on Hugging Face Jobs, with correct token handling and result persistence back to the Hub.</p>
139
+ </div>
140
+
141
+ <div class="metadata">
142
+ <p><strong>Skill Name:</strong> hf-jobs</p>
143
+ <p><strong>Main Documentation:</strong> <a href="hf-jobs/SKILL.md">hf-jobs/SKILL.md</a></p>
144
+ <p><strong>Scripts Directory:</strong> <code>hf-jobs/scripts/</code></p>
145
+ <p><strong>References Directory:</strong> <code>hf-jobs/references/</code></p>
146
+ </div>
147
+
148
+ <div class="section">
149
+ <h2>Overview</h2>
150
+ <p>This skill focuses on running real workloads via Hugging Face Jobs. It includes ready-to-run UV scripts and guides for authentication (HF tokens), secrets vs env vars, timeouts, hardware selection, and pushing results to the Hub.</p>
151
+ </div>
152
+
153
+ <div class="section">
154
+ <h2>Core Documentation</h2>
155
+ <ul class="file-list">
156
+ <li>
157
+ <a href="hf-jobs/SKILL.md">SKILL.md</a>
158
+ <div class="file-path">hf-jobs/SKILL.md</div>
159
+ <div class="file-description">Complete skill documentation (how to submit jobs, tokens/secrets, timeouts, persistence, and how to use the bundled scripts)</div>
160
+ </li>
161
+ </ul>
162
+ </div>
163
+
164
+ <div class="section">
165
+ <h2>References</h2>
166
+ <ul class="file-list">
167
+ <li>
168
+ <a href="hf-jobs/references/token_usage.md">token_usage.md</a>
169
+ <div class="file-path">hf-jobs/references/token_usage.md</div>
170
+ <div class="file-description">Token best practices: secrets vs env, permissions, common errors (401/403), and secure patterns</div>
171
+ </li>
172
+ <li>
173
+ <a href="hf-jobs/references/hub_saving.md">hub_saving.md</a>
174
+ <div class="file-path">hf-jobs/references/hub_saving.md</div>
175
+ <div class="file-description">How to persist results: push datasets/models/files to the Hub (ephemeral job filesystem)</div>
176
+ </li>
177
+ <li>
178
+ <a href="hf-jobs/references/hardware_guide.md">hardware_guide.md</a>
179
+ <div class="file-path">hf-jobs/references/hardware_guide.md</div>
180
+ <div class="file-description">Flavor selection guidance for CPU/GPU/TPU workloads</div>
181
+ </li>
182
+ <li>
183
+ <a href="hf-jobs/references/troubleshooting.md">troubleshooting.md</a>
184
+ <div class="file-path">hf-jobs/references/troubleshooting.md</div>
185
+ <div class="file-description">Common failure modes (timeouts, missing deps, OOM, auth) and fixes</div>
186
+ </li>
187
+ </ul>
188
+ </div>
189
+
190
+ <div class="section">
191
+ <h2>Scripts</h2>
192
+ <ul class="file-list">
193
+ <li>
194
+ <a href="hf-jobs/scripts/generate-responses.py">generate-responses.py</a>
195
+ <div class="file-path">hf-jobs/scripts/generate-responses.py</div>
196
+ <div class="file-description">vLLM batch generation: load prompts/messages from a dataset, generate responses, push dataset + card to Hub</div>
197
+ </li>
198
+ <li>
199
+ <a href="hf-jobs/scripts/cot-self-instruct.py">cot-self-instruct.py</a>
200
+ <div class="file-path">hf-jobs/scripts/cot-self-instruct.py</div>
201
+ <div class="file-description">CoT Self-Instruct synthetic data generation (reasoning/instruction) + optional filtering, pushes dataset + card</div>
202
+ </li>
203
+ <li>
204
+ <a href="hf-jobs/scripts/finepdfs-stats.py">finepdfs-stats.py</a>
205
+ <div class="file-path">hf-jobs/scripts/finepdfs-stats.py</div>
206
+ <div class="file-description">Polars streaming stats over Hub parquet (finepdfs-edu); optional upload of computed stats to a dataset repo</div>
207
+ </li>
208
+ </ul>
209
+ </div>
210
+ </div>
211
+ </body>
212
  </html>
213
+
214
+
references/hardware_guide.md ADDED
@@ -0,0 +1,266 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Hardware Selection Guide
2
+
3
+ Choosing the right hardware (flavor) is critical for cost-effective workloads.
4
+
5
+ ## Available Hardware
6
+
7
+ ### CPU
8
+ - `cpu-basic` - Basic CPU, testing only
9
+ - `cpu-upgrade` - Enhanced CPU
10
+
11
+ **Use cases:** Data processing, testing scripts, lightweight workloads
12
+ **Not recommended for:** Model training, GPU-accelerated workloads
13
+
14
+ ### GPU Options
15
+
16
+ | Flavor | GPU | Memory | Use Case | Cost/hour |
17
+ |--------|-----|--------|----------|-----------|
18
+ | `t4-small` | NVIDIA T4 | 16GB | <1B models, demos, batch inference | ~$0.50-1 |
19
+ | `t4-medium` | NVIDIA T4 | 16GB | 1-3B models, development | ~$1-2 |
20
+ | `l4x1` | NVIDIA L4 | 24GB | 3-7B models, efficient workloads | ~$2-3 |
21
+ | `l4x4` | 4x NVIDIA L4 | 96GB | Multi-GPU workloads | ~$8-12 |
22
+ | `a10g-small` | NVIDIA A10G | 24GB | 3-7B models, production | ~$3-4 |
23
+ | `a10g-large` | NVIDIA A10G | 24GB | 7-13B models | ~$4-6 |
24
+ | `a10g-largex2` | 2x NVIDIA A10G | 48GB | Multi-GPU, large models | ~$8-12 |
25
+ | `a10g-largex4` | 4x NVIDIA A10G | 96GB | Multi-GPU, very large models | ~$16-24 |
26
+ | `a100-large` | NVIDIA A100 | 40GB | 13B+ models, fast workloads | ~$8-12 |
27
+
28
+ ## Selection Guidelines
29
+
30
+ ### By Workload Type
31
+
32
+ **Data Processing**
33
+ - **Recommended:** `cpu-upgrade` or `l4x1`
34
+ - **Use case:** Transform, filter, analyze datasets
35
+ - **Batch size:** Depends on data size
36
+ - **Time:** Varies by dataset size
37
+
38
+ **Batch Inference**
39
+ - **Recommended:** `a10g-large` or `a100-large`
40
+ - **Use case:** Run inference on thousands of samples
41
+ - **Batch size:** 8-32 depending on model
42
+ - **Time:** Depends on number of samples
43
+
44
+ **Experiments & Benchmarks**
45
+ - **Recommended:** `a10g-small` or `a10g-large`
46
+ - **Use case:** Reproducible ML experiments
47
+ - **Batch size:** Varies
48
+ - **Time:** Depends on experiment complexity
49
+
50
+ **Model Training** (see `model-trainer` skill for details)
51
+ - **Recommended:** See model-trainer skill
52
+ - **Use case:** Fine-tuning models
53
+ - **Batch size:** Depends on model size
54
+ - **Time:** Hours to days
55
+
56
+ **Synthetic Data Generation**
57
+ - **Recommended:** `a10g-large` or `a100-large`
58
+ - **Use case:** Generate datasets using LLMs
59
+ - **Batch size:** Depends on generation method
60
+ - **Time:** Hours for large datasets
61
+
62
+ ### By Budget
63
+
64
+ **Minimal Budget (<$5 total)**
65
+ - Use `cpu-basic` or `t4-small`
66
+ - Process small datasets
67
+ - Quick tests and demos
68
+
69
+ **Small Budget ($5-20)**
70
+ - Use `t4-medium` or `a10g-small`
71
+ - Process medium datasets
72
+ - Run experiments
73
+
74
+ **Medium Budget ($20-50)**
75
+ - Use `a10g-small` or `a10g-large`
76
+ - Process large datasets
77
+ - Production workloads
78
+
79
+ **Large Budget ($50-200)**
80
+ - Use `a10g-large` or `a100-large`
81
+ - Large-scale processing
82
+ - Multiple experiments
83
+
84
+ ### By Model Size (for inference/processing)
85
+
86
+ **Tiny Models (<1B parameters)**
87
+ - **Recommended:** `t4-small`
88
+ - **Example:** Qwen2.5-0.5B, TinyLlama
89
+ - **Batch size:** 8-16
90
+
91
+ **Small Models (1-3B parameters)**
92
+ - **Recommended:** `t4-medium` or `a10g-small`
93
+ - **Example:** Qwen2.5-1.5B, Phi-2
94
+ - **Batch size:** 4-8
95
+
96
+ **Medium Models (3-7B parameters)**
97
+ - **Recommended:** `a10g-small` or `a10g-large`
98
+ - **Example:** Qwen2.5-7B, Mistral-7B
99
+ - **Batch size:** 2-4
100
+
101
+ **Large Models (7-13B parameters)**
102
+ - **Recommended:** `a10g-large` or `a100-large`
103
+ - **Example:** Llama-3-8B
104
+ - **Batch size:** 1-2
105
+
106
+ **Very Large Models (13B+ parameters)**
107
+ - **Recommended:** `a100-large`
108
+ - **Example:** Llama-3-13B, Llama-3-70B
109
+ - **Batch size:** 1
110
+
111
+ ## Memory Considerations
112
+
113
+ ### Estimating Memory Requirements
114
+
115
+ **For inference:**
116
+ ```
117
+ Memory (GB) ≈ (Model params in billions) × 2-4
118
+ ```
119
+
120
+ **For training:**
121
+ ```
122
+ Memory (GB) ≈ (Model params in billions) × 20 (full) or × 4 (LoRA)
123
+ ```
124
+
125
+ **Examples:**
126
+ - Qwen2.5-0.5B inference: ~1-2GB ✅ fits t4-small
127
+ - Qwen2.5-7B inference: ~14-28GB ✅ fits a10g-large
128
+ - Qwen2.5-7B training: ~140GB ❌ not feasible without LoRA
129
+
130
+ ### Memory Optimization
131
+
132
+ If hitting memory limits:
133
+
134
+ 1. **Reduce batch size**
135
+ ```python
136
+ batch_size = 1
137
+ ```
138
+
139
+ 2. **Process in chunks**
140
+ ```python
141
+ for chunk in chunks:
142
+ process(chunk)
143
+ ```
144
+
145
+ 3. **Use smaller models**
146
+ - Use quantized models
147
+ - Use LoRA adapters
148
+
149
+ 4. **Upgrade hardware**
150
+ - cpu → t4 → a10g → a100
151
+
152
+ ## Cost Estimation
153
+
154
+ ### Formula
155
+
156
+ ```
157
+ Total Cost = (Hours of runtime) × (Cost per hour)
158
+ ```
159
+
160
+ ### Example Calculations
161
+
162
+ **Data processing:**
163
+ - Hardware: cpu-upgrade ($0.50/hour)
164
+ - Time: 1 hour
165
+ - Cost: $0.50
166
+
167
+ **Batch inference:**
168
+ - Hardware: a10g-large ($5/hour)
169
+ - Time: 2 hours
170
+ - Cost: $10.00
171
+
172
+ **Experiments:**
173
+ - Hardware: a10g-small ($3.50/hour)
174
+ - Time: 4 hours
175
+ - Cost: $14.00
176
+
177
+ ### Cost Optimization Tips
178
+
179
+ 1. **Start small:** Test on cpu-basic or t4-small
180
+ 2. **Monitor runtime:** Set appropriate timeouts
181
+ 3. **Optimize code:** Reduce unnecessary compute
182
+ 4. **Choose right hardware:** Don't over-provision
183
+ 5. **Use checkpoints:** Resume if job fails
184
+ 6. **Monitor costs:** Check running jobs regularly
185
+
186
+ ## Multi-GPU Workloads
187
+
188
+ Multi-GPU flavors automatically distribute workloads:
189
+
190
+ **Multi-GPU flavors:**
191
+ - `l4x4` - 4x L4 GPUs
192
+ - `a10g-largex2` - 2x A10G GPUs
193
+ - `a10g-largex4` - 4x A10G GPUs
194
+
195
+ **When to use:**
196
+ - Large models (>13B parameters)
197
+ - Need faster processing (linear speedup)
198
+ - Large datasets (>100K samples)
199
+ - Parallel workloads
200
+
201
+ **Example:**
202
+ ```python
203
+ hf_jobs("uv", {
204
+ "script": "process.py",
205
+ "flavor": "a10g-largex2", # 2 GPUs
206
+ "timeout": "4h",
207
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"}
208
+ })
209
+ ```
210
+
211
+ ## Choosing Between Options
212
+
213
+ ### CPU vs GPU
214
+
215
+ **Choose CPU when:**
216
+ - No GPU acceleration needed
217
+ - Data processing only
218
+ - Budget constrained
219
+ - Simple workloads
220
+
221
+ **Choose GPU when:**
222
+ - Model inference/training
223
+ - GPU-accelerated libraries
224
+ - Need faster processing
225
+ - Large models
226
+
227
+ ### a10g vs a100
228
+
229
+ **Choose a10g when:**
230
+ - Model <13B parameters
231
+ - Budget conscious
232
+ - Processing time not critical
233
+
234
+ **Choose a100 when:**
235
+ - Model 13B+ parameters
236
+ - Need fastest processing
237
+ - Memory requirements high
238
+ - Budget allows
239
+
240
+ ### Single vs Multi-GPU
241
+
242
+ **Choose single GPU when:**
243
+ - Model <7B parameters
244
+ - Budget constrained
245
+ - Simpler debugging
246
+
247
+ **Choose multi-GPU when:**
248
+ - Model >13B parameters
249
+ - Need faster processing
250
+ - Large batch sizes required
251
+ - Cost-effective for large jobs
252
+
253
+ ## Quick Reference
254
+
255
+ ```python
256
+ # Workload type → Hardware selection
257
+ HARDWARE_MAP = {
258
+ "data_processing": "cpu-upgrade",
259
+ "batch_inference_small": "t4-small",
260
+ "batch_inference_medium": "a10g-large",
261
+ "batch_inference_large": "a100-large",
262
+ "experiments": "a10g-small",
263
+ "training": "see model-trainer skill"
264
+ }
265
+ ```
266
+
references/hub_saving.md ADDED
@@ -0,0 +1,339 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Saving Results to Hugging Face Hub
2
+
3
+ **⚠️ CRITICAL:** Job environments are ephemeral. ALL results are lost when a job completes unless persisted to the Hub or external storage.
4
+
5
+ ## Why Persistence is Required
6
+
7
+ When running on Hugging Face Jobs:
8
+ - Environment is temporary
9
+ - All files deleted on job completion
10
+ - No local disk persistence
11
+ - Cannot access results after job ends
12
+
13
+ **Without persistence, all work is permanently lost.**
14
+
15
+ ## Persistence Options
16
+
17
+ ### Option 1: Push to Hugging Face Hub (Recommended)
18
+
19
+ **For models:**
20
+ ```python
21
+ from transformers import AutoModel
22
+ model.push_to_hub("username/model-name", token=os.environ.get("HF_TOKEN"))
23
+ ```
24
+
25
+ **For datasets:**
26
+ ```python
27
+ from datasets import Dataset
28
+ dataset.push_to_hub("username/dataset-name", token=os.environ.get("HF_TOKEN"))
29
+ ```
30
+
31
+ **For files/artifacts:**
32
+ ```python
33
+ from huggingface_hub import HfApi
34
+ api = HfApi(token=os.environ.get("HF_TOKEN"))
35
+ api.upload_file(
36
+ path_or_fileobj="results.json",
37
+ path_in_repo="results.json",
38
+ repo_id="username/results",
39
+ repo_type="dataset"
40
+ )
41
+ ```
42
+
43
+ ### Option 2: External Storage
44
+
45
+ **S3:**
46
+ ```python
47
+ import boto3
48
+ s3 = boto3.client('s3')
49
+ s3.upload_file('results.json', 'my-bucket', 'results.json')
50
+ ```
51
+
52
+ **Google Cloud Storage:**
53
+ ```python
54
+ from google.cloud import storage
55
+ client = storage.Client()
56
+ bucket = client.bucket('my-bucket')
57
+ blob = bucket.blob('results.json')
58
+ blob.upload_from_filename('results.json')
59
+ ```
60
+
61
+ ### Option 3: API Endpoint
62
+
63
+ ```python
64
+ import requests
65
+ requests.post("https://your-api.com/results", json=results)
66
+ ```
67
+
68
+ ## Required Configuration for Hub Push
69
+
70
+ ### Job Configuration
71
+
72
+ **Always include HF_TOKEN:**
73
+ ```python
74
+ hf_jobs("uv", {
75
+ "script": "your_script.py",
76
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Required for Hub operations
77
+ })
78
+ ```
79
+
80
+ ### Script Configuration
81
+
82
+ **Verify token exists:**
83
+ ```python
84
+ import os
85
+ assert "HF_TOKEN" in os.environ, "HF_TOKEN required for Hub operations!"
86
+ ```
87
+
88
+ **Use token for Hub operations:**
89
+ ```python
90
+ from huggingface_hub import HfApi
91
+
92
+ # Auto-detects HF_TOKEN from environment
93
+ api = HfApi()
94
+
95
+ # Or explicitly pass token
96
+ api = HfApi(token=os.environ.get("HF_TOKEN"))
97
+ ```
98
+
99
+ ## Complete Examples
100
+
101
+ ### Example 1: Push Dataset
102
+
103
+ ```python
104
+ hf_jobs("uv", {
105
+ "script": """
106
+ # /// script
107
+ # dependencies = ["datasets", "huggingface-hub"]
108
+ # ///
109
+
110
+ import os
111
+ from datasets import Dataset
112
+ from huggingface_hub import HfApi
113
+
114
+ # Verify token
115
+ assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
116
+
117
+ # Process data
118
+ data = {"text": ["Sample 1", "Sample 2"]}
119
+ dataset = Dataset.from_dict(data)
120
+
121
+ # Push to Hub
122
+ dataset.push_to_hub("username/my-dataset")
123
+ print("✅ Dataset pushed!")
124
+ """,
125
+ "flavor": "cpu-basic",
126
+ "timeout": "30m",
127
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"}
128
+ })
129
+ ```
130
+
131
+ ### Example 2: Push Model
132
+
133
+ ```python
134
+ hf_jobs("uv", {
135
+ "script": """
136
+ # /// script
137
+ # dependencies = ["transformers"]
138
+ # ///
139
+
140
+ import os
141
+ from transformers import AutoModel, AutoTokenizer
142
+
143
+ # Verify token
144
+ assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
145
+
146
+ # Load and process model
147
+ model = AutoModel.from_pretrained("base-model")
148
+ tokenizer = AutoTokenizer.from_pretrained("base-model")
149
+ # ... process model ...
150
+
151
+ # Push to Hub
152
+ model.push_to_hub("username/my-model")
153
+ tokenizer.push_to_hub("username/my-model")
154
+ print("✅ Model pushed!")
155
+ """,
156
+ "flavor": "a10g-large",
157
+ "timeout": "2h",
158
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"}
159
+ })
160
+ ```
161
+
162
+ ### Example 3: Push Artifacts
163
+
164
+ ```python
165
+ hf_jobs("uv", {
166
+ "script": """
167
+ # /// script
168
+ # dependencies = ["huggingface-hub", "pandas"]
169
+ # ///
170
+
171
+ import os
172
+ import json
173
+ import pandas as pd
174
+ from huggingface_hub import HfApi
175
+
176
+ # Verify token
177
+ assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
178
+
179
+ # Generate results
180
+ results = {"accuracy": 0.95, "loss": 0.05}
181
+ df = pd.DataFrame([results])
182
+
183
+ # Save files
184
+ with open("results.json", "w") as f:
185
+ json.dump(results, f)
186
+ df.to_csv("results.csv", index=False)
187
+
188
+ # Push to Hub
189
+ api = HfApi()
190
+ api.upload_file("results.json", "results.json", "username/results", repo_type="dataset")
191
+ api.upload_file("results.csv", "results.csv", "username/results", repo_type="dataset")
192
+ print("✅ Results pushed!")
193
+ """,
194
+ "flavor": "cpu-basic",
195
+ "timeout": "30m",
196
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"}
197
+ })
198
+ ```
199
+
200
+ ## Authentication Methods
201
+
202
+ ### Method 1: Automatic Token (Recommended)
203
+
204
+ ```python
205
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"}
206
+ ```
207
+
208
+ Uses your logged-in Hugging Face token automatically.
209
+
210
+ ### Method 2: Explicit Token
211
+
212
+ ```python
213
+ "secrets": {"HF_TOKEN": "hf_abc123..."}
214
+ ```
215
+
216
+ Provide token explicitly (not recommended for security).
217
+
218
+ ### Method 3: Environment Variable
219
+
220
+ ```python
221
+ "env": {"HF_TOKEN": "hf_abc123..."}
222
+ ```
223
+
224
+ Pass as regular environment variable (less secure than secrets).
225
+
226
+ **Always prefer Method 1** for security and convenience.
227
+
228
+ ## Verification Checklist
229
+
230
+ Before submitting any job that saves to Hub, verify:
231
+
232
+ - [ ] `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config
233
+ - [ ] Script checks for token: `assert "HF_TOKEN" in os.environ`
234
+ - [ ] Hub push code included in script
235
+ - [ ] Repository name doesn't conflict with existing repos
236
+ - [ ] You have write access to the target namespace
237
+
238
+ ## Repository Setup
239
+
240
+ ### Automatic Creation
241
+
242
+ If repository doesn't exist, it's created automatically when first pushing (if token has write permissions).
243
+
244
+ ### Manual Creation
245
+
246
+ Create repository before pushing:
247
+
248
+ ```python
249
+ from huggingface_hub import HfApi
250
+
251
+ api = HfApi()
252
+ api.create_repo(
253
+ repo_id="username/repo-name",
254
+ repo_type="model", # or "dataset"
255
+ private=False, # or True for private repo
256
+ )
257
+ ```
258
+
259
+ ### Repository Naming
260
+
261
+ **Valid names:**
262
+ - `username/my-model`
263
+ - `username/model-name`
264
+ - `organization/model-name`
265
+
266
+ **Invalid names:**
267
+ - `model-name` (missing username)
268
+ - `username/model name` (spaces not allowed)
269
+ - `username/MODEL` (uppercase discouraged)
270
+
271
+ ## Troubleshooting
272
+
273
+ ### Error: 401 Unauthorized
274
+
275
+ **Cause:** HF_TOKEN not provided or invalid
276
+
277
+ **Solutions:**
278
+ 1. Verify `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config
279
+ 2. Check you're logged in: `hf_whoami()`
280
+ 3. Re-login: `hf auth login`
281
+
282
+ ### Error: 403 Forbidden
283
+
284
+ **Cause:** No write access to repository
285
+
286
+ **Solutions:**
287
+ 1. Check repository namespace matches your username
288
+ 2. Verify you're a member of organization (if using org namespace)
289
+ 3. Check token has write permissions
290
+
291
+ ### Error: Repository not found
292
+
293
+ **Cause:** Repository doesn't exist and auto-creation failed
294
+
295
+ **Solutions:**
296
+ 1. Manually create repository first
297
+ 2. Check repository name format
298
+ 3. Verify namespace exists
299
+
300
+ ### Error: Push failed
301
+
302
+ **Cause:** Network issues or Hub unavailable
303
+
304
+ **Solutions:**
305
+ 1. Check logs for specific error
306
+ 2. Verify token is valid
307
+ 3. Retry push operation
308
+
309
+ ## Best Practices
310
+
311
+ 1. **Always verify token exists** before Hub operations
312
+ 2. **Use descriptive repo names** (e.g., `my-experiment-results` not `results`)
313
+ 3. **Push incrementally** for large results (use checkpoints)
314
+ 4. **Verify push success** in logs before job completes
315
+ 5. **Use appropriate repo types** (model vs dataset)
316
+ 6. **Add README** with result descriptions
317
+ 7. **Tag repos** with relevant tags
318
+
319
+ ## Monitoring Push Progress
320
+
321
+ Check logs for push progress:
322
+
323
+ ```python
324
+ hf_jobs("logs", {"job_id": "your-job-id"})
325
+ ```
326
+
327
+ **Look for:**
328
+ ```
329
+ Pushing to username/repo-name...
330
+ Upload file results.json: 100%
331
+ ✅ Push successful
332
+ ```
333
+
334
+ ## Key Takeaway
335
+
336
+ **Without `secrets={"HF_TOKEN": "$HF_TOKEN"}` and persistence code, all results are permanently lost.**
337
+
338
+ Always verify both are configured before submitting any job that produces results.
339
+
references/token_usage.md ADDED
@@ -0,0 +1,546 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Token Usage Guide for Hugging Face Jobs
2
+
3
+ **⚠️ CRITICAL:** Proper token usage is essential for any job that interacts with the Hugging Face Hub.
4
+
5
+ ## Overview
6
+
7
+ Hugging Face tokens are authentication credentials that allow your jobs to interact with the Hub. They're required for:
8
+ - Pushing models/datasets to Hub
9
+ - Accessing private repositories
10
+ - Creating new repositories
11
+ - Using Hub APIs programmatically
12
+ - Any authenticated Hub operations
13
+
14
+ ## Token Types
15
+
16
+ ### Read Token
17
+ - **Permissions:** Download models/datasets, read private repos
18
+ - **Use case:** Jobs that only need to download/read content
19
+ - **Creation:** https://huggingface.co/settings/tokens
20
+
21
+ ### Write Token
22
+ - **Permissions:** Push models/datasets, create repos, modify content
23
+ - **Use case:** Jobs that need to upload results (most common)
24
+ - **Creation:** https://huggingface.co/settings/tokens
25
+ - **⚠️ Required for:** Pushing models, datasets, or any uploads
26
+
27
+ ### Organization Token
28
+ - **Permissions:** Act on behalf of an organization
29
+ - **Use case:** Jobs running under organization namespace
30
+ - **Creation:** Organization settings → Tokens
31
+
32
+ ## Providing Tokens to Jobs
33
+
34
+ ### Method 1: Automatic Token (Recommended) ⭐
35
+
36
+ ```python
37
+ hf_jobs("uv", {
38
+ "script": "your_script.py",
39
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Automatic replacement
40
+ })
41
+ ```
42
+
43
+ **How it works:**
44
+ 1. `$HF_TOKEN` is a placeholder that gets replaced with your actual token
45
+ 2. Uses the token from your logged-in session (`hf auth login`)
46
+ 3. Token is encrypted server-side when passed as a secret
47
+ 4. Most secure and convenient method
48
+
49
+ **Benefits:**
50
+ - ✅ No token exposure in code
51
+ - ✅ Uses your current login session
52
+ - ✅ Automatically updated if you re-login
53
+ - ✅ Works seamlessly with MCP tools
54
+ - ✅ Token encrypted server-side
55
+
56
+ **Requirements:**
57
+ - Must be logged in: `hf auth login` or `hf_whoami()` works
58
+ - Token must have required permissions
59
+
60
+ ### Method 2: Explicit Token (Not Recommended)
61
+
62
+ ```python
63
+ hf_jobs("uv", {
64
+ "script": "your_script.py",
65
+ "secrets": {"HF_TOKEN": "hf_abc123..."} # ⚠️ Hardcoded token
66
+ })
67
+ ```
68
+
69
+ **When to use:**
70
+ - Only if automatic token doesn't work
71
+ - Testing with a specific token
72
+ - Organization tokens (use with caution)
73
+
74
+ **Security concerns:**
75
+ - ❌ Token visible in code/logs
76
+ - ❌ Must manually update if token rotates
77
+ - ❌ Risk of token exposure
78
+ - ❌ Not recommended for production
79
+
80
+ ### Method 3: Environment Variable (Less Secure)
81
+
82
+ ```python
83
+ hf_jobs("uv", {
84
+ "script": "your_script.py",
85
+ "env": {"HF_TOKEN": "hf_abc123..."} # ⚠️ Less secure than secrets
86
+ })
87
+ ```
88
+
89
+ **Difference from secrets:**
90
+ - `env` variables are visible in job logs
91
+ - `secrets` are encrypted server-side
92
+ - Always prefer `secrets` for tokens
93
+
94
+ **When to use:**
95
+ - Only for non-sensitive configuration
96
+ - Never use for tokens (use `secrets` instead)
97
+
98
+ ## Using Tokens in Scripts
99
+
100
+ ### Accessing Tokens
101
+
102
+ Tokens passed via `secrets` are available as environment variables in your script:
103
+
104
+ ```python
105
+ import os
106
+
107
+ # Get token from environment
108
+ token = os.environ.get("HF_TOKEN")
109
+
110
+ # Verify token exists
111
+ if not token:
112
+ raise ValueError("HF_TOKEN not found in environment!")
113
+ ```
114
+
115
+ ### Using with Hugging Face Hub
116
+
117
+ **Option 1: Explicit token parameter**
118
+ ```python
119
+ from huggingface_hub import HfApi
120
+
121
+ api = HfApi(token=os.environ.get("HF_TOKEN"))
122
+ api.upload_file(...)
123
+ ```
124
+
125
+ **Option 2: Auto-detection (Recommended)**
126
+ ```python
127
+ from huggingface_hub import HfApi
128
+
129
+ # Automatically uses HF_TOKEN env var
130
+ api = HfApi() # ✅ Simpler, uses token from environment
131
+ api.upload_file(...)
132
+ ```
133
+
134
+ **Option 3: With transformers/datasets**
135
+ ```python
136
+ from transformers import AutoModel
137
+ from datasets import load_dataset
138
+
139
+ # Auto-detects HF_TOKEN from environment
140
+ model = AutoModel.from_pretrained("username/model")
141
+ dataset = load_dataset("username/dataset")
142
+
143
+ # For push operations, token is auto-detected
144
+ model.push_to_hub("username/new-model")
145
+ dataset.push_to_hub("username/new-dataset")
146
+ ```
147
+
148
+ ### Complete Example
149
+
150
+ ```python
151
+ # /// script
152
+ # dependencies = ["huggingface-hub", "datasets"]
153
+ # ///
154
+
155
+ import os
156
+ from huggingface_hub import HfApi
157
+ from datasets import Dataset
158
+
159
+ # Verify token is available
160
+ assert "HF_TOKEN" in os.environ, "HF_TOKEN required for Hub operations!"
161
+
162
+ # Use token for Hub operations
163
+ api = HfApi() # Auto-detects HF_TOKEN
164
+
165
+ # Create and push dataset
166
+ data = {"text": ["Hello", "World"]}
167
+ dataset = Dataset.from_dict(data)
168
+
169
+ # Push to Hub (token auto-detected)
170
+ dataset.push_to_hub("username/my-dataset")
171
+
172
+ print("✅ Dataset pushed successfully!")
173
+ ```
174
+
175
+ ## Token Verification
176
+
177
+ ### Check Authentication Locally
178
+
179
+ ```python
180
+ from huggingface_hub import whoami
181
+
182
+ try:
183
+ user_info = whoami()
184
+ print(f"✅ Logged in as: {user_info['name']}")
185
+ except Exception as e:
186
+ print(f"❌ Not authenticated: {e}")
187
+ ```
188
+
189
+ ### Verify Token in Job
190
+
191
+ ```python
192
+ import os
193
+
194
+ # Check token exists
195
+ if "HF_TOKEN" not in os.environ:
196
+ raise ValueError("HF_TOKEN not found in environment!")
197
+
198
+ token = os.environ["HF_TOKEN"]
199
+
200
+ # Verify token format (should start with "hf_")
201
+ if not token.startswith("hf_"):
202
+ raise ValueError(f"Invalid token format: {token[:10]}...")
203
+
204
+ # Test token works
205
+ from huggingface_hub import whoami
206
+ try:
207
+ user_info = whoami(token=token)
208
+ print(f"✅ Token valid for user: {user_info['name']}")
209
+ except Exception as e:
210
+ raise ValueError(f"Token validation failed: {e}")
211
+ ```
212
+
213
+ ## Common Token Issues
214
+
215
+ ### Error: 401 Unauthorized
216
+
217
+ **Symptoms:**
218
+ ```
219
+ 401 Client Error: Unauthorized for url: https://huggingface.co/api/...
220
+ ```
221
+
222
+ **Causes:**
223
+ 1. Token missing from job
224
+ 2. Token invalid or expired
225
+ 3. Token not passed correctly
226
+
227
+ **Solutions:**
228
+ 1. Add `secrets={"HF_TOKEN": "$HF_TOKEN"}` to job config
229
+ 2. Verify `hf_whoami()` works locally
230
+ 3. Re-login: `hf auth login`
231
+ 4. Check token hasn't expired
232
+
233
+ **Verification:**
234
+ ```python
235
+ # In your script
236
+ import os
237
+ assert "HF_TOKEN" in os.environ, "HF_TOKEN missing!"
238
+ ```
239
+
240
+ ### Error: 403 Forbidden
241
+
242
+ **Symptoms:**
243
+ ```
244
+ 403 Client Error: Forbidden for url: https://huggingface.co/api/...
245
+ ```
246
+
247
+ **Causes:**
248
+ 1. Token lacks required permissions (read-only token used for write)
249
+ 2. No access to private repository
250
+ 3. Organization permissions insufficient
251
+
252
+ **Solutions:**
253
+ 1. Ensure token has write permissions
254
+ 2. Check token type at https://huggingface.co/settings/tokens
255
+ 3. Verify access to target repository
256
+ 4. Use organization token if needed
257
+
258
+ **Check token permissions:**
259
+ ```python
260
+ from huggingface_hub import whoami
261
+
262
+ user_info = whoami()
263
+ print(f"User: {user_info['name']}")
264
+ print(f"Type: {user_info.get('type', 'user')}")
265
+ ```
266
+
267
+ ### Error: Token not found in environment
268
+
269
+ **Symptoms:**
270
+ ```
271
+ KeyError: 'HF_TOKEN'
272
+ ValueError: HF_TOKEN not found
273
+ ```
274
+
275
+ **Causes:**
276
+ 1. `secrets` not passed in job config
277
+ 2. Wrong key name (should be `HF_TOKEN`)
278
+ 3. Using `env` instead of `secrets`
279
+
280
+ **Solutions:**
281
+ 1. Use `secrets={"HF_TOKEN": "$HF_TOKEN"}` (not `env`)
282
+ 2. Verify key name is exactly `HF_TOKEN`
283
+ 3. Check job config syntax
284
+
285
+ **Correct configuration:**
286
+ ```python
287
+ # ✅ Correct
288
+ hf_jobs("uv", {
289
+ "script": "...",
290
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"}
291
+ })
292
+
293
+ # ❌ Wrong - using env instead of secrets
294
+ hf_jobs("uv", {
295
+ "script": "...",
296
+ "env": {"HF_TOKEN": "$HF_TOKEN"} # Less secure
297
+ })
298
+
299
+ # ❌ Wrong - wrong key name
300
+ hf_jobs("uv", {
301
+ "script": "...",
302
+ "secrets": {"TOKEN": "$HF_TOKEN"} # Wrong key
303
+ })
304
+ ```
305
+
306
+ ### Error: Repository access denied
307
+
308
+ **Symptoms:**
309
+ ```
310
+ 403 Client Error: Forbidden
311
+ Repository not found or access denied
312
+ ```
313
+
314
+ **Causes:**
315
+ 1. Token doesn't have access to private repo
316
+ 2. Repository doesn't exist and can't be created
317
+ 3. Wrong namespace
318
+
319
+ **Solutions:**
320
+ 1. Use token from account with access
321
+ 2. Verify repo visibility (public vs private)
322
+ 3. Check namespace matches token owner
323
+ 4. Create repo first if needed
324
+
325
+ **Check repository access:**
326
+ ```python
327
+ from huggingface_hub import HfApi
328
+
329
+ api = HfApi()
330
+ try:
331
+ repo_info = api.repo_info("username/repo-name")
332
+ print(f"✅ Access granted: {repo_info.id}")
333
+ except Exception as e:
334
+ print(f"❌ Access denied: {e}")
335
+ ```
336
+
337
+ ## Token Security Best Practices
338
+
339
+ ### 1. Never Commit Tokens
340
+
341
+ **❌ Bad:**
342
+ ```python
343
+ # Never do this!
344
+ token = "hf_abc123xyz..."
345
+ api = HfApi(token=token)
346
+ ```
347
+
348
+ **✅ Good:**
349
+ ```python
350
+ # Use environment variable
351
+ token = os.environ.get("HF_TOKEN")
352
+ api = HfApi(token=token)
353
+ ```
354
+
355
+ ### 2. Use Secrets, Not Environment Variables
356
+
357
+ **❌ Bad:**
358
+ ```python
359
+ hf_jobs("uv", {
360
+ "script": "...",
361
+ "env": {"HF_TOKEN": "$HF_TOKEN"} # Visible in logs
362
+ })
363
+ ```
364
+
365
+ **✅ Good:**
366
+ ```python
367
+ hf_jobs("uv", {
368
+ "script": "...",
369
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"} # Encrypted server-side
370
+ })
371
+ ```
372
+
373
+ ### 3. Use Automatic Token Replacement
374
+
375
+ **❌ Bad:**
376
+ ```python
377
+ hf_jobs("uv", {
378
+ "script": "...",
379
+ "secrets": {"HF_TOKEN": "hf_abc123..."} # Hardcoded
380
+ })
381
+ ```
382
+
383
+ **✅ Good:**
384
+ ```python
385
+ hf_jobs("uv", {
386
+ "script": "...",
387
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"} # Automatic
388
+ })
389
+ ```
390
+
391
+ ### 4. Rotate Tokens Regularly
392
+
393
+ - Generate new tokens periodically
394
+ - Revoke old tokens
395
+ - Update job configurations
396
+ - Monitor token usage
397
+
398
+ ### 5. Use Minimal Permissions
399
+
400
+ - Create tokens with only needed permissions
401
+ - Use read tokens when write isn't needed
402
+ - Don't use admin tokens for regular jobs
403
+
404
+ ### 6. Don't Share Tokens
405
+
406
+ - Each user should use their own token
407
+ - Don't commit tokens to repositories
408
+ - Don't share tokens in logs or messages
409
+
410
+ ### 7. Monitor Token Usage
411
+
412
+ - Check token activity in Hub settings
413
+ - Review job logs for token issues
414
+ - Set up alerts for unauthorized access
415
+
416
+ ## Token Workflow Examples
417
+
418
+ ### Example 1: Push Model to Hub
419
+
420
+ ```python
421
+ hf_jobs("uv", {
422
+ "script": """
423
+ # /// script
424
+ # dependencies = ["transformers"]
425
+ # ///
426
+
427
+ import os
428
+ from transformers import AutoModel, AutoTokenizer
429
+
430
+ # Verify token
431
+ assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
432
+
433
+ # Load and process model
434
+ model = AutoModel.from_pretrained("base-model")
435
+ # ... process model ...
436
+
437
+ # Push to Hub (token auto-detected)
438
+ model.push_to_hub("username/my-model")
439
+ print("✅ Model pushed!")
440
+ """,
441
+ "flavor": "a10g-large",
442
+ "timeout": "2h",
443
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Token provided
444
+ })
445
+ ```
446
+
447
+ ### Example 2: Access Private Dataset
448
+
449
+ ```python
450
+ hf_jobs("uv", {
451
+ "script": """
452
+ # /// script
453
+ # dependencies = ["datasets"]
454
+ # ///
455
+
456
+ import os
457
+ from datasets import load_dataset
458
+
459
+ # Verify token
460
+ assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
461
+
462
+ # Load private dataset (token auto-detected)
463
+ dataset = load_dataset("private-org/private-dataset")
464
+ print(f"✅ Loaded {len(dataset)} examples")
465
+ """,
466
+ "flavor": "cpu-basic",
467
+ "timeout": "30m",
468
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Token provided
469
+ })
470
+ ```
471
+
472
+ ### Example 3: Create and Push Dataset
473
+
474
+ ```python
475
+ hf_jobs("uv", {
476
+ "script": """
477
+ # /// script
478
+ # dependencies = ["datasets", "huggingface-hub"]
479
+ # ///
480
+
481
+ import os
482
+ from datasets import Dataset
483
+ from huggingface_hub import HfApi
484
+
485
+ # Verify token
486
+ assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
487
+
488
+ # Create dataset
489
+ data = {"text": ["Sample 1", "Sample 2"]}
490
+ dataset = Dataset.from_dict(data)
491
+
492
+ # Push to Hub
493
+ api = HfApi() # Auto-detects HF_TOKEN
494
+ dataset.push_to_hub("username/my-dataset")
495
+ print("✅ Dataset pushed!")
496
+ """,
497
+ "flavor": "cpu-basic",
498
+ "timeout": "30m",
499
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Token provided
500
+ })
501
+ ```
502
+
503
+ ## Quick Reference
504
+
505
+ ### Token Checklist
506
+
507
+ Before submitting a job that uses Hub:
508
+
509
+ - [ ] Job includes `secrets={"HF_TOKEN": "$HF_TOKEN"}`
510
+ - [ ] Script checks for token: `assert "HF_TOKEN" in os.environ`
511
+ - [ ] Token has required permissions (read/write)
512
+ - [ ] User is logged in: `hf_whoami()` works
513
+ - [ ] Token not hardcoded in script
514
+ - [ ] Using `secrets` not `env` for token
515
+
516
+ ### Common Patterns
517
+
518
+ **Pattern 1: Auto-detect token**
519
+ ```python
520
+ from huggingface_hub import HfApi
521
+ api = HfApi() # Uses HF_TOKEN from environment
522
+ ```
523
+
524
+ **Pattern 2: Explicit token**
525
+ ```python
526
+ import os
527
+ from huggingface_hub import HfApi
528
+ api = HfApi(token=os.environ.get("HF_TOKEN"))
529
+ ```
530
+
531
+ **Pattern 3: Verify token**
532
+ ```python
533
+ import os
534
+ assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
535
+ ```
536
+
537
+ ## Key Takeaways
538
+
539
+ 1. **Always use `secrets={"HF_TOKEN": "$HF_TOKEN"}`** for Hub operations
540
+ 2. **Never hardcode tokens** in scripts or job configs
541
+ 3. **Verify token exists** in script before Hub operations
542
+ 4. **Use auto-detection** when possible (`HfApi()` without token parameter)
543
+ 5. **Check permissions** - ensure token has required access
544
+ 6. **Monitor token usage** - review activity regularly
545
+ 7. **Rotate tokens** - generate new tokens periodically
546
+
references/troubleshooting.md ADDED
@@ -0,0 +1,431 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Troubleshooting Guide
2
+
3
+ Common issues and solutions for Hugging Face Jobs.
4
+
5
+ ## Authentication Issues
6
+
7
+ ### Error: 401 Unauthorized
8
+
9
+ **Symptoms:**
10
+ ```
11
+ 401 Client Error: Unauthorized for url: https://huggingface.co/api/...
12
+ ```
13
+
14
+ **Causes:**
15
+ - Token missing from job
16
+ - Token invalid or expired
17
+ - Token not passed correctly
18
+
19
+ **Solutions:**
20
+ 1. Add `secrets={"HF_TOKEN": "$HF_TOKEN"}` to job config
21
+ 2. Verify `hf_whoami()` works locally
22
+ 3. Re-login: `hf auth login`
23
+ 4. Check token hasn't expired
24
+
25
+ **Verification:**
26
+ ```python
27
+ # In your script
28
+ import os
29
+ assert "HF_TOKEN" in os.environ, "HF_TOKEN missing!"
30
+ ```
31
+
32
+ ### Error: 403 Forbidden
33
+
34
+ **Symptoms:**
35
+ ```
36
+ 403 Client Error: Forbidden for url: https://huggingface.co/api/...
37
+ ```
38
+
39
+ **Causes:**
40
+ - Token lacks required permissions
41
+ - No access to private repository
42
+ - Organization permissions insufficient
43
+
44
+ **Solutions:**
45
+ 1. Ensure token has write permissions
46
+ 2. Check token type at https://huggingface.co/settings/tokens
47
+ 3. Verify access to target repository
48
+ 4. Use organization token if needed
49
+
50
+ ### Error: Token not found in environment
51
+
52
+ **Symptoms:**
53
+ ```
54
+ KeyError: 'HF_TOKEN'
55
+ ValueError: HF_TOKEN not found
56
+ ```
57
+
58
+ **Causes:**
59
+ - `secrets` not passed in job config
60
+ - Wrong key name (should be `HF_TOKEN`)
61
+ - Using `env` instead of `secrets`
62
+
63
+ **Solutions:**
64
+ 1. Use `secrets={"HF_TOKEN": "$HF_TOKEN"}` (not `env`)
65
+ 2. Verify key name is exactly `HF_TOKEN`
66
+ 3. Check job config syntax
67
+
68
+ ## Job Execution Issues
69
+
70
+ ### Error: Job Timeout
71
+
72
+ **Symptoms:**
73
+ - Job stops unexpectedly
74
+ - Status shows "TIMEOUT"
75
+ - Partial results only
76
+
77
+ **Causes:**
78
+ - Default 30min timeout exceeded
79
+ - Job takes longer than expected
80
+ - No timeout specified
81
+
82
+ **Solutions:**
83
+ 1. Check logs for actual runtime
84
+ 2. Increase timeout with buffer: `"timeout": "3h"`
85
+ 3. Optimize code for faster execution
86
+ 4. Process data in chunks
87
+ 5. Add 20-30% buffer to estimated time
88
+
89
+ **Example:**
90
+ ```python
91
+ hf_jobs("uv", {
92
+ "script": "...",
93
+ "timeout": "2h" # Set appropriate timeout
94
+ })
95
+ ```
96
+
97
+ ### Error: Out of Memory (OOM)
98
+
99
+ **Symptoms:**
100
+ ```
101
+ RuntimeError: CUDA out of memory
102
+ MemoryError: Unable to allocate array
103
+ ```
104
+
105
+ **Causes:**
106
+ - Batch size too large
107
+ - Model too large for hardware
108
+ - Insufficient GPU memory
109
+
110
+ **Solutions:**
111
+ 1. Reduce batch size
112
+ 2. Process data in smaller chunks
113
+ 3. Upgrade hardware: cpu → t4 → a10g → a100
114
+ 4. Use smaller models or quantization
115
+ 5. Enable gradient checkpointing (for training)
116
+
117
+ **Example:**
118
+ ```python
119
+ # Reduce batch size
120
+ batch_size = 1
121
+
122
+ # Process in chunks
123
+ for chunk in chunks:
124
+ process(chunk)
125
+ ```
126
+
127
+ ### Error: Missing Dependencies
128
+
129
+ **Symptoms:**
130
+ ```
131
+ ModuleNotFoundError: No module named 'package_name'
132
+ ImportError: cannot import name 'X'
133
+ ```
134
+
135
+ **Causes:**
136
+ - Package not in dependencies
137
+ - Wrong package name
138
+ - Version mismatch
139
+
140
+ **Solutions:**
141
+ 1. Add to PEP 723 header:
142
+ ```python
143
+ # /// script
144
+ # dependencies = ["package-name>=1.0.0"]
145
+ # ///
146
+ ```
147
+ 2. Check package name spelling
148
+ 3. Specify version if needed
149
+ 4. Check package availability
150
+
151
+ ### Error: Script Not Found
152
+
153
+ **Symptoms:**
154
+ ```
155
+ FileNotFoundError: script.py not found
156
+ ```
157
+
158
+ **Causes:**
159
+ - Local file path used (not supported)
160
+ - URL incorrect
161
+ - Script not accessible
162
+
163
+ **Solutions:**
164
+ 1. Use inline script (recommended)
165
+ 2. Use publicly accessible URL
166
+ 3. Upload script to Hub first
167
+ 4. Check URL is correct
168
+
169
+ **Correct approaches:**
170
+ ```python
171
+ # ✅ Inline code
172
+ hf_jobs("uv", {"script": "# /// script\n# dependencies = [...]\n# ///\n\n<code>"})
173
+
174
+ # ✅ From URL
175
+ hf_jobs("uv", {"script": "https://huggingface.co/user/repo/resolve/main/script.py"})
176
+ ```
177
+
178
+ ## Hub Push Issues
179
+
180
+ ### Error: Push Failed
181
+
182
+ **Symptoms:**
183
+ ```
184
+ Error pushing to Hub
185
+ Upload failed
186
+ ```
187
+
188
+ **Causes:**
189
+ - Network issues
190
+ - Token missing or invalid
191
+ - Repository access denied
192
+ - File too large
193
+
194
+ **Solutions:**
195
+ 1. Check token: `assert "HF_TOKEN" in os.environ`
196
+ 2. Verify repository exists or can be created
197
+ 3. Check network connectivity in logs
198
+ 4. Retry push operation
199
+ 5. Split large files into chunks
200
+
201
+ ### Error: Repository Not Found
202
+
203
+ **Symptoms:**
204
+ ```
205
+ 404 Client Error: Not Found
206
+ Repository not found
207
+ ```
208
+
209
+ **Causes:**
210
+ - Repository doesn't exist
211
+ - Wrong repository name
212
+ - No access to private repo
213
+
214
+ **Solutions:**
215
+ 1. Create repository first:
216
+ ```python
217
+ from huggingface_hub import HfApi
218
+ api = HfApi()
219
+ api.create_repo("username/repo-name", repo_type="dataset")
220
+ ```
221
+ 2. Check repository name format
222
+ 3. Verify namespace exists
223
+ 4. Check repository visibility
224
+
225
+ ### Error: Results Not Saved
226
+
227
+ **Symptoms:**
228
+ - Job completes successfully
229
+ - No results visible on Hub
230
+ - Files not persisted
231
+
232
+ **Causes:**
233
+ - No persistence code in script
234
+ - Push code not executed
235
+ - Push failed silently
236
+
237
+ **Solutions:**
238
+ 1. Add persistence code to script
239
+ 2. Verify push executes successfully
240
+ 3. Check logs for push errors
241
+ 4. Add error handling around push
242
+
243
+ **Example:**
244
+ ```python
245
+ try:
246
+ dataset.push_to_hub("username/dataset")
247
+ print("✅ Push successful")
248
+ except Exception as e:
249
+ print(f"❌ Push failed: {e}")
250
+ raise
251
+ ```
252
+
253
+ ## Hardware Issues
254
+
255
+ ### Error: GPU Not Available
256
+
257
+ **Symptoms:**
258
+ ```
259
+ CUDA not available
260
+ No GPU found
261
+ ```
262
+
263
+ **Causes:**
264
+ - CPU flavor used instead of GPU
265
+ - GPU not requested
266
+ - CUDA not installed in image
267
+
268
+ **Solutions:**
269
+ 1. Use GPU flavor: `"flavor": "a10g-large"`
270
+ 2. Check image has CUDA support
271
+ 3. Verify GPU availability in logs
272
+
273
+ ### Error: Slow Performance
274
+
275
+ **Symptoms:**
276
+ - Job takes longer than expected
277
+ - Low GPU utilization
278
+ - CPU bottleneck
279
+
280
+ **Causes:**
281
+ - Wrong hardware selected
282
+ - Inefficient code
283
+ - Data loading bottleneck
284
+
285
+ **Solutions:**
286
+ 1. Upgrade hardware
287
+ 2. Optimize code
288
+ 3. Use batch processing
289
+ 4. Profile code to find bottlenecks
290
+
291
+ ## General Issues
292
+
293
+ ### Error: Job Status Unknown
294
+
295
+ **Symptoms:**
296
+ - Can't check job status
297
+ - Status API returns error
298
+
299
+ **Solutions:**
300
+ 1. Use job URL: `https://huggingface.co/jobs/username/job-id`
301
+ 2. Check logs: `hf_jobs("logs", {"job_id": "..."})`
302
+ 3. Inspect job: `hf_jobs("inspect", {"job_id": "..."})`
303
+
304
+ ### Error: Logs Not Available
305
+
306
+ **Symptoms:**
307
+ - No logs visible
308
+ - Logs delayed
309
+
310
+ **Causes:**
311
+ - Job just started (logs delayed 30-60s)
312
+ - Job failed before logging
313
+ - Logs not yet generated
314
+
315
+ **Solutions:**
316
+ 1. Wait 30-60 seconds after job start
317
+ 2. Check job status first
318
+ 3. Use job URL for web interface
319
+
320
+ ### Error: Cost Unexpectedly High
321
+
322
+ **Symptoms:**
323
+ - Job costs more than expected
324
+ - Longer runtime than estimated
325
+
326
+ **Causes:**
327
+ - Job ran longer than timeout
328
+ - Wrong hardware selected
329
+ - Inefficient code
330
+
331
+ **Solutions:**
332
+ 1. Monitor job runtime
333
+ 2. Set appropriate timeout
334
+ 3. Optimize code
335
+ 4. Choose right hardware
336
+ 5. Check cost estimates before running
337
+
338
+ ## Debugging Tips
339
+
340
+ ### 1. Add Logging
341
+
342
+ ```python
343
+ import logging
344
+ logging.basicConfig(level=logging.INFO)
345
+ logger = logging.getLogger(__name__)
346
+
347
+ logger.info("Starting processing...")
348
+ logger.info(f"Processed {count} items")
349
+ ```
350
+
351
+ ### 2. Verify Environment
352
+
353
+ ```python
354
+ import os
355
+ print(f"Python version: {os.sys.version}")
356
+ print(f"CUDA available: {torch.cuda.is_available()}")
357
+ print(f"HF_TOKEN present: {'HF_TOKEN' in os.environ}")
358
+ ```
359
+
360
+ ### 3. Test Locally First
361
+
362
+ Run script locally before submitting to catch errors early:
363
+ ```bash
364
+ python script.py
365
+ ```
366
+
367
+ ### 4. Check Job Logs
368
+
369
+ ```python
370
+ # View logs
371
+ hf_jobs("logs", {"job_id": "your-job-id"})
372
+
373
+ # Or use job URL
374
+ # https://huggingface.co/jobs/username/job-id
375
+ ```
376
+
377
+ ### 5. Add Error Handling
378
+
379
+ ```python
380
+ try:
381
+ # Your code
382
+ process_data()
383
+ except Exception as e:
384
+ print(f"Error: {e}")
385
+ import traceback
386
+ traceback.print_exc()
387
+ raise
388
+ ```
389
+
390
+ ## Quick Reference
391
+
392
+ ### Common Error Codes
393
+
394
+ | Code | Meaning | Solution |
395
+ |------|---------|----------|
396
+ | 401 | Unauthorized | Add `secrets={"HF_TOKEN": "$HF_TOKEN"}` |
397
+ | 403 | Forbidden | Check token permissions |
398
+ | 404 | Not Found | Verify repository exists |
399
+ | 500 | Server Error | Retry or contact support |
400
+
401
+ ### Checklist Before Submitting
402
+
403
+ - [ ] Token configured: `secrets={"HF_TOKEN": "$HF_TOKEN"}`
404
+ - [ ] Script checks for token: `assert "HF_TOKEN" in os.environ`
405
+ - [ ] Timeout set appropriately
406
+ - [ ] Hardware selected correctly
407
+ - [ ] Dependencies listed in PEP 723 header
408
+ - [ ] Persistence code included
409
+ - [ ] Error handling added
410
+ - [ ] Logging added for debugging
411
+
412
+ ## Getting Help
413
+
414
+ If issues persist:
415
+
416
+ 1. **Check logs** - Most errors include detailed messages
417
+ 2. **Review documentation** - See main SKILL.md
418
+ 3. **Check Hub status** - https://status.huggingface.co
419
+ 4. **Community forums** - https://discuss.huggingface.co
420
+ 5. **GitHub issues** - For bugs in huggingface_hub
421
+
422
+ ## Key Takeaways
423
+
424
+ 1. **Always include token** - `secrets={"HF_TOKEN": "$HF_TOKEN"}`
425
+ 2. **Set appropriate timeout** - Default 30min may be insufficient
426
+ 3. **Verify persistence** - Results won't persist without code
427
+ 4. **Check logs** - Most issues visible in job logs
428
+ 5. **Test locally** - Catch errors before submitting
429
+ 6. **Add error handling** - Better debugging information
430
+ 7. **Monitor costs** - Set timeouts to avoid unexpected charges
431
+
scripts/cot-self-instruct.py ADDED
@@ -0,0 +1,718 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # /// script
2
+ # requires-python = ">=3.10"
3
+ # dependencies = [
4
+ # "datasets",
5
+ # "transformers",
6
+ # "vllm>=0.6.5",
7
+ # "huggingface-hub[hf_transfer]",
8
+ # "torch",
9
+ # "numpy",
10
+ # "tqdm",
11
+ # "scikit-learn",
12
+ # ]
13
+ # ///
14
+ """
15
+ Generate high-quality synthetic data using Chain-of-Thought Self-Instruct methodology.
16
+
17
+ This script implements the CoT-Self-Instruct approach from the paper "CoT-Self-Instruct:
18
+ Building high-quality synthetic prompts for reasoning and non-reasoning tasks" (2025).
19
+
20
+ It supports two modes:
21
+ 1. Reasoning tasks: Generates both questions and answers with Chain-of-Thought
22
+ 2. Instruction tasks: Generates diverse prompts for general instruction following
23
+
24
+ Example usage:
25
+ # Reasoning tasks with Answer-Consistency filtering
26
+ uv run cot-self-instruct.py \\
27
+ --seed-dataset davanstrien/s1k-reasoning \\
28
+ --output-dataset username/synthetic-math \\
29
+ --task-type reasoning \\
30
+ --num-samples 5000 \\
31
+ --filter-method answer-consistency
32
+
33
+ # Instruction tasks with RIP filtering
34
+ uv run cot-self-instruct.py \\
35
+ --seed-dataset wildchat-filtered \\
36
+ --output-dataset username/synthetic-prompts \\
37
+ --task-type instruction \\
38
+ --filter-method rip \\
39
+ --reward-model Nexusflow/Athene-RM-8B
40
+
41
+ # HF Jobs execution
42
+ hf jobs uv run --flavor l4x4 \\
43
+ --image vllm/vllm-openai \\
44
+ -e HF_TOKEN=$(python3 -c "from huggingface_hub import get_token; print(get_token())") \\
45
+ https://huggingface.co/datasets/uv-scripts/synthetic-data/raw/main/cot-self-instruct.py \\
46
+ [args...]
47
+ """
48
+
49
+ import argparse
50
+ import json
51
+ import logging
52
+ import os
53
+ import random
54
+ import re
55
+ import sys
56
+ from collections import Counter
57
+ from datetime import datetime
58
+ from typing import Dict, List, Optional, Tuple, Union
59
+
60
+ import numpy as np
61
+ import torch
62
+ from datasets import Dataset, load_dataset
63
+ from huggingface_hub import DatasetCard, login
64
+ from sklearn.cluster import KMeans
65
+ from tqdm.auto import tqdm
66
+ from transformers import AutoTokenizer
67
+ from vllm import LLM, SamplingParams
68
+
69
+ # Enable HF Transfer for faster downloads
70
+ os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
71
+
72
+ logging.basicConfig(
73
+ level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
74
+ )
75
+ logger = logging.getLogger(__name__)
76
+
77
+ # Prompt templates from the paper
78
+ REASONING_PROMPT_TEMPLATE = """You are a reasoning question generator assistant. Your goal is to create a novel, and challenging reasoning question. You are provided the following seed questions:
79
+ Seed Question 1: {seed1}
80
+ Seed Question 2: {seed2}
81
+ Your task is to:
82
+ 1. Write a brand-new, self-contained reasoning question that meets the following requirements:
83
+ (a) The question draws inspiration from the seed question without copying it verbatim, remaining novel and of comparable difficulty.
84
+ (b) The question's final answer should be a single, unambiguous scalar value (e.g., an integer, reduced fraction, exact radical), or another answer type that can be verified in one step (e.g., 'yes/no,' a choice from A to D).
85
+ 2. Then reason step by step, solve the new question and format your output as follows:
86
+ [New Question Begin]{{your_generated_question}}[New Question End]
87
+ [Final Answer to New Question Begin]\\boxed{{your_final_answer}}[Final Answer to New Question End]"""
88
+
89
+ INSTRUCTION_PROMPT_TEMPLATE = """You are a prompt generator assistant. Your goal is to create diverse and creative synthetic prompts.
90
+ Please follow the steps below to create synthetic prompts.
91
+ Step 1: Carefully read #Prompt 1# and #Prompt 2#. Identify and list all the common elements between these two prompts. If no common elements are found, list the main elements from each prompt.
92
+ Step 2: Develop a comprehensive plan based on the #Common Elements List# or #Main Elements List# from Step 1. This plan will guide the generation of new synthetic prompts that are similar to the original prompts.
93
+ Step 3: Execute the plan step by step and provide one #Synthetic Prompt#.
94
+ Please reply strictly in the following format:
95
+ - Step 1 #Common Elements List# or #Main Elements List#:
96
+ - Step 2 #Plan#:
97
+ - Step 3 #Synthetic Prompt#:
98
+ #Prompt 1#:
99
+ {prompt1}
100
+ #Prompt 2#:
101
+ {prompt2}"""
102
+
103
+
104
+ def check_gpu_availability() -> int:
105
+ """Check if CUDA is available and return the number of GPUs."""
106
+ if not torch.cuda.is_available():
107
+ logger.error("CUDA is not available. This script requires a GPU.")
108
+ logger.error(
109
+ "Please run on a machine with NVIDIA GPU or use HF Jobs with GPU flavor."
110
+ )
111
+ sys.exit(1)
112
+
113
+ num_gpus = torch.cuda.device_count()
114
+ for i in range(num_gpus):
115
+ gpu_name = torch.cuda.get_device_name(i)
116
+ gpu_memory = torch.cuda.get_device_properties(i).total_memory / 1024**3
117
+ logger.info(f"GPU {i}: {gpu_name} with {gpu_memory:.1f} GB memory")
118
+
119
+ return num_gpus
120
+
121
+
122
+ def parse_thinking_output(text: str) -> str:
123
+ """Remove thinking tokens from model output."""
124
+ # Remove <think>...</think> blocks
125
+ text = re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL)
126
+ return text.strip()
127
+
128
+
129
+ def extract_reasoning_output(text: str) -> Tuple[Optional[str], Optional[str]]:
130
+ """Extract question and answer from reasoning task output."""
131
+ text = parse_thinking_output(text)
132
+
133
+ # Extract question
134
+ question_match = re.search(r'\[New Question Begin\](.*?)\[New Question End\]', text, re.DOTALL)
135
+ if not question_match:
136
+ return None, None
137
+ question = question_match.group(1).strip()
138
+
139
+ # Extract answer
140
+ answer_match = re.search(r'\[Final Answer to New Question Begin\]\\?boxed\{(.*?)\}\[Final Answer to New Question End\]', text, re.DOTALL)
141
+ if not answer_match:
142
+ # Try without \boxed
143
+ answer_match = re.search(r'\[Final Answer to New Question Begin\](.*?)\[Final Answer to New Question End\]', text, re.DOTALL)
144
+
145
+ if not answer_match:
146
+ return question, None
147
+
148
+ answer = answer_match.group(1).strip()
149
+ return question, answer
150
+
151
+
152
+ def extract_instruction_output(text: str) -> Optional[str]:
153
+ """Extract synthetic prompt from instruction task output."""
154
+ text = parse_thinking_output(text)
155
+
156
+ # Look for the synthetic prompt after "Step 3 #Synthetic Prompt#:"
157
+ match = re.search(r'Step 3 #Synthetic Prompt#:\s*(.+)', text, re.DOTALL)
158
+ if match:
159
+ return match.group(1).strip()
160
+ return None
161
+
162
+
163
+ def categorize_prompts(prompts: List[str], num_categories: int = 8) -> Dict[int, List[int]]:
164
+ """Categorize prompts using clustering for instruction tasks."""
165
+ from transformers import AutoModel
166
+
167
+ logger.info(f"Categorizing {len(prompts)} prompts into {num_categories} categories...")
168
+
169
+ # Use a small model for embeddings
170
+ tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
171
+ model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
172
+
173
+ # Get embeddings
174
+ embeddings = []
175
+ for prompt in tqdm(prompts, desc="Computing embeddings"):
176
+ inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
177
+ with torch.no_grad():
178
+ outputs = model(**inputs)
179
+ embedding = outputs.last_hidden_state.mean(dim=1).numpy()
180
+ embeddings.append(embedding[0])
181
+
182
+ # Cluster
183
+ kmeans = KMeans(n_clusters=num_categories, random_state=42)
184
+ labels = kmeans.fit_predict(embeddings)
185
+
186
+ # Group by category
187
+ categories = {}
188
+ for idx, label in enumerate(labels):
189
+ if label not in categories:
190
+ categories[label] = []
191
+ categories[label].append(idx)
192
+
193
+ return categories
194
+
195
+
196
+ def generate_synthetic_data(
197
+ llm: LLM,
198
+ seed_data: List[Dict],
199
+ task_type: str,
200
+ num_samples: int,
201
+ categories: Optional[Dict[int, List[int]]] = None,
202
+ ) -> List[Dict]:
203
+ """Generate synthetic data using CoT-Self-Instruct."""
204
+ synthetic_data = []
205
+
206
+ # Set up progress bar
207
+ pbar = tqdm(total=num_samples, desc="Generating synthetic data")
208
+
209
+ while len(synthetic_data) < num_samples:
210
+ # Sample seed data
211
+ if task_type == "reasoning":
212
+ # Random sampling for reasoning tasks
213
+ seeds = random.sample(seed_data, min(2, len(seed_data)))
214
+ prompt = REASONING_PROMPT_TEMPLATE.format(
215
+ seed1=seeds[0].get("question", seeds[0].get("prompt", "")),
216
+ seed2=seeds[1].get("question", seeds[1].get("prompt", "")) if len(seeds) > 1 else seeds[0].get("question", seeds[0].get("prompt", ""))
217
+ )
218
+ else:
219
+ # Category-aware sampling for instruction tasks
220
+ if categories:
221
+ # Pick a random category
222
+ category = random.choice(list(categories.keys()))
223
+ category_indices = categories[category]
224
+ indices = random.sample(category_indices, min(2, len(category_indices)))
225
+ seeds = [seed_data[i] for i in indices]
226
+ else:
227
+ seeds = random.sample(seed_data, min(2, len(seed_data)))
228
+
229
+ prompt = INSTRUCTION_PROMPT_TEMPLATE.format(
230
+ prompt1=seeds[0].get("prompt", seeds[0].get("question", "")),
231
+ prompt2=seeds[1].get("prompt", seeds[1].get("question", "")) if len(seeds) > 1 else seeds[0].get("prompt", seeds[0].get("question", ""))
232
+ )
233
+
234
+ # Generate
235
+ sampling_params = SamplingParams(
236
+ temperature=0.7 if task_type == "reasoning" else 0.8,
237
+ top_p=0.95 if task_type == "reasoning" else 0.9,
238
+ max_tokens=2048,
239
+ )
240
+
241
+ outputs = llm.generate([prompt], sampling_params)
242
+ output_text = outputs[0].outputs[0].text
243
+
244
+ # Parse output
245
+ if task_type == "reasoning":
246
+ question, answer = extract_reasoning_output(output_text)
247
+ if question and answer:
248
+ synthetic_data.append({
249
+ "question": question,
250
+ "answer": answer,
251
+ "seed_indices": [seed_data.index(s) for s in seeds],
252
+ })
253
+ pbar.update(1)
254
+ else:
255
+ synthetic_prompt = extract_instruction_output(output_text)
256
+ if synthetic_prompt:
257
+ synthetic_data.append({
258
+ "prompt": synthetic_prompt,
259
+ "seed_indices": [seed_data.index(s) for s in seeds],
260
+ })
261
+ pbar.update(1)
262
+
263
+ pbar.close()
264
+ return synthetic_data
265
+
266
+
267
+ def answer_consistency_filter(
268
+ llm: LLM,
269
+ synthetic_data: List[Dict],
270
+ k_responses: int = 16,
271
+ threshold: float = 0.5,
272
+ ) -> List[Dict]:
273
+ """Filter reasoning tasks using Answer-Consistency."""
274
+ logger.info(f"Applying Answer-Consistency filter with K={k_responses}")
275
+
276
+ filtered_data = []
277
+
278
+ for item in tqdm(synthetic_data, desc="Answer-Consistency filtering"):
279
+ question = item["question"]
280
+ original_answer = item["answer"]
281
+
282
+ # Generate K responses
283
+ prompts = [question] * k_responses
284
+ sampling_params = SamplingParams(
285
+ temperature=0.6,
286
+ top_p=0.95,
287
+ max_tokens=1024,
288
+ )
289
+
290
+ outputs = llm.generate(prompts, sampling_params)
291
+
292
+ # Extract answers
293
+ answers = []
294
+ for output in outputs:
295
+ text = output.outputs[0].text
296
+ # Try to extract boxed answer
297
+ match = re.search(r'\\boxed\{(.*?)\}', text)
298
+ if match:
299
+ answers.append(match.group(1).strip())
300
+
301
+ if not answers:
302
+ continue
303
+
304
+ # Get majority answer
305
+ answer_counts = Counter(answers)
306
+ if answer_counts:
307
+ majority_answer, count = answer_counts.most_common(1)[0]
308
+
309
+ # Check if majority answer matches original and meets threshold
310
+ if (majority_answer == original_answer and
311
+ count / len(answers) >= threshold):
312
+ item["consistency_score"] = count / len(answers)
313
+ filtered_data.append(item)
314
+
315
+ logger.info(f"Answer-Consistency: kept {len(filtered_data)}/{len(synthetic_data)} examples")
316
+ return filtered_data
317
+
318
+
319
+ def rip_filter(
320
+ llm: LLM,
321
+ synthetic_data: List[Dict],
322
+ reward_model_id: str,
323
+ k_responses: int = 32,
324
+ threshold: float = 0.5,
325
+ ) -> List[Dict]:
326
+ """Filter using Rejecting Instruction Preferences (RIP)."""
327
+ logger.info(f"Applying RIP filter with K={k_responses} and reward model {reward_model_id}")
328
+
329
+ # Note: In a full implementation, you would load and use the actual reward model
330
+ # For this example, we'll use a placeholder scoring mechanism
331
+ logger.warning("RIP filtering requires a reward model implementation - using placeholder")
332
+
333
+ filtered_data = []
334
+
335
+ for item in tqdm(synthetic_data, desc="RIP filtering"):
336
+ prompt = item.get("prompt", item.get("question", ""))
337
+
338
+ # Generate K responses
339
+ prompts = [prompt] * k_responses
340
+ sampling_params = SamplingParams(
341
+ temperature=1.0,
342
+ top_p=1.0,
343
+ max_tokens=1024,
344
+ )
345
+
346
+ outputs = llm.generate(prompts, sampling_params)
347
+
348
+ # In real implementation: score each response with reward model
349
+ # For now, use length as a proxy (longer responses often score higher)
350
+ scores = [len(output.outputs[0].text) for output in outputs]
351
+
352
+ # Use minimum score as quality indicator
353
+ min_score = min(scores) if scores else 0
354
+ normalized_score = min_score / 1000 # Normalize to 0-1 range
355
+
356
+ if normalized_score >= threshold:
357
+ item["rip_score"] = normalized_score
358
+ filtered_data.append(item)
359
+
360
+ logger.info(f"RIP filter: kept {len(filtered_data)}/{len(synthetic_data)} examples")
361
+ return filtered_data
362
+
363
+
364
+ def create_dataset_card(
365
+ task_type: str,
366
+ source_dataset: str,
367
+ generation_model: str,
368
+ filter_method: str,
369
+ num_generated: int,
370
+ num_filtered: int,
371
+ generation_time: str,
372
+ additional_info: Dict = None,
373
+ ) -> str:
374
+ """Create a comprehensive dataset card."""
375
+ filter_info = ""
376
+ if filter_method == "answer-consistency":
377
+ filter_info = """
378
+ ### Answer-Consistency Filtering
379
+
380
+ This dataset was filtered using Answer-Consistency:
381
+ - Generated K responses for each synthetic question
382
+ - Kept only examples where majority answer matched the generated answer
383
+ - Ensures high-quality, correctly solved problems"""
384
+ elif filter_method == "rip":
385
+ filter_info = """
386
+ ### RIP (Rejecting Instruction Preferences) Filtering
387
+
388
+ This dataset was filtered using RIP:
389
+ - Generated K responses for each synthetic prompt
390
+ - Scored responses using a reward model
391
+ - Kept only prompts with high minimum scores"""
392
+
393
+ return f"""---
394
+ tags:
395
+ - synthetic-data
396
+ - cot-self-instruct
397
+ - {task_type}
398
+ - uv-script
399
+ ---
400
+
401
+ # CoT-Self-Instruct Synthetic Data
402
+
403
+ This dataset contains synthetic {task_type} data generated using the Chain-of-Thought Self-Instruct methodology.
404
+
405
+ ## Generation Details
406
+
407
+ - **Source Dataset**: [{source_dataset}](https://huggingface.co/datasets/{source_dataset})
408
+ - **Generation Model**: [{generation_model}](https://huggingface.co/{generation_model})
409
+ - **Task Type**: {task_type}
410
+ - **Filter Method**: {filter_method}
411
+ - **Generated Examples**: {num_generated:,}
412
+ - **After Filtering**: {num_filtered:,} ({(num_filtered/num_generated)*100:.1f}% acceptance rate)
413
+ - **Generation Date**: {generation_time}
414
+ {filter_info}
415
+
416
+ ## Methodology
417
+
418
+ Generated using CoT-Self-Instruct, which:
419
+ 1. Uses Chain-of-Thought reasoning to analyze seed examples
420
+ 2. Generates new synthetic examples of similar quality and complexity
421
+ 3. Applies quality filtering to ensure high-quality outputs
422
+
423
+ Based on the paper: "CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks" (2025)
424
+
425
+ ## Generation Script
426
+
427
+ Generated using the CoT-Self-Instruct script from [uv-scripts/synthetic-data](https://huggingface.co/datasets/uv-scripts/synthetic-data).
428
+
429
+ To reproduce:
430
+ ```bash
431
+ uv run https://huggingface.co/datasets/uv-scripts/synthetic-data/raw/main/cot-self-instruct.py \\
432
+ --seed-dataset {source_dataset} \\
433
+ --output-dataset <your-dataset> \\
434
+ --task-type {task_type} \\
435
+ --generation-model {generation_model} \\
436
+ --filter-method {filter_method}
437
+ ```
438
+ """
439
+
440
+
441
+ def main():
442
+ parser = argparse.ArgumentParser(
443
+ description="Generate synthetic data using CoT-Self-Instruct",
444
+ formatter_class=argparse.RawDescriptionHelpFormatter,
445
+ epilog=__doc__,
446
+ )
447
+
448
+ # Dataset arguments
449
+ parser.add_argument(
450
+ "--seed-dataset",
451
+ type=str,
452
+ required=True,
453
+ help="HuggingFace dataset ID containing seed examples",
454
+ )
455
+ parser.add_argument(
456
+ "--output-dataset",
457
+ type=str,
458
+ required=True,
459
+ help="HuggingFace dataset ID for output",
460
+ )
461
+
462
+ # Task configuration
463
+ parser.add_argument(
464
+ "--task-type",
465
+ type=str,
466
+ choices=["reasoning", "instruction", "auto"],
467
+ default="auto",
468
+ help="Type of task (reasoning generates Q&A, instruction generates prompts)",
469
+ )
470
+ parser.add_argument(
471
+ "--task-column",
472
+ type=str,
473
+ default=None,
474
+ help="Column name containing tasks (auto-detected if not specified)",
475
+ )
476
+
477
+ # Model configuration
478
+ parser.add_argument(
479
+ "--generation-model",
480
+ type=str,
481
+ default="Qwen/Qwen3-30B-A3B-Thinking-2507",
482
+ help="Model for synthetic data generation",
483
+ )
484
+ parser.add_argument(
485
+ "--filter-model",
486
+ type=str,
487
+ default=None,
488
+ help="Model for filtering (defaults to generation model)",
489
+ )
490
+ parser.add_argument(
491
+ "--reward-model",
492
+ type=str,
493
+ default="Nexusflow/Athene-RM-8B",
494
+ help="Reward model for RIP filtering",
495
+ )
496
+
497
+ # Generation parameters
498
+ parser.add_argument(
499
+ "--num-samples",
500
+ type=int,
501
+ default=5000,
502
+ help="Number of synthetic examples to generate",
503
+ )
504
+ parser.add_argument(
505
+ "--batch-size",
506
+ type=int,
507
+ default=1,
508
+ help="Batch size for generation",
509
+ )
510
+
511
+ # Filtering parameters
512
+ parser.add_argument(
513
+ "--filter-method",
514
+ type=str,
515
+ choices=["answer-consistency", "rip", "both", "none"],
516
+ default="answer-consistency",
517
+ help="Quality filtering method",
518
+ )
519
+ parser.add_argument(
520
+ "--k-responses",
521
+ type=int,
522
+ default=16,
523
+ help="Number of responses for filtering",
524
+ )
525
+ parser.add_argument(
526
+ "--quality-threshold",
527
+ type=float,
528
+ default=0.5,
529
+ help="Minimum quality threshold for filtering",
530
+ )
531
+
532
+ # GPU configuration
533
+ parser.add_argument(
534
+ "--tensor-parallel-size",
535
+ type=int,
536
+ default=None,
537
+ help="Number of GPUs for tensor parallelism (auto-detected if not set)",
538
+ )
539
+ parser.add_argument(
540
+ "--gpu-memory-utilization",
541
+ type=float,
542
+ default=0.9,
543
+ help="GPU memory utilization",
544
+ )
545
+
546
+ # Other arguments
547
+ parser.add_argument(
548
+ "--hf-token",
549
+ type=str,
550
+ default=None,
551
+ help="HuggingFace API token",
552
+ )
553
+ parser.add_argument(
554
+ "--seed",
555
+ type=int,
556
+ default=42,
557
+ help="Random seed",
558
+ )
559
+
560
+ args = parser.parse_args()
561
+
562
+ # Set random seeds
563
+ random.seed(args.seed)
564
+ np.random.seed(args.seed)
565
+ torch.manual_seed(args.seed)
566
+
567
+ # Check GPU
568
+ num_gpus = check_gpu_availability()
569
+ tensor_parallel_size = args.tensor_parallel_size or num_gpus
570
+
571
+ # Authentication
572
+ hf_token = args.hf_token or os.environ.get("HF_TOKEN")
573
+ if hf_token:
574
+ login(token=hf_token)
575
+
576
+ # Load seed dataset
577
+ logger.info(f"Loading seed dataset: {args.seed_dataset}")
578
+ seed_dataset = load_dataset(args.seed_dataset, split="train")
579
+
580
+ # Auto-detect task type and column if needed
581
+ if args.task_type == "auto":
582
+ columns = seed_dataset.column_names
583
+ if "question" in columns and "answer" in columns:
584
+ args.task_type = "reasoning"
585
+ logger.info("Auto-detected task type: reasoning")
586
+ else:
587
+ args.task_type = "instruction"
588
+ logger.info("Auto-detected task type: instruction")
589
+
590
+ if not args.task_column:
591
+ if args.task_type == "reasoning":
592
+ args.task_column = "question"
593
+ else:
594
+ # Try to find prompt column
595
+ for col in ["prompt", "instruction", "text", "input"]:
596
+ if col in seed_dataset.column_names:
597
+ args.task_column = col
598
+ break
599
+
600
+ logger.info(f"Using task column: {args.task_column}")
601
+
602
+ # Convert to list of dicts
603
+ seed_data = seed_dataset.to_list()
604
+
605
+ # Categorize prompts for instruction tasks
606
+ categories = None
607
+ if args.task_type == "instruction" and len(seed_data) > 100:
608
+ prompts = [item.get(args.task_column, "") for item in seed_data]
609
+ categories = categorize_prompts(prompts)
610
+
611
+ # Initialize generation model
612
+ logger.info(f"Loading generation model: {args.generation_model}")
613
+ generation_llm = LLM(
614
+ model=args.generation_model,
615
+ tensor_parallel_size=tensor_parallel_size,
616
+ gpu_memory_utilization=args.gpu_memory_utilization,
617
+ )
618
+
619
+ # Generate synthetic data
620
+ start_time = datetime.now()
621
+ synthetic_data = generate_synthetic_data(
622
+ generation_llm,
623
+ seed_data,
624
+ args.task_type,
625
+ args.num_samples,
626
+ categories,
627
+ )
628
+
629
+ # Apply filtering
630
+ filter_llm = generation_llm
631
+ if args.filter_model and args.filter_model != args.generation_model:
632
+ logger.info(f"Loading filter model: {args.filter_model}")
633
+ # Clean up generation model
634
+ del generation_llm
635
+ torch.cuda.empty_cache()
636
+
637
+ filter_llm = LLM(
638
+ model=args.filter_model,
639
+ tensor_parallel_size=tensor_parallel_size,
640
+ gpu_memory_utilization=args.gpu_memory_utilization,
641
+ )
642
+
643
+ filtered_data = synthetic_data
644
+ if args.filter_method != "none":
645
+ if args.filter_method == "answer-consistency" and args.task_type == "reasoning":
646
+ filtered_data = answer_consistency_filter(
647
+ filter_llm,
648
+ synthetic_data,
649
+ args.k_responses,
650
+ args.quality_threshold,
651
+ )
652
+ elif args.filter_method == "rip":
653
+ filtered_data = rip_filter(
654
+ filter_llm,
655
+ synthetic_data,
656
+ args.reward_model,
657
+ args.k_responses,
658
+ args.quality_threshold,
659
+ )
660
+ elif args.filter_method == "both":
661
+ if args.task_type == "reasoning":
662
+ filtered_data = answer_consistency_filter(
663
+ filter_llm,
664
+ synthetic_data,
665
+ args.k_responses,
666
+ args.quality_threshold,
667
+ )
668
+ filtered_data = rip_filter(
669
+ filter_llm,
670
+ filtered_data,
671
+ args.reward_model,
672
+ args.k_responses,
673
+ args.quality_threshold,
674
+ )
675
+
676
+ # Create HuggingFace dataset
677
+ logger.info(f"Creating dataset with {len(filtered_data)} examples")
678
+ dataset = Dataset.from_list(filtered_data)
679
+
680
+ # Create dataset card
681
+ generation_time = start_time.strftime("%Y-%m-%d %H:%M:%S UTC")
682
+ dataset_card = create_dataset_card(
683
+ args.task_type,
684
+ args.seed_dataset,
685
+ args.generation_model,
686
+ args.filter_method,
687
+ len(synthetic_data),
688
+ len(filtered_data),
689
+ generation_time,
690
+ )
691
+
692
+ # Push to hub
693
+ logger.info(f"Pushing dataset to: {args.output_dataset}")
694
+ # Create dataset card
695
+ card = DatasetCard(dataset_card)
696
+ dataset.push_to_hub(args.output_dataset)
697
+ # Push card separately
698
+ card.push_to_hub(args.output_dataset)
699
+
700
+ logger.info("Done! Dataset available at: https://huggingface.co/datasets/" + args.output_dataset)
701
+
702
+ # Print example HF Jobs command if running locally
703
+ if len(sys.argv) > 1:
704
+ print("\nTo run on HF Jobs:")
705
+ print(f"""hf jobs uv run --flavor l4x4 \\
706
+ --image vllm/vllm-openai \\
707
+ -e HF_TOKEN=$(python3 -c "from huggingface_hub import get_token; print(get_token())") \\
708
+ https://huggingface.co/datasets/uv-scripts/synthetic-data/raw/main/cot-self-instruct.py \\
709
+ --seed-dataset {args.seed_dataset} \\
710
+ --output-dataset {args.output_dataset} \\
711
+ --task-type {args.task_type} \\
712
+ --generation-model {args.generation_model} \\
713
+ --filter-method {args.filter_method} \\
714
+ --num-samples {args.num_samples}""")
715
+
716
+
717
+ if __name__ == "__main__":
718
+ main()
scripts/finepdfs-stats.py ADDED
@@ -0,0 +1,546 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # /// script
2
+ # requires-python = ">=3.12"
3
+ # dependencies = [
4
+ # "polars>=1.31.0",
5
+ # "huggingface-hub",
6
+ # "datasets",
7
+ # "ascii-graph",
8
+ # ]
9
+ # ///
10
+ """
11
+ Analyze educational quality trends across CommonCrawl dumps using Polars streaming.
12
+
13
+ Answers: "Is the web getting more educational over time?"
14
+
15
+ Demonstrates Polars HF Hub integration - process 50M+ docs without downloading 300GB+.
16
+
17
+ Example usage:
18
+ # Analyze English PDFs (default)
19
+ uv run finepdfs-stats.py
20
+
21
+ # Analyze all 70+ languages
22
+ uv run finepdfs-stats.py --all-languages
23
+
24
+ # Quick test
25
+ uv run finepdfs-stats.py --limit 10000 --show-plan
26
+
27
+ # Save results to HF Hub
28
+ uv run finepdfs-stats.py --output-repo username/finepdfs-temporal-stats
29
+
30
+ # Run on HF Jobs
31
+ hf jobs uv run \\
32
+ -s HF_TOKEN \\
33
+ -e HF_XET_HIGH_PERFORMANCE=1 \\
34
+ https://huggingface.co/datasets/uv-scripts/dataset-stats/raw/main/finepdfs-stats.py \\
35
+ -- --output-repo username/stats
36
+ """
37
+
38
+ import argparse
39
+ import logging
40
+ import os
41
+ import sys
42
+ import time
43
+ from pathlib import Path
44
+
45
+ import polars as pl
46
+ from ascii_graph import Pyasciigraph
47
+ from datasets import Dataset
48
+ from huggingface_hub import HfApi, create_repo, list_repo_tree, login
49
+
50
+ logging.basicConfig(
51
+ level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
52
+ )
53
+ logger = logging.getLogger(__name__)
54
+
55
+ # Common language+script codes for finepdfs-edu
56
+ COMMON_LANGUAGES = {
57
+ "eng_Latn": "English (Latin script)",
58
+ "fra_Latn": "French (Latin script)",
59
+ "deu_Latn": "German (Latin script)",
60
+ "spa_Latn": "Spanish (Latin script)",
61
+ "por_Latn": "Portuguese (Latin script)",
62
+ "ita_Latn": "Italian (Latin script)",
63
+ "nld_Latn": "Dutch (Latin script)",
64
+ "pol_Latn": "Polish (Latin script)",
65
+ "rus_Cyrl": "Russian (Cyrillic script)",
66
+ "zho_Hans": "Chinese (Simplified)",
67
+ "zho_Hant": "Chinese (Traditional)",
68
+ "jpn_Jpan": "Japanese",
69
+ "kor_Hang": "Korean",
70
+ "ara_Arab": "Arabic",
71
+ "hin_Deva": "Hindi (Devanagari)",
72
+ }
73
+
74
+
75
+ def list_available_languages(dataset_id: str) -> list[str]:
76
+ """List available language subsets in the dataset."""
77
+ try:
78
+ tree = list_repo_tree(dataset_id, path_in_repo="data", repo_type="dataset")
79
+ languages = [
80
+ item.path.replace("data/", "")
81
+ for item in tree
82
+ if item.path.startswith("data/")
83
+ and "/" not in item.path.replace("data/", "")
84
+ ]
85
+ return sorted(languages)
86
+ except Exception as e:
87
+ logger.warning(f"Could not list languages: {e}")
88
+ return list(COMMON_LANGUAGES.keys())
89
+
90
+
91
+ def compute_temporal_stats(df: pl.LazyFrame, output_path: Path) -> pl.DataFrame:
92
+ """Single scan: compute stats grouped by dump for temporal analysis."""
93
+ query = df.group_by("dump").agg(
94
+ pl.len().alias("doc_count"),
95
+ pl.col("token_count").sum().alias("total_tokens"),
96
+ pl.col("fw_edu_scores").list.mean().mean().alias("avg_edu_score"),
97
+ (pl.col("fw_edu_scores").list.mean() >= 3).sum().alias("high_edu_count"),
98
+ )
99
+ query.sink_parquet(output_path, engine="streaming")
100
+ return pl.read_parquet(output_path)
101
+
102
+
103
+ def compute_global_stats(temporal: pl.DataFrame) -> pl.DataFrame:
104
+ """Compute global stats from temporal breakdown."""
105
+ total = temporal["doc_count"].sum()
106
+ return pl.DataFrame(
107
+ {
108
+ "total_docs": [total],
109
+ "total_tokens": [temporal["total_tokens"].sum()],
110
+ "avg_edu_score": [
111
+ (temporal["avg_edu_score"] * temporal["doc_count"]).sum() / total
112
+ ],
113
+ "high_edu_rate": [temporal["high_edu_count"].sum() / total],
114
+ "num_dumps": [len(temporal)],
115
+ }
116
+ )
117
+
118
+
119
+ def format_temporal_stats(temporal: pl.DataFrame) -> pl.DataFrame:
120
+ """Format temporal stats with high_edu_rate, sorted chronologically."""
121
+ return (
122
+ temporal.with_columns(
123
+ (pl.col("high_edu_count") / pl.col("doc_count")).alias("high_edu_rate")
124
+ )
125
+ .select(["dump", "doc_count", "avg_edu_score", "high_edu_rate"])
126
+ .sort(
127
+ "dump"
128
+ ) # Chronological order (CC-MAIN-2017-xx comes before CC-MAIN-2024-xx)
129
+ )
130
+
131
+
132
+ def create_ascii_charts(temporal_stats: pl.DataFrame) -> str:
133
+ """Create ASCII bar charts showing temporal trends."""
134
+ # Extract year from dump name (CC-MAIN-2024-42 -> 2024)
135
+ # Group by year and average the values for cleaner display
136
+ yearly = (
137
+ temporal_stats.with_columns(
138
+ pl.col("dump").str.extract(r"CC-MAIN-(\d{4})", 1).alias("year")
139
+ )
140
+ .group_by("year")
141
+ .agg(
142
+ pl.col("doc_count").sum(),
143
+ pl.col("avg_edu_score").mean(),
144
+ pl.col("high_edu_rate").mean(),
145
+ )
146
+ .sort("year")
147
+ )
148
+
149
+ lines = []
150
+
151
+ # High edu rate chart (more dramatic differences)
152
+ data_rate = [
153
+ (row["year"], row["high_edu_rate"] * 100)
154
+ for row in yearly.iter_rows(named=True)
155
+ ]
156
+ graph = Pyasciigraph(line_length=60, float_format="{0:.1f}%")
157
+ lines.extend(graph.graph("High Educational Content (edu >= 3)", data_rate))
158
+
159
+ lines.append("")
160
+
161
+ # Avg edu score chart
162
+ data_score = [
163
+ (row["year"], row["avg_edu_score"]) for row in yearly.iter_rows(named=True)
164
+ ]
165
+ graph2 = Pyasciigraph(line_length=60, float_format="{0:.2f}")
166
+ lines.extend(graph2.graph("Average Educational Score", data_score))
167
+
168
+ return "\n".join(lines)
169
+
170
+
171
+ def create_readme(
172
+ args,
173
+ global_stats: pl.DataFrame,
174
+ temporal_stats: pl.DataFrame,
175
+ scan_time: float,
176
+ ascii_charts: str,
177
+ ) -> str:
178
+ """Create README content for the stats dataset."""
179
+ stats = global_stats.to_dicts()[0]
180
+ total_docs = stats.get("total_docs", 0)
181
+ docs_per_sec = total_docs / scan_time if scan_time > 0 else 0
182
+
183
+ # Get first and last year averages for trend (more representative than single dumps)
184
+ yearly = (
185
+ temporal_stats.with_columns(
186
+ pl.col("dump").str.extract(r"CC-MAIN-(\d{4})", 1).alias("year")
187
+ )
188
+ .group_by("year")
189
+ .agg(
190
+ pl.col("doc_count").sum(),
191
+ pl.col("avg_edu_score").mean(),
192
+ pl.col("high_edu_rate").mean(),
193
+ )
194
+ .sort("year")
195
+ )
196
+ first_year = yearly.head(1).to_dicts()[0]
197
+ last_year = yearly.tail(1).to_dicts()[0]
198
+
199
+ scope = (
200
+ "all languages"
201
+ if args.all_languages
202
+ else COMMON_LANGUAGES.get(args.lang, args.lang)
203
+ )
204
+
205
+ return f"""---
206
+ tags:
207
+ - uv-script
208
+ - statistics
209
+ - polars
210
+ - finepdfs-edu
211
+ - temporal-analysis
212
+ license: odc-by
213
+ configs:
214
+ - config_name: global_stats
215
+ data_files: global_stats/train-*.parquet
216
+ - config_name: temporal_stats
217
+ data_files: temporal_stats/train-*.parquet
218
+ default_viewer_config: temporal_stats
219
+ ---
220
+
221
+ # Is the Web Getting More Educational?
222
+
223
+ Temporal analysis of educational quality in **{scope}** across {stats.get("num_dumps", 0)} CommonCrawl dumps.
224
+
225
+ ## Trend
226
+
227
+ ```
228
+ {ascii_charts}
229
+ ```
230
+
231
+ ## Key Finding
232
+
233
+ | Year | Avg Edu Score | High Edu Rate |
234
+ |------|---------------|---------------|
235
+ | {first_year["year"]} | {first_year["avg_edu_score"]:.2f} | {first_year["high_edu_rate"] * 100:.1f}% |
236
+ | {last_year["year"]} | {last_year["avg_edu_score"]:.2f} | {last_year["high_edu_rate"] * 100:.1f}% |
237
+
238
+ ## Performance
239
+
240
+ - **{total_docs:,} documents** processed in **{scan_time:.0f} seconds**
241
+ - **{docs_per_sec:,.0f} docs/sec** using Polars streaming
242
+ - Single scan, no full dataset download required
243
+
244
+ ## Summary
245
+
246
+ | Metric | Value |
247
+ |--------|-------|
248
+ | Scope | {scope} |
249
+ | Total Documents | {total_docs:,} |
250
+ | Total Tokens | {stats.get("total_tokens", 0):,} |
251
+ | Avg Edu Score | {stats.get("avg_edu_score", 0):.3f} |
252
+ | High Edu Rate | {stats.get("high_edu_rate", 0) * 100:.1f}% |
253
+ | CommonCrawl Dumps | {stats.get("num_dumps", 0)} |
254
+
255
+ ## Files
256
+
257
+ - `global_stats` - Overall summary
258
+ - `temporal_stats` - Per-dump breakdown (sorted chronologically)
259
+
260
+ ## Reproduce
261
+
262
+ ```bash
263
+ uv run https://huggingface.co/datasets/uv-scripts/dataset-stats/raw/main/finepdfs-stats.py \\
264
+ {"--all-languages" if args.all_languages else f"--lang {args.lang}"} --output-repo your-username/stats
265
+ ```
266
+
267
+ ## Source
268
+
269
+ - **Dataset**: [{args.source_dataset}](https://huggingface.co/datasets/{args.source_dataset})
270
+ - **Script**: [uv-scripts/dataset-stats](https://huggingface.co/datasets/uv-scripts/dataset-stats)
271
+ """
272
+
273
+
274
+ def main():
275
+ parser = argparse.ArgumentParser(
276
+ description="Analyze educational quality trends across CommonCrawl dumps",
277
+ formatter_class=argparse.RawDescriptionHelpFormatter,
278
+ epilog=__doc__,
279
+ )
280
+
281
+ parser.add_argument(
282
+ "--source-dataset",
283
+ type=str,
284
+ default="HuggingFaceFW/finepdfs-edu",
285
+ help="Source dataset (default: HuggingFaceFW/finepdfs-edu)",
286
+ )
287
+
288
+ parser.add_argument(
289
+ "--lang",
290
+ type=str,
291
+ default="eng_Latn",
292
+ help="Language+script code (default: eng_Latn)",
293
+ )
294
+
295
+ parser.add_argument(
296
+ "--all-languages",
297
+ action="store_true",
298
+ help="Analyze all languages (70+) instead of single language",
299
+ )
300
+
301
+ parser.add_argument(
302
+ "--show-plan",
303
+ action="store_true",
304
+ help="Show Polars query plan (demonstrates optimization)",
305
+ )
306
+
307
+ parser.add_argument(
308
+ "--list-languages",
309
+ action="store_true",
310
+ help="List available languages and exit",
311
+ )
312
+
313
+ parser.add_argument(
314
+ "--limit",
315
+ type=int,
316
+ help="Limit to first N rows (for testing)",
317
+ )
318
+
319
+ parser.add_argument(
320
+ "--output-repo",
321
+ type=str,
322
+ help="HuggingFace dataset repository to upload results",
323
+ )
324
+
325
+ parser.add_argument(
326
+ "--output-dir",
327
+ type=str,
328
+ default="./stats_output",
329
+ help="Local directory for output files",
330
+ )
331
+
332
+ parser.add_argument(
333
+ "--hf-token",
334
+ type=str,
335
+ help="HuggingFace API token (or set HF_TOKEN env var)",
336
+ )
337
+
338
+ parser.add_argument(
339
+ "--private",
340
+ action="store_true",
341
+ help="Make the output dataset private",
342
+ )
343
+
344
+ args = parser.parse_args()
345
+
346
+ # Check for high-performance mode
347
+ if os.environ.get("HF_XET_HIGH_PERFORMANCE"):
348
+ logger.info("High-performance mode enabled (HF_XET_HIGH_PERFORMANCE=1)")
349
+
350
+ # List languages mode
351
+ if args.list_languages:
352
+ print(f"Available language+script codes for {args.source_dataset}:\n")
353
+ print("Common languages:")
354
+ for code, name in COMMON_LANGUAGES.items():
355
+ print(f" {code:12} - {name}")
356
+ print("\nFetching full list from HF Hub...")
357
+ all_langs = list_available_languages(args.source_dataset)
358
+ print(f"\nAll available ({len(all_langs)} total):")
359
+ for lang in all_langs[:30]: # Show first 30
360
+ name = COMMON_LANGUAGES.get(lang, "")
361
+ print(f" {lang:12} {name}")
362
+ if len(all_langs) > 30:
363
+ print(f" ... and {len(all_langs) - 30} more")
364
+ sys.exit(0)
365
+
366
+ # Build the parquet path
367
+ if args.all_languages:
368
+ source_path = f"hf://datasets/{args.source_dataset}/data/*/train/*.parquet"
369
+ scope_desc = "all languages"
370
+ else:
371
+ source_path = (
372
+ f"hf://datasets/{args.source_dataset}/data/{args.lang}/train/*.parquet"
373
+ )
374
+ scope_desc = f"{args.lang} ({COMMON_LANGUAGES.get(args.lang, 'unknown')})"
375
+
376
+ logger.info(f"Scanning: {source_path}")
377
+ logger.info(f"Scope: {scope_desc}")
378
+
379
+ # Create lazy frame - this doesn't load any data yet!
380
+ logger.info("Creating lazy query plan...")
381
+ df = pl.scan_parquet(source_path)
382
+
383
+ # Apply limit if specified
384
+ if args.limit:
385
+ logger.info(f"Limiting to first {args.limit:,} rows")
386
+ df = df.head(args.limit)
387
+
388
+ # Show query plan if requested
389
+ if args.show_plan:
390
+ # Build a sample query to show the plan
391
+ sample_query = df.select(
392
+ pl.len(),
393
+ pl.col("token_count").sum(),
394
+ pl.col("language").n_unique(),
395
+ )
396
+ print("\nQuery Plan (showing Polars optimization):")
397
+ print("=" * 60)
398
+ print(sample_query.explain())
399
+ print("=" * 60)
400
+ print("\nNote: Polars uses projection pushdown - only reads columns needed!")
401
+ print("The 'text' column is never loaded, making this very fast.\n")
402
+
403
+ # Create output directory
404
+ output_dir = Path(args.output_dir)
405
+ output_dir.mkdir(parents=True, exist_ok=True)
406
+
407
+ # Single scan: compute temporal stats
408
+ logger.info("Computing temporal stats (single scan)...")
409
+ start = time.perf_counter()
410
+ temporal_path = output_dir / "temporal_stats.parquet"
411
+ temporal_raw = compute_temporal_stats(df, temporal_path)
412
+ scan_time = time.perf_counter() - start
413
+ logger.info(f"Scan complete in {scan_time:.2f}s - {len(temporal_raw)} dumps")
414
+
415
+ # Compute stats
416
+ global_stats = compute_global_stats(temporal_raw)
417
+ temporal_stats = format_temporal_stats(temporal_raw)
418
+
419
+ # Save
420
+ global_stats.write_parquet(output_dir / "global_stats.parquet")
421
+ temporal_stats.write_parquet(output_dir / "temporal_stats.parquet")
422
+
423
+ # Print results
424
+ total_docs = global_stats["total_docs"][0]
425
+ docs_per_sec = total_docs / scan_time if scan_time > 0 else 0
426
+
427
+ print("\n" + "=" * 70)
428
+ print("IS THE WEB GETTING MORE EDUCATIONAL?")
429
+ print("=" * 70)
430
+
431
+ print(f"\nScope: {scope_desc}")
432
+ print(f"Dataset: {args.source_dataset}")
433
+
434
+ print("\n" + "-" * 70)
435
+ print("GLOBAL STATS")
436
+ print("-" * 70)
437
+ print(global_stats)
438
+
439
+ print("\n" + "-" * 70)
440
+ print(f"TEMPORAL TREND ({len(temporal_stats)} CommonCrawl dumps)")
441
+ print("-" * 70)
442
+ # Show first 5 and last 5
443
+ if len(temporal_stats) > 10:
444
+ print("Earliest dumps:")
445
+ print(temporal_stats.head(5))
446
+ print("\n...")
447
+ print("\nLatest dumps:")
448
+ print(temporal_stats.tail(5))
449
+ else:
450
+ print(temporal_stats)
451
+
452
+ # Create ASCII charts
453
+ ascii_charts = create_ascii_charts(temporal_stats)
454
+ print("\n" + "-" * 70)
455
+ print("TREND VISUALIZATION")
456
+ print("-" * 70)
457
+ print(ascii_charts)
458
+
459
+ print("\n" + "-" * 70)
460
+ print("PERFORMANCE")
461
+ print("-" * 70)
462
+ print(f"Scan time: {scan_time:.2f}s")
463
+ print(f"Documents: {total_docs:,}")
464
+ print(f"Throughput: {docs_per_sec:,.0f} docs/sec")
465
+
466
+ logger.info(f"Results saved to: {output_dir}")
467
+
468
+ # Upload to HF Hub if requested
469
+ if args.output_repo:
470
+ hf_token = args.hf_token or os.environ.get("HF_TOKEN")
471
+ if hf_token:
472
+ login(token=hf_token)
473
+
474
+ api = HfApi(token=hf_token)
475
+
476
+ logger.info(f"Creating/updating dataset repository: {args.output_repo}")
477
+ create_repo(
478
+ args.output_repo,
479
+ repo_type="dataset",
480
+ private=args.private,
481
+ token=hf_token,
482
+ exist_ok=True,
483
+ )
484
+
485
+ # Upload each as a dataset config
486
+ configs = [
487
+ ("global_stats", global_stats),
488
+ ("temporal_stats", temporal_stats),
489
+ ]
490
+
491
+ for config_name, stats_df in configs:
492
+ logger.info(f"Uploading {config_name}...")
493
+ ds = Dataset.from_polars(stats_df)
494
+ ds.push_to_hub(
495
+ args.output_repo,
496
+ config_name=config_name,
497
+ token=hf_token,
498
+ private=args.private,
499
+ )
500
+ time.sleep(1) # Avoid 409 conflicts
501
+
502
+ # Upload README
503
+ readme_content = create_readme(
504
+ args, global_stats, temporal_stats, scan_time, ascii_charts
505
+ )
506
+ api.upload_file(
507
+ path_or_fileobj=readme_content.encode(),
508
+ path_in_repo="README.md",
509
+ repo_id=args.output_repo,
510
+ repo_type="dataset",
511
+ token=hf_token,
512
+ )
513
+
514
+ dataset_url = f"https://huggingface.co/datasets/{args.output_repo}"
515
+ logger.info(f"Dataset uploaded: {dataset_url}")
516
+ print(f"\nResults uploaded to: {dataset_url}")
517
+
518
+
519
+ if __name__ == "__main__":
520
+ if len(sys.argv) == 1:
521
+ print("Is the Web Getting More Educational?")
522
+ print("=" * 40)
523
+ print("\nAnalyze educational quality trends across CommonCrawl dumps")
524
+ print("using Polars streaming - no download needed!\n")
525
+ print("Example commands:\n")
526
+ print("# Quick test:")
527
+ print("uv run finepdfs-stats.py --limit 10000\n")
528
+ print("# Analyze English PDFs:")
529
+ print("uv run finepdfs-stats.py\n")
530
+ print("# Analyze ALL 70+ languages:")
531
+ print("uv run finepdfs-stats.py --all-languages\n")
532
+ print("# Show query plan (see Polars optimization):")
533
+ print("uv run finepdfs-stats.py --show-plan --limit 1000\n")
534
+ print("# Save results to HF Hub:")
535
+ print("uv run finepdfs-stats.py --output-repo username/temporal-stats\n")
536
+ print("# Run on HF Jobs:")
537
+ print("hf jobs uv run \\")
538
+ print(" -s HF_TOKEN \\")
539
+ print(" -e HF_XET_HIGH_PERFORMANCE=1 \\")
540
+ print(
541
+ " https://huggingface.co/datasets/uv-scripts/dataset-stats/raw/main/finepdfs-stats.py \\"
542
+ )
543
+ print(" -- --output-repo username/stats")
544
+ sys.exit(0)
545
+
546
+ main()
scripts/generate-responses.py ADDED
@@ -0,0 +1,587 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # /// script
2
+ # requires-python = ">=3.10"
3
+ # dependencies = [
4
+ # "datasets",
5
+ # "flashinfer-python",
6
+ # "huggingface-hub[hf_transfer]",
7
+ # "hf-xet>= 1.1.7",
8
+ # "torch",
9
+ # "transformers",
10
+ # "vllm>=0.8.5",
11
+ # ]
12
+ #
13
+ # ///
14
+ """
15
+ Generate responses for prompts in a dataset using vLLM for efficient GPU inference.
16
+
17
+ This script loads a dataset from Hugging Face Hub containing chat-formatted messages,
18
+ applies the model's chat template, generates responses using vLLM, and saves the
19
+ results back to the Hub with a comprehensive dataset card.
20
+
21
+ Example usage:
22
+ # Local execution with auto GPU detection
23
+ uv run generate-responses.py \\
24
+ username/input-dataset \\
25
+ username/output-dataset \\
26
+ --messages-column messages
27
+
28
+ # With custom model and sampling parameters
29
+ uv run generate-responses.py \\
30
+ username/input-dataset \\
31
+ username/output-dataset \\
32
+ --model-id meta-llama/Llama-3.1-8B-Instruct \\
33
+ --temperature 0.9 \\
34
+ --top-p 0.95 \\
35
+ --max-tokens 2048
36
+
37
+ # HF Jobs execution (see script output for full command)
38
+ hf jobs uv run --flavor a100x4 ...
39
+ """
40
+
41
+ import argparse
42
+ import logging
43
+ import os
44
+ import sys
45
+ from datetime import datetime
46
+ from typing import Optional
47
+
48
+ from datasets import load_dataset
49
+ from huggingface_hub import DatasetCard, get_token, login
50
+ from torch import cuda
51
+ from tqdm.auto import tqdm
52
+ from transformers import AutoTokenizer
53
+ from vllm import LLM, SamplingParams
54
+
55
+ # Enable HF Transfer for faster downloads
56
+ os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
57
+
58
+ logging.basicConfig(
59
+ level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
60
+ )
61
+ logger = logging.getLogger(__name__)
62
+
63
+
64
+ def check_gpu_availability() -> int:
65
+ """Check if CUDA is available and return the number of GPUs."""
66
+ if not cuda.is_available():
67
+ logger.error("CUDA is not available. This script requires a GPU.")
68
+ logger.error(
69
+ "Please run on a machine with NVIDIA GPU or use HF Jobs with GPU flavor."
70
+ )
71
+ sys.exit(1)
72
+
73
+ num_gpus = cuda.device_count()
74
+ for i in range(num_gpus):
75
+ gpu_name = cuda.get_device_name(i)
76
+ gpu_memory = cuda.get_device_properties(i).total_memory / 1024**3
77
+ logger.info(f"GPU {i}: {gpu_name} with {gpu_memory:.1f} GB memory")
78
+
79
+ return num_gpus
80
+
81
+
82
+ def create_dataset_card(
83
+ source_dataset: str,
84
+ model_id: str,
85
+ messages_column: str,
86
+ prompt_column: Optional[str],
87
+ sampling_params: SamplingParams,
88
+ tensor_parallel_size: int,
89
+ num_examples: int,
90
+ generation_time: str,
91
+ num_skipped: int = 0,
92
+ max_model_len_used: Optional[int] = None,
93
+ ) -> str:
94
+ """Create a comprehensive dataset card documenting the generation process."""
95
+ filtering_section = ""
96
+ if num_skipped > 0:
97
+ skip_percentage = (num_skipped / num_examples) * 100
98
+ processed = num_examples - num_skipped
99
+ filtering_section = f"""
100
+
101
+ ### Filtering Statistics
102
+
103
+ - **Total Examples**: {num_examples:,}
104
+ - **Processed**: {processed:,} ({100 - skip_percentage:.1f}%)
105
+ - **Skipped (too long)**: {num_skipped:,} ({skip_percentage:.1f}%)
106
+ - **Max Model Length Used**: {max_model_len_used:,} tokens
107
+
108
+ Note: Prompts exceeding the maximum model length were skipped and have empty responses."""
109
+
110
+ return f"""---
111
+ tags:
112
+ - generated
113
+ - vllm
114
+ - uv-script
115
+ ---
116
+
117
+ # Generated Responses Dataset
118
+
119
+ This dataset contains generated responses for prompts from [{source_dataset}](https://huggingface.co/datasets/{source_dataset}).
120
+
121
+ ## Generation Details
122
+
123
+ - **Source Dataset**: [{source_dataset}](https://huggingface.co/datasets/{source_dataset})
124
+ - **Input Column**: `{prompt_column if prompt_column else messages_column}` ({"plain text prompts" if prompt_column else "chat messages"})
125
+ - **Model**: [{model_id}](https://huggingface.co/{model_id})
126
+ - **Number of Examples**: {num_examples:,}
127
+ - **Generation Date**: {generation_time}{filtering_section}
128
+
129
+ ### Sampling Parameters
130
+
131
+ - **Temperature**: {sampling_params.temperature}
132
+ - **Top P**: {sampling_params.top_p}
133
+ - **Top K**: {sampling_params.top_k}
134
+ - **Min P**: {sampling_params.min_p}
135
+ - **Max Tokens**: {sampling_params.max_tokens}
136
+ - **Repetition Penalty**: {sampling_params.repetition_penalty}
137
+
138
+ ### Hardware Configuration
139
+
140
+ - **Tensor Parallel Size**: {tensor_parallel_size}
141
+ - **GPU Configuration**: {tensor_parallel_size} GPU(s)
142
+
143
+ ## Dataset Structure
144
+
145
+ The dataset contains all columns from the source dataset plus:
146
+ - `response`: The generated response from the model
147
+
148
+ ## Generation Script
149
+
150
+ Generated using the vLLM inference script from [uv-scripts/vllm](https://huggingface.co/datasets/uv-scripts/vllm).
151
+
152
+ To reproduce this generation:
153
+
154
+ ```bash
155
+ uv run https://huggingface.co/datasets/uv-scripts/vllm/raw/main/generate-responses.py \\
156
+ {source_dataset} \\
157
+ <output-dataset> \\
158
+ --model-id {model_id} \\
159
+ {"--prompt-column " + prompt_column if prompt_column else "--messages-column " + messages_column} \\
160
+ --temperature {sampling_params.temperature} \\
161
+ --top-p {sampling_params.top_p} \\
162
+ --top-k {sampling_params.top_k} \\
163
+ --max-tokens {sampling_params.max_tokens}{f" \\\\\\n --max-model-len {max_model_len_used}" if max_model_len_used else ""}
164
+ ```
165
+ """
166
+
167
+
168
+ def main(
169
+ src_dataset_hub_id: str,
170
+ output_dataset_hub_id: str,
171
+ model_id: str = "Qwen/Qwen3-30B-A3B-Instruct-2507",
172
+ messages_column: str = "messages",
173
+ prompt_column: Optional[str] = None,
174
+ output_column: str = "response",
175
+ temperature: float = 0.7,
176
+ top_p: float = 0.8,
177
+ top_k: int = 20,
178
+ min_p: float = 0.0,
179
+ max_tokens: int = 16384,
180
+ repetition_penalty: float = 1.0,
181
+ gpu_memory_utilization: float = 0.90,
182
+ max_model_len: Optional[int] = None,
183
+ tensor_parallel_size: Optional[int] = None,
184
+ skip_long_prompts: bool = True,
185
+ max_samples: Optional[int] = None,
186
+ hf_token: Optional[str] = None,
187
+ ):
188
+ """
189
+ Main generation pipeline.
190
+
191
+ Args:
192
+ src_dataset_hub_id: Input dataset on Hugging Face Hub
193
+ output_dataset_hub_id: Where to save results on Hugging Face Hub
194
+ model_id: Hugging Face model ID for generation
195
+ messages_column: Column name containing chat messages
196
+ prompt_column: Column name containing plain text prompts (alternative to messages_column)
197
+ output_column: Column name for generated responses
198
+ temperature: Sampling temperature
199
+ top_p: Top-p sampling parameter
200
+ top_k: Top-k sampling parameter
201
+ min_p: Minimum probability threshold
202
+ max_tokens: Maximum tokens to generate
203
+ repetition_penalty: Repetition penalty parameter
204
+ gpu_memory_utilization: GPU memory utilization factor
205
+ max_model_len: Maximum model context length (None uses model default)
206
+ tensor_parallel_size: Number of GPUs to use (auto-detect if None)
207
+ skip_long_prompts: Skip prompts exceeding max_model_len instead of failing
208
+ max_samples: Maximum number of samples to process (None for all)
209
+ hf_token: Hugging Face authentication token
210
+ """
211
+ generation_start_time = datetime.now().isoformat()
212
+
213
+ # GPU check and configuration
214
+ num_gpus = check_gpu_availability()
215
+ if tensor_parallel_size is None:
216
+ tensor_parallel_size = num_gpus
217
+ logger.info(
218
+ f"Auto-detected {num_gpus} GPU(s), using tensor_parallel_size={tensor_parallel_size}"
219
+ )
220
+ else:
221
+ logger.info(f"Using specified tensor_parallel_size={tensor_parallel_size}")
222
+ if tensor_parallel_size > num_gpus:
223
+ logger.warning(
224
+ f"Requested {tensor_parallel_size} GPUs but only {num_gpus} available"
225
+ )
226
+
227
+ # Authentication - try multiple methods
228
+ HF_TOKEN = hf_token or os.environ.get("HF_TOKEN") or get_token()
229
+
230
+ if not HF_TOKEN:
231
+ logger.error("No HuggingFace token found. Please provide token via:")
232
+ logger.error(" 1. --hf-token argument")
233
+ logger.error(" 2. HF_TOKEN environment variable")
234
+ logger.error(" 3. Run 'huggingface-cli login' or use login() in Python")
235
+ sys.exit(1)
236
+
237
+ logger.info("HuggingFace token found, authenticating...")
238
+ login(token=HF_TOKEN)
239
+
240
+ # Initialize vLLM
241
+ logger.info(f"Loading model: {model_id}")
242
+ vllm_kwargs = {
243
+ "model": model_id,
244
+ "tensor_parallel_size": tensor_parallel_size,
245
+ "gpu_memory_utilization": gpu_memory_utilization,
246
+ }
247
+ if max_model_len is not None:
248
+ vllm_kwargs["max_model_len"] = max_model_len
249
+ logger.info(f"Using max_model_len={max_model_len}")
250
+
251
+ llm = LLM(**vllm_kwargs)
252
+
253
+ # Load tokenizer for chat template
254
+ logger.info("Loading tokenizer...")
255
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
256
+
257
+ # Create sampling parameters
258
+ sampling_params = SamplingParams(
259
+ temperature=temperature,
260
+ top_p=top_p,
261
+ top_k=top_k,
262
+ min_p=min_p,
263
+ max_tokens=max_tokens,
264
+ repetition_penalty=repetition_penalty,
265
+ )
266
+
267
+ # Load dataset
268
+ logger.info(f"Loading dataset: {src_dataset_hub_id}")
269
+ dataset = load_dataset(src_dataset_hub_id, split="train")
270
+
271
+ # Apply max_samples if specified
272
+ if max_samples is not None and max_samples < len(dataset):
273
+ logger.info(f"Limiting dataset to {max_samples} samples")
274
+ dataset = dataset.select(range(max_samples))
275
+
276
+ total_examples = len(dataset)
277
+ logger.info(f"Dataset loaded with {total_examples:,} examples")
278
+
279
+ # Determine which column to use and validate
280
+ if prompt_column:
281
+ # Use prompt column mode
282
+ if prompt_column not in dataset.column_names:
283
+ logger.error(
284
+ f"Column '{prompt_column}' not found. Available columns: {dataset.column_names}"
285
+ )
286
+ sys.exit(1)
287
+ logger.info(f"Using prompt column mode with column: '{prompt_column}'")
288
+ use_messages = False
289
+ else:
290
+ # Use messages column mode
291
+ if messages_column not in dataset.column_names:
292
+ logger.error(
293
+ f"Column '{messages_column}' not found. Available columns: {dataset.column_names}"
294
+ )
295
+ sys.exit(1)
296
+ logger.info(f"Using messages column mode with column: '{messages_column}'")
297
+ use_messages = True
298
+
299
+ # Get effective max length for filtering
300
+ if max_model_len is not None:
301
+ effective_max_len = max_model_len
302
+ else:
303
+ # Get model's default max length
304
+ effective_max_len = llm.llm_engine.model_config.max_model_len
305
+ logger.info(f"Using effective max model length: {effective_max_len}")
306
+
307
+ # Process messages and apply chat template
308
+ logger.info("Preparing prompts...")
309
+ all_prompts = []
310
+ valid_prompts = []
311
+ valid_indices = []
312
+ skipped_info = []
313
+
314
+ for i, example in enumerate(tqdm(dataset, desc="Processing prompts")):
315
+ if use_messages:
316
+ # Messages mode: use existing chat messages
317
+ messages = example[messages_column]
318
+ # Apply chat template
319
+ prompt = tokenizer.apply_chat_template(
320
+ messages, tokenize=False, add_generation_prompt=True
321
+ )
322
+ else:
323
+ # Prompt mode: convert plain text to messages format
324
+ user_prompt = example[prompt_column]
325
+ messages = [{"role": "user", "content": user_prompt}]
326
+ # Apply chat template
327
+ prompt = tokenizer.apply_chat_template(
328
+ messages, tokenize=False, add_generation_prompt=True
329
+ )
330
+
331
+ all_prompts.append(prompt)
332
+
333
+ # Count tokens if filtering is enabled
334
+ if skip_long_prompts:
335
+ tokens = tokenizer.encode(prompt)
336
+ if len(tokens) <= effective_max_len:
337
+ valid_prompts.append(prompt)
338
+ valid_indices.append(i)
339
+ else:
340
+ skipped_info.append((i, len(tokens)))
341
+ else:
342
+ valid_prompts.append(prompt)
343
+ valid_indices.append(i)
344
+
345
+ # Log filtering results
346
+ if skip_long_prompts and skipped_info:
347
+ logger.warning(
348
+ f"Skipped {len(skipped_info)} prompts that exceed max_model_len ({effective_max_len} tokens)"
349
+ )
350
+ logger.info("Skipped prompt details (first 10):")
351
+ for idx, (prompt_idx, token_count) in enumerate(skipped_info[:10]):
352
+ logger.info(
353
+ f" - Example {prompt_idx}: {token_count} tokens (exceeds by {token_count - effective_max_len})"
354
+ )
355
+ if len(skipped_info) > 10:
356
+ logger.info(f" ... and {len(skipped_info) - 10} more")
357
+
358
+ skip_percentage = (len(skipped_info) / total_examples) * 100
359
+ if skip_percentage > 10:
360
+ logger.warning(f"WARNING: {skip_percentage:.1f}% of prompts were skipped!")
361
+
362
+ if not valid_prompts:
363
+ logger.error("No valid prompts to process after filtering!")
364
+ sys.exit(1)
365
+
366
+ # Generate responses - vLLM handles batching internally
367
+ logger.info(f"Starting generation for {len(valid_prompts):,} valid prompts...")
368
+ logger.info("vLLM will handle batching and scheduling automatically")
369
+
370
+ outputs = llm.generate(valid_prompts, sampling_params)
371
+
372
+ # Extract generated text and create full response list
373
+ logger.info("Extracting generated responses...")
374
+ responses = [""] * total_examples # Initialize with empty strings
375
+
376
+ for idx, output in enumerate(outputs):
377
+ original_idx = valid_indices[idx]
378
+ response = output.outputs[0].text.strip()
379
+ responses[original_idx] = response
380
+
381
+ # Add responses to dataset
382
+ logger.info("Adding responses to dataset...")
383
+ dataset = dataset.add_column(output_column, responses)
384
+
385
+ # Create dataset card
386
+ logger.info("Creating dataset card...")
387
+ card_content = create_dataset_card(
388
+ source_dataset=src_dataset_hub_id,
389
+ model_id=model_id,
390
+ messages_column=messages_column,
391
+ prompt_column=prompt_column,
392
+ sampling_params=sampling_params,
393
+ tensor_parallel_size=tensor_parallel_size,
394
+ num_examples=total_examples,
395
+ generation_time=generation_start_time,
396
+ num_skipped=len(skipped_info) if skip_long_prompts else 0,
397
+ max_model_len_used=effective_max_len if skip_long_prompts else None,
398
+ )
399
+
400
+ # Push dataset to hub
401
+ logger.info(f"Pushing dataset to: {output_dataset_hub_id}")
402
+ dataset.push_to_hub(output_dataset_hub_id, token=HF_TOKEN)
403
+
404
+ # Push dataset card
405
+ card = DatasetCard(card_content)
406
+ card.push_to_hub(output_dataset_hub_id, token=HF_TOKEN)
407
+
408
+ logger.info("✅ Generation complete!")
409
+ logger.info(
410
+ f"Dataset available at: https://huggingface.co/datasets/{output_dataset_hub_id}"
411
+ )
412
+
413
+
414
+ if __name__ == "__main__":
415
+ if len(sys.argv) > 1:
416
+ parser = argparse.ArgumentParser(
417
+ description="Generate responses for dataset prompts using vLLM",
418
+ formatter_class=argparse.RawDescriptionHelpFormatter,
419
+ epilog="""
420
+ Examples:
421
+ # Basic usage with default Qwen model
422
+ uv run generate-responses.py input-dataset output-dataset
423
+
424
+ # With custom model and parameters
425
+ uv run generate-responses.py input-dataset output-dataset \\
426
+ --model-id meta-llama/Llama-3.1-8B-Instruct \\
427
+ --temperature 0.9 \\
428
+ --max-tokens 2048
429
+
430
+ # Force specific GPU configuration
431
+ uv run generate-responses.py input-dataset output-dataset \\
432
+ --tensor-parallel-size 2 \\
433
+ --gpu-memory-utilization 0.95
434
+
435
+ # Using environment variable for token
436
+ HF_TOKEN=hf_xxx uv run generate-responses.py input-dataset output-dataset
437
+ """,
438
+ )
439
+
440
+ parser.add_argument(
441
+ "src_dataset_hub_id",
442
+ help="Input dataset on Hugging Face Hub (e.g., username/dataset-name)",
443
+ )
444
+ parser.add_argument(
445
+ "output_dataset_hub_id", help="Output dataset name on Hugging Face Hub"
446
+ )
447
+ parser.add_argument(
448
+ "--model-id",
449
+ type=str,
450
+ default="Qwen/Qwen3-30B-A3B-Instruct-2507",
451
+ help="Model to use for generation (default: Qwen3-30B-A3B-Instruct-2507)",
452
+ )
453
+ parser.add_argument(
454
+ "--messages-column",
455
+ type=str,
456
+ default="messages",
457
+ help="Column containing chat messages (default: messages)",
458
+ )
459
+ parser.add_argument(
460
+ "--prompt-column",
461
+ type=str,
462
+ help="Column containing plain text prompts (alternative to --messages-column)",
463
+ )
464
+ parser.add_argument(
465
+ "--output-column",
466
+ type=str,
467
+ default="response",
468
+ help="Column name for generated responses (default: response)",
469
+ )
470
+ parser.add_argument(
471
+ "--max-samples",
472
+ type=int,
473
+ help="Maximum number of samples to process (default: all)",
474
+ )
475
+ parser.add_argument(
476
+ "--temperature",
477
+ type=float,
478
+ default=0.7,
479
+ help="Sampling temperature (default: 0.7)",
480
+ )
481
+ parser.add_argument(
482
+ "--top-p",
483
+ type=float,
484
+ default=0.8,
485
+ help="Top-p sampling parameter (default: 0.8)",
486
+ )
487
+ parser.add_argument(
488
+ "--top-k",
489
+ type=int,
490
+ default=20,
491
+ help="Top-k sampling parameter (default: 20)",
492
+ )
493
+ parser.add_argument(
494
+ "--min-p",
495
+ type=float,
496
+ default=0.0,
497
+ help="Minimum probability threshold (default: 0.0)",
498
+ )
499
+ parser.add_argument(
500
+ "--max-tokens",
501
+ type=int,
502
+ default=16384,
503
+ help="Maximum tokens to generate (default: 16384)",
504
+ )
505
+ parser.add_argument(
506
+ "--repetition-penalty",
507
+ type=float,
508
+ default=1.0,
509
+ help="Repetition penalty (default: 1.0)",
510
+ )
511
+ parser.add_argument(
512
+ "--gpu-memory-utilization",
513
+ type=float,
514
+ default=0.90,
515
+ help="GPU memory utilization factor (default: 0.90)",
516
+ )
517
+ parser.add_argument(
518
+ "--max-model-len",
519
+ type=int,
520
+ help="Maximum model context length (default: model's default)",
521
+ )
522
+ parser.add_argument(
523
+ "--tensor-parallel-size",
524
+ type=int,
525
+ help="Number of GPUs to use (default: auto-detect)",
526
+ )
527
+ parser.add_argument(
528
+ "--hf-token",
529
+ type=str,
530
+ help="Hugging Face token (can also use HF_TOKEN env var)",
531
+ )
532
+ parser.add_argument(
533
+ "--skip-long-prompts",
534
+ action="store_true",
535
+ default=True,
536
+ help="Skip prompts that exceed max_model_len instead of failing (default: True)",
537
+ )
538
+ parser.add_argument(
539
+ "--no-skip-long-prompts",
540
+ dest="skip_long_prompts",
541
+ action="store_false",
542
+ help="Fail on prompts that exceed max_model_len",
543
+ )
544
+
545
+ args = parser.parse_args()
546
+
547
+ main(
548
+ src_dataset_hub_id=args.src_dataset_hub_id,
549
+ output_dataset_hub_id=args.output_dataset_hub_id,
550
+ model_id=args.model_id,
551
+ messages_column=args.messages_column,
552
+ prompt_column=args.prompt_column,
553
+ output_column=args.output_column,
554
+ temperature=args.temperature,
555
+ top_p=args.top_p,
556
+ top_k=args.top_k,
557
+ min_p=args.min_p,
558
+ max_tokens=args.max_tokens,
559
+ repetition_penalty=args.repetition_penalty,
560
+ gpu_memory_utilization=args.gpu_memory_utilization,
561
+ max_model_len=args.max_model_len,
562
+ tensor_parallel_size=args.tensor_parallel_size,
563
+ skip_long_prompts=args.skip_long_prompts,
564
+ max_samples=args.max_samples,
565
+ hf_token=args.hf_token,
566
+ )
567
+ else:
568
+ # Show HF Jobs example when run without arguments
569
+ print("""
570
+ vLLM Response Generation Script
571
+ ==============================
572
+
573
+ This script requires arguments. For usage information:
574
+ uv run generate-responses.py --help
575
+
576
+ Example HF Jobs command with multi-GPU:
577
+ # If you're logged in with huggingface-cli, token will be auto-detected
578
+ hf jobs uv run \\
579
+ --flavor l4x4 \\
580
+ https://huggingface.co/datasets/uv-scripts/vllm/raw/main/generate-responses.py \\
581
+ username/input-dataset \\
582
+ username/output-dataset \\
583
+ --messages-column messages \\
584
+ --model-id Qwen/Qwen3-30B-A3B-Instruct-2507 \\
585
+ --temperature 0.7 \\
586
+ --max-tokens 16384
587
+ """)