hf-jobs / references /hardware_guide.md
burtenshaw's picture
burtenshaw HF Staff
Upload folder using huggingface_hub
7200e76 verified

Hardware Selection Guide

Choosing the right hardware (flavor) is critical for cost-effective workloads.

Available Hardware

CPU

  • cpu-basic - Basic CPU, testing only
  • cpu-upgrade - Enhanced CPU

Use cases: Data processing, testing scripts, lightweight workloads Not recommended for: Model training, GPU-accelerated workloads

GPU Options

Flavor GPU Memory Use Case Cost/hour
t4-small NVIDIA T4 16GB <1B models, demos, batch inference ~$0.50-1
t4-medium NVIDIA T4 16GB 1-3B models, development ~$1-2
l4x1 NVIDIA L4 24GB 3-7B models, efficient workloads ~$2-3
l4x4 4x NVIDIA L4 96GB Multi-GPU workloads ~$8-12
a10g-small NVIDIA A10G 24GB 3-7B models, production ~$3-4
a10g-large NVIDIA A10G 24GB 7-13B models ~$4-6
a10g-largex2 2x NVIDIA A10G 48GB Multi-GPU, large models ~$8-12
a10g-largex4 4x NVIDIA A10G 96GB Multi-GPU, very large models ~$16-24
a100-large NVIDIA A100 40GB 13B+ models, fast workloads ~$8-12

Selection Guidelines

By Workload Type

Data Processing

  • Recommended: cpu-upgrade or l4x1
  • Use case: Transform, filter, analyze datasets
  • Batch size: Depends on data size
  • Time: Varies by dataset size

Batch Inference

  • Recommended: a10g-large or a100-large
  • Use case: Run inference on thousands of samples
  • Batch size: 8-32 depending on model
  • Time: Depends on number of samples

Experiments & Benchmarks

  • Recommended: a10g-small or a10g-large
  • Use case: Reproducible ML experiments
  • Batch size: Varies
  • Time: Depends on experiment complexity

Model Training (see model-trainer skill for details)

  • Recommended: See model-trainer skill
  • Use case: Fine-tuning models
  • Batch size: Depends on model size
  • Time: Hours to days

Synthetic Data Generation

  • Recommended: a10g-large or a100-large
  • Use case: Generate datasets using LLMs
  • Batch size: Depends on generation method
  • Time: Hours for large datasets

By Budget

Minimal Budget (<$5 total)

  • Use cpu-basic or t4-small
  • Process small datasets
  • Quick tests and demos

Small Budget ($5-20)

  • Use t4-medium or a10g-small
  • Process medium datasets
  • Run experiments

Medium Budget ($20-50)

  • Use a10g-small or a10g-large
  • Process large datasets
  • Production workloads

Large Budget ($50-200)

  • Use a10g-large or a100-large
  • Large-scale processing
  • Multiple experiments

By Model Size (for inference/processing)

Tiny Models (<1B parameters)

  • Recommended: t4-small
  • Example: Qwen2.5-0.5B, TinyLlama
  • Batch size: 8-16

Small Models (1-3B parameters)

  • Recommended: t4-medium or a10g-small
  • Example: Qwen2.5-1.5B, Phi-2
  • Batch size: 4-8

Medium Models (3-7B parameters)

  • Recommended: a10g-small or a10g-large
  • Example: Qwen2.5-7B, Mistral-7B
  • Batch size: 2-4

Large Models (7-13B parameters)

  • Recommended: a10g-large or a100-large
  • Example: Llama-3-8B
  • Batch size: 1-2

Very Large Models (13B+ parameters)

  • Recommended: a100-large
  • Example: Llama-3-13B, Llama-3-70B
  • Batch size: 1

Memory Considerations

Estimating Memory Requirements

For inference:

Memory (GB) β‰ˆ (Model params in billions) Γ— 2-4

For training:

Memory (GB) β‰ˆ (Model params in billions) Γ— 20 (full) or Γ— 4 (LoRA)

Examples:

  • Qwen2.5-0.5B inference: ~1-2GB βœ… fits t4-small
  • Qwen2.5-7B inference: ~14-28GB βœ… fits a10g-large
  • Qwen2.5-7B training: ~140GB ❌ not feasible without LoRA

Memory Optimization

If hitting memory limits:

  1. Reduce batch size

    batch_size = 1
    
  2. Process in chunks

    for chunk in chunks:
        process(chunk)
    
  3. Use smaller models

    • Use quantized models
    • Use LoRA adapters
  4. Upgrade hardware

    • cpu β†’ t4 β†’ a10g β†’ a100

Cost Estimation

Formula

Total Cost = (Hours of runtime) Γ— (Cost per hour)

Example Calculations

Data processing:

  • Hardware: cpu-upgrade ($0.50/hour)
  • Time: 1 hour
  • Cost: $0.50

Batch inference:

  • Hardware: a10g-large ($5/hour)
  • Time: 2 hours
  • Cost: $10.00

Experiments:

  • Hardware: a10g-small ($3.50/hour)
  • Time: 4 hours
  • Cost: $14.00

Cost Optimization Tips

  1. Start small: Test on cpu-basic or t4-small
  2. Monitor runtime: Set appropriate timeouts
  3. Optimize code: Reduce unnecessary compute
  4. Choose right hardware: Don't over-provision
  5. Use checkpoints: Resume if job fails
  6. Monitor costs: Check running jobs regularly

Multi-GPU Workloads

Multi-GPU flavors automatically distribute workloads:

Multi-GPU flavors:

  • l4x4 - 4x L4 GPUs
  • a10g-largex2 - 2x A10G GPUs
  • a10g-largex4 - 4x A10G GPUs

When to use:

  • Large models (>13B parameters)
  • Need faster processing (linear speedup)
  • Large datasets (>100K samples)
  • Parallel workloads

Example:

hf_jobs("uv", {
    "script": "process.py",
    "flavor": "a10g-largex2",  # 2 GPUs
    "timeout": "4h",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}
})

Choosing Between Options

CPU vs GPU

Choose CPU when:

  • No GPU acceleration needed
  • Data processing only
  • Budget constrained
  • Simple workloads

Choose GPU when:

  • Model inference/training
  • GPU-accelerated libraries
  • Need faster processing
  • Large models

a10g vs a100

Choose a10g when:

  • Model <13B parameters
  • Budget conscious
  • Processing time not critical

Choose a100 when:

  • Model 13B+ parameters
  • Need fastest processing
  • Memory requirements high
  • Budget allows

Single vs Multi-GPU

Choose single GPU when:

  • Model <7B parameters
  • Budget constrained
  • Simpler debugging

Choose multi-GPU when:

  • Model >13B parameters
  • Need faster processing
  • Large batch sizes required
  • Cost-effective for large jobs

Quick Reference

# Workload type β†’ Hardware selection
HARDWARE_MAP = {
    "data_processing": "cpu-upgrade",
    "batch_inference_small": "t4-small",
    "batch_inference_medium": "a10g-large",
    "batch_inference_large": "a100-large",
    "experiments": "a10g-small",
    "training": "see model-trainer skill"
}