Hardware Selection Guide
Choosing the right hardware (flavor) is critical for cost-effective workloads.
Available Hardware
CPU
cpu-basic- Basic CPU, testing onlycpu-upgrade- Enhanced CPU
Use cases: Data processing, testing scripts, lightweight workloads Not recommended for: Model training, GPU-accelerated workloads
GPU Options
| Flavor | GPU | Memory | Use Case | Cost/hour |
|---|---|---|---|---|
t4-small |
NVIDIA T4 | 16GB | <1B models, demos, batch inference | ~$0.50-1 |
t4-medium |
NVIDIA T4 | 16GB | 1-3B models, development | ~$1-2 |
l4x1 |
NVIDIA L4 | 24GB | 3-7B models, efficient workloads | ~$2-3 |
l4x4 |
4x NVIDIA L4 | 96GB | Multi-GPU workloads | ~$8-12 |
a10g-small |
NVIDIA A10G | 24GB | 3-7B models, production | ~$3-4 |
a10g-large |
NVIDIA A10G | 24GB | 7-13B models | ~$4-6 |
a10g-largex2 |
2x NVIDIA A10G | 48GB | Multi-GPU, large models | ~$8-12 |
a10g-largex4 |
4x NVIDIA A10G | 96GB | Multi-GPU, very large models | ~$16-24 |
a100-large |
NVIDIA A100 | 40GB | 13B+ models, fast workloads | ~$8-12 |
Selection Guidelines
By Workload Type
Data Processing
- Recommended:
cpu-upgradeorl4x1 - Use case: Transform, filter, analyze datasets
- Batch size: Depends on data size
- Time: Varies by dataset size
Batch Inference
- Recommended:
a10g-largeora100-large - Use case: Run inference on thousands of samples
- Batch size: 8-32 depending on model
- Time: Depends on number of samples
Experiments & Benchmarks
- Recommended:
a10g-smallora10g-large - Use case: Reproducible ML experiments
- Batch size: Varies
- Time: Depends on experiment complexity
Model Training (see model-trainer skill for details)
- Recommended: See model-trainer skill
- Use case: Fine-tuning models
- Batch size: Depends on model size
- Time: Hours to days
Synthetic Data Generation
- Recommended:
a10g-largeora100-large - Use case: Generate datasets using LLMs
- Batch size: Depends on generation method
- Time: Hours for large datasets
By Budget
Minimal Budget (<$5 total)
- Use
cpu-basicort4-small - Process small datasets
- Quick tests and demos
Small Budget ($5-20)
- Use
t4-mediumora10g-small - Process medium datasets
- Run experiments
Medium Budget ($20-50)
- Use
a10g-smallora10g-large - Process large datasets
- Production workloads
Large Budget ($50-200)
- Use
a10g-largeora100-large - Large-scale processing
- Multiple experiments
By Model Size (for inference/processing)
Tiny Models (<1B parameters)
- Recommended:
t4-small - Example: Qwen2.5-0.5B, TinyLlama
- Batch size: 8-16
Small Models (1-3B parameters)
- Recommended:
t4-mediumora10g-small - Example: Qwen2.5-1.5B, Phi-2
- Batch size: 4-8
Medium Models (3-7B parameters)
- Recommended:
a10g-smallora10g-large - Example: Qwen2.5-7B, Mistral-7B
- Batch size: 2-4
Large Models (7-13B parameters)
- Recommended:
a10g-largeora100-large - Example: Llama-3-8B
- Batch size: 1-2
Very Large Models (13B+ parameters)
- Recommended:
a100-large - Example: Llama-3-13B, Llama-3-70B
- Batch size: 1
Memory Considerations
Estimating Memory Requirements
For inference:
Memory (GB) β (Model params in billions) Γ 2-4
For training:
Memory (GB) β (Model params in billions) Γ 20 (full) or Γ 4 (LoRA)
Examples:
- Qwen2.5-0.5B inference: ~1-2GB β fits t4-small
- Qwen2.5-7B inference: ~14-28GB β fits a10g-large
- Qwen2.5-7B training: ~140GB β not feasible without LoRA
Memory Optimization
If hitting memory limits:
Reduce batch size
batch_size = 1Process in chunks
for chunk in chunks: process(chunk)Use smaller models
- Use quantized models
- Use LoRA adapters
Upgrade hardware
- cpu β t4 β a10g β a100
Cost Estimation
Formula
Total Cost = (Hours of runtime) Γ (Cost per hour)
Example Calculations
Data processing:
- Hardware: cpu-upgrade ($0.50/hour)
- Time: 1 hour
- Cost: $0.50
Batch inference:
- Hardware: a10g-large ($5/hour)
- Time: 2 hours
- Cost: $10.00
Experiments:
- Hardware: a10g-small ($3.50/hour)
- Time: 4 hours
- Cost: $14.00
Cost Optimization Tips
- Start small: Test on cpu-basic or t4-small
- Monitor runtime: Set appropriate timeouts
- Optimize code: Reduce unnecessary compute
- Choose right hardware: Don't over-provision
- Use checkpoints: Resume if job fails
- Monitor costs: Check running jobs regularly
Multi-GPU Workloads
Multi-GPU flavors automatically distribute workloads:
Multi-GPU flavors:
l4x4- 4x L4 GPUsa10g-largex2- 2x A10G GPUsa10g-largex4- 4x A10G GPUs
When to use:
- Large models (>13B parameters)
- Need faster processing (linear speedup)
- Large datasets (>100K samples)
- Parallel workloads
Example:
hf_jobs("uv", {
"script": "process.py",
"flavor": "a10g-largex2", # 2 GPUs
"timeout": "4h",
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
})
Choosing Between Options
CPU vs GPU
Choose CPU when:
- No GPU acceleration needed
- Data processing only
- Budget constrained
- Simple workloads
Choose GPU when:
- Model inference/training
- GPU-accelerated libraries
- Need faster processing
- Large models
a10g vs a100
Choose a10g when:
- Model <13B parameters
- Budget conscious
- Processing time not critical
Choose a100 when:
- Model 13B+ parameters
- Need fastest processing
- Memory requirements high
- Budget allows
Single vs Multi-GPU
Choose single GPU when:
- Model <7B parameters
- Budget constrained
- Simpler debugging
Choose multi-GPU when:
- Model >13B parameters
- Need faster processing
- Large batch sizes required
- Cost-effective for large jobs
Quick Reference
# Workload type β Hardware selection
HARDWARE_MAP = {
"data_processing": "cpu-upgrade",
"batch_inference_small": "t4-small",
"batch_inference_medium": "a10g-large",
"batch_inference_large": "a100-large",
"experiments": "a10g-small",
"training": "see model-trainer skill"
}