Update README.md
Browse files
README.md
CHANGED
|
@@ -295,11 +295,8 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
|
|
| 295 |
| | Phi-4 mini-Ins | Phi-4-mini-instruct-FP8 |
|
| 296 |
| latency (batch_size=1) | 1.61s | 1.25s (1.29x speedup) |
|
| 297 |
| latency (batch_size=256) | 5.16s | 4.89s (1.05x speedup) |
|
| 298 |
-
| serving (num_prompts=1) | 1.37 req/s | 1.66 req/s (1.21x speedup) |
|
| 299 |
-
| serving (num_prompts=1000) | 62.55 req/s | 72.56 req/s (1.16x speedup) |
|
| 300 |
|
| 301 |
Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
|
| 302 |
-
Note the result is not using fbgemm kernels, (no `fbgemm-gpu-genai` installed), fbgemm kernels has less speedup when num_prompts is 1000 currently.
|
| 303 |
|
| 304 |
<details>
|
| 305 |
<summary> Reproduce Model Performance Results </summary>
|
|
@@ -334,41 +331,6 @@ python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model
|
|
| 334 |
VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-FP8 --batch-size 1
|
| 335 |
```
|
| 336 |
|
| 337 |
-
## benchmark_serving
|
| 338 |
-
|
| 339 |
-
We benchmarked the throughput in a serving environment.
|
| 340 |
-
|
| 341 |
-
Download sharegpt dataset:
|
| 342 |
-
|
| 343 |
-
```Shell
|
| 344 |
-
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
|
| 345 |
-
```
|
| 346 |
-
|
| 347 |
-
Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
|
| 348 |
-
|
| 349 |
-
Note: you can change the number of prompts to be benchmarked with `--num-prompts` argument for `benchmark_serving` script.
|
| 350 |
-
### baseline
|
| 351 |
-
Server:
|
| 352 |
-
```Shell
|
| 353 |
-
vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
|
| 354 |
-
```
|
| 355 |
-
|
| 356 |
-
Client:
|
| 357 |
-
```Shell
|
| 358 |
-
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
|
| 359 |
-
```
|
| 360 |
-
|
| 361 |
-
### FP8
|
| 362 |
-
Server:
|
| 363 |
-
```Shell
|
| 364 |
-
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Phi-4-mini-instruct-FP8 --tokenizer microsoft/Phi-4-mini-instruct -O3
|
| 365 |
-
```
|
| 366 |
-
|
| 367 |
-
Client:
|
| 368 |
-
```Shell
|
| 369 |
-
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-FP8 --num-prompts 1
|
| 370 |
-
```
|
| 371 |
-
|
| 372 |
</details>
|
| 373 |
|
| 374 |
# Paper: TorchAO: PyTorch-Native Training-to-Serving Model Optimization
|
|
|
|
| 295 |
| | Phi-4 mini-Ins | Phi-4-mini-instruct-FP8 |
|
| 296 |
| latency (batch_size=1) | 1.61s | 1.25s (1.29x speedup) |
|
| 297 |
| latency (batch_size=256) | 5.16s | 4.89s (1.05x speedup) |
|
|
|
|
|
|
|
| 298 |
|
| 299 |
Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
|
|
|
|
| 300 |
|
| 301 |
<details>
|
| 302 |
<summary> Reproduce Model Performance Results </summary>
|
|
|
|
| 331 |
VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-FP8 --batch-size 1
|
| 332 |
```
|
| 333 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 334 |
</details>
|
| 335 |
|
| 336 |
# Paper: TorchAO: PyTorch-Native Training-to-Serving Model Optimization
|