From aeaafc0df85b6fae1f6301942f73991f15c2e10f Mon Sep 17 00:00:00 2001 From: Eduardo Bonet <ebonet@gitlab.com> Date: Mon, 3 Feb 2025 20:24:03 +0000 Subject: [PATCH] Add TPS values for mistral models --- .../supported_llm_serving_platforms.md | 15 +---- ...ported_models_and_hardware_requirements.md | 55 +++++++++++++++++-- 2 files changed, 50 insertions(+), 20 deletions(-) diff --git a/doc/administration/self_hosted_models/supported_llm_serving_platforms.md b/doc/administration/self_hosted_models/supported_llm_serving_platforms.md index e5423bd335b55..047602496d221 100644 --- a/doc/administration/self_hosted_models/supported_llm_serving_platforms.md +++ b/doc/administration/self_hosted_models/supported_llm_serving_platforms.md @@ -31,18 +31,12 @@ For more information on: - vLLM supported models, see the [vLLM Supported Models documentation](https://docs.vllm.ai/en/latest/models/supported_models.html). - Available options when using vLLM to run a model, see the [vLLM documentation on engine arguments](https://docs.vllm.ai/en/stable/usage/engine_args.html). +- The hardware needed for the models, see the [Supported models and Hardware requirements documentation](supported_llm_serving_platforms.md). Examples: #### Mistral-7B-Instruct-v0.2 -Mistral-7B-Instruct-v0.3 requires at least: - -- 55GB of disk memory for storage -- 35 GB of GPU vRAM for serving. - -With a `a2-highgpu-2g` machine on GCP or equivalent (2x Nvidia A100 40GB - 150 GB vRAM), the model is expected to infer requests at the rate of 250 tokens per second. - 1. Download the model from HuggingFace: ```shell @@ -63,13 +57,6 @@ With a `a2-highgpu-2g` machine on GCP or equivalent (2x Nvidia A100 40GB - 150 G #### Mixtral-8x7B-Instruct-v0.1 -Mistral-7B-Instruct-v0.3 requires at least: - -- 355 GB of disk memory for storage -- 210 GB of GPU vRAM for serving. - -You should at least a `a2-highgpu-4g` machine on GCP or equivalent (4x Nvidia A100 40GB - 340 GB vRAM). With this configuration, the model is expected to infer requests at the rate of 25 tokens per second. - 1. Download the model from HuggingFace: ```shell diff --git a/doc/administration/self_hosted_models/supported_models_and_hardware_requirements.md b/doc/administration/self_hosted_models/supported_models_and_hardware_requirements.md index 0e5ef43373b4f..2cd331563b5ac 100644 --- a/doc/administration/self_hosted_models/supported_models_and_hardware_requirements.md +++ b/doc/administration/self_hosted_models/supported_models_and_hardware_requirements.md @@ -73,12 +73,55 @@ The following hardware specifications are the minimum requirements for running s ### GPU requirements by model size -| Model size | Minimum GPU configuration | Minimum VRAM required | -|------------|------------------------------|---------------------| -| 7B models<br>(for example, Mistral 7B) | 1x NVIDIA A100 (40GB) | 24 GB | -| 22B models<br>(for example, Codestral 22B) | 2x NVIDIA A100 (80GB) | 90 GB | -| Mixtral 8x7B | 2x NVIDIA A100 (80GB) | 100 GB | -| Mixtral 8x22B | 8x NVIDIA A100 (80GB) | 300 GB | +| Model size | Minimum GPU configuration | Minimum VRAM required | +|--------------------------------------------|---------------------------|-----------------------| +| 7B models<br>(for example, Mistral 7B) | 1x NVIDIA A100 (40GB) | 35 GB | +| 22B models<br>(for example, Codestral 22B) | 2x NVIDIA A100 (80GB) | 110 GB | +| Mixtral 8x7B | 2x NVIDIA A100 (80GB) | 220 GB | +| Mixtral 8x22B | 8x NVIDIA A100 (80GB) | 526 GB | + +Use [Hugging Face's memory utility](https://huggingface.co/spaces/hf-accelerate/model-memory-usage) to verify memory requirements. + +### Response time by model size and GPU + +#### Small machine + +With a `a2-highgpu-2g` (2x Nvidia A100 40 GB - 150 GB vRAM) or equivalent: + +| Model name | Number of requests | Average time per request (sec) | Average tokens in response | Average tokens per second per request | Total time for requests | Total TPS | +|--------------------------|--------------------|------------------------------|----------------------------|---------------------------------------|-------------------------|-----------| +| Mistral-7B-Instruct-v0.3 | 1 | 7.09 | 717.0 | 101.19 | 7.09 | 101.17 | +| Mistral-7B-Instruct-v0.3 | 10 | 8.41 | 764.2 | 90.35 | 13.70 | 557.80 | +| Mistral-7B-Instruct-v0.3 | 100 | 13.97 | 693.23 | 49.17 | 20.81 | 3331.59 | + +#### Medium machine + +With a `a2-ultragpu-4g` (4x Nvidia A100 40 GB - 340 GB vRAM) machine on GCP or equivalent: + +| Model name | Number of requests | Average time per request (sec) | Average tokens in response | Average tokens per second per request | Total time for requests | Total TPS | +|----------------------------|--------------------|------------------------------|----------------------------|---------------------------------------|-------------------------|-----------| +| Mistral-7B-Instruct-v0.3 | 1 | 3.80 | 499.0 | 131.25 | 3.80 | 131.23 | +| Mistral-7B-Instruct-v0.3 | 10 | 6.00 | 740.6 | 122.85 | 8.19 | 904.22 | +| Mistral-7B-Instruct-v0.3 | 100 | 11.71 | 695.71 | 59.06 | 15.54 | 4477.34 | +| Mixtral-8x7B-Instruct-v0.1 | 1 | 6.50 | 400.0 | 61.55 | 6.50 | 61.53 | +| Mixtral-8x7B-Instruct-v0.1 | 10 | 16.58 | 768.9 | 40.33 | 32.56 | 236.13 | +| Mixtral-8x7B-Instruct-v0.1 | 100 | 25.90 | 767.38 | 26.87 | 55.57 | 1380.68 | + +#### Large machine + +With a `a2-ultragpu-8g` (8 x NVIDIA A100 80 GB - 1360 GB vRAM) machine on GCP or equivalent: + +| Model name | Number of requests | Average time per request (sec) | Average tokens in response | Average tokens per second per request | Total time for requests (sec) | Total TPS | +|-----------------------------|--------------------|------------------------------|----------------------------|---------------------------------------|-----------------------------|-----------| +| Mistral-7B-Instruct-v0.3 | 1 | 3.23 | 479.0 | 148.41 | 3.22 | 148.36 | +| Mistral-7B-Instruct-v0.3 | 10 | 4.95 | 678.3 | 135.98 | 6.85 | 989.11 | +| Mistral-7B-Instruct-v0.3 | 100 | 10.14 | 713.27 | 69.63 | 13.96 | 5108.75 | +| Mixtral-8x7B-Instruct-v0.1 | 1 | 6.08 | 709.0 | 116.69 | 6.07 | 116.64 | +| Mixtral-8x7B-Instruct-v0.1 | 10 | 9.95 | 645.0 | 63.68 | 13.40 | 481.06 | +| Mixtral-8x7B-Instruct-v0.1 | 100 | 13.83 | 585.01 | 41.80 | 20.38 | 2869.12 | +| Mixtral-8x22B-Instruct-v0.1 | 1 | 14.39 | 828.0 | 57.56 | 14.38 | 57.55 | +| Mixtral-8x22B-Instruct-v0.1 | 10 | 20.57 | 629.7 | 30.24 | 28.02 | 224.71 | +| Mixtral-8x22B-Instruct-v0.1 | 100 | 27.58 | 592.49 | 21.34 | 36.80 | 1609.85 | ### AI Gateway Hardware Requirements -- GitLab