From aeaafc0df85b6fae1f6301942f73991f15c2e10f Mon Sep 17 00:00:00 2001
From: Eduardo Bonet <ebonet@gitlab.com>
Date: Mon, 3 Feb 2025 20:24:03 +0000
Subject: [PATCH] Add TPS values for mistral models

---
 .../supported_llm_serving_platforms.md        | 15 +----
 ...ported_models_and_hardware_requirements.md | 55 +++++++++++++++++--
 2 files changed, 50 insertions(+), 20 deletions(-)

diff --git a/doc/administration/self_hosted_models/supported_llm_serving_platforms.md b/doc/administration/self_hosted_models/supported_llm_serving_platforms.md
index e5423bd335b55..047602496d221 100644
--- a/doc/administration/self_hosted_models/supported_llm_serving_platforms.md
+++ b/doc/administration/self_hosted_models/supported_llm_serving_platforms.md
@@ -31,18 +31,12 @@ For more information on:
 
 - vLLM supported models, see the [vLLM Supported Models documentation](https://docs.vllm.ai/en/latest/models/supported_models.html).
 - Available options when using vLLM to run a model, see the [vLLM documentation on engine arguments](https://docs.vllm.ai/en/stable/usage/engine_args.html).
+- The hardware needed for the models, see the [Supported models and Hardware requirements documentation](supported_llm_serving_platforms.md).
 
 Examples:
 
 #### Mistral-7B-Instruct-v0.2
 
-Mistral-7B-Instruct-v0.3 requires at least:
-
-- 55GB of disk memory for storage
-- 35 GB of GPU vRAM for serving.
-
-With a `a2-highgpu-2g` machine on GCP or equivalent (2x Nvidia A100 40GB - 150 GB vRAM), the model is expected to infer requests at the rate of 250 tokens per second.
-
 1. Download the model from HuggingFace:
 
    ```shell
@@ -63,13 +57,6 @@ With a `a2-highgpu-2g` machine on GCP or equivalent (2x Nvidia A100 40GB - 150 G
 
 #### Mixtral-8x7B-Instruct-v0.1
 
-Mistral-7B-Instruct-v0.3 requires at least:
-
-- 355 GB of disk memory for storage
-- 210 GB of GPU vRAM for serving.
-
-You should at least a `a2-highgpu-4g` machine on GCP or equivalent (4x Nvidia A100 40GB - 340 GB vRAM). With this configuration, the model is expected to infer requests at the rate of 25 tokens per second.
-
 1. Download the model from HuggingFace:
 
    ```shell
diff --git a/doc/administration/self_hosted_models/supported_models_and_hardware_requirements.md b/doc/administration/self_hosted_models/supported_models_and_hardware_requirements.md
index 0e5ef43373b4f..2cd331563b5ac 100644
--- a/doc/administration/self_hosted_models/supported_models_and_hardware_requirements.md
+++ b/doc/administration/self_hosted_models/supported_models_and_hardware_requirements.md
@@ -73,12 +73,55 @@ The following hardware specifications are the minimum requirements for running s
 
 ### GPU requirements by model size
 
-| Model size | Minimum GPU configuration | Minimum VRAM required |
-|------------|------------------------------|---------------------|
-| 7B models<br>(for example, Mistral 7B) | 1x NVIDIA A100 (40GB) | 24 GB |
-| 22B models<br>(for example, Codestral 22B) | 2x NVIDIA A100 (80GB) | 90 GB |
-| Mixtral 8x7B | 2x NVIDIA A100 (80GB) | 100 GB |
-| Mixtral 8x22B | 8x NVIDIA A100 (80GB) | 300 GB |
+| Model size                                 | Minimum GPU configuration | Minimum VRAM required |
+|--------------------------------------------|---------------------------|-----------------------|
+| 7B models<br>(for example, Mistral 7B)     | 1x NVIDIA A100 (40GB)     | 35 GB                 |
+| 22B models<br>(for example, Codestral 22B) | 2x NVIDIA A100 (80GB)     | 110 GB                |
+| Mixtral 8x7B                               | 2x NVIDIA A100 (80GB)     | 220 GB                |
+| Mixtral 8x22B                              | 8x NVIDIA A100 (80GB)     | 526 GB                |
+
+Use [Hugging Face's memory utility](https://huggingface.co/spaces/hf-accelerate/model-memory-usage) to verify memory requirements.
+
+### Response time by model size and GPU
+
+#### Small machine
+
+With a `a2-highgpu-2g` (2x Nvidia A100 40 GB - 150 GB vRAM) or equivalent:
+
+| Model name               | Number of requests | Average time per request (sec) | Average tokens in response | Average tokens per second per request | Total time for requests | Total TPS |
+|--------------------------|--------------------|------------------------------|----------------------------|---------------------------------------|-------------------------|-----------|
+| Mistral-7B-Instruct-v0.3 | 1                  | 7.09                         | 717.0                      | 101.19                                | 7.09                    | 101.17    |
+| Mistral-7B-Instruct-v0.3 | 10                 | 8.41                         | 764.2                      | 90.35                                 | 13.70                   | 557.80    |
+| Mistral-7B-Instruct-v0.3 | 100                | 13.97                        | 693.23                     | 49.17                                 | 20.81                   | 3331.59   |
+
+#### Medium machine
+
+With a `a2-ultragpu-4g` (4x Nvidia A100 40 GB - 340 GB vRAM) machine on GCP or equivalent:
+
+| Model name                 | Number of requests | Average time per request (sec) | Average tokens in response | Average tokens per second per request | Total time for requests | Total TPS |
+|----------------------------|--------------------|------------------------------|----------------------------|---------------------------------------|-------------------------|-----------|
+| Mistral-7B-Instruct-v0.3   | 1                  | 3.80                         | 499.0                      | 131.25                                | 3.80                    | 131.23    |
+| Mistral-7B-Instruct-v0.3   | 10                 | 6.00                         | 740.6                      | 122.85                                | 8.19                    | 904.22    |
+| Mistral-7B-Instruct-v0.3   | 100                | 11.71                        | 695.71                     | 59.06                                 | 15.54                   | 4477.34   |
+| Mixtral-8x7B-Instruct-v0.1 | 1                  | 6.50                         | 400.0                      | 61.55                                 | 6.50                    | 61.53     |
+| Mixtral-8x7B-Instruct-v0.1 | 10                 | 16.58                        | 768.9                      | 40.33                                 | 32.56                   | 236.13    |
+| Mixtral-8x7B-Instruct-v0.1 | 100                | 25.90                        | 767.38                     | 26.87                                 | 55.57                   | 1380.68   |
+
+#### Large machine
+
+With a `a2-ultragpu-8g` (8 x NVIDIA A100 80 GB - 1360 GB vRAM) machine on GCP or equivalent:
+
+| Model name                  | Number of requests | Average time per request (sec) | Average tokens in response | Average tokens per second per request | Total time for requests (sec) | Total TPS |
+|-----------------------------|--------------------|------------------------------|----------------------------|---------------------------------------|-----------------------------|-----------|
+| Mistral-7B-Instruct-v0.3    | 1                  | 3.23                         | 479.0                      | 148.41                                | 3.22                        | 148.36    |
+| Mistral-7B-Instruct-v0.3    | 10                 | 4.95                         | 678.3                      | 135.98                                | 6.85                        | 989.11    |
+| Mistral-7B-Instruct-v0.3    | 100                | 10.14                        | 713.27                     | 69.63                                 | 13.96                       | 5108.75   |
+| Mixtral-8x7B-Instruct-v0.1  | 1                  | 6.08                         | 709.0                      | 116.69                                | 6.07                        | 116.64    |
+| Mixtral-8x7B-Instruct-v0.1  | 10                 | 9.95                         | 645.0                      | 63.68                                 | 13.40                       | 481.06    |
+| Mixtral-8x7B-Instruct-v0.1  | 100                | 13.83                        | 585.01                     | 41.80                                 | 20.38                       | 2869.12   |
+| Mixtral-8x22B-Instruct-v0.1 | 1                  | 14.39                        | 828.0                      | 57.56                                 | 14.38                       | 57.55     |
+| Mixtral-8x22B-Instruct-v0.1 | 10                 | 20.57                        | 629.7                      | 30.24                                 | 28.02                       | 224.71    |
+| Mixtral-8x22B-Instruct-v0.1 | 100                | 27.58                        | 592.49                     | 21.34                                 | 36.80                       | 1609.85   |
 
 ### AI Gateway Hardware Requirements
 
-- 
GitLab