[Docs] Benchmark docs (#5360)
* first doc version * add benchmark docs * fix typos * improve README * Update docs/source/benchmarks.rst Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * fix naming and docs Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
This commit is contained in:
committed by
GitHub
parent
482c9178d3
commit
4bcc35cd69
@@ -289,7 +289,7 @@
|
||||
"\n",
|
||||
"Being able to accurately benchmark language models on both *speed* and *required memory* is therefore very important.\n",
|
||||
"\n",
|
||||
"HuggingFace's Transformer library allows users to benchmark models for both Tensorflow 2 and PyTorch using the `PyTorchBenchmark` and `TensorflowBenchmark` classes.\n",
|
||||
"HuggingFace's Transformer library allows users to benchmark models for both TensorFlow 2 and PyTorch using the `PyTorchBenchmark` and `TensorFlowBenchmark` classes.\n",
|
||||
"\n",
|
||||
"The currently available features for `PyTorchBenchmark` are summarized in the following table.\n",
|
||||
"\n",
|
||||
@@ -306,7 +306,7 @@
|
||||
"\n",
|
||||
"* *torchscript* corresponds to PyTorch's torchscript format, see [here](https://pytorch.org/docs/stable/jit.html).\n",
|
||||
"\n",
|
||||
"The currently available features for `TensorflowBenchmark` are summarized in the following table.\n",
|
||||
"The currently available features for `TensorFlowBenchmark` are summarized in the following table.\n",
|
||||
"\n",
|
||||
"| | CPU | CPU + eager execution | GPU | GPU + eager execution | GPU + XLA | GPU + FP16 | TPU |\n",
|
||||
":-- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |\n",
|
||||
@@ -315,16 +315,16 @@
|
||||
"**Speed - Train** | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ |\n",
|
||||
"**Memory - Train** | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ |\n",
|
||||
"\n",
|
||||
"* *eager execution* means that the function is run in the eager execution environment of Tensorflow 2, see [here](https://www.tensorflow.org/guide/eager).\n",
|
||||
"* *eager execution* means that the function is run in the eager execution environment of TensorFlow 2, see [here](https://www.tensorflow.org/guide/eager).\n",
|
||||
"\n",
|
||||
"* *XLA* stands for Tensorflow's Accelerated Linear Algebra (XLA) compiler, see [here](https://www.tensorflow.org/xla)\n",
|
||||
"* *XLA* stands for TensorFlow's Accelerated Linear Algebra (XLA) compiler, see [here](https://www.tensorflow.org/xla)\n",
|
||||
"\n",
|
||||
"* *FP16* stands for Tensorflow's mixed-precision package and is analogous to PyTorch's FP16 feature, see [here](https://www.tensorflow.org/guide/mixed_precision).\n",
|
||||
"* *FP16* stands for TensorFlow's mixed-precision package and is analogous to PyTorch's FP16 feature, see [here](https://www.tensorflow.org/guide/mixed_precision).\n",
|
||||
"\n",
|
||||
"***Note***: In ~1,2 weeks it will also be possible to benchmark training in Tensorflow.\n",
|
||||
"***Note***: In ~1,2 weeks it will also be possible to benchmark training in TensorFlow.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"This notebook will show the user how to use `PyTorchBenchmark` and `TensorflowBenchmark` for two different scenarios:\n",
|
||||
"This notebook will show the user how to use `PyTorchBenchmark` and `TensorFlowBenchmark` for two different scenarios:\n",
|
||||
"\n",
|
||||
"1. **Inference - Pre-trained Model Comparison** - *A user wants to implement a pre-trained model in production for inference. She wants to compare different models on speed and required memory.*\n",
|
||||
"\n",
|
||||
@@ -443,7 +443,7 @@
|
||||
"source": [
|
||||
"Looks good! Now we import `transformers` and download the scripts `run_benchmark.py`, `run_benchmark_tf.py`, and `plot_csv_file.py` which can be found under `transformers/examples/benchmarking`.\n",
|
||||
"\n",
|
||||
"`run_benchmark_tf.py` and `run_benchmark.py` are very simple scripts leveraging the `PyTorchBenchmark` and `TensorflowBenchmark` classes, respectively."
|
||||
"`run_benchmark_tf.py` and `run_benchmark.py` are very simple scripts leveraging the `PyTorchBenchmark` and `TensorFlowBenchmark` classes, respectively."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -482,7 +482,7 @@
|
||||
"colab_type": "text"
|
||||
},
|
||||
"source": [
|
||||
"Information about the input arguments to the *run_benchmark* scripts can be accessed by running `!python run_benchmark.py --help` for PyTorch and `!python run_benchmark_tf.py --help` for Tensorflow."
|
||||
"Information about the input arguments to the *run_benchmark* scripts can be accessed by running `!python run_benchmark.py --help` for PyTorch and `!python run_benchmark_tf.py --help` for TensorFlow."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -1130,7 +1130,7 @@
|
||||
},
|
||||
"source": [
|
||||
"At this point, it is important to understand how the peak memory is measured. The benchmarking tools measure the peak memory usage the same way the command `nvidia-smi` does - see [here](https://developer.nvidia.com/nvidia-system-management-interface) for more information. \n",
|
||||
"In short, all memory that is allocated for a given *model identifier*, *batch size* and *sequence length* is measured in a separate process. This way it can be ensured that there is no previously unreleased memory falsely included in the measurement. One should also note that the measured memory even includes the memory allocated by the CUDA driver to load PyTorch and Tensorflow and is, therefore, higher than library-specific memory measurement function, *e.g.* this one for [PyTorch](https://pytorch.org/docs/stable/cuda.html#torch.cuda.max_memory_allocated).\n",
|
||||
"In short, all memory that is allocated for a given *model identifier*, *batch size* and *sequence length* is measured in a separate process. This way it can be ensured that there is no previously unreleased memory falsely included in the measurement. One should also note that the measured memory even includes the memory allocated by the CUDA driver to load PyTorch and TensorFlow and is, therefore, higher than library-specific memory measurement function, *e.g.* this one for [PyTorch](https://pytorch.org/docs/stable/cuda.html#torch.cuda.max_memory_allocated).\n",
|
||||
"\n",
|
||||
"Alright, let's analyze the results. It can be noted that the models `aodiniz/bert_uncased_L-10_H-512_A-8_cord19-200616_squad2` and `deepset/roberta-base-squad2` require significantly less memory than the other three models. Besides `mrm8488/longformer-base-4096-finetuned-squadv2` all models more or less follow the same memory consumption pattern with `aodiniz/bert_uncased_L-10_H-512_A-8_cord19-200616_squad2` seemingly being able to better scale to larger sequence lengths. \n",
|
||||
"`mrm8488/longformer-base-4096-finetuned-squadv2` is a *Longformer* model, which makes use of *LocalAttention* (check this blog post to learn more about local attention) so that the model scales much better to longer input sequences.\n",
|
||||
@@ -1256,7 +1256,7 @@
|
||||
"source": [
|
||||
"Interesting! `aodiniz/bert_uncased_L-10_H-51` clearly scales better for higher batch sizes and does not even run out of memory for 512 tokens.\n",
|
||||
"\n",
|
||||
"For comparison, let's run the same benchmarking on Tensorflow."
|
||||
"For comparison, let's run the same benchmarking on TensorFlow."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -1341,7 +1341,7 @@
|
||||
"colab_type": "text"
|
||||
},
|
||||
"source": [
|
||||
"Let's see the same plot for Tensorflow."
|
||||
"Let's see the same plot for TensorFlow."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -1394,7 +1394,7 @@
|
||||
"colab_type": "text"
|
||||
},
|
||||
"source": [
|
||||
"The model implemented in Tensorflow requires more memory than the one implemented in PyTorch. Let's say for whatever reason we have decided to use Tensorflow instead of PyTorch. \n",
|
||||
"The model implemented in TensorFlow requires more memory than the one implemented in PyTorch. Let's say for whatever reason we have decided to use TensorFlow instead of PyTorch. \n",
|
||||
"\n",
|
||||
"The next step is to measure the inference time of these two models. Instead of disabling time measurement with `--no_speed`, we will now disable memory measurement with `--no_memory`."
|
||||
]
|
||||
@@ -1499,7 +1499,7 @@
|
||||
"source": [
|
||||
"Ok, this took some time... time measurements take much longer than memory measurements because the forward pass is called multiple times for stable results. Timing measurements leverage Python's [timeit module](https://docs.python.org/2/library/timeit.html#timeit.Timer.repeat) and run 10 times the value given to the `--repeat` argument (defaults to 3), so in our case 30 times.\n",
|
||||
"\n",
|
||||
"Let's focus on the resulting plot. It becomes obvious that `aodiniz/bert_uncased_L-10_H-51` is around twice as fast as `deepset/roberta-base-squad2`. Given that the model is also more memory efficient and assuming that the model performs reasonably well, for the sake of this notebook we will settle on `aodiniz/bert_uncased_L-10_H-51`. Our model should be able to process input sequences of up to 512 tokens. Latency time of around 2 seconds might be too long though, so let's compare the time for different batch sizes and using Tensorflows XLA package for more speed."
|
||||
"Let's focus on the resulting plot. It becomes obvious that `aodiniz/bert_uncased_L-10_H-51` is around twice as fast as `deepset/roberta-base-squad2`. Given that the model is also more memory efficient and assuming that the model performs reasonably well, for the sake of this notebook we will settle on `aodiniz/bert_uncased_L-10_H-51`. Our model should be able to process input sequences of up to 512 tokens. Latency time of around 2 seconds might be too long though, so let's compare the time for different batch sizes and using TensorFlows XLA package for more speed."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -1551,7 +1551,7 @@
|
||||
"colab_type": "text"
|
||||
},
|
||||
"source": [
|
||||
"First of all, it can be noted that XLA reduces latency time by a factor of ca. 1.3 (which is more than observed for other models by Tensorflow [here](https://www.tensorflow.org/xla)). A batch size of 64 looks like a good choice. More or less half a second for the forward pass is good enough.\n",
|
||||
"First of all, it can be noted that XLA reduces latency time by a factor of ca. 1.3 (which is more than observed for other models by TensorFlow [here](https://www.tensorflow.org/xla)). A batch size of 64 looks like a good choice. More or less half a second for the forward pass is good enough.\n",
|
||||
"\n",
|
||||
"Cool, now it should be straightforward to benchmark your favorite models. All the inference time measurements can also be done using the `run_benchmark.py` script for PyTorch."
|
||||
]
|
||||
@@ -2021,4 +2021,4 @@
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user