Replace unsupported facebookresearch/bitsandbytes (#20093)

With https://github.com/TimDettmers/bitsandbytes, which is by the same author and is still being updated
This commit is contained in:
Tom Aarsen
2022-11-07 14:52:03 +01:00
committed by GitHub
parent 4ab6e9e2f8
commit b8112eddec
2 changed files with 4 additions and 4 deletions

View File

@@ -217,7 +217,7 @@ Let's look at the details.
**Optimizer States:** **Optimizer States:**
- 8 bytes * number of parameters for normal AdamW (maintains 2 states) - 8 bytes * number of parameters for normal AdamW (maintains 2 states)
- 2 bytes * number of parameters for 8-bit AdamW optimizers like [bitsandbytes](https://github.com/facebookresearch/bitsandbytes) - 2 bytes * number of parameters for 8-bit AdamW optimizers like [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
- 4 bytes * number of parameters for optimizers like SGD with momentum (maintains only 1 state) - 4 bytes * number of parameters for optimizers like SGD with momentum (maintains only 1 state)
**Gradients** **Gradients**
@@ -425,7 +425,7 @@ $ python examples/pytorch/translation/run_translation.py -h | grep "\-optim"
For example, if you have [NVIDIA/apex](https://github.com/NVIDIA/apex) installed `--optim adamw_apex_fused` will give you the fastest training experience among all supported AdamW optimizers. For example, if you have [NVIDIA/apex](https://github.com/NVIDIA/apex) installed `--optim adamw_apex_fused` will give you the fastest training experience among all supported AdamW optimizers.
On the other hand [8bit BNB optimizer](https://github.com/facebookresearch/bitsandbytes) can save 3/4 of memory normally used by a typical AdamW optimizer if it is configured to quantize all optimizer states, but in some situations only some optimizer states are quintized and then more memory is used. On the other hand [8bit BNB optimizer](https://github.com/TimDettmers/bitsandbytes) can save 3/4 of memory normally used by a typical AdamW optimizer if it is configured to quantize all optimizer states, but in some situations only some optimizer states are quintized and then more memory is used.
Let's get a feel for the numbers and use for example use a 3B-parameter model, like `t5-3b`. Note that since a Gigabyte correpsonds to a billion bytes we can simply multiply the parameters (in billions) with the number of necessary bytes per parameter to get Gigabytes of GPU memory usage: Let's get a feel for the numbers and use for example use a 3B-parameter model, like `t5-3b`. Note that since a Gigabyte correpsonds to a billion bytes we can simply multiply the parameters (in billions) with the number of necessary bytes per parameter to get Gigabytes of GPU memory usage:
@@ -484,7 +484,7 @@ We went from 15 GB memory usage to 5 GB - a 3x improvement while maintaining the
Instead of aggregating optimizer states like Adafactor, 8-bit Adam keeps the full state and quantizes it. Quantization means that it stores the state with lower precision and dequantizes it only for the optimization. This is similar to the idea behind FP16 training where using variables with lower precision saves memory. Instead of aggregating optimizer states like Adafactor, 8-bit Adam keeps the full state and quantizes it. Quantization means that it stores the state with lower precision and dequantizes it only for the optimization. This is similar to the idea behind FP16 training where using variables with lower precision saves memory.
In contrast to the previous approaches is this one not integrated into the [`Trainer`] as a simple flag. We need to install the 8-bit optimizer and then pass it as a custom optimizer to the [`Trainer`]. Follow the installation guide in the Github [repo](https://github.com/facebookresearch/bitsandbytes) to install the `bitsandbytes` library that implements the 8-bit Adam optimizer. In contrast to the previous approaches is this one not integrated into the [`Trainer`] as a simple flag. We need to install the 8-bit optimizer and then pass it as a custom optimizer to the [`Trainer`]. Follow the installation guide in the Github [repo](https://github.com/TimDettmers/bitsandbytes) to install the `bitsandbytes` library that implements the 8-bit Adam optimizer.
Once installed, we just need to initialize the the optimizer. Although this looks like a considerable amount of work it actually just involves two steps: first we need to group the model's parameters into two groups where to one group we apply weight decay and to the other we don't. Usually, biases and layer norm parameters are not weight decayed. Then in a second step we just do some argument housekeeping to use the same parameters as the previously used AdamW optimizer. Once installed, we just need to initialize the the optimizer. Although this looks like a considerable amount of work it actually just involves two steps: first we need to group the model's parameters into two groups where to one group we apply weight decay and to the other we don't. Usually, biases and layer norm parameters are not weight decayed. Then in a second step we just do some argument housekeeping to use the same parameters as the previously used AdamW optimizer.

View File

@@ -703,7 +703,7 @@ We are very excited to be hosting 2 days of talks from Kensho-Technologies, Mozi
- Memory efficient training: - Memory efficient training:
In case, you are getting out-of-memory errors on your GPU, we recommend to use In case, you are getting out-of-memory errors on your GPU, we recommend to use
[bitsandbytes](https://github.com/facebookresearch/bitsandbytes) to replace the [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) to replace the
native memory-intensive Adam optimizer with the one of `bitsandbytes`. You native memory-intensive Adam optimizer with the one of `bitsandbytes`. You
can simply run the script `./run_speech_recognition_ctc_bnb.py` provided in this can simply run the script `./run_speech_recognition_ctc_bnb.py` provided in this
folder that makes use of `bitsandbytes` instead of the official one. folder that makes use of `bitsandbytes` instead of the official one.