Bloom Optimize operations (#17866)
* fix tolerance for a bloom slow test * enhance alibi padding - get rid of for loops - deals better with padded batched input - avoid useless cpu/gpu communication when creating alibi Co-authored-by: justheuristic <justheuristic@gmail.com> * optimize attention mask * fix scaled softmax limit values * optimize building alibi tensor Co-authored-by: Younes Belkada <younesbelkada@users.noreply.github.com> * fix attention_mask shape when it's None * minor fixes - fix docstring + arg names * remove colons in docstring * Apply suggestions from code review Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * apply suggestion * remove unsued arg * refactor a bit - use [:, None] for consistency * refactor attention block Co-authored-by: Nouamane Tazi <nouamane98@gmail.com> * quick fixes * first attempt * refactor attention block and fix all tests except "test_simple_generation" - added comments to better explain attention block * remove debug lines and add TODO comment * change `torch.bmm` to `torch.baddbmm` - fixes `test_simple_generation`but breaks `test_batch_generation_padd` * styling * all tests are passing now - use `bmm` - add explanation for `allow_fp16_reduced_precision_reduction` Co-authored-by: Younes Belkada <younesbelkada@users.noreply.github.com> * styling Co-authored-by: Younes Belkada <younesbelkada@users.noreply.github.com> * fix support for accelerate Co-authored-by: Younes Belkada <younesbelkada@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * remove attn softmax in fp32 * refactor comments * refactor a bit - remove warning message - remove print on test * refer to pytorch t5 * change the slow tests - do the tests in fp32 - remove some comments - keep large comments * update expected output for `test_simple_generation` - we now test using fp32 * make style + change comments a bit * fix dtype padd test Co-authored-by: justheuristic <justheuristic@gmail.com> Co-authored-by: Nouamane Tazi <nouamane98@gmail.com> Co-authored-by: Younes Belkada <younesbelkada@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
@@ -377,15 +377,34 @@ class BloomModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase)
|
||||
@slow
|
||||
@require_torch_gpu
|
||||
def test_simple_generation(self):
|
||||
# This test is a bit flaky. For some GPU architectures, pytorch sets by default allow_fp16_reduced_precision_reduction = True and some operations
|
||||
# do not give the same results under this configuration, especially torch.baddmm and torch.bmm. https://pytorch.org/docs/stable/notes/numerical_accuracy.html#fp16-on-mi200
|
||||
# We set allow_fp16_reduced_precision_reduction = True. Please see: https://pytorch.org/docs/stable/notes/cuda.html#reduced-precision-reduction-in-fp16-gemms
|
||||
# This discrepancy is observed only when using small models and seems to be stable for larger models.
|
||||
# Our conclusion is that these operations are flaky for small inputs but seems to be stable for larger inputs (for the functions `baddmm` and `bmm`), and therefore for larger models.
|
||||
|
||||
# Here is a summary of an ablation study of our observations
|
||||
# EXPECTED_OUTPUT = "I enjoy walking with my cute dog, and I love to watch the kids play. I am a very active person, and I am a very good listener. I am a very good person, and I am a very good person. I am a"
|
||||
# 350m + allow_fp16_reduced_precision_reduction = False + torch.bmm ==> PASS
|
||||
# 350m + allow_fp16_reduced_precision_reduction = False + torch.baddm ==> PASS
|
||||
# 350m + allow_fp16_reduced_precision_reduction = True + torch.baddm ==> PASS
|
||||
# 350m + allow_fp16_reduced_precision_reduction = True + torch.bmm ==> FAIL
|
||||
|
||||
# EXPECTED_OUTPUT = "I enjoy walking with my cute dog, but I also enjoy hiking, biking, and swimming. I love to cook and bake. I love to cook and bake. I love to cook and bake. I love to cook and bake. I love"
|
||||
# >=760m + allow_fp16_reduced_precision_reduction = True + torch.baddm ==> PASS (for use_cache=True and use_cache=False)
|
||||
# >=760m + allow_fp16_reduced_precision_reduction = True + torch.bmm ==> PASS
|
||||
# >=760m + allow_fp16_reduced_precision_reduction = False + torch.bmm ==> PASS
|
||||
|
||||
path_350m = "bigscience/bloom-350m"
|
||||
model = BloomForCausalLM.from_pretrained(path_350m, torch_dtype="auto", use_cache=True).cuda()
|
||||
model = BloomForCausalLM.from_pretrained(path_350m, use_cache=True).cuda()
|
||||
model = model.eval()
|
||||
tokenizer = BloomTokenizerFast.from_pretrained(path_350m)
|
||||
|
||||
input_sentence = "I enjoy walking with my cute dog"
|
||||
# This output has been obtained using fp32 model on the huggingface DGX workstation - NVIDIA A100 GPU
|
||||
EXPECTED_OUTPUT = (
|
||||
"I enjoy walking with my cute dog, and I love to watch the kids play. I am a very active person, and I am"
|
||||
" a very good listener. I am a very good person, and I am a very good person. I am a"
|
||||
"I enjoy walking with my cute dog, and I love to watch the kids play with the kids. I am a very "
|
||||
"active person, and I enjoy working out, and I am a very active person. I am a very active person, and I"
|
||||
)
|
||||
|
||||
input_ids = tokenizer.encode(input_sentence, return_tensors="pt")
|
||||
@@ -397,7 +416,7 @@ class BloomModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase)
|
||||
@require_torch_gpu
|
||||
def test_batch_generation(self):
|
||||
path_350m = "bigscience/bloom-350m"
|
||||
model = BloomForCausalLM.from_pretrained(path_350m, torch_dtype="auto", use_cache=True).cuda()
|
||||
model = BloomForCausalLM.from_pretrained(path_350m, use_cache=True).cuda()
|
||||
model = model.eval()
|
||||
tokenizer = BloomTokenizerFast.from_pretrained(path_350m, padding_side="left")
|
||||
|
||||
@@ -416,8 +435,9 @@ class BloomModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase)
|
||||
@slow
|
||||
@require_torch_gpu
|
||||
def test_batch_generation_padd(self):
|
||||
|
||||
path_350m = "bigscience/bloom-350m"
|
||||
model = BloomForCausalLM.from_pretrained(path_350m, torch_dtype="auto", use_cache=True).cuda()
|
||||
model = BloomForCausalLM.from_pretrained(path_350m, use_cache=True).cuda()
|
||||
model = model.eval()
|
||||
tokenizer = BloomTokenizerFast.from_pretrained(path_350m, padding_side="left")
|
||||
|
||||
|
||||
Reference in New Issue
Block a user