Fix many HPU failures in the CI (#39066)

* more torch.hpu patches

* increase top_k because it results in flaky behavior when Tempreture, TopP and TopK are used together, which ends up killing beams early.

* remove temporal fix

* fix scatter operation when input and src are the same

* trigger

* fix and reduce

* skip finding batch size as it makes the hpu go loco

* fix fsdp (yay all are passing)

* fix checking equal nan values

* style

* remove models list

* order

* rename to cuda_extensions

* Update src/transformers/trainer.py
This commit is contained in:
Ilyas Moutawwakil
2025-07-03 11:17:27 +02:00
committed by GitHub
parent bff964c429
commit 18e0cae207
5 changed files with 71 additions and 54 deletions

View File

@@ -62,4 +62,5 @@ if __name__ == "__main__":
start = end
end = start + num_jobs_per_splits + (1 if idx < num_jobs % args.num_splits else 0)
model_splits.append(d[start:end])
print(model_splits)