* fixup mamba2 - caching and several other small fixes
* fixup cached forward
* correct fix this time
* fixup cache - we do not need to extend the attn mask it's handled by generate (gives total ids + mask at each step)
* remove unnecessary (un)squeeze
* fixup cache position
* simplify a few things
* [run-slow] mamba2
* multi gpu attempt two
* [run-slow] mamba2
* [run-slow] mamba2
* [run-slow] mamba2
* [run-slow] mamba2
* add newer slow path fix
* [run-slow] mamba2