Reformer enwik8 - Model card (#4286)
This commit is contained in:
committed by
GitHub
parent
b290c32e16
commit
336116d960
57
model_cards/google/reformer-enwik8/README.md
Normal file
57
model_cards/google/reformer-enwik8/README.md
Normal file
@@ -0,0 +1,57 @@
|
|||||||
|
## Reformer Language model on character level and trained on enwik8.
|
||||||
|
|
||||||
|
*enwik8* is a dataset based on Wikipedia and is often used to measure the model's ability to *compress* data, *e.g.* in
|
||||||
|
the scope of the *Hutter prize*: https://en.wikipedia.org/wiki/Hutter_Prize.
|
||||||
|
|
||||||
|
`reformer-enwik8` was pretrained on the first 90M chars of *enwik8* whereas the text was chunked into batches of size 65536 chars (=2^16).
|
||||||
|
The model's weights were taken from https://console.cloud.google.com/storage/browser/trax-ml/reformer/enwik8 and converted
|
||||||
|
to Hugging Face's PyTorch ReformerLM model `ReformerModelWithLMHead`.
|
||||||
|
|
||||||
|
The model is a language model that operates on characters.
|
||||||
|
Therefore, this model does not need a tokenizer. The following function can instead be used for **encoding** and **decoding**:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import torch
|
||||||
|
|
||||||
|
# Encoding
|
||||||
|
def encode(list_of_strings, pad_to_max_length=True, pad_token_id=0):
|
||||||
|
max_length = max([len(string) for string in list_of_strings])
|
||||||
|
|
||||||
|
# create emtpy tensors
|
||||||
|
attention_masks = torch.zeros((len(list_of_strings), max_length), dtype=torch.long)
|
||||||
|
input_ids = torch.full((len(list_of_strings), max_length), pad_token_id, dtype=torch.long)
|
||||||
|
|
||||||
|
for idx, string in enumerate(list_of_strings):
|
||||||
|
# make sure string is in byte format
|
||||||
|
if not isinstance(string, bytes):
|
||||||
|
string = str.encode(string)
|
||||||
|
|
||||||
|
input_ids[idx, :len(string)] = torch.tensor([x + 2 for x in string])
|
||||||
|
attention_masks[idx, :len(string)] = 1
|
||||||
|
|
||||||
|
return input_ids, attention_masks
|
||||||
|
|
||||||
|
# Decoding
|
||||||
|
def decode(outputs_ids):
|
||||||
|
decoded_outputs = []
|
||||||
|
for output_ids in outputs_ids.tolist():
|
||||||
|
# transform id back to char IDs < 2 are simply transformed to ""
|
||||||
|
decoded_outputs.append("".join([chr(x - 2) if x > 1 else "" for x in output_ids]))
|
||||||
|
return decoded_outputs
|
||||||
|
```
|
||||||
|
|
||||||
|
Text can be generated as follows:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from transformers import ReformerModelWithLMHead
|
||||||
|
|
||||||
|
model = ReformerModelWithLMHead.from_pretrained("google/reformer-enwik8")
|
||||||
|
encoded, attention_masks = encode(["In 1965, Brooks left IBM to found the Department of"])
|
||||||
|
decode(model.generate(encoded, do_sample=True, max_length=150))
|
||||||
|
|
||||||
|
# gives:
|
||||||
|
# In 1965, Brooks left IBM to found the Department of Journalism in 1968. IBM had jurisdiction himself in 1980, while Brooks resolved, nevertheless thro
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
***Note***: Language generation using `ReformerModelWithLMHead` is not optimized yet and is rather slow.
|
||||||
Reference in New Issue
Block a user