add exemplary training data. update to nvidia apex. refactor 'item -> line in doc' mapping. add warning for unknown word.

This commit is contained in:
tholor
2018-12-20 18:30:52 +01:00
parent 17595ef2de
commit e5fc98c542
2 changed files with 71 additions and 95 deletions

View File

@@ -498,8 +498,8 @@ loss = 0.06423990014260186
#### LM Fine-tuning
The data should be a text file in the same format as [sample_text.txt](./samples/sample_text.txt) (one sentence per line, docs separated by empty line).
Training one epoch on a 500k sentence corpus takes about 1:20h on 4 x NVIDIA Tesla P100 with `train_batch_size=200` and `max_seq_length=128`:
You can download an [exemplary training corpus](https://ext-bert-sample.obs.eu-de.otc.t-systems.com/small_wiki_sentence_corpus.txt) generated from wikipedia articles and splitted into ~500k sentences with spaCy.
Training one epoch on this corpus takes about 1:20h on 4 x NVIDIA Tesla P100 with `train_batch_size=200` and `max_seq_length=128`:
```shell