add exemplary training data. update to nvidia apex. refactor 'item -> line in doc' mapping. add warning for unknown word.

2018-12-20 18:30:52 +01:00
parent 17595ef2de
commit e5fc98c542
2 changed files with 71 additions and 95 deletions
--- a/README.md
+++ b/README.md
@@ -498,8 +498,8 @@ loss = 0.06423990014260186
 #### LM Fine-tuning

 The data should be a text file in the same format as [sample_text.txt](./samples/sample_text.txt)  (one sentence per line, docs separated by empty line).
-
-Training one epoch on a 500k sentence corpus takes about 1:20h on 4 x NVIDIA Tesla P100 with `train_batch_size=200` and `max_seq_length=128`:
+You can download an [exemplary training corpus](https://ext-bert-sample.obs.eu-de.otc.t-systems.com/small_wiki_sentence_corpus.txt) generated from wikipedia articles and splitted into ~500k sentences with spaCy. 
+Training one epoch on this corpus takes about 1:20h on 4 x NVIDIA Tesla P100 with `train_batch_size=200` and `max_seq_length=128`:


 ```shell