prepare_seq2seq_batch makes labels/ decoder_input_ids made later. (#6654)

* broken test * batch parity * tests pass * boom boom * boom boom * split out bart tokenizer tests * fix tests * boom boom * Fixed dataset bug * Fix marian * Undo extra * Get marian working * Fix t5 tok tests * Test passing * Cleanup * better assert msg * require torch * Fix mbart tests * undo extra decoder_attn_mask change * Fix import * pegasus tokenizer can ignore src_lang kwargs * unused kwarg test cov * boom boom * add todo for pegasus issue * cover one word translation edge case * Cleanup * doc
2020-08-28 11:15:17 -04:00
parent cb276b41de
commit 9336086ab5
20 changed files with 429 additions and 290 deletions
--- a/examples/seq2seq/README.md
+++ b/examples/seq2seq/README.md
@@ -71,8 +71,8 @@ Summarization Tips:
 (It rarely makes sense to start from `bart-large` unless you are a researching finetuning methods).

 **Update 2018-07-18**
-Datasets: `Seq2SeqDataset` should be used for all tokenizers without a `prepare_seq2seq_batch` method. For those who do (like Marian, MBart), `TranslationDataset` should be used.**
-A new dataset is needed to support multilingual tasks.
+Datasets: `LegacySeq2SeqDataset` will be used for all tokenizers without a `prepare_seq2seq_batch` method. Otherwise, `Seq2SeqDataset` will be used.
+Future work/help wanted: A new dataset to support multilingual tasks.


 ### Command Line Options
@@ -106,7 +106,7 @@ The following command should work on a 16GB GPU:
    --train_batch_size=1 \
    --eval_batch_size=1 \
    --output_dir=xsum_results \
-    --num_train_epochs 1 \
+    --num_train_epochs 6 \
    --model_name_or_path facebook/bart-large
 ```