HuggingFace_transformer

Author	SHA1	Message	Date
Rémi Louf	a67413ccc8	extend works in-place	2019-10-28 10:49:49 +01:00
Rémi Louf	b915ba9dfe	pad sequence with 0, mask with -1	2019-10-28 10:49:49 +01:00
thomwolf	8cd56e3036	fix data processing in script	2019-10-17 16:33:26 +02:00
Rémi Louf	578d23e061	add training pipeline (formatting temporary)	2019-10-17 14:02:27 +02:00
Rémi Louf	47a06d88a0	use two different tokenizers for storyand summary	2019-10-17 13:04:26 +02:00
Rémi Louf	bfb9b540d4	add Model2Model to __init__	2019-10-17 12:59:51 +02:00
Rémi Louf	c1bc709c35	correct the truncation and padding of dataset	2019-10-17 10:41:53 +02:00
Rémi Louf	e4e0ee14bd	add separator between data import and train	2019-10-16 20:05:32 +02:00
Rémi Louf	1aec940587	test the full story processing	2019-10-15 15:18:07 +02:00
Rémi Louf	22e1af6859	truncation function is fully tested	2019-10-15 14:43:50 +02:00
Rémi Louf	260ac7d9a8	wip commit, switching computers	2019-10-15 12:24:35 +02:00
Rémi Louf	412793275d	delegate the padding with special tokens to the tokenizer	2019-10-14 20:45:16 +02:00
Rémi Louf	447fffb21f	process the raw CNN/Daily Mail dataset the data provided by Li Dong et al. were already tokenized, which means that they are not compatible with all the models in the library. We thus process the raw data directly and tokenize them using the models' tokenizers.	2019-10-14 18:12:20 +02:00
Rémi Louf	67d10960ae	load and prepare CNN/Daily Mail data We write a function to load an preprocess the CNN/Daily Mail dataset as provided by Li Dong et al. The issue is that this dataset has already been tokenized by the authors, so we actually need to find the original, plain-text dataset if we want to apply it to all models.	2019-10-14 14:11:20 +02:00
Rémi Louf	b3261e7ace	read parameters from CLI, load model & tokenizer	2019-10-11 18:40:38 +02:00
Rémi Louf	d889e0b71b	add base for seq2seq finetuning	2019-10-11 17:36:12 +02:00

16 Commits