Below is the list of corpora used along with the output of wc command (counting lines, words and characters). These corpora were concatenated and tokenized with HuggingFace Roberta Tokenizer.
Vitalii Radchenko - contact me on Twitter @vitaliradchenko