chore: Fix multiple typos (#28574)

This commit is contained in:
hugo-syn
2024-01-18 14:35:09 +01:00
committed by GitHub
parent 8189977885
commit 5d8eb93eee
5 changed files with 5 additions and 5 deletions

View File

@@ -50,7 +50,7 @@ The raw dataset contains many duplicates. We deduplicated and filtered the datas
- fraction of alphanumeric characters < 0.25
- containing the word "auto-generated" or similar in the first 5 lines
- filtering with a probability of 0.7 of files with a mention of "test file" or "configuration file" or similar in the first 5 lines
- filtering with a probability of 0.7 of files with high occurence of the keywords "test " or "config"
- filtering with a probability of 0.7 of files with high occurrence of the keywords "test " or "config"
- filtering with a probability of 0.7 of files without a mention of the keywords `def` , `for`, `while` and `class`
- filtering files that use the assignment operator `=` less than 5 times
- filtering files with ratio between number of characters and number of tokens after tokenization < 1.5 (the average ratio is 3.6)