Fix codeparrot deduplication - ignore whitespaces (#18023)

* ignore whitspaces for hash

* reformat code

* Update README.md
This commit is contained in:
Loubna Ben Allal
2022-07-28 15:58:26 +02:00
committed by GitHub
parent 5d1fed0740
commit 286a18fa00
2 changed files with 6 additions and 2 deletions

View File

@@ -39,7 +39,7 @@ The source of the dataset is the GitHub dump available on Google's [BigQuery](ht
### Preprocessing ### Preprocessing
The raw dataset contains many duplicates. We deduplicated and filtered the dataset using the heuristics proposed in OpenAI's Codex [paper](https://arxiv.org/abs/2107.03374) and some new ones: The raw dataset contains many duplicates. We deduplicated and filtered the dataset using the heuristics proposed in OpenAI's Codex [paper](https://arxiv.org/abs/2107.03374) and some new ones:
- exact deduplication using each file's hash - exact deduplication using each file's hash after having removed whistespaces.
- near deduplication using MinHash and Jaccard similarity. MinHash with a Jaccard threshold (default=0.85) is first used to create duplicate clusters. Then these clusters are then reduced to unique files based on the exact Jaccard similarity. See `deduplicate_dataset` in `minhash_deduplication.py` for a detailed description. - near deduplication using MinHash and Jaccard similarity. MinHash with a Jaccard threshold (default=0.85) is first used to create duplicate clusters. Then these clusters are then reduced to unique files based on the exact Jaccard similarity. See `deduplicate_dataset` in `minhash_deduplication.py` for a detailed description.
- filtering files with max line length > 1000 - filtering files with max line length > 1000
- filtering files with mean line length > 100 - filtering files with mean line length > 100

View File

@@ -3,6 +3,7 @@ import hashlib
import json import json
import multiprocessing import multiprocessing
import os import os
import re
import shutil import shutil
import time import time
from pathlib import Path from pathlib import Path
@@ -15,9 +16,12 @@ from minhash_deduplication import deduplicate_dataset
from transformers import AutoTokenizer, HfArgumentParser from transformers import AutoTokenizer, HfArgumentParser
PATTERN = re.compile(r"\s+")
def get_hash(example): def get_hash(example):
"""Get hash of content field.""" """Get hash of content field."""
return {"hash": hashlib.md5(example["content"].strip().encode("utf-8")).hexdigest()} return {"hash": hashlib.md5(re.sub(PATTERN, "", example["content"]).encode("utf-8")).hexdigest()}
def line_stats(example): def line_stats(example):