Fix codeparrot deduplication - ignore whitespaces (#18023)
* ignore whitspaces for hash * reformat code * Update README.md
This commit is contained in:
@@ -39,7 +39,7 @@ The source of the dataset is the GitHub dump available on Google's [BigQuery](ht
|
|||||||
### Preprocessing
|
### Preprocessing
|
||||||
The raw dataset contains many duplicates. We deduplicated and filtered the dataset using the heuristics proposed in OpenAI's Codex [paper](https://arxiv.org/abs/2107.03374) and some new ones:
|
The raw dataset contains many duplicates. We deduplicated and filtered the dataset using the heuristics proposed in OpenAI's Codex [paper](https://arxiv.org/abs/2107.03374) and some new ones:
|
||||||
|
|
||||||
- exact deduplication using each file's hash
|
- exact deduplication using each file's hash after having removed whistespaces.
|
||||||
- near deduplication using MinHash and Jaccard similarity. MinHash with a Jaccard threshold (default=0.85) is first used to create duplicate clusters. Then these clusters are then reduced to unique files based on the exact Jaccard similarity. See `deduplicate_dataset` in `minhash_deduplication.py` for a detailed description.
|
- near deduplication using MinHash and Jaccard similarity. MinHash with a Jaccard threshold (default=0.85) is first used to create duplicate clusters. Then these clusters are then reduced to unique files based on the exact Jaccard similarity. See `deduplicate_dataset` in `minhash_deduplication.py` for a detailed description.
|
||||||
- filtering files with max line length > 1000
|
- filtering files with max line length > 1000
|
||||||
- filtering files with mean line length > 100
|
- filtering files with mean line length > 100
|
||||||
|
|||||||
@@ -3,6 +3,7 @@ import hashlib
|
|||||||
import json
|
import json
|
||||||
import multiprocessing
|
import multiprocessing
|
||||||
import os
|
import os
|
||||||
|
import re
|
||||||
import shutil
|
import shutil
|
||||||
import time
|
import time
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
@@ -15,9 +16,12 @@ from minhash_deduplication import deduplicate_dataset
|
|||||||
from transformers import AutoTokenizer, HfArgumentParser
|
from transformers import AutoTokenizer, HfArgumentParser
|
||||||
|
|
||||||
|
|
||||||
|
PATTERN = re.compile(r"\s+")
|
||||||
|
|
||||||
|
|
||||||
def get_hash(example):
|
def get_hash(example):
|
||||||
"""Get hash of content field."""
|
"""Get hash of content field."""
|
||||||
return {"hash": hashlib.md5(example["content"].strip().encode("utf-8")).hexdigest()}
|
return {"hash": hashlib.md5(re.sub(PATTERN, "", example["content"]).encode("utf-8")).hexdigest()}
|
||||||
|
|
||||||
|
|
||||||
def line_stats(example):
|
def line_stats(example):
|
||||||
|
|||||||
Reference in New Issue
Block a user