[CodeParrot] Near-deduplication with jaccard similarity (#17054)

* deduplication draft

* update style

* update style test

* dummy test main

* rename modules

* rename functions

* return extremes in deduplicate_clusters

* update style

* cast str for gzip

* update doc string

* time processing

* use dataset map to compute minhash

* fill value for short token

* remove da map method

* update style

* use share object to multiprocess

* update style

* use f-string and minor fix

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Loubna Ben Allal <44069155+loubnabnl@users.noreply.github.com>

* update style

* use module parameters

* change ds_dedup to ds_filter

* save ds_dedup

* mv test to script tests

* make jaccard threshold a parameter of deduplicate_dataset

* update style

* add doc strings

* update style

* add doc string for DuplicationIndex

* save files into data dir

* update readme

* Update examples/research_projects/codeparrot/README.md

Co-authored-by: Loubna Ben Allal <44069155+loubnabnl@users.noreply.github.com>

* make near deduplication optional

* move near deduplication in README

* Update examples/research_projects/codeparrot/README.md

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* use f string

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Loubna Ben Allal <44069155+loubnabnl@users.noreply.github.com>
This commit is contained in:
Jia LI
2022-06-21 14:23:36 +02:00
committed by GitHub
parent eb16be415a
commit da2bd2ae96
7 changed files with 334 additions and 5 deletions

View File

@@ -40,6 +40,7 @@ The source of the dataset is the GitHub dump available on Google's [BigQuery](ht
The raw dataset contains many duplicates. We deduplicated and filtered the dataset using the heuristics proposed in OpenAI's Codex [paper](https://arxiv.org/abs/2107.03374) and some new ones:
- exact deduplication using each file's hash
- near deduplication using MinHash and Jaccard similarity. MinHash with a Jaccard threshold (default=0.85) is first used to create duplicate clusters. Then these clusters are then reduced to unique files based on the exact Jaccard similarity. See `deduplicate_dataset` in `minhash_deduplication.py` for a detailed description.
- filtering files with max line length > 1000
- filtering files with mean line length > 100
- fraction of alphanumeric characters < 0.25