[CodeParrot] Near-deduplication with jaccard similarity (#17054)

* deduplication draft

* update style

* update style test

* dummy test main

* rename modules

* rename functions

* return extremes in deduplicate_clusters

* update style

* cast str for gzip

* update doc string

* time processing

* use dataset map to compute minhash

* fill value for short token

* remove da map method

* update style

* use share object to multiprocess

* update style

* use f-string and minor fix

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Loubna Ben Allal <44069155+loubnabnl@users.noreply.github.com>

* update style

* use module parameters

* change ds_dedup to ds_filter

* save ds_dedup

* mv test to script tests

* make jaccard threshold a parameter of deduplicate_dataset

* update style

* add doc strings

* update style

* add doc string for DuplicationIndex

* save files into data dir

* update readme

* Update examples/research_projects/codeparrot/README.md

Co-authored-by: Loubna Ben Allal <44069155+loubnabnl@users.noreply.github.com>

* make near deduplication optional

* move near deduplication in README

* Update examples/research_projects/codeparrot/README.md

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* use f string

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Loubna Ben Allal <44069155+loubnabnl@users.noreply.github.com>
This commit is contained in:
Jia LI
2022-06-21 14:23:36 +02:00
committed by GitHub
parent eb16be415a
commit da2bd2ae96
7 changed files with 334 additions and 5 deletions

View File

@@ -157,6 +157,12 @@ class PreprocessingArguments:
default="lvwerra/codeparrot",
metadata={"help": "Name or path to the tokenizer."},
)
near_deduplication: Optional[bool] = field(
default=False, metadata={"help": "If True, near-duplicate samples are removed."}
)
jaccard_threshold: Optional[float] = field(
default=0.85, metadata={"help": "Jaccard threshold for near-duplicate samples."}
)
@dataclass