From dc31a72f505bc115a2214a68c8ea7c956f98fd1b Mon Sep 17 00:00:00 2001 From: Kevin Canwen Xu Date: Sat, 11 Jul 2020 21:37:30 +0800 Subject: [PATCH] Add Microsoft's CodeBERT (#5683) * Add Microsoft's CodeBERT * link style * single modal * unused import --- .../microsoft/codebert-base-mlm/README.md | 46 +++++++++++++++++++ model_cards/microsoft/codebert-base/README.md | 27 +++++++++++ 2 files changed, 73 insertions(+) create mode 100644 model_cards/microsoft/codebert-base-mlm/README.md create mode 100644 model_cards/microsoft/codebert-base/README.md diff --git a/model_cards/microsoft/codebert-base-mlm/README.md b/model_cards/microsoft/codebert-base-mlm/README.md new file mode 100644 index 0000000000..2801f42d23 --- /dev/null +++ b/model_cards/microsoft/codebert-base-mlm/README.md @@ -0,0 +1,46 @@ +## CodeBERT-base-mlm +Pretrained weights for [CodeBERT: A Pre-Trained Model for Programming and Natural Languages](https://arxiv.org/abs/2002.08155). + +### Training Data +The model is trained on the code corpus of [CodeSearchNet](https://github.com/github/CodeSearchNet) + +### Training Objective +This model is initialized with Roberta-base and trained with a simple MLM (Masked Language Model) objective. + +### Usage +```python +from transformers import RobertaTokenizer, RobertaForMaskedLM, pipeline + +model = RobertaForMaskedLM.from_pretrained('microsoft/codebert-base-mlm') +tokenizer = RobertaTokenizer.from_pretrained('microsoft/codebert-base-mlm') + +code_example = "if (x is not None) (x>1)" +fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer) + +outputs = fill_mask(code_example) +print(outputs) +``` +Expected results: +``` +{'sequence': ' if (x is not None) and (x>1)', 'score': 0.6049249172210693, 'token': 8} +{'sequence': ' if (x is not None) or (x>1)', 'score': 0.30680200457572937, 'token': 50} +{'sequence': ' if (x is not None) if (x>1)', 'score': 0.02133703976869583, 'token': 114} +{'sequence': ' if (x is not None) then (x>1)', 'score': 0.018607674166560173, 'token': 172} +{'sequence': ' if (x is not None) AND (x>1)', 'score': 0.007619690150022507, 'token': 4248} +``` + +### Reference +1. [Bimodal CodeBERT trained with MLM+RTD objective](https://huggingface.co/microsoft/codebert-base) (suitable for code search and document generation) +2. 🤗 [Hugging Face's CodeBERTa](https://huggingface.co/huggingface/CodeBERTa-small-v1) (small size, 6 layers) + +### Citation +```bibtex +@misc{feng2020codebert, + title={CodeBERT: A Pre-Trained Model for Programming and Natural Languages}, + author={Zhangyin Feng and Daya Guo and Duyu Tang and Nan Duan and Xiaocheng Feng and Ming Gong and Linjun Shou and Bing Qin and Ting Liu and Daxin Jiang and Ming Zhou}, + year={2020}, + eprint={2002.08155}, + archivePrefix={arXiv}, + primaryClass={cs.CL} +} +``` diff --git a/model_cards/microsoft/codebert-base/README.md b/model_cards/microsoft/codebert-base/README.md new file mode 100644 index 0000000000..4ae9f25fc2 --- /dev/null +++ b/model_cards/microsoft/codebert-base/README.md @@ -0,0 +1,27 @@ +## CodeBERT-base +Pretrained weights for [CodeBERT: A Pre-Trained Model for Programming and Natural Languages](https://arxiv.org/abs/2002.08155). + +### Training Data +The model is trained on bi-modal data (documents & code) of [CodeSearchNet](https://github.com/github/CodeSearchNet) + +### Training Objective +This model is initialized with Roberta-base and trained with MLM+RTD objective (cf. the paper). + +### Usage +Please see [the official repository](https://github.com/microsoft/CodeBERT) for scripts that support "code search" and "code-to-document generation". + +### Reference +1. [CodeBERT trained with Masked LM objective](https://huggingface.co/microsoft/codebert-base-mlm) (suitable for code completion) +2. 🤗 [Hugging Face's CodeBERTa](https://huggingface.co/huggingface/CodeBERTa-small-v1) (small size, 6 layers) + +### Citation +```bibtex +@misc{feng2020codebert, + title={CodeBERT: A Pre-Trained Model for Programming and Natural Languages}, + author={Zhangyin Feng and Daya Guo and Duyu Tang and Nan Duan and Xiaocheng Feng and Ming Gong and Linjun Shou and Bing Qin and Ting Liu and Daxin Jiang and Ming Zhou}, + year={2020}, + eprint={2002.08155}, + archivePrefix={arXiv}, + primaryClass={cs.CL} +} +```