Fix Colab links + install dependencies first.
Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
This commit is contained in:
@@ -2,6 +2,12 @@
|
|||||||
"cells": [
|
"cells": [
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
|
"metadata": {
|
||||||
|
"pycharm": {
|
||||||
|
"is_executing": false,
|
||||||
|
"name": "#%% md\n"
|
||||||
|
}
|
||||||
|
},
|
||||||
"source": [
|
"source": [
|
||||||
"## Tokenization doesn't have to be slow !\n",
|
"## Tokenization doesn't have to be slow !\n",
|
||||||
"\n",
|
"\n",
|
||||||
@@ -81,34 +87,46 @@
|
|||||||
"\n",
|
"\n",
|
||||||
"All of these building blocks can be combined to create working tokenization pipelines. \n",
|
"All of these building blocks can be combined to create working tokenization pipelines. \n",
|
||||||
"In the next section we will go over our first pipeline."
|
"In the next section we will go over our first pipeline."
|
||||||
],
|
]
|
||||||
"metadata": {
|
|
||||||
"collapsed": false,
|
|
||||||
"pycharm": {
|
|
||||||
"name": "#%% md\n",
|
|
||||||
"is_executing": false
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
|
"metadata": {
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%% md\n"
|
||||||
|
}
|
||||||
|
},
|
||||||
"source": [
|
"source": [
|
||||||
"Alright, now we are ready to implement our first tokenization pipeline through `tokenizers`. \n",
|
"Alright, now we are ready to implement our first tokenization pipeline through `tokenizers`. \n",
|
||||||
"\n",
|
"\n",
|
||||||
"For this, we will train a Byte-Pair Encoding (BPE) tokenizer on a quite small input for the purpose of this notebook.\n",
|
"For this, we will train a Byte-Pair Encoding (BPE) tokenizer on a quite small input for the purpose of this notebook.\n",
|
||||||
"We will work with [the file from peter Norving](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=2ahUKEwjYp9Ppru_nAhUBzIUKHfbUAG8QFjAAegQIBhAB&url=https%3A%2F%2Fnorvig.com%2Fbig.txt&usg=AOvVaw2ed9iwhcP1RKUiEROs15Dz).\n",
|
"We will work with [the file from Peter Norving](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=2ahUKEwjYp9Ppru_nAhUBzIUKHfbUAG8QFjAAegQIBhAB&url=https%3A%2F%2Fnorvig.com%2Fbig.txt&usg=AOvVaw2ed9iwhcP1RKUiEROs15Dz).\n",
|
||||||
"This file contains around 130.000 lines of raw text that will be processed by the library to generate a working tokenizer."
|
"This file contains around 130.000 lines of raw text that will be processed by the library to generate a working tokenizer.\n"
|
||||||
],
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"collapsed": false,
|
|
||||||
"pycharm": {
|
"pycharm": {
|
||||||
"name": "#%% md\n"
|
"is_executing": false,
|
||||||
}
|
"name": "#%% code\n"
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"!pip install tokenizers"
|
||||||
|
]
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 2,
|
"execution_count": 2,
|
||||||
|
"metadata": {
|
||||||
|
"pycharm": {
|
||||||
|
"is_executing": false,
|
||||||
|
"name": "#%% code\n"
|
||||||
|
}
|
||||||
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"BIG_FILE_URL = 'https://raw.githubusercontent.com/dscape/spell/master/test/resources/big.txt'\n",
|
"BIG_FILE_URL = 'https://raw.githubusercontent.com/dscape/spell/master/test/resources/big.txt'\n",
|
||||||
@@ -122,33 +140,31 @@
|
|||||||
" big_f.write(response.content)\n",
|
" big_f.write(response.content)\n",
|
||||||
" else:\n",
|
" else:\n",
|
||||||
" print(\"Unable to get the file: {}\".format(response.reason))\n"
|
" print(\"Unable to get the file: {}\".format(response.reason))\n"
|
||||||
],
|
]
|
||||||
"metadata": {
|
|
||||||
"collapsed": false,
|
|
||||||
"pycharm": {
|
|
||||||
"name": "#%% code\n",
|
|
||||||
"is_executing": false
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
|
"metadata": {
|
||||||
|
"pycharm": {
|
||||||
|
"is_executing": false,
|
||||||
|
"name": "#%% md\n"
|
||||||
|
}
|
||||||
|
},
|
||||||
"source": [
|
"source": [
|
||||||
" \n",
|
" \n",
|
||||||
"Now that we have our training data we need to create the overall pipeline for the tokenizer\n",
|
"Now that we have our training data we need to create the overall pipeline for the tokenizer\n",
|
||||||
" "
|
" "
|
||||||
],
|
]
|
||||||
"metadata": {
|
|
||||||
"collapsed": false,
|
|
||||||
"pycharm": {
|
|
||||||
"name": "#%% md\n",
|
|
||||||
"is_executing": false
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 10,
|
"execution_count": 10,
|
||||||
|
"metadata": {
|
||||||
|
"pycharm": {
|
||||||
|
"is_executing": false,
|
||||||
|
"name": "#%% code\n"
|
||||||
|
}
|
||||||
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"# For the user's convenience `tokenizers` provides some very high-level classes encapsulating\n",
|
"# For the user's convenience `tokenizers` provides some very high-level classes encapsulating\n",
|
||||||
@@ -165,49 +181,47 @@
|
|||||||
"tokenizer = Tokenizer(BPE.empty())\n",
|
"tokenizer = Tokenizer(BPE.empty())\n",
|
||||||
"\n",
|
"\n",
|
||||||
"# Then we enable lower-casing and unicode-normalization\n",
|
"# Then we enable lower-casing and unicode-normalization\n",
|
||||||
"# The Sequence normalizer allows us to combine multiple Normalizer, that will be\n",
|
"# The Sequence normalizer allows us to combine multiple Normalizer that will be\n",
|
||||||
"# executed in sequence.\n",
|
"# executed in order.\n",
|
||||||
"tokenizer.normalizer = Sequence([\n",
|
"tokenizer.normalizer = Sequence([\n",
|
||||||
" NFKC(),\n",
|
" NFKC(),\n",
|
||||||
" Lowercase()\n",
|
" Lowercase()\n",
|
||||||
"])\n",
|
"])\n",
|
||||||
"\n",
|
"\n",
|
||||||
"# Out tokenizer also needs a pre-tokenizer responsible for converting the input to a ByteLevel representation.\n",
|
"# Our tokenizer also needs a pre-tokenizer responsible for converting the input to a ByteLevel representation.\n",
|
||||||
"tokenizer.pre_tokenizer = ByteLevel()\n",
|
"tokenizer.pre_tokenizer = ByteLevel()\n",
|
||||||
"\n",
|
"\n",
|
||||||
"# And finally, let's plug a decoder so we can recover from a tokenized input to the original one\n",
|
"# And finally, let's plug a decoder so we can recover from a tokenized input to the original one\n",
|
||||||
"tokenizer.decoder = ByteLevelDecoder()"
|
"tokenizer.decoder = ByteLevelDecoder()"
|
||||||
],
|
]
|
||||||
"metadata": {
|
|
||||||
"collapsed": false,
|
|
||||||
"pycharm": {
|
|
||||||
"name": "#%% code\n",
|
|
||||||
"is_executing": false
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"source": [
|
|
||||||
"The overall pipeline is now ready to be trained on the corpus we downloaded earlier in this notebook."
|
|
||||||
],
|
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"collapsed": false,
|
|
||||||
"pycharm": {
|
"pycharm": {
|
||||||
"name": "#%% md\n"
|
"name": "#%% md\n"
|
||||||
}
|
}
|
||||||
}
|
},
|
||||||
|
"source": [
|
||||||
|
"The overall pipeline is now ready to be trained on the corpus we downloaded earlier in this notebook."
|
||||||
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 11,
|
"execution_count": 11,
|
||||||
|
"metadata": {
|
||||||
|
"pycharm": {
|
||||||
|
"is_executing": false,
|
||||||
|
"name": "#%% code\n"
|
||||||
|
}
|
||||||
|
},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
{
|
{
|
||||||
"name": "stdout",
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
"text": [
|
"text": [
|
||||||
"Trained vocab size: 25000\n"
|
"Trained vocab size: 25000\n"
|
||||||
],
|
]
|
||||||
"output_type": "stream"
|
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"source": [
|
"source": [
|
||||||
@@ -218,79 +232,77 @@
|
|||||||
"tokenizer.train(trainer, [\"big.txt\"])\n",
|
"tokenizer.train(trainer, [\"big.txt\"])\n",
|
||||||
"\n",
|
"\n",
|
||||||
"print(\"Trained vocab size: {}\".format(tokenizer.get_vocab_size()))"
|
"print(\"Trained vocab size: {}\".format(tokenizer.get_vocab_size()))"
|
||||||
],
|
]
|
||||||
"metadata": {
|
|
||||||
"collapsed": false,
|
|
||||||
"pycharm": {
|
|
||||||
"name": "#%% code\n",
|
|
||||||
"is_executing": false
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
|
"metadata": {
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%% md\n"
|
||||||
|
}
|
||||||
|
},
|
||||||
"source": [
|
"source": [
|
||||||
"Et voilà ! You trained your very first tokenizer from scratch using `tokenizers`. Of course, this \n",
|
"Et voilà ! You trained your very first tokenizer from scratch using `tokenizers`. Of course, this \n",
|
||||||
"covers only the basics, and you may want to have a look at the `add_special_tokens` or `special_tokens` parameters\n",
|
"covers only the basics, and you may want to have a look at the `add_special_tokens` or `special_tokens` parameters\n",
|
||||||
"on the `Trainer` class, but the overall process should be very similar.\n",
|
"on the `Trainer` class, but the overall process should be very similar.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"We can save the content of the model to reuse it later."
|
"We can save the content of the model to reuse it later."
|
||||||
],
|
]
|
||||||
"metadata": {
|
|
||||||
"collapsed": false,
|
|
||||||
"pycharm": {
|
|
||||||
"name": "#%% md\n"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 12,
|
"execution_count": 12,
|
||||||
|
"metadata": {
|
||||||
|
"pycharm": {
|
||||||
|
"is_executing": false,
|
||||||
|
"name": "#%% code\n"
|
||||||
|
}
|
||||||
|
},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
{
|
{
|
||||||
"data": {
|
"data": {
|
||||||
"text/plain": "['./vocab.json', './merges.txt']"
|
"text/plain": [
|
||||||
|
"['./vocab.json', './merges.txt']"
|
||||||
|
]
|
||||||
},
|
},
|
||||||
|
"execution_count": 12,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"output_type": "execute_result",
|
"output_type": "execute_result"
|
||||||
"execution_count": 12
|
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"source": [
|
"source": [
|
||||||
"# You will see the generated files in the output.\n",
|
"# You will see the generated files in the output.\n",
|
||||||
"tokenizer.model.save('.')"
|
"tokenizer.model.save('.')"
|
||||||
],
|
]
|
||||||
"metadata": {
|
|
||||||
"collapsed": false,
|
|
||||||
"pycharm": {
|
|
||||||
"name": "#%% code\n",
|
|
||||||
"is_executing": false
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"source": [
|
|
||||||
"Now, let load the trained model and start using out newly trained tokenizer"
|
|
||||||
],
|
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"collapsed": false,
|
|
||||||
"pycharm": {
|
"pycharm": {
|
||||||
"name": "#%% md\n"
|
"name": "#%% md\n"
|
||||||
}
|
}
|
||||||
}
|
},
|
||||||
|
"source": [
|
||||||
|
"Now, let load the trained model and start using out newly trained tokenizer"
|
||||||
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 13,
|
"execution_count": 13,
|
||||||
|
"metadata": {
|
||||||
|
"pycharm": {
|
||||||
|
"is_executing": false,
|
||||||
|
"name": "#%% code\n"
|
||||||
|
}
|
||||||
|
},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
{
|
{
|
||||||
"name": "stdout",
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
"text": [
|
"text": [
|
||||||
"Encoded string: ['Ġthis', 'Ġis', 'Ġa', 'Ġsimple', 'Ġin', 'put', 'Ġto', 'Ġbe', 'Ġtoken', 'ized']\n",
|
"Encoded string: ['Ġthis', 'Ġis', 'Ġa', 'Ġsimple', 'Ġin', 'put', 'Ġto', 'Ġbe', 'Ġtoken', 'ized']\n",
|
||||||
"Decoded string: this is a simple input to be tokenized\n"
|
"Decoded string: this is a simple input to be tokenized\n"
|
||||||
],
|
]
|
||||||
"output_type": "stream"
|
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"source": [
|
"source": [
|
||||||
@@ -302,17 +314,15 @@
|
|||||||
"\n",
|
"\n",
|
||||||
"decoded = tokenizer.decode(encoding.ids)\n",
|
"decoded = tokenizer.decode(encoding.ids)\n",
|
||||||
"print(\"Decoded string: {}\".format(decoded))"
|
"print(\"Decoded string: {}\".format(decoded))"
|
||||||
],
|
]
|
||||||
"metadata": {
|
|
||||||
"collapsed": false,
|
|
||||||
"pycharm": {
|
|
||||||
"name": "#%% code\n",
|
|
||||||
"is_executing": false
|
|
||||||
}
|
|
||||||
}
|
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
|
"metadata": {
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%% md\n"
|
||||||
|
}
|
||||||
|
},
|
||||||
"source": [
|
"source": [
|
||||||
"The Encoding structure exposes multiple properties which are useful when working with transformers models\n",
|
"The Encoding structure exposes multiple properties which are useful when working with transformers models\n",
|
||||||
"\n",
|
"\n",
|
||||||
@@ -324,13 +334,7 @@
|
|||||||
"- special_token_mask: If your input contains special tokens such as [CLS], [SEP], [MASK], [PAD], then this would be a vector with 1 in places where a special token has been added.\n",
|
"- special_token_mask: If your input contains special tokens such as [CLS], [SEP], [MASK], [PAD], then this would be a vector with 1 in places where a special token has been added.\n",
|
||||||
"- type_ids: If your was made of multiple \"parts\" such as (question, context), then this would be a vector with for each token the segment it belongs to.\n",
|
"- type_ids: If your was made of multiple \"parts\" such as (question, context), then this would be a vector with for each token the segment it belongs to.\n",
|
||||||
"- overflowing: If your has been truncated into multiple subparts because of a length limit (for BERT for example the sequence length is limited to 512), this will contain all the remaining overflowing parts."
|
"- overflowing: If your has been truncated into multiple subparts because of a length limit (for BERT for example the sequence length is limited to 512), this will contain all the remaining overflowing parts."
|
||||||
],
|
]
|
||||||
"metadata": {
|
|
||||||
"collapsed": false,
|
|
||||||
"pycharm": {
|
|
||||||
"name": "#%% md\n"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"metadata": {
|
"metadata": {
|
||||||
@@ -342,25 +346,25 @@
|
|||||||
"language_info": {
|
"language_info": {
|
||||||
"codemirror_mode": {
|
"codemirror_mode": {
|
||||||
"name": "ipython",
|
"name": "ipython",
|
||||||
"version": 2
|
"version": 3
|
||||||
},
|
},
|
||||||
"file_extension": ".py",
|
"file_extension": ".py",
|
||||||
"mimetype": "text/x-python",
|
"mimetype": "text/x-python",
|
||||||
"name": "python",
|
"name": "python",
|
||||||
"nbconvert_exporter": "python",
|
"nbconvert_exporter": "python",
|
||||||
"pygments_lexer": "ipython2",
|
"pygments_lexer": "ipython3",
|
||||||
"version": "2.7.6"
|
"version": "3.7.6"
|
||||||
},
|
},
|
||||||
"pycharm": {
|
"pycharm": {
|
||||||
"stem_cell": {
|
"stem_cell": {
|
||||||
"cell_type": "raw",
|
"cell_type": "raw",
|
||||||
"source": [],
|
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"collapsed": false
|
"collapsed": false
|
||||||
}
|
},
|
||||||
|
"source": []
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"nbformat": 4,
|
"nbformat": 4,
|
||||||
"nbformat_minor": 0
|
"nbformat_minor": 1
|
||||||
}
|
}
|
||||||
@@ -75,6 +75,20 @@
|
|||||||
"in PyTorch and TensorFlow in a transparent and interchangeable way. "
|
"in PyTorch and TensorFlow in a transparent and interchangeable way. "
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"!pip install transformers"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%% code\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 74,
|
"execution_count": 74,
|
||||||
|
|||||||
@@ -51,6 +51,20 @@
|
|||||||
"```"
|
"```"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"!pip install transformers"
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false,
|
||||||
|
"pycharm": {
|
||||||
|
"name": "#%% code\n"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 29,
|
"execution_count": 29,
|
||||||
|
|||||||
@@ -11,7 +11,7 @@ Pull Request and we'll review it so it can be included here.
|
|||||||
|
|
||||||
| Notebook | Description | |
|
| Notebook | Description | |
|
||||||
|:----------|:-------------:|------:|
|
|:----------|:-------------:|------:|
|
||||||
| [Getting Started Tokenizers](01-training-tokenizers.ipynb) | How to train and use your very own tokenizer |[](https://colab.research.google.com/github/huggingface/transformers/blob/docker-notebooks/notebooks/01-training-tokenizers.ipynb) |
|
| [Getting Started Tokenizers](01-training-tokenizers.ipynb) | How to train and use your very own tokenizer |[](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb) |
|
||||||
| [Getting Started Transformers](02-transformers.ipynb) | How to easily start using transformers | [](https://colab.research.google.com/github/huggingface/transformers/blob/docker-notebooks/notebooks/01-training-tokenizers.ipynb) |
|
| [Getting Started Transformers](02-transformers.ipynb) | How to easily start using transformers | [](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb) |
|
||||||
| [How to use Pipelines](03-pipelines.ipynb) | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [](https://colab.research.google.com/github/huggingface/transformers/blob/docker-notebooks/notebooks/01-training-tokenizers.ipynb) |
|
| [How to use Pipelines](03-pipelines.ipynb) | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb) |
|
||||||
| [How to train a language model](https://github.com/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)| Highlight all the steps to effectively train Transformer model on custom data | [](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)|
|
| [How to train a language model](https://github.com/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)| Highlight all the steps to effectively train Transformer model on custom data | [](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)|
|
||||||
|
|||||||
Reference in New Issue
Block a user