diff --git a/notebooks/01-training-tokenizers.ipynb b/notebooks/01-training-tokenizers.ipynb index 554d25d3ff..1a56594961 100644 --- a/notebooks/01-training-tokenizers.ipynb +++ b/notebooks/01-training-tokenizers.ipynb @@ -2,6 +2,12 @@ "cells": [ { "cell_type": "markdown", + "metadata": { + "pycharm": { + "is_executing": false, + "name": "#%% md\n" + } + }, "source": [ "## Tokenization doesn't have to be slow !\n", "\n", @@ -81,34 +87,46 @@ "\n", "All of these building blocks can be combined to create working tokenization pipelines. \n", "In the next section we will go over our first pipeline." - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% md\n", - "is_executing": false - } - } + ] }, { "cell_type": "markdown", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, "source": [ "Alright, now we are ready to implement our first tokenization pipeline through `tokenizers`. \n", "\n", "For this, we will train a Byte-Pair Encoding (BPE) tokenizer on a quite small input for the purpose of this notebook.\n", - "We will work with [the file from peter Norving](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=2ahUKEwjYp9Ppru_nAhUBzIUKHfbUAG8QFjAAegQIBhAB&url=https%3A%2F%2Fnorvig.com%2Fbig.txt&usg=AOvVaw2ed9iwhcP1RKUiEROs15Dz).\n", - "This file contains around 130.000 lines of raw text that will be processed by the library to generate a working tokenizer." - ], + "We will work with [the file from Peter Norving](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=2ahUKEwjYp9Ppru_nAhUBzIUKHfbUAG8QFjAAegQIBhAB&url=https%3A%2F%2Fnorvig.com%2Fbig.txt&usg=AOvVaw2ed9iwhcP1RKUiEROs15Dz).\n", + "This file contains around 130.000 lines of raw text that will be processed by the library to generate a working tokenizer.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, "metadata": { - "collapsed": false, "pycharm": { - "name": "#%% md\n" + "is_executing": false, + "name": "#%% code\n" } - } + }, + "outputs": [], + "source": [ + "!pip install tokenizers" + ] }, { "cell_type": "code", "execution_count": 2, + "metadata": { + "pycharm": { + "is_executing": false, + "name": "#%% code\n" + } + }, "outputs": [], "source": [ "BIG_FILE_URL = 'https://raw.githubusercontent.com/dscape/spell/master/test/resources/big.txt'\n", @@ -122,33 +140,31 @@ " big_f.write(response.content)\n", " else:\n", " print(\"Unable to get the file: {}\".format(response.reason))\n" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% code\n", - "is_executing": false - } - } + ] }, { "cell_type": "markdown", + "metadata": { + "pycharm": { + "is_executing": false, + "name": "#%% md\n" + } + }, "source": [ " \n", "Now that we have our training data we need to create the overall pipeline for the tokenizer\n", " " - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% md\n", - "is_executing": false - } - } + ] }, { "cell_type": "code", "execution_count": 10, + "metadata": { + "pycharm": { + "is_executing": false, + "name": "#%% code\n" + } + }, "outputs": [], "source": [ "# For the user's convenience `tokenizers` provides some very high-level classes encapsulating\n", @@ -165,49 +181,47 @@ "tokenizer = Tokenizer(BPE.empty())\n", "\n", "# Then we enable lower-casing and unicode-normalization\n", - "# The Sequence normalizer allows us to combine multiple Normalizer, that will be\n", - "# executed in sequence.\n", + "# The Sequence normalizer allows us to combine multiple Normalizer that will be\n", + "# executed in order.\n", "tokenizer.normalizer = Sequence([\n", " NFKC(),\n", " Lowercase()\n", "])\n", "\n", - "# Out tokenizer also needs a pre-tokenizer responsible for converting the input to a ByteLevel representation.\n", + "# Our tokenizer also needs a pre-tokenizer responsible for converting the input to a ByteLevel representation.\n", "tokenizer.pre_tokenizer = ByteLevel()\n", "\n", "# And finally, let's plug a decoder so we can recover from a tokenized input to the original one\n", "tokenizer.decoder = ByteLevelDecoder()" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% code\n", - "is_executing": false - } - } + ] }, { "cell_type": "markdown", - "source": [ - "The overall pipeline is now ready to be trained on the corpus we downloaded earlier in this notebook." - ], "metadata": { - "collapsed": false, "pycharm": { "name": "#%% md\n" } - } + }, + "source": [ + "The overall pipeline is now ready to be trained on the corpus we downloaded earlier in this notebook." + ] }, { "cell_type": "code", "execution_count": 11, + "metadata": { + "pycharm": { + "is_executing": false, + "name": "#%% code\n" + } + }, "outputs": [ { "name": "stdout", + "output_type": "stream", "text": [ "Trained vocab size: 25000\n" - ], - "output_type": "stream" + ] } ], "source": [ @@ -218,79 +232,77 @@ "tokenizer.train(trainer, [\"big.txt\"])\n", "\n", "print(\"Trained vocab size: {}\".format(tokenizer.get_vocab_size()))" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% code\n", - "is_executing": false - } - } + ] }, { "cell_type": "markdown", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, "source": [ "Et voilà ! You trained your very first tokenizer from scratch using `tokenizers`. Of course, this \n", "covers only the basics, and you may want to have a look at the `add_special_tokens` or `special_tokens` parameters\n", "on the `Trainer` class, but the overall process should be very similar.\n", "\n", "We can save the content of the model to reuse it later." - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% md\n" - } - } + ] }, { "cell_type": "code", "execution_count": 12, + "metadata": { + "pycharm": { + "is_executing": false, + "name": "#%% code\n" + } + }, "outputs": [ { "data": { - "text/plain": "['./vocab.json', './merges.txt']" + "text/plain": [ + "['./vocab.json', './merges.txt']" + ] }, + "execution_count": 12, "metadata": {}, - "output_type": "execute_result", - "execution_count": 12 + "output_type": "execute_result" } ], "source": [ "# You will see the generated files in the output.\n", "tokenizer.model.save('.')" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% code\n", - "is_executing": false - } - } + ] }, { "cell_type": "markdown", - "source": [ - "Now, let load the trained model and start using out newly trained tokenizer" - ], "metadata": { - "collapsed": false, "pycharm": { "name": "#%% md\n" } - } + }, + "source": [ + "Now, let load the trained model and start using out newly trained tokenizer" + ] }, { "cell_type": "code", "execution_count": 13, + "metadata": { + "pycharm": { + "is_executing": false, + "name": "#%% code\n" + } + }, "outputs": [ { "name": "stdout", + "output_type": "stream", "text": [ "Encoded string: ['Ġthis', 'Ġis', 'Ġa', 'Ġsimple', 'Ġin', 'put', 'Ġto', 'Ġbe', 'Ġtoken', 'ized']\n", "Decoded string: this is a simple input to be tokenized\n" - ], - "output_type": "stream" + ] } ], "source": [ @@ -302,17 +314,15 @@ "\n", "decoded = tokenizer.decode(encoding.ids)\n", "print(\"Decoded string: {}\".format(decoded))" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% code\n", - "is_executing": false - } - } + ] }, { "cell_type": "markdown", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, "source": [ "The Encoding structure exposes multiple properties which are useful when working with transformers models\n", "\n", @@ -324,13 +334,7 @@ "- special_token_mask: If your input contains special tokens such as [CLS], [SEP], [MASK], [PAD], then this would be a vector with 1 in places where a special token has been added.\n", "- type_ids: If your was made of multiple \"parts\" such as (question, context), then this would be a vector with for each token the segment it belongs to.\n", "- overflowing: If your has been truncated into multiple subparts because of a length limit (for BERT for example the sequence length is limited to 512), this will contain all the remaining overflowing parts." - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% md\n" - } - } + ] } ], "metadata": { @@ -342,25 +346,25 @@ "language_info": { "codemirror_mode": { "name": "ipython", - "version": 2 + "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", - "pygments_lexer": "ipython2", - "version": "2.7.6" + "pygments_lexer": "ipython3", + "version": "3.7.6" }, "pycharm": { "stem_cell": { "cell_type": "raw", - "source": [], "metadata": { "collapsed": false - } + }, + "source": [] } } }, "nbformat": 4, - "nbformat_minor": 0 -} \ No newline at end of file + "nbformat_minor": 1 +} diff --git a/notebooks/02-transformers.ipynb b/notebooks/02-transformers.ipynb index 44655c1e4a..e02d19c5a6 100644 --- a/notebooks/02-transformers.ipynb +++ b/notebooks/02-transformers.ipynb @@ -75,6 +75,20 @@ "in PyTorch and TensorFlow in a transparent and interchangeable way. " ] }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "!pip install transformers" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% code\n" + } + } + }, { "cell_type": "code", "execution_count": 74, diff --git a/notebooks/03-pipelines.ipynb b/notebooks/03-pipelines.ipynb index 9a5b3f7c4f..ddaffcee06 100644 --- a/notebooks/03-pipelines.ipynb +++ b/notebooks/03-pipelines.ipynb @@ -51,6 +51,20 @@ "```" ] }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "!pip install transformers" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% code\n" + } + } + }, { "cell_type": "code", "execution_count": 29, diff --git a/notebooks/README.md b/notebooks/README.md index 9a7d3a4511..234a6cf8ed 100644 --- a/notebooks/README.md +++ b/notebooks/README.md @@ -11,7 +11,7 @@ Pull Request and we'll review it so it can be included here. | Notebook | Description | | |:----------|:-------------:|------:| -| [Getting Started Tokenizers](01-training-tokenizers.ipynb) | How to train and use your very own tokenizer |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/docker-notebooks/notebooks/01-training-tokenizers.ipynb) | -| [Getting Started Transformers](02-transformers.ipynb) | How to easily start using transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/docker-notebooks/notebooks/01-training-tokenizers.ipynb) | -| [How to use Pipelines](03-pipelines.ipynb) | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/docker-notebooks/notebooks/01-training-tokenizers.ipynb) | +| [Getting Started Tokenizers](01-training-tokenizers.ipynb) | How to train and use your very own tokenizer |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb) | +| [Getting Started Transformers](02-transformers.ipynb) | How to easily start using transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb) | +| [How to use Pipelines](03-pipelines.ipynb) | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb) | | [How to train a language model](https://github.com/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)| Highlight all the steps to effectively train Transformer model on custom data | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)|