Fix Colab links + install dependencies first.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-03-05 11:40:15 +01:00
parent ff9e79ba3a
commit 30624f7056
4 changed files with 138 additions and 106 deletions
--- a/notebooks/01-training-tokenizers.ipynb
+++ b/notebooks/01-training-tokenizers.ipynb
@@ -2,6 +2,12 @@
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Tokenization doesn't have to be slow !\n",
    "\n",
@@ -81,34 +87,46 @@
    "\n",
    "All of these building blocks can be combined to create working tokenization pipelines. \n",
    "In the next section we will go over our first pipeline."
-   ],
+   ]
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n",
     "is_executing": false
    }
   }
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Alright, now we are ready to implement our first tokenization pipeline through `tokenizers`. \n",
    "\n",
    "For this, we will train a Byte-Pair Encoding (BPE) tokenizer on a quite small input for the purpose of this notebook.\n",
-    "We will work with [the file from peter Norving](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=2ahUKEwjYp9Ppru_nAhUBzIUKHfbUAG8QFjAAegQIBhAB&url=https%3A%2F%2Fnorvig.com%2Fbig.txt&usg=AOvVaw2ed9iwhcP1RKUiEROs15Dz).\n",
+    "We will work with [the file from Peter Norving](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=2ahUKEwjYp9Ppru_nAhUBzIUKHfbUAG8QFjAAegQIBhAB&url=https%3A%2F%2Fnorvig.com%2Fbig.txt&usg=AOvVaw2ed9iwhcP1RKUiEROs15Dz).\n",
-    "This file contains around 130.000 lines of raw text that will be processed by the library to generate a working tokenizer."
+    "This file contains around 130.000 lines of raw text that will be processed by the library to generate a working tokenizer.\n"
-   ],
+   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "pycharm": {
-     "name": "#%% md\n"
+     "is_executing": false,
-    }
+     "name": "#%% code\n"
    }
   },
   "outputs": [],
   "source": [
    "!pip install tokenizers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% code\n"
    }
   },
   "outputs": [],
   "source": [
    "BIG_FILE_URL = 'https://raw.githubusercontent.com/dscape/spell/master/test/resources/big.txt'\n",
@@ -122,33 +140,31 @@
    "        big_f.write(response.content)\n",
    "    else:\n",
    "        print(\"Unable to get the file: {}\".format(response.reason))\n"
-   ],
+   ]
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% code\n",
     "is_executing": false
    }
   }
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% md\n"
    }
   },
   "source": [
    " \n",
    "Now that we have our training data we need to create the overall pipeline for the tokenizer\n",
    " "
-   ],
+   ]
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n",
     "is_executing": false
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% code\n"
    }
   },
   "outputs": [],
   "source": [
    "# For the user's convenience `tokenizers` provides some very high-level classes encapsulating\n",
@@ -165,49 +181,47 @@
    "tokenizer = Tokenizer(BPE.empty())\n",
    "\n",
    "# Then we enable lower-casing and unicode-normalization\n",
-    "# The Sequence normalizer allows us to combine multiple Normalizer, that will be\n",
+    "# The Sequence normalizer allows us to combine multiple Normalizer that will be\n",
-    "# executed in sequence.\n",
+    "# executed in order.\n",
    "tokenizer.normalizer = Sequence([\n",
    "    NFKC(),\n",
    "    Lowercase()\n",
    "])\n",
    "\n",
-    "# Out tokenizer also needs a pre-tokenizer responsible for converting the input to a ByteLevel representation.\n",
+    "# Our tokenizer also needs a pre-tokenizer responsible for converting the input to a ByteLevel representation.\n",
    "tokenizer.pre_tokenizer = ByteLevel()\n",
    "\n",
    "# And finally, let's plug a decoder so we can recover from a tokenized input to the original one\n",
    "tokenizer.decoder = ByteLevelDecoder()"
-   ],
+   ]
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% code\n",
     "is_executing": false
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "The overall pipeline is now ready to be trained on the corpus we downloaded earlier in this notebook."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
-   }
+   },
   "source": [
    "The overall pipeline is now ready to be trained on the corpus we downloaded earlier in this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% code\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Trained vocab size: 25000\n"
-     ],
+     ]
     "output_type": "stream"
    }
   ],
   "source": [
@@ -218,79 +232,77 @@
    "tokenizer.train(trainer, [\"big.txt\"])\n",
    "\n",
    "print(\"Trained vocab size: {}\".format(tokenizer.get_vocab_size()))"
-   ],
+   ]
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% code\n",
     "is_executing": false
    }
   }
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Et voilà ! You trained your very first tokenizer from scratch using `tokenizers`. Of course, this \n",
    "covers only the basics, and you may want to have a look at the `add_special_tokens` or `special_tokens` parameters\n",
    "on the `Trainer` class, but the overall process should be very similar.\n",
    "\n",
    "We can save the content of the model to reuse it later."
-   ],
+   ]
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% code\n"
    }
   },
   "outputs": [
    {
     "data": {
-      "text/plain": "['./vocab.json', './merges.txt']"
+      "text/plain": [
       "['./vocab.json', './merges.txt']"
      ]
     },
     "execution_count": 12,
     "metadata": {},
-     "output_type": "execute_result",
+     "output_type": "execute_result"
     "execution_count": 12
    }
   ],
   "source": [
    "# You will see the generated files in the output.\n",
    "tokenizer.model.save('.')"
-   ],
+   ]
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% code\n",
     "is_executing": false
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Now, let load the trained model and start using out newly trained tokenizer"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
-   }
+   },
   "source": [
    "Now, let load the trained model and start using out newly trained tokenizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%% code\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Encoded string: ['Ġthis', 'Ġis', 'Ġa', 'Ġsimple', 'Ġin', 'put', 'Ġto', 'Ġbe', 'Ġtoken', 'ized']\n",
      "Decoded string:  this is a simple input to be tokenized\n"
-     ],
+     ]
     "output_type": "stream"
    }
   ],
   "source": [
@@ -302,17 +314,15 @@
    "\n",
    "decoded = tokenizer.decode(encoding.ids)\n",
    "print(\"Decoded string: {}\".format(decoded))"
-   ],
+   ]
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% code\n",
     "is_executing": false
    }
   }
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "The Encoding structure exposes multiple properties which are useful when working with transformers models\n",
    "\n",
@@ -324,13 +334,7 @@
    "- special_token_mask: If your input contains special tokens such as [CLS], [SEP], [MASK], [PAD], then this would be a vector with 1 in places where a special token has been added.\n",
    "- type_ids: If your was made of multiple \"parts\" such as (question, context), then this would be a vector with for each token the segment it belongs to.\n",
    "- overflowing: If your has been truncated into multiple subparts because of a length limit (for BERT for example the sequence length is limited to 512), this will contain all the remaining overflowing parts."
-   ],
+   ]
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  }
 ],
 "metadata": {
@@ -342,25 +346,25 @@
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
-    "version": 2
+    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython2",
+   "pygments_lexer": "ipython3",
-   "version": "2.7.6"
+   "version": "3.7.6"
  },
  "pycharm": {
   "stem_cell": {
    "cell_type": "raw",
    "source": [],
    "metadata": {
     "collapsed": false
-    }
+    },
    "source": []
   }
  }
 },
 "nbformat": 4,
- "nbformat_minor": 0
+ "nbformat_minor": 1
 }
--- a/notebooks/02-transformers.ipynb
+++ b/notebooks/02-transformers.ipynb
@@ -75,6 +75,20 @@
    "in PyTorch and TensorFlow in a transparent and interchangeable way. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "!pip install transformers"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% code\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 74,
--- a/notebooks/03-pipelines.ipynb
+++ b/notebooks/03-pipelines.ipynb
@@ -51,6 +51,20 @@
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "!pip install transformers"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% code\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 29,
--- a/notebooks/README.md
+++ b/notebooks/README.md
@@ -11,7 +11,7 @@ Pull Request and we'll review it so it can be included here.
 | Notebook     |      Description      |   |
 |:----------|:-------------:|------:|
-| [Getting Started Tokenizers](01-training-tokenizers.ipynb)  | How to train and use your very own tokenizer  |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/docker-notebooks/notebooks/01-training-tokenizers.ipynb) |
+| [Getting Started Tokenizers](01-training-tokenizers.ipynb)  | How to train and use your very own tokenizer  |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb) |
-| [Getting Started Transformers](02-transformers.ipynb)   | How to easily start using transformers  | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/docker-notebooks/notebooks/01-training-tokenizers.ipynb) |
+| [Getting Started Transformers](02-transformers.ipynb)   | How to easily start using transformers  | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb) |
-| [How to use Pipelines](03-pipelines.ipynb)  | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/docker-notebooks/notebooks/01-training-tokenizers.ipynb) |
+| [How to use Pipelines](03-pipelines.ipynb)  | Simple and efficient way to use State-of-the-Art models on downstream tasks through transformers | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/transformers/blob/master/notebooks/01-training-tokenizers.ipynb) |
 | [How to train a language model](https://github.com/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)| Highlight all the steps to effectively train Transformer model on custom data | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)|