Adding Docker images for transformers + notebooks (#3051)
* Added transformers-pytorch-cpu and gpu Docker images Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added automatic jupyter launch for Docker image. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Move image from alpine to Ubuntu to align with NVidia container images. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added TRANSFORMERS_VERSION argument to Dockerfile. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added Pytorch-GPU based Docker image Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added Tensorflow images. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Use python 3.7 as Tensorflow doesnt provide 3.8 compatible wheel. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Remove double FROM instructions on transformers-pytorch-cpu image. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added transformers-tensorflow-gpu Docker image. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * use the correct ubuntu version for tensorflow-gpu Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added pipelines example notebook Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added transformers-cpu and transformers-gpu (including both PyTorch and TensorFlow) images. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Docker images doesnt start jupyter notebook by default. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Tokenizers notebook Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Update images links Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Update Docker images to python 3.7.6 and transformers 2.5.1 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added 02-transformers notebook. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Trying to realign 02-transformers notebook ? Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added Transformer image schema * Some tweaks on tokenizers notebook * Removed old notebooks. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Attempt to provide table of content for each notebooks Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Second attempt. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Reintroduce transformer image. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Keep trying Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * It's going to fly ! Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Remaining of the Table of Content Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Fix inlined elements for the table of content Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Removed anaconda dependencies for Docker images. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Removing notebooks ToC Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added LABEL to each docker image. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Removed old Dockerfile Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Directly use the context and include transformers from here. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Reduce overall size of compiled Docker images. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Install jupyter by default and use CMD for easier launching of the images. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Reduce number of layers in the images. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added README.md for notebooks. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Fix notebooks link in README Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Fix some wording issues. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added blog notebooks too. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Addressing spelling errors in review comments. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> Co-authored-by: MOI Anthony <xn1t0x@gmail.com>
This commit is contained in:
366
notebooks/01-training-tokenizers.ipynb
Normal file
366
notebooks/01-training-tokenizers.ipynb
Normal file
@@ -0,0 +1,366 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"## Tokenization doesn't have to be slow !\n",
|
||||
"\n",
|
||||
"### Introduction\n",
|
||||
"\n",
|
||||
"Before going deep into any Machine Learning or Deep Learning Natural Language Processing models, every practitioner\n",
|
||||
"should find a way to map raw input strings to a representation understandable by a trainable model.\n",
|
||||
"\n",
|
||||
"One very simple approach would be to split inputs over every space and assign an identifier to each word. This approach\n",
|
||||
"would look similar to the code below in python\n",
|
||||
"\n",
|
||||
"```python\n",
|
||||
"s = \"very long corpus...\"\n",
|
||||
"words = s.split(\" \") # Split over space\n",
|
||||
"vocabulary = dict(enumerate(set(words))) # Map storing the word to it's corresponding id\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"This approach might work well if your vocabulary remains small as it would store every word (or **token**) present in your original\n",
|
||||
"input. Moreover, word variations like \"cat\" and \"cats\" would not share the same identifiers even if their meaning is \n",
|
||||
"quite close.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"### Subtoken Tokenization\n",
|
||||
"\n",
|
||||
"To overcome the issues described above, recent works have been done on tokenization, leveraging \"subtoken\" tokenization.\n",
|
||||
"**Subtokens** extends the previous splitting strategy to furthermore explode a word into grammatically logicial sub-components learned\n",
|
||||
"from the data.\n",
|
||||
"\n",
|
||||
"Taking our previous example of the words __cat__ and __cats__, a sub-tokenization of the word __cats__ would be [cat, ##s]. Where the prefix _\"##\"_ indicates a subtoken of the initial input. \n",
|
||||
"Such training algorithms might extract sub-tokens such as _\"##ing\"_, _\"##ed\"_ over English corpus.\n",
|
||||
"\n",
|
||||
"As you might think of, this kind of sub-tokens construction leveraging compositions of _\"pieces\"_ overall reduces the size\n",
|
||||
"of the vocabulary you have to carry to train a Machine Learning model. On the other side, as one token might be exploded\n",
|
||||
"into multiple subtokens, the input of your model might increase and become an issue on model with non-linear complexity over the input sequence's length. \n",
|
||||
" \n",
|
||||
"\n",
|
||||
" \n",
|
||||
"Among all the tokenization algorithms, we can highlight a few subtokens algorithms used in Transformers-based SoTA models : \n",
|
||||
"\n",
|
||||
"- [Byte Pair Encoding (BPE) - Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2015)](https://arxiv.org/abs/1508.07909)\n",
|
||||
"- [Word Piece - Japanese and Korean voice search (Schuster, M., and Nakajima, K., 2015)](https://research.google/pubs/pub37842/)\n",
|
||||
"- [Unigram Language Model - Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (Kudo, T., 2018)](https://arxiv.org/abs/1804.10959)\n",
|
||||
"- [Sentence Piece - A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (Taku Kudo and John Richardson, 2018)](https://arxiv.org/abs/1808.06226)\n",
|
||||
"\n",
|
||||
"Going through all of them is out of the scope of this notebook, so we will just highlight how you can use them.\n",
|
||||
"\n",
|
||||
"### @huggingface/tokenizers library \n",
|
||||
"Along with the transformers library, we @huggingface provide a blazing fast tokenization library\n",
|
||||
"able to train, tokenize and decode dozens of Gb/s of text on a common multi-core machine.\n",
|
||||
"\n",
|
||||
"The library is written in Rust allowing us to take full advantage of multi-core parallel computations in a native and memory-aware way, on-top of which \n",
|
||||
"we provide bindings for Python and NodeJS (more bindings may be added in the future). \n",
|
||||
"\n",
|
||||
"We designed the library so that it provides all the required blocks to create end-to-end tokenizers in an interchangeable way. In that sense, we provide\n",
|
||||
"these various components: \n",
|
||||
"\n",
|
||||
"- **Normalizer**: Executes all the initial transformations over the initial input string. For example when you need to\n",
|
||||
"lowercase some text, maybe strip it, or even apply one of the common unicode normalization process, you will add a Normalizer. \n",
|
||||
"- **PreTokenizer**: In charge of splitting the initial input string. That's the component that decides where and how to\n",
|
||||
"pre-segment the origin string. The simplest example would be like we saw before, to simply split on spaces.\n",
|
||||
"- **Model**: Handles all the sub-token discovery and generation, this part is trainable and really dependant\n",
|
||||
" of your input data.\n",
|
||||
"- **Post-Processor**: Provides advanced construction features to be compatible with some of the Transformers-based SoTA\n",
|
||||
"models. For instance, for BERT it would wrap the tokenized sentence around [CLS] and [SEP] tokens.\n",
|
||||
"- **Decoder**: In charge of mapping back a tokenized input to the original string. The decoder is usually chosen according\n",
|
||||
"to the `PreTokenizer` we used previously.\n",
|
||||
"- **Trainer**: Provides training capabilities to each model.\n",
|
||||
"\n",
|
||||
"For each of the components above we provide multiple implementations:\n",
|
||||
"\n",
|
||||
"- **Normalizer**: Lowercase, Unicode (NFD, NFKD, NFC, NFKC), Bert, Strip, ...\n",
|
||||
"- **PreTokenizer**: ByteLevel, WhitespaceSplit, CharDelimiterSplit, Metaspace, ...\n",
|
||||
"- **Model**: WordLevel, BPE, WordPiece\n",
|
||||
"- **Post-Processor**: BertProcessor, ...\n",
|
||||
"- **Decoder**: WordLevel, BPE, WordPiece, ...\n",
|
||||
"\n",
|
||||
"All of these building blocks can be combined to create working tokenization pipelines. \n",
|
||||
"In the next section we will go over our first pipeline."
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% md\n",
|
||||
"is_executing": false
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Alright, now we are ready to implement our first tokenization pipeline through `tokenizers`. \n",
|
||||
"\n",
|
||||
"For this, we will train a Byte-Pair Encoding (BPE) tokenizer on a quite small input for the purpose of this notebook.\n",
|
||||
"We will work with [the file from peter Norving](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=2ahUKEwjYp9Ppru_nAhUBzIUKHfbUAG8QFjAAegQIBhAB&url=https%3A%2F%2Fnorvig.com%2Fbig.txt&usg=AOvVaw2ed9iwhcP1RKUiEROs15Dz).\n",
|
||||
"This file contains around 130.000 lines of raw text that will be processed by the library to generate a working tokenizer."
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"BIG_FILE_URL = 'https://raw.githubusercontent.com/dscape/spell/master/test/resources/big.txt'\n",
|
||||
"\n",
|
||||
"# Let's download the file and save it somewhere\n",
|
||||
"from requests import get\n",
|
||||
"with open('big.txt', 'wb') as big_f:\n",
|
||||
" response = get(BIG_FILE_URL, )\n",
|
||||
" \n",
|
||||
" if response.status_code == 200:\n",
|
||||
" big_f.write(response.content)\n",
|
||||
" else:\n",
|
||||
" print(\"Unable to get the file: {}\".format(response.reason))\n"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% code\n",
|
||||
"is_executing": false
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
" \n",
|
||||
"Now that we have our training data we need to create the overall pipeline for the tokenizer\n",
|
||||
" "
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% md\n",
|
||||
"is_executing": false
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# For the user's convenience `tokenizers` provides some very high-level classes encapsulating\n",
|
||||
"# the overall pipeline for various well-known tokenization algorithm. \n",
|
||||
"# Everything described below can be replaced by the ByteLevelBPETokenizer class. \n",
|
||||
"\n",
|
||||
"from tokenizers import Tokenizer\n",
|
||||
"from tokenizers.decoders import ByteLevel as ByteLevelDecoder\n",
|
||||
"from tokenizers.models import BPE\n",
|
||||
"from tokenizers.normalizers import Lowercase, NFKC, Sequence\n",
|
||||
"from tokenizers.pre_tokenizers import ByteLevel\n",
|
||||
"\n",
|
||||
"# First we create an empty Byte-Pair Encoding model (i.e. not trained model)\n",
|
||||
"tokenizer = Tokenizer(BPE.empty())\n",
|
||||
"\n",
|
||||
"# Then we enable lower-casing and unicode-normalization\n",
|
||||
"# The Sequence normalizer allows us to combine multiple Normalizer, that will be\n",
|
||||
"# executed in sequence.\n",
|
||||
"tokenizer.normalizer = Sequence([\n",
|
||||
" NFKC(),\n",
|
||||
" Lowercase()\n",
|
||||
"])\n",
|
||||
"\n",
|
||||
"# Out tokenizer also needs a pre-tokenizer responsible for converting the input to a ByteLevel representation.\n",
|
||||
"tokenizer.pre_tokenizer = ByteLevel()\n",
|
||||
"\n",
|
||||
"# And finally, let's plug a decoder so we can recover from a tokenized input to the original one\n",
|
||||
"tokenizer.decoder = ByteLevelDecoder()"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% code\n",
|
||||
"is_executing": false
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"The overall pipeline is now ready to be trained on the corpus we downloaded earlier in this notebook."
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"text": [
|
||||
"Trained vocab size: 25000\n"
|
||||
],
|
||||
"output_type": "stream"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from tokenizers.trainers import BpeTrainer\n",
|
||||
"\n",
|
||||
"# We initialize our trainer, giving him the details about the vocabulary we want to generate\n",
|
||||
"trainer = BpeTrainer(vocab_size=25000, show_progress=True, initial_alphabet=ByteLevel.alphabet())\n",
|
||||
"tokenizer.train(trainer, [\"big.txt\"])\n",
|
||||
"\n",
|
||||
"print(\"Trained vocab size: {}\".format(tokenizer.get_vocab_size()))"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% code\n",
|
||||
"is_executing": false
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Et voilà ! You trained your very first tokenizer from scratch using `tokenizers`. Of course, this \n",
|
||||
"covers only the basics, and you may want to have a look at the `add_special_tokens` or `special_tokens` parameters\n",
|
||||
"on the `Trainer` class, but the overall process should be very similar.\n",
|
||||
"\n",
|
||||
"We can save the content of the model to reuse it later."
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": "['./vocab.json', './merges.txt']"
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "execute_result",
|
||||
"execution_count": 12
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# You will see the generated files in the output.\n",
|
||||
"tokenizer.model.save('.')"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% code\n",
|
||||
"is_executing": false
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Now, let load the trained model and start using out newly trained tokenizer"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"text": [
|
||||
"Encoded string: ['Ġthis', 'Ġis', 'Ġa', 'Ġsimple', 'Ġin', 'put', 'Ġto', 'Ġbe', 'Ġtoken', 'ized']\n",
|
||||
"Decoded string: this is a simple input to be tokenized\n"
|
||||
],
|
||||
"output_type": "stream"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Let's tokenizer a simple input\n",
|
||||
"tokenizer.model = BPE.from_files('vocab.json', 'merges.txt')\n",
|
||||
"encoding = tokenizer.encode(\"This is a simple input to be tokenized\")\n",
|
||||
"\n",
|
||||
"print(\"Encoded string: {}\".format(encoding.tokens))\n",
|
||||
"\n",
|
||||
"decoded = tokenizer.decode(encoding.ids)\n",
|
||||
"print(\"Decoded string: {}\".format(decoded))"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% code\n",
|
||||
"is_executing": false
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"The Encoding structure exposes multiple properties which are useful when working with transformers models\n",
|
||||
"\n",
|
||||
"- normalized_str: The input string after normalization (lower-casing, unicode, stripping, etc.)\n",
|
||||
"- original_str: The input string as it was provided\n",
|
||||
"- tokens: The generated tokens with their string representation\n",
|
||||
"- input_ids: The generated tokens with their integer representation\n",
|
||||
"- attention_mask: If your input has been padded by the tokenizer, then this would be a vector of 1 for any non padded token and 0 for padded ones.\n",
|
||||
"- special_token_mask: If your input contains special tokens such as [CLS], [SEP], [MASK], [PAD], then this would be a vector with 1 in places where a special token has been added.\n",
|
||||
"- type_ids: If your was made of multiple \"parts\" such as (question, context), then this would be a vector with for each token the segment it belongs to.\n",
|
||||
"- overflowing: If your has been truncated into multiple subparts because of a length limit (for BERT for example the sequence length is limited to 512), this will contain all the remaining overflowing parts."
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"pycharm": {
|
||||
"name": "#%% md\n"
|
||||
}
|
||||
}
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 2
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython2",
|
||||
"version": "2.7.6"
|
||||
},
|
||||
"pycharm": {
|
||||
"stem_cell": {
|
||||
"cell_type": "raw",
|
||||
"source": [],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 0
|
||||
}
|
||||
Reference in New Issue
Block a user