From 95a26fcf2d8d7072e4e63129cea8605f756bba1d Mon Sep 17 00:00:00 2001
From: Alexander Measure <ameasure@gmail.com>
Date: Fri, 22 May 2020 15:17:09 -0400
Subject: [PATCH] link to paper was broken (#4526)

changed from https://https://arxiv.org/abs/2001.04451.pdf to https://arxiv.org/abs/2001.04451.pdf
---
 docs/source/model_doc/reformer.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/model_doc/reformer.rst b/docs/source/model_doc/reformer.rst
index a0d00433df..781e203ce7 100644
--- a/docs/source/model_doc/reformer.rst
+++ b/docs/source/model_doc/reformer.rst
@@ -5,7 +5,7 @@ file a `Github Issue <https://github.com/huggingface/transformers/issues/new?ass
 
 Overview
 ~~~~~
-The Reformer model was presented in `Reformer: The Efficient Transformer <https://https://arxiv.org/abs/2001.04451.pdf>`_ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
+The Reformer model was presented in `Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451.pdf>`_ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
 Here the abstract: 
 
 *Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L^2) to O(Llog(L)), where L is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.*