From fa661ce749b0d14ae1999d1b097866248624a842 Mon Sep 17 00:00:00 2001 From: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Date: Fri, 5 Jun 2020 12:22:50 -0400 Subject: [PATCH] Add model summary (#4789) * Add model summary * Add link to pretrained models --- docs/source/imgs/local_attention_mask.png | Bin 0 -> 27493 bytes docs/source/index.rst | 1 + docs/source/summary.rst | 492 ++++++++++++++++++++++ 3 files changed, 493 insertions(+) create mode 100644 docs/source/imgs/local_attention_mask.png create mode 100644 docs/source/summary.rst diff --git a/docs/source/imgs/local_attention_mask.png b/docs/source/imgs/local_attention_mask.png new file mode 100644 index 0000000000000000000000000000000000000000..284e728820c8fb57e4d295b5c1e33653acae405c GIT binary patch literal 27493 zcmdSBbwHGT*DfleB1%dsB_JgrAR^t}DJ=*HD4o)vf^;b%AV?@7jdV-b0D`2Hv@kT% zb=LSk&-1$ldmuIpNNkcyHt_LUn~&Ye4lEh{6TcJ3Tn zCj1m!z6Ad{tGTfQf6yG&q{YtV_mD5alMCjeilXPv6-8Vw_pQfTCEw!8 zB4I)C872#HNQ?1KQCgxlJQ`dKb}%l|X3dV72CXZjp6 zy!JU=^l9a;+r<92?XPfibL&^wO&7G<)g;D#YY8ElMk1N>390=eeXR5!K4fBmZN08` z2H}axB>Oth_~G?0rX*gb_WQ+~pO9$aePFtObQHpzzQ6mN%&ykMM}IvvGkKVO{I2WG z!)mNodMU}X7xvSpA{%~kI|U|MZc3gRiRU_#-g<0?Wk=!v4g(vHli*TS^-SyONi)K) zp=rXdiIqDf9#gzV!0x1*I~X2$3D{3ya|gy_-tnH?-`$xiIxSx}jmKQx9-goh)IafR z)(;5Tsazj-MfX2&52?i2shz4^k~&(yInOt+SWrJtZ^@5E;z*G{#9=31ua|rBKi?a- zaMy^b`daql$u@T#HQ$%9v9VLFiJsVR4u_i&KB1A3c=qXp;vr(w!os*{SFe`LS&;`s z`(Ghcx)G+6iOYuw<;T5QoQK9s(QTY>vhYdoHFp`&)9yjklO;V>0cVzLIUz0Mew~#M z1MeV+ISH2Ec1t~Jz0ARxYI<^QpP?C!pMrnb*j2kQE}ngJi1nFoqW$voIAolsE5Cr) zd5-!0I~>D(&Q2ILX<{4=y0}cksdd9_LN`pF)}0dgtL{$;`!yb zy`oE@o5nU`IfEKot)&nINghMA){n#D;=ejuG0ZSuX+ijnn&Z&YaEl`pw153%x^Sx{ zC>}HCj=<&hc6LiW0pX8`Y+@7{i9qRD`%mT!U_>m4!vIZ`AAk+YiL ze6PQQvhZZ$YE~TQ_0o_eBwFyHsU~jRnkVj)v5FEc!AH4+==~m}=-1rG&>i|)R^$j3 ztjgDUWN>7ZmD{S<>yETccnqSWF3wBvswzbjz@LYg{^_yq|Mr*8EFG>aA;c$Fzp2_; zWW=@}4d3+AN0)o6(D;%>nICZ+;*$1w$h9-bAuW-SftUNa!y(3NbzNx+z6k5aim`Hw zPv1F7*4(ByJ}F8ZCrGvFOa#XHi_JYNd(=ZYGbp;Ioh$+VUq5;qry1kPrkaV zP2%XQ!}~Io;hz|uHSyoZio&rafY|ehA#Yn7g$tK4RYcXm6!c{3vl?88VWC%xAJg1c z#W;pJwttCwoaN}C@eQLB z9?a%ZvVVr1VNAN;vy7GwJG%4oQ|&TZ81o_x3#TMPL8YKat$;o1Gjpl|uXFk^$&L0*$GFMEB?|}blJM!R|D~`%Sgl4wL6Y@G;y;o}X(6UP zb_PPEy=G}(dp=E!%4(gLIIxXHHTLGOLS>bY@F~YCw#r8*fol2U6u!&zwLQD%Ug<>& z`yaT$S1-?HX$ju^yeBA9%cIq)u%R6ItO6s;dN<70Tir<{IVQ>-b2^qgnV9Y3nIXME z5y>4Q{0;J?TS8I&4k?(l^h29vl(xvYV!7Q9rPj`9I*a~sLoH-j0{jieuo$Q zsru^$cwF+|d{X*K#q5P#3r}G_wPkN>MUM%6LSJ>X5dAAkGo=u*| zFwCdWwQ8xtla4-0p*V9GufpYXTa7FS>X$!}qjjZBJSL}$jf9-4PNu?gT3; zQ?-7~d^9`3(5&C~B9ltg*oIhh@>xzCl4l?BL57*NISrj3UP?8m5|=X#V#RoR)?f5< zB}wK}!I%9(`Q~{fkHaSUZT>O#XB`3@!&}EuPvY*Pu@+~vSE7Ab=n+}I^KH4 zNHNG_Mz1hdaiy{}2KKaOLeG5|J$$Wv*Qm{1q~)ik=c2!*f>M)bEnXbCd2e~&bTXkY z8s#0*8GiD*AXS{%D7+xOag151ax_v>p>ZbveXO=e%DDC?3U=5V4L4o1I5a;S`Qab3Nz z+TFWwkb$92Sfo5(4WP2UHum!V`lBJR{1itmpO%gjUECjA+)C_a98#vcvzc8>OQNME z7=XY*L;jsz0XjG$k*woOrazj)zb^V~<} zrsuIIl$IY~Rj~H9u5l8G(bE1{Si56JFVWii=n0$q^Y44)t7z{hC-)ah@At|o4vhag zj2fChE!G`i4NA-Bc%oY?f&v%P&V&uMOvJB`$SCML(>$jiM=MY<+%fFsd@@?t{JxQC z(YTWIp^hn8`^xR>oyH{QCv4^$zQ;mST| zbjvqxY;+>0001QW>|m6Xmd`F-?G367XRAAC}fqW*RIB4vNB#Rmh^^Cr>S#>W|xe_*%L&)pK*R;X2fvMeRiQ;2 zTi7$~uvMyNcR>%HbqSwb{~#67%1M$88Bz`>=7XNc4&sxY74z<)LE9ty({!M^UgJ>RzfK7cpu3k^a@oF3Ke@VFgQNt$ix3!PCW}e zp7V*ud}c1P>rm@lvK9Taz^f`fQ&kd7h4T744OXQlvi$K7{&d{w{KW7Z2LK#U=l+-S zKPt()E)FkmL-V?PPjU>3E@QF%fk{qi+hws&b?r}ua_jaZ zV`F;9ZR`80&HAt)OD&ILL<*Hr$|gN{Qg0sB6k6qhp2k}FYQQ{grujTWTtxJzo%Sa= zB*VRaL)h|>8tU|B^eh4uSL5efDvpxgCK)5+qZn^9H53S+HZ#5VIUxF&kq-FH)@%JJ zFet1EgX%;R;aqd{fr{3x`SS5k^L2prbTNormYRdLJ;3{ilP?>^YBoA^;BwX`pEX)w zJ#gJR=7eYFRL57a6}w%3OiM$~II2J83Qj=w)j6kE_$+B3temvld}>=b@o1Sxs*HIJ zjaQcGLuqtpu6;9>utU~7eu1w1RR}Q@#Ol4(rs$<_kPJ*1Y}$TuBtce;KU$N1=v5mR zzp)>soLZPAgUE0GrBiG_tH>RM^Rw9D#@zI!^(=N*=>TNWX*F7Y(~6s%aemX`u?RN_ zmXpg1bB(N73I895MkjU?Devr}&Qvlh`kc=l;*!x6B%?KX3fg!55H&V?nN!bZL&kXLd*=#L?x&5ZG}tK%1=i(mI)XW@7)MvUI?nh z;hL%P9J=LkBQF}(>H1i#MMQAKxcWLfK5;6+_A19<|CS>)4LwZbd{{g93P(7@pIMdTF?)HZLHV|xgcQ7ItVbJHWbl*?pg5_{$}Rh&ZnfTDZE z`8cq`6Yx{csK^m6l|X@;_ZI^cWi=;Q$5NS1 zrF?Rp^rm92T73_^76zQHG6feNb}DGQpIK-9;A9YToZ)2P;C#`W=q|>q^VgD(j(Qla zuKZ^n$+Y^Z8Bxc(T(rCgi4fQtufoM zC|ie0TgnS`*_TGhwscjh-AzZ(9b*D>fxzZWUu z>UU`gGKIG|#u#Tjq_v%c(@%}ZZpR~k}Rx(=& zCxy^BWME)u3GOv6Tv=3x_~fQyA&6=VJnK(SR^#~qNZsbUae{#(VP_Wz^)s33-V?m@ z=)TSR(i%=!hqyNa8-~1Zty1acn*5AyK%ZWkWxZn<@O|x0e*VVR7OaouZHw4937B;A z^jPIW$b|(+`K}&%ZJ$FzAcG*0Ar0dGz+s@Fz1S6?XZj5XIY&RP$YmX<5049EE?@js zxcf>vHz%{LG}mL3Ws4zH?Bp{q&?6J_5ouPf7c_d7P>XO)W2bS6ad{cfk+nb|{7p*~ z`z>Koe12pijfoMN+A{OpJQ<)G`sUa0lwwLsNNtzB=KN+AT z&04r3TZ!Y`67?n_*zovJ9O8w(!)$S321D3Q86ey`>m8;Bq-`wqZ`p0bR71_OsX34yvf#X;loYN!Q z%3JXS$-JAWUnjVKM-&gw|CJcf{}mdfCEj`QCsJ5O`+L%OEAl6v*?InVBqIM8FmV_8 zTaNlh_gBgiO!*t|S-SRD{-Oi~<0Yz8nqy9zLVuBHkHsN4(bNR*VN}TVUP{N{&44of z2fVA^3;4IpKh7Q*dd?8`p3EHu21hhiC;tWv`==53$$aMvPM5vlQf;yM1P>%y>F;m; z12QDOp%FM;umq5>Gj)8eg=?_l?7$o-pm+5rMA>=?_FF;4@LHa-kVIcrhCv5Yeep%h z2ui%S1>)9w`4=6FY^H4BQ)}6TS`!DE3thL}p$?3eb;ng&6+wh8{qPQb$Mq1QFygql zO3ItB4p;Vc<0R{e1u&dzl^cW(av{J3DJ1ffaT{3L(oq3eCmGp-cz|R%(bi>Yt zcleq`@5>uM*KR+i_tQlpEqYVlI12juSC4Kd`d{+kmlTdjt?>TRO-equL@4ehX#ef` z=}cApj~hQi$POoH%%uD-%t#@zZr{7#%55#p=<}e~?{cS>O{hPW$S!A+mqrA}}YBaZi4c z30#KmZ*Db*j_|}E%fGsy{C9Q4wRe0~U7i{DjvdiNai>YJ8F4|Ycbas04OV}}k*i5z z7dv#A70GOuaIVupLb!PNVJhVXU>s$;$XZh!)qseO~F{ zY+rlwnPF6}GTc;o!naYDf-#;CmK%%Dp+R(aLg2W49^{2e$<>*78QV`pjs$AvTt<(A z*~GXtc_h25Mgu#0uf4Tyynqblkq3&gTGcG<6ph%yXeIe=U&Z0&sPT0%!a8LLd?%zd6K`UYEs@kTK9w2U8x!=ZKC z3cUKbd`*GBQVV!kw{3y`gd7iNJvMU%PQGWk_$V@oL~-ql41>q>lW~`x$H-fKm6Rze zO-k~a{8Q-P@84N{fJ<6!F4*sJFwUD=E_jpSApjpf>iwQD?m(QsulxVhh9?b$oBn8P zoW=i^o&JWPN=;l?k^)ANTGK5NK1=K*!iT>8R2=hwO=axYR^TKV3bm)#nP5FHa@% zl)RsqBP8-z&~|XJYO9IPVa;@oO*EFeHT^01L}a{`ucvlT@RphR%(ROk;7sM(GxU!n zL_ZKYB#?Fv4396nJs{%QM9f0k88xg8=wG-*?m8#LF>F6QEG#LiXHqa1Jv{m{JzzHw zN=l70*0Y|ZAi8Q)7?QAL<)@e;IzfDTrc=N>7*6S{VGaetw`b+rN5Jux3cMW^UdvfNbXFIe!kD^C-77f~leGa%obxSw0Cx?*Z#gmj8^rrO9 zEcp%3AKPxaMPt5Z&dy&Nb0;$p2)T#Mp8q!G!(?MZLFIFLaVs|`z|ec23~*Rod2ZVI zzcDvvTu+&XH|LX2LsdUW&&G%J0yR@`dd!wZKJh(g%>9*#UeQ@N!OWKPF)%|E^091f z(OI`$`voFstT@=<6`dzfO62h zjq-_pDpazAqhw5-a@y`rL;vs7RsFv+JTvRmnc;&WD!h&~yQdo$+YTD8lGeH%JcvK{a@<8 zsiRZ;xjCZi!4>_U&2LGbsGcPy%+FC!+*#n8>AZq>jUou>hkNBEd;BwyOBFx|Lm zKcNtKyS))K5WTT}tK6x?nDPFG`iDEKH4hmIIGVtT0I zAsqngtP5G1H0l>%h}* zn**G^e2B{J-bbw~qd&>H4G_)(PdS`V^_z7tK9zXWXXypMcDoK=1=`bPIa#26vtyOo z6w$<|xf$A+-k1$H(cV(gFT(^4_f<%8c`a+Ye%pv3*$2Nqb(!GK~jcGJ1`*FXeVzX4o6&|QD(=&9MO*Z(P z1eyg8Lpql&;9p(g6z%2tr?8$*FmOkY3Zowe4XJ*xYr-b~<}K;&p{G2dHsbi>`M0;f zmSY=xJ_O`1TMG^j%>EW}J6zVJ>f@$owiyZJY?9m_*dp`1rt$aNq)nQbBVdg?vk3+d zj9X0(%~}fr|53j4f9F5A|0F$8b^M=1CKUhS|5q!6`S(@^=+Zyl2(0YixLC5kh*O5Y zcTRvFRF!;ic?Y?EzyH^iE2JYV@~scj(TvX~^l5jjB!uIYj4|uJx14!bi)F0d9-jR4E^8 z65&t)U3Y#mKPl*7y2(Vmq*>hkB=6gsNwup$I$lV>66ihorKZUI>tC2DP)+8TsYZ2ZXOf^<`*A!jC z)zH<3VZFM#C6s2=__5>i<;x=K+cJ>M6Le*sFj^pysJGPEK4GVE=x?cyz;VX%cy*Ds zI!guUc4sVldOB)0VK5^&>9jqv&`tOzDAJch`AAhoC9q=u0$3nQaXek zD-LV2H3QcBVen`SaG%ZP^(WSSC_JVEr9r z5AJ>uW8yTLjm;~wg2FX0$o=CGSwKSZWz3z}Yr1UGdil-d_exmDI}A96gJbT+z2s67 zHvZtIaPp+H!J(uVinvQ{Ed>cc} z&!B}+%*xF#TI>J`w{nnZBlPlXpJX)57|C4$MigZwt_CC_!tdvI&vEt=s$-;DVq+!? z8aFOQV05^}$Iwsqam}#~Say!0Ya_(=Cr&7X#8*4+$6+?W&bw)v?Eczo$_kq0aBdI? zh`;ASFP1y|bleXUIA@cFw3fi&a4C;AQV>@jZ5m&Yl%}c66*=|-T4{xZET})h0)$JT zm|n*}<(6jR7VUoNP=>)6EHCddh@R~}ihkq9JG|t`n_#mDw~&^24CM1yUK@dzFJCsV z$bB?@{G$q)PeAe2d_G$HfnUbpHiyFm4Xa`f$<4cL=pCh|#`&h*VuXVlMvpDuc`eK{ z282^ezH=VZQP*Ny;KQL1G~4)Wd*;uc$I;{py^ap z!5qz%jUBJ)D=~YauEClZjPxJJD(LIjxDhqO{+`u>GMWL5J;cMx(kJ~oAPbA`G(I)X z|Ilz~5D_V&(Y+*wtZKpeR^vcvOh8a*Jb2sw>}%2459*J)-6FD5ozCPuPNW=5Ozo5t zw@!s9x`*zE7Anb%vGU5pLfL3n)zElggcvD;mX^ZTf^wjJNRxv*wX}9(8O<3$Y{e#l ziZP}Vs0n=5d=3tdpe9sqlRT1;Lve#vEmxYGn<14j=45?wsoSQ2RI*n#**`jR+^RmZ zyLORPZ&Dy)hRI|_)K)Eje$|}aamT@Gtg@@F#wA+#bbhLEvEc15uvJXF0ab8ym##MY zUEyh&xQt~!Dtw^67TO|Z<1vfZh5Q1?onXcGjq>A15;B<4=eqBrjsN5)1)&(vIi+fXe$G%ShYE-hzM*ZX!4W8S#6V~Ma`LalWG*<^9s&(*mmP9-h_fy()Qy} z@m=9W2W84gsV$Xeo(?ley8QP13wN|3h{hey(BpXAK7)kO{B5z%I4MOw9+Rzhm#>rg zHjr2>`k(!J6Hkdf7o2vId09Uve0~!Z9bdTI=zB(&i5@n?$wcvb49UtX`LNbpX=v{@ zFRR;1NgQ$n#PY1{f>}-mq?JAjUqtpM?jum*=l`tW{;t#@gZ@qY1i|AUl^P|3{!RSE z{)^!x_&dW1;_Dx8{GXzUC0+&Iva6Uxk|PHvAxa#Jh}S~YnLp7L1BH0{jnbao`@#)Z z%fOpZZ^~=PXsB+#eekXNLka`8S;gay@5$uz+x2z(F;KZ7Yxt1zN|KnR8uvWb=uS>c zcd#K5>yN_Id&97xyPzbhDNCJgVFBvWvw+bOXELamtHTp0iNtLu`!ume0HQfJ9kZ_J zHL5`6dfj@NM#Jtle=RfX9s4C4H1Cmkw>r{qBy9UaeZ??5s$GYjkQda)TTi-11x5eA}Y z`PD;GX3d22+We~;11t&o>Fy00erBN}`Z_P-EOpqiM@?IHSz@jEUHA|~6#OeX?CJCK zU=Oq$WAqq_kB(QF9I~2$_=9670PUo&futuvQUet!M=@X4LV~WXyfMWC+CB;P{|G zR!tEr*b|$KvKJI3?Gy+z6q2cJ3@P4~TZ>CLoL767-3({an!>k%_S)BxNqW#l;54Ls zdCdxp|H}l7c`IC%flrhwLm`yx*)D=9j^jKZb+AF+ixY6RC~*I>Qn`(Xk56r;mG*@% zwPImTqx^BKSB^254lPt2lY^^TN5nDG^}F6JaySm&wZxcu+H2+zEqvTnf7;7J0{w|MqNG-!KhUD6ta3TynPoyD%6s-HVd&F_$+3tK;H?6L&q4&T zwXi9G-&iKqFSn;H;~hmOHM~y-DU4>PPSUV_MBrx2mf8fWGw8|d<({@yb`}< ziw}D$t*3MzU1ng;E<(10E`#w8gXaLIfq^*od+)Wh3NPo4-vSAh#O1$X_J4Z%cR2rF z^3vH&2Q}aI8(Ij-?d6NQH?(@hB@Fo9SagCi_bvN?F)LqUyPfzbuQl8OIZeBZTpMH| zg`f};em3sai-6hb#m06d>ee%>= zQq2L1^w&P#GA%)oho!Hy1w}gZB{ZMi&#zwN``r?ihL{^Sbp|o{(Q*a7pRo zvjmOeUHuVZMuGNA)-jinat8AGJ)dKOXKSOrL^+xvm9gNFk`AY8TXEZ-8-LCvjXUBy z`Ekm;kpfrvQ4&{KVAuWW8F{ub^L#b4MnN{At7pyHK;wYaQM(a*xrP4sIeO45bgsTa zCf98mA3*J4WdQrFehg|PGhStRgVBHh_cql3hcD6hdyY9msrM+`4 z;S<$K>uAKwv5#%tzdYRgt;B;8Y@WP1O>8ADZXF;|6o3 zTAziFYus9z4`5r|n%zb)VU!HLc~kGake^Tjsn3D*+N$r@?guDngI~Rj7fI~$Aeoo$ z!P`gQJ8Dr%i|co`8Pjw?~30}KK8hRtSi#$v>eOEZiS4eP;07`ti-u*^S#&gAcwvNHFWLid9!*(8e za`nn>GRWrEmR_m0)Ibr^e^;0iux4uGnRXyYf5+*@YODGuFO0Ze*A7bp9s zvl96w^@S-u+%G6p!ZBXmBm%hN)g|ZmkuLR-G4#|bZFWuu^sUOt^uJ>xYk@;?ErA9o zLXIr6&%Dq!o3ur^RADe?C?Ox>#_{{0T?q+>(v!}twVRE>SlPyK3*3rXpK_XZh7H;H zQhGaOxxSD05YI(eetTn{WPhMzUTYN31Q)vdyc|M7+FLaB5syV*}v4WXqS=_r6 zvxZO^S>VM!cgb@IuxKB9(Q9yX0bwWx{N0B(CmME?20_YiL?cjJ!)|%%nHk4!dgTzC z!pc^DX>Y0h#$X^A#v&*BZY}VA9(%|BwTXx}f%!F8Mu$=VcMJOuND-_q%p!9!UIqJG z{)g@~m1PQWK9;o|uWl6wcmht@IhZyQ1ry3O{aiPzgM}#HG$~2(3e_w|3Z3!X9QhM{ab#mn-bU6 zOp}wYh*3uarz|~=i&5YE3XStoi*uPEuFR9;zV(DZ_(?xBd7chZ6?*O2uoza#Axe*@ zEy?-Z35yqcVjR&msU{<$Tje(Q;_Z%n_fntE$v_E0IVwD2L!#q1TKU&$n)nKVXi2U) zb99yT6X_;;?8QW^32b(oIc`pV2{^ z^u9h_m z=rmXE?xXSqP%1)>hnBy6oBzLw$p5Uqh>zE_WY7fQU7V9^bu8u(+H0T@5;;(#OfwC6 zS!5Gt)+0==E8|f(j=Mn&?9Q!NPN7RcdPKbEH(!qq>+Xg+yLdc$xmT9JCmBL&PzvbKAHo@KKlicwiC9M~6WH~iU8{+rQ7L_tUE$3~ zuia+h`s_xhnG#Gg5#ZQ}TG`+EZl>7s-b0BOk9cUF^ql>M%q@0{+^V$ z;Tdc55c6@#&{J94Pp@1wNMEX5Yu1;Sr!bbjHCn8lq3~{v9T3fE@gu2^AA>6~Er>&J z4A7amD>b^|};BKM&4@$2wZ4Fe&p&G7u?=%gD+mEuj$&G|)r6u`bH5ScZm%fXpz1WcPQ= z>se8$^wW1fU-fd`0c&CkBnW>1omA?r!sC$IXZZZh;TxocG^izBDSM6n)@TQqIC2*kb)TUmiWnZr|fe4;| zg~84^iGjAgjaO9MW7Jr#*~H1&Ul($d75X?$X>H@R@lvoOgDXw-$VPUoH62AVHJK9d z7n)27*&i(fm(s~)2|8*p>sYx-$LD#tN@#=VX1r7AK>LOj&9*{Mjw|Fq{990cxr|1a zo(?I*toL2pKT^nH{U(BmgQ;VZ7%Ysw5yJjHje{NwAM3Oxi1oHco&tE9E@fcn>BAM3 z4{0`6fxfwSz3^GveL|4=AxPS-%c*HOQEn`+UctbS2og>LO1>jT zvAh#m>#LwCe>I6PfhnyFswrPL7m8xNrzUpVFA3c$nt}i{j z`H&QE4QGDIh8FuQc!9No40uxL*zd9e;1jIXeV$f`Js@B>iK+4HT9*?NlIv?Ge z2vPuQY%L!9L0qvb{(JA~i*0VFZVCx^TR?t9^}U1W^iXzdSJCA1OPx_>vaO`xsIL4bs8L?8}ii z`W(ZrH68|^u8B`4Uys<>!V17-Rg?ywnw=unOBd1H*}0ZSkr5bTci*7eM93}?3f%xk zScf=Pt`qwO;s-HWJhcmM-{_(|(*f@L!4Np( z2j(jJYTzKL`Sl%H_5;9)i}$|xQK$Uxu<=cvCXB7_ai1JXD}|=9d!kv_;RYm%FnjIKCX~;@bVoc z(YD>ep{+URW4$aBZESKYopNnd=f_j-cxyQpam`UM;?b}1CGEw>xGGlTpHZPR;_>5L zMt2+xR*pa0y4C6w4y?|a*M{f-H_sIM`x5u(Raiy9Y6agpGs6m!#1&NUEKIt9gazCU(*=@rnnwmN_7&v1>ZJT0Jk}pHSCWk4 zXa8g9HLlPJ?YFUX-ihfC;PTsq{+}bg{(BM6TP3AeL*W@XB^Pm%zTRCwo!uo1lh2F* zH`X>)1-9bD**-2?oSxd#;3k31W%;#nEQhG09p_zellbm|sC*gsAcYLSW|K@ z(|kuk5e~0u&qa^tEiVU|0lC2RTJCt+MI!7S$StwxM|qSgFzc5^8Tn>O8~4feF}Kxj z$F1-y;+}43*KP-wWL@W5qej~Z3IeRON%U$X=(41U)2+`I%;faN8RuUVscvB7!dTey zHnpb77S0Oc!duYoZmeDkC@EC-j;{5m!8 zt?KKbUBY(pwRIqD4DP3(FJZ?ncg1S>A!A!%?$4_vMQXE2er3JF4iHV2$q4HRRb3H^ zL~?Nr97LySUbSrW*?Eo!J9LX027b{%u5y*THhzNgcRWFMPNkCw-W1Z>IY$EE#eiCL zf1$jN?l*Wb-o1tLo#s|;q{Ko$Qfyv1z>)|Sm;sh|GxeMo7PZ- zCO+%H=o0}af$k8Sl$zov7}qyjyir{w7mo?^%*BiNvM?XICh&+WmJ#IwKgpIYV@|FG z@jHuGVcG9V$1G|reMWlS&&N2_i--+wVz759lgCIn5yA%4U-ogiM<_wIiZ31c3dTzC4mt|M*BPRC8!j)gIm+AjYyHw+R&|| z@_q1VHby8Y|I4wuf+?ltf+;Y>vtRw|oF!H4XXe*~lYvN^3InHe$-FvNUW+;!J?~zA zm*fPuS((e76z>D_XkGrY=yzYMp2iEZt9HUrcX6FoZ{k3lo%1e}DSg(M9DXG@TN=}( zm8ows!N8mff0fJQxRVvsNRpA`>1N+C(bT0M(7rm6}gU9)KU`WQ81HL1Af@ZI@ z@dL1-f2}jU2q*+f`sNWgHELvBAQ$Yp8Us~ol%0v`kv25@DnQv< zq6s#kE5&OxVKxqAS5+RfR_A5jG8 zy&N5vbquF@EW6lQ_(@-Fx1spmGLMC={wK;I%&Yz;f0iwnBU3AK-pJ-l7T6jz5ThT5_Xg z4xx3T8;^TU6dP$lKfY0TlY?7yho$_cFwD{M^I@%>*3Im4QS(p} z=L;D?Hqw@(@pt`IhJi&?%*-M1m1^pEfm;&gGy7C|+oClrf%AjOC+GQS!K#irE{1u& z*d%^mdq*?lgU$!8079JNl8=uUwQIcv7ZC5*7TmpGfLLZ^to1fBzUWaK{ZXgvp%?v1 z^$Ab5rDcohyNw^C(yA)jj4*TJJk9$L;%Xr9C$Q?r`zOiZN*SJXAEVqp9oqAO0hY-^ zoV1cqEEEocc=56|@D@TzDBv+W3J{jic*aMstEJSxJ6N<|q#jswA)S5+7;?IZDbwWA zaw~1)B>ZH1Qs+I|qWlX+vs?IOsr^P`!&;3W7;nrwG6W&b#a=ki!{mK#Tztwx19Smk zyx9pcGhX>93nyrLs41XtXcCH8qJS3(NTiUjF?!!idW*gQi+ujmZ+5d`+`%|7BX`xg z-u<7PsL7|RQ$>qk(0-5WIv*cQouQXccwgHlz5^v|D4AzG$6U!p@EI|ywKdc$6^E?< zr$OV&ENkj{KJV?N^+=nJ-*(jnKB3d&t8+RS3xCZSYpTAWs@a2yv&~U0ACZ^NJ5HrA z>+H5&<0Wq~lcxtMbErI7)ZK$3%{(Tg_hSk96F(Q{^EWhm3Emc@;|e(5sC~+7*7J_R z#t;D94^f5BN_GG?!I8RGg`uIL3C0BNipNa<7^D6dR}72>|F)z3dodA=2LJC-$NvYU zvutvVI^57@_grNu+xx-K!2GXJKLi~DRvTBZaA-%!g3+pf*#1wKWL`u-vS+ZOUb*q> zodD1zvN_iH?U(`9GA+=vjFl6BH3hik>Gt?O7C>+(dvBl_Zx^$l zG#gmCZmI_8Y_IqRGIzK6>2Ze~1>iX6(hoZ{8LB%anm5Jlanh~ zLO@)sbZ+Gv8N7)9v_s~0k7Jx)BRSSHmsdVwp%tjv$mNvjwT-Ya)Z8&C<;whv2GRGV zv4lT(g3u~PP1yBcdf`RK{3L8TuAQbNznPzmnQdNqX&;uh0Y3Ex=7zMYX{CBIJzZnl z@bGu4ME*+y?Qf?IPiwZVFqmxj10CqAJgiE?B_qaPAXIl4lLg^s zC@kK;l$?4J{pq@5$I2g&&mRe&Vru3ZgvA#xg1tStCZJod-01kDtb@&$aaQCi2eKb@ zis7OrSn9+OOSQN{PkL$J+$Es&S$=vef)rXQea|J+qd75DhnKp*PN&Uq*&;tG8tFP; zZF`f8m7=$H{|#illf3fCsCNMj?}fk}HR-zeP%ji$;24G`U6CGB(gv@b=ZmdlexD=@ z0M%0Y4}k+9PO3$xUG7pXcZwC%>r{oO1hQeKqy|(0kzA%q__}YOM!3Hn>b)>n{8<`x z|0|1NTMHAIFaBC+d`FIKdBvN?+JvG!spf-=;Fqec&h9P(C=;J7uPpMOu>R(g^LNaZ zlFz0s2e|IdlyXnOtfb-Tl5!ULXyA2*y&!N2)wgkG0%iNhU5re49IXgETR(R8e#8n~ zW2}SeL73nvrBVjI(_}AKdUgb-I$35+8w17%OBqq4#s9%hoBu&B`FHqslM)B4GO!DT zjIPR|ZtIk4=YRoXIdppergWCg=71Za>G(9lr!id618im%9NfS?kFk493V<=IyANH) zIvb26m@Wdx=MwaEn@!)POkT-P$cc_mnG8;fj{(i(c;zkCO(Bjx6kR7SOUK2EU4MWY zINr>e@(u;7j>jDc85IW>Cg!Fw;r&!#GFQ!$KV;QmQ_Vaf#xM2Y>pd-H~7+#N#1{Q8!Y_H1eVoH-H`S3^zgIHf+ zo>0=UHLWm5Twj<#N|#|+?kjL4*F5I4h>w45D<{o_2T)1Ow!fY^H3GU3<&^=&f=%BE zIz0#6EpctlD~-kKJKARx>{M(X?}Ouz?Q*XS?eH3DX;v;F*@rF#g$9es$OI-L84C(A z<_|C~!2M!Ui~30Crf;7Kq0dO)YB1Ti0|n=!xmjaN3*0QJy`#HykqH_bu--jKLtjBv z&4HA9QQu3W=mbEW8QB*JM%=DX0dqc~&Q#8+w}q8|G?MWS)2nyY{CDpbL$;B9dVO3# z0Cnpd{EL-1QHgKEs3~nW!mkkQu>o1cGx81nJ${1&>TU{nY`9n(&b}UMg|oLVeW=g8 zdg-vDiJz{Y>C%DQqpXO`8t+S3xY#8=zR~eNi14qxou+JOxosDm_3=W5-qCX)nIet$ z&S^oPhxX#d@B6rXvg?0>sNG{+5Ox&{xuKXesb*=YCSMtQ6!F39`(qaE5>nf`Bf)@! zHU9NS9b5Bp_r3gKtv7u?hrDkku~=abW4(x-91wt@Ry{)I9jzuw>Xu<18S59aO46{n2cPCq&U0YwC~?{0UVa?&njznUPMaOfA7qm#fc&?e zYr@lA<=|L_Az(*FgLLG`y@e&pESK0=DsXKxjBaZSTSeV93fT!mT+Zc8`S)OWTt&;~ zAe|22mgvqoXG;SZQ?Cdj;O$+q1!>%Kq1jx?r!}kZ!zQ9-5$K94Np;?;oope;+X>@F z+}hB@txjB(p)|Y*HY(JR&-*;t_Qa7}cc%PMDd(5Lujfo{&r;1EM^$(!{tynt`!{Jk zQ$FOMEq$3?{|#S4`Y0+E0n)HgK>E0R1rwBDnrpMGZaiD--#BRmu?-9iP#(SRuHr>ZSSPO}TS)nwvOqS` z)wJm~QA1%DMh>j&cbh56VC{6~`8tdg8V$Ke|JEvOY;1P*8KF|^Jh1Ma?dr1oHt zYh!n}tnC@xE-gFZMumq*1KScGgwO?@?UA+nsuF0?rZ;^yN_gNXA2VQC7A;N=Wgy&i2v#qCU`<%P(fA(7b*!wqpzn{`ku#ybwM%GwCu=t#5e>~mw5+V74Bc^_&B{IGA3V^4rPXx>%e3|eby8w%MUk}IslaEa?-BhEW&T^G00Spe zh{seqWKSO{tx7WH3zgsRKtns)OUiy>_vJNa9v*jxw0OC*6>|hmc&CygSmrdUfATs^ z)RXvdcY7Llw3=4r*LZ5G=_g9~A3*A#MxZ|-!`1LU&=cKeOkh-ZA}5PdqJTU{^BEUE zx)5wnAuQ}lRo-gfaFIC-8kEN278AN=&`NqoUVT~te_Dpu&Q&#jfv>xE>5L?X@Ge_9 zw|YQS(thhHUp>Q)#|v=XQ|! zJi`_?t<;gpa+Bz!2v-3Lm~by#8eopLmmUw@hV3QFjy(V44-M1u)xgCu7icHQ!?>*$ z4E)@Z!rechrmSB3GZ`AQsPZ5GJKk1{$0dIuIK2~bQ4E=-opK>Ml|3B8<8gV35|mJf zhgDLRY27&S?4(Zhgk5mwfaJ9(w_U9DRGJr9mXTZaa851QE13H_e>)iR45^kEdJ54j z@H_8nB`hTMj0PjNT^M2S76Fu?d_uC)$NlmkM)taooM;Xn?NK~o+(j*$DjR96rK?mz z;GN@P+&;*__OpWcfjjy)9KX|$bK6w^5m5W$TTto>SI30)vk1lt4?xEfN z`;pdiIkdVOhAb4T>P}L~hG6e}vuNt$?r|+P{@=keDkqN9c%gc=Y zA44G~L9@67oPra};$pq^oMl;;yX3O#!w%(5S$Hg`h+ih6fE1>!P^!}wdLTJQ|I)Qu zcdOR2;i})uvWCQ<3hh9(;D|4H!|ubDqveLgxMpMdR2MXyW9*S5cq+@BxSQz0&TLQZvPckW11IazTILGwM^T z=(rE=yqyGVa@ubafk5r9;wsm;xbR805X@dBi|m}p0R>95Pceezu|YBCraj@<5u`MOlbG5mmGbz*JgAWE7$tE?j-KQ1Y{ z=X2y)jhcnzm^o&IbEq3$928ON&l!)4xBV363va*kv@phuz)K8DX0$T-!zE}+KJ&n4 zeo8Oh$jV60Y?~KgN{qVVQ1+JZkTl{@u!N~dJvd8dhg2Ret=8`T`j#}`^h)X$Y~#ns zOD;0rU%k!s4V~+%#_vbtuJC|-rK_|lrjErTW2nY{%aZewT1k}iyWg7Dm`z}rV5!~x z1WO9`ZX9uawJCj|ez$i{7Op_oJh~+3(=PDhVCXc1>AToQ3fyLb!t||ZZE+#mYUevU zEE*Aowb(*@hm|klE%ej>J@^Y=&m70Xw zdWoFP>mH7>uvPN(iw2v&;dunaHUO?Ludi$RkFXyQ z&N5T}2PXf2zxDW66AaHgTTcq|?FTf(8Y>Owr7ZR_cJ@2bu%q;gS5YqBJPCDlW4^2h z#|;>Ed)peilz}E9A)=|N6_c1EJjs78j=2sJp#SV^fz@Zy2>FE~g|Fq68tVMptHf)~xE+P#eTa0l7 z8OHo^0R>wM_ohhC;2SW_;&xR_sR)F?ypeia~toP(*6K#_ zea1BfwB3sJtuP{tbZvo9iIVvHn4IrniCP`L&w?Ftq3Gyv4)OE~n4uxy%~2bh&P(19GsGcshHEk1x|nI)jhEiT7a z&V@#-gQP$Cc&+>KVDB7n;Pg4cC-r+W))7Wu0(z4q{?Y#>!Eluuy0+b!;^~CD(KsI?*Uw+s-=|&hZstP7%UQ_Tmq(kxO%;`Zd?Tq z{;avT+?fE92D6vfvzvXtfM7Kdm4mXKFmxO#>Pb6D5S&T4=Wt5Mv=E_p9QoKaq-))= zy#f8opl$hyvX_@(W;eXiteSAe91(@Lg$gsPGYugGKXp*4G-p4u=#Xbju^fHKJoA|U ziaD`WdL*rlcR~)mO<*ESPL12uZUgXg8@Gw?UUIMAsy_y73cZ?jMTdgP?WJ3Y*wzU| zEwBvRhQ|lPY$AN?39Z!I0&8vRwKN$dF;OK_Y2RYFMNme(;I1>pjk%UG zI?&$2_*oU(;VRhpD%|n8y9l9B^27Oj?3{?A%dsuZu0O6s#!- z0Hu3ld`RX$bT9qu5(2m0ie9Q2PRkXB&J-mR(Ab@~+5XaL+>j7;3nu4sr2G&ojm&k0 zRtoax$}qtW5F39XpAFsjGerc5=D!}x{7-7`|0Yif8ApLt?;#+Mg7Mu%t1C{Ig1_mw zxTq2I*?LDKdaF4=^|NW5>7y<2?>n^>HSD*s0a?F3hhAL2`A3vd$6>+TykOI7iBOi? zJG!VEdzQk;gkSq9!ZQb{@*~sI;woI8$&PeL-Pg|{H46v{6fQ0ME{ zr19!_lUeY?CrF&nuCp#I&KT4WVT&S0l!*tF@;Ik~Fl#MmXz{VB@?hArL)P9RTM`$= zZ@=1f?jC@Gtsft-_69BG4Qs~i=h|akD3n>PTP$cim@g*0W~L7{J7BOz{-AoH?CG7y z`Ef+ZMu=VQF`E653JVL^&Bl^W+NX6qi>t@xCbK*6 z&jFc7UKiN%Zh#O|!*FQdd0OtWB28FVQvAOKN=7*{fY9n8T61kQ0fbh(KMB*^-U^Kr zeydRwym=1h(7mRUH&ucEIo5wH2w8R&azR7eCQQ|WYJcqVXr8VveiKiEQuwLrMI+Nk zES7#igz;b6U|Dk-@z~=xt~s?Bd}J;62|;S%lA$Ge^{b>s9+(#A4n5V(KPcP+_GmAf z*O$uih;QaLa$Odk$zN(QCq}tm2|dG6pv&a?se$$erOGK{!~3?Op}`~aPzw(pW(R5^ z`x&xnhZ_fsF&C2^l!WU~VnCojLi(DHM53<0di z5EJ)D>7)Y?jlqS-reX+TJTzQ$;Q3NVM4*cM9KNf(wqrS!A1h>s`6CmXvER+_5`4qs z8O4Xr;zpHQwSk-c?IK;X{mXMoN-P!9jk1lO>gZQU-e>XcUGNa7bU$7wAigg0(=Foj z!%D)=)NNQ||65(LA-$CWX#5T=oodd0lruBzZ*Fy3-v3;Qp<_!fiQL5;41s9%^e5N@HnDb z?MyoMxNH~NtDjl$p>X54ZMok)ryVvh|iHR&tGLAqQiWZaiNnp)vx7H1iG`!G36+EnLv!>7aWJ=a3D+*oTCQ-P=MKS^NSlUE@~?UgP&hp!!l6-vhaE5(C9 zA2E{A{42XjJ-WQc%z)4VE8#sKO8Z<&LQg-=&+V6hgp8rp0x-EbavbWEz zFnj=;#H_S5ViXi%w+i^>ZmfEi+k^tVjnk_FsDMUBM?7`}eSvc#qL3WMiCMD-hXULe zNSDP1=zmO;*K5}?3n1)6c=?bgYq`&*4D=w_O%djmw?{l|65DuE^xmMQniOGm;`on2 z3*6XU`j!_O%CqG1@srMEb*KP>hm zo`Je73x>GxB|0BDI+Y|R$0h6Ix+z90E&y@i6WT@r2~Q(4sg`WZE2pmWBg2XPh?_0g zfhx_?a^Tty!UzI8`WL7|YT5Omn3kA)_QhJb1cwlP<_}ZTm9xzJrQpj zotSuth2@D*rpZHkm1eTe^BVG~n!5Ve7f&oV^H3b3aBNaKc_V#&Ca-Xzk;N*^e`mFS zeL0=1edsuWEcmKZYsU(L%G>+Ni^s@|$muM)ZF`43<&z8kKTdi2PbWg)jaZXq5wTgA{-vB9vLva8A literal 0 HcmV?d00001 diff --git a/docs/source/index.rst b/docs/source/index.rst index 8288f49069..8529712f32 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -60,6 +60,7 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train installation quickstart glossary + summary pretrained_models usage model_sharing diff --git a/docs/source/summary.rst b/docs/source/summary.rst new file mode 100644 index 0000000000..94c7752fbb --- /dev/null +++ b/docs/source/summary.rst @@ -0,0 +1,492 @@ +Summary of the models +================================================ + +This is a summary of the models available in the transformers library. It assumes you’re familiar with the original +`transformer model `_. For a gentle introduction check the `annotated transformer +`_. Here we focus on the high-level differences between the +models. You can check them more in detail in their respective documentation. Also checkout the +:doc:`pretrained model page ` to see the checkpoints available for each type of model. + +Each one of the models in the library falls into one of the following categories: + + * :ref:`autoregressive-models` + * :ref:`autoencoding-models` + * :ref:`seq-to-seq-models` + * :ref:`multimodal-models` + +Autoregressive models are pretrained on the classic language modeling task: guess the next token having read all the +previous ones. They correspond to the decoder of the original transformer model, and a mask is used on top of the full +sentence so that the attention heads can only see what was before in the next, and not what’s after. Although those +models can be fine-tuned and achieve great results on many tasks, the most natural application is text generation. +A typical example of such models is GPT. + +Autoencoding models are pretrained by corrupting the input tokens in some way and trying to reconstruct the original +sentence. They correspond to the encoder of the original transformer model in the sense that they get access to the +full inputs without any mask. Those models usually build a bidirectional representation of the whole sentence. They can +be fine-tuned and achieve great results on many tasks such as text generation, but their most natural application is +sentence classification or token classification. A typical example of such models is BERT. + +Note that the only difference between autoregressive models and autoencoding models is in the way the model is +pretrained. Therefore, the same architecture can be used for both autoregressive and autoencoding models. When a given +model has been used for both pretraining, we have put it in the category corresponding to the article it was first +introduced. + +Sequence-to-sequence models use both the encoder and the decoder of the original transformer, either for translation +tasks or by transforming other tasks to sequence-to-sequence problems. They can be fine-tuned to many tasks but their +most natural applications are translation, summarization and question answering. The original transformer model is an +example of such a model (only for translation), T5 is an example that can be fine-tuned on other tasks. + +Multimodal models mix text inputs with other kinds (like image) and are more specific to a given task. + +.. _autoregressive-models: + +Autoregressive models +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +As mentioned before, these models rely on the decoder part of the original transformer and use an attention mask so +that at each position, the model can only look at the tokens before in the attention heads. + +Original GPT +---------------------------------------------- + +`Improving Language Understanding by Generative Pre-Training `_, +Alec Radford et al. + +The first autoregressive model based on the transformer architecture, pretrained on the Book Corpus dataset. + +The library provides versions of the model for language modeling and multitask language modeling/multiple choice +classification. + +More information in this :doc:`model documentation `. + +GPT-2 +---------------------------------------------- + +`Language Models are Unsupervised Multitask Learners `_, +Alec Radford et al. + +A bigger and better version of GPT, pretrained on WebText (web pages from outgoing links in Reddit with 3 karmas or +more). + +The library provides versions of the model for language modeling and multitask language modeling/multiple choice +classification. + +More information in this :doc:`model documentation `. + +CTRL +---------------------------------------------- + +`CTRL: A Conditional Transformer Language Model for Controllable Generation `_, +Nitish Shirish Keskar et al. + +Same as the GPT model but adds the idea of control codes. Text is generated from a prompt (can be empty) and one (or +several) of those control codes which are then used to influence the text generation: generate with the style of +wikipedia article, a book or a movie review. + +The library provides a version of the model for language modeling only. + +More information in this :doc:`model documentation `. + +Transformer-XL +---------------------------------------------- + +`Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context `_, +Zihang Dai et al. + +Same as a regular GPT model, but introduces a recurrence mechanism for two consecutive segments (similar to a regular +RNNs with two consecutive inputs). In this context, a segment is a number of consecutive tokens (for instance 512) that +may span across multiple documents, and segments are fed in order to the model. + +Basically, the hidden states of the previous segment are concatenated to the current input to compute the attention +scores. This allows the model to pay attention to information that was in the previous segment as well as the current +one. By stacking multiple attention layers, the receptive field can be increased to multiple previous segments. + +This changes the positional embeddings to positional relative embeddings (as the regular positional embeddings would +give the same results in the current input and the current hidden state at a given position) and needs to make some +adjustments in the way attention scores are computed. + +The library provides a version of the model for language modeling only. + +More information in this :doc:`model documentation `. + +.. _reformer: + +Reformer +---------------------------------------------- + +`Reformer: The Efficient Transformer `_, +Nikita Kitaev et al . + +An autoregressive transformer model with lots of tricks to reduce memory footprint and compute time. Those tricks +include: + + * Use :ref:`Axial position encoding ` (see below for more details). It’s a mechanism to avoid + having a huge positional encoding matrix (when the sequence length is very big) by factorizing it in smaller + matrices. + * Replace traditional attention by :ref:`LSH (local-sensitive hashing) attention ` (see below for more + details). It's a technique to avoid compute the full product query-key in the attention layers. + * Avoid storing the intermediate results of each layer by using reversible transformer layers to obtain them during + the backward pass (subtracting the residuals from the input of the next layer gives them back) or recomputing them + for results inside a given layer (less efficient than storing them but saves memory). + * Compute the feedforward operations by chunks and not on the whole batch. + +With those tricks, the model can be fed much larger sentences than traditional transformer autoregressive models. + +**Note:** This model could be very well be used in an autoencoding setting, there is no checkpoint for such a +pretraining yet, though. + +The library provides a version of the model for language modeling only. + +More information in this :doc:`model documentation `. + +XLNet +---------------------------------------------- + +`XLNet: Generalized Autoregressive Pretraining for Language Understanding `_, +Zhilin Yang et al. + +XLNet is not a traditional autoregressive model but uses a training strategy that builds on that. It permutes the +tokens in the sentence, then allows the model to use the last n tokens to predict the token n+1. Since this is all done +with a mask, the sentence is actually fed in the model in the right order, but instead of masking the first n tokens +for n+1, XLNet uses a mask that hides the previous tokens in some given permutation of 1,...,sequence length. + +XLNet also uses the same recurrence mechanism as TransformerXL to build long-term dependencies. + +The library provides a version of the model for language modeling, token classification, sentence classification, +multiple choice classification and question answering. + +More information in this :doc:`model documentation `. + +.. _autoencoding-models: + +Autoencoding models +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +As mentioned before, these models rely on the encoder part of the original transformer and use no mask so the model can ` +look at all the tokens in the attention heads. For pretraining, inputs are a corrupted version of the sentence, usually +obtained by masking tokens, and targets are the original sentences. + +BERT +---------------------------------------------- + +`BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding `_, +Jacob Devlin et al. + +Corrupts the inputs by using random masking, more precisely, during pretraining, a given percentage of tokens (usually +15%) are masked by + + * a special mask token with probability 0.8 + * a random token different from the one masked with probability 0.1 + * the same token with probability 0.1 + +The model must predict the original sentence, but has a second objective: inputs are two sentences A and B (with a +separation token in between). With probability 50%, the sentences are consecutive in the corpus, in the remaining 50% +they are not related. The model has to predict if the sentences are consecutive or not. + +The library provides a version of the model for language modeling (traditional or masked), next sentence prediction, +token classification, sentence classification, multiple choice classification and question answering. + +More information in this :doc:`model documentation `. + +ALBERT +---------------------------------------------- + +`ALBERT: A Lite BERT for Self-supervised Learning of Language Representations `_, +Zhenzhong Lan et al. + +Same as BERT but with a few tweaks: + + * Embedding size E is different from hidden size H justified because the embeddings are context independent (one + embedding vector represents one token) whereas hidden states are context dependent (one hidden state represents a + sequence of tokens) so it's more logical to have H >> E. Als, the embedding matrix is large since it's V x E (V + being the vocab size). If E < H, it has less parameters. + * Layers are split in groups that share parameters (to save memory). + * Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A et B + (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have + been swapped or not. + +The library provides a version of the model for masked language modeling, token classification, sentence +classification, multiple choice classification and question answering. + +More information in this :doc:`model documentation `. + +RoBERTa +---------------------------------------------- + +`RoBERTa: A Robustly Optimized BERT Pretraining Approach `_, +Yinhan Liu et al. + +Same as BERT with better pretraining tricks: + + * dynamic masking: tokens are masked differently at each epoch whereas BERT does it once and for all + * no NSP (next sentence prediction) loss and instead of putting just two sentences together, put a chunk of + contiguous texts together to reach 512 tokens (so sentences in in an order than may span other several documents) + * train with larger batches + * use BPE with bytes as a subunit and not characters (because of unicode characters) + +The library provides a version of the model for masked language modeling, token classification, sentence +classification, multiple choice classification and question answering. + +More information in this :doc:`model documentation `. + +DistilBERT +---------------------------------------------- + +`DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter `_, +Victor Sanh et al. + +Same as BERT but smaller. Trained by distillation of the pretrained BERT model, meaning it's been trained to predict +the same probabilities as the larger model. The actual objective is a combination of: + + * finding the same probabilities as the teacher model + * predicting the masked tokens correctly (but no next-sentence objective) + * a cosine similarity between the hidden states of the student and the teacher model + +The library provides a version of the model for masked language modeling, token classification, sentence classification +and question answering. + +More information in this :doc:`model documentation `. + +XLM +---------------------------------------------- + +`Cross-lingual Language Model Pretraining `_, Guillaume Lample and Alexis Conneau + +A transformer model trained on several languages. There are three different type of training for this model and the +library provides checkpoints for all of them: + + * Causal language modeling (CLM) which is the traditional autoregressive training (so this model could be in the + previous section as well). One of the languages is selected for each training sample, and the model input is a + sentence of 256 tokens that may span on several documents in one one those languages. + * Masked language modeling (MLM) which is like RoBERTa. One of the languages is selected for each training sample, + and the model input is a sentence of 256 tokens that may span on several documents in one one those languages, with + dynamic masking of the tokens. + * A combination of MLM and translation language modeling (TLM). This consists of concatenating a sentence in two + different languages, with random masking. To predict one of the masked token, the model can use both the + surrounding context in language 1 as well as the context given by language 2. + +Checkpoints refer to which method was used for pretraining by having `clm`, `mlm` or `mlm-tlm` in their names. On top +of positional embeddings, the model has language embeddings. When training using MLM/CLM, this gives the model an +indication of the language used, and when training using MLM+TLM, an indication of which part of the input is in which +language. + +The library provides a version of the model for language modeling, token classification, sentence classification and +question answering. + +More information in this :doc:`model documentation `. + +XLM-RoBERTa +---------------------------------------------- + +`Unsupervised Cross-lingual Representation Learning at Scale `_, Alexis Conneau et +al. + +Uses RoBERTa tricks on the XLM approach, but does not use the translation language modeling objective, only using +masked language modeling on sentences coming from one language. However, the model is trained on many more languages +(100) and doesn't use the language embeddings, so it's capable of detecting the input language by itself. + +The library provides a version of the model for masked language modeling, token classification, sentence +classification, multiple choice classification and question answering. + +More information in this :doc:`model documentation `. + +FlauBERT +---------------------------------------------- + +`FlauBERT: Unsupervised Language Model Pre-training for French `_, Hang Le et al. + +Like RoBERTa, without the sentence ordering prediction (so just trained on the MLM objective). + +The library provides a version of the model for language modeling and sentence classification. + +More information in this :doc:`model documentation `. + +ELECTRA +---------------------------------------------- + +`ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators `_, +Kevin Clark et al. + +ELECTRA is a transformer model pretrained with the use of another (small) masked language model. The inputs are +corrupted by that language model, which takes an input text that is randomly masked and outputs a text in which ELECTRA +has to predict which token is an original and which one has been replaced. Like for GAN training, the small language +model is trained for a few steps (but with the original texts as objective, not to fool the ELECTRA model like in a +traditional GAN setting) then the ELECTRA model is trained for a few steps. + +The library provides a version of the model for masked language modeling, token classification and sentence +classification. + +More information in this :doc:`model documentation `. + +.. _longformer: + +Longformer +---------------------------------------------- + +`Longformer: The Long-Document Transformer `_, Iz Beltagy et al. + +A transformer model replacing the attention matrices by sparse matrices to go faster. Often, the local context (e.g., +what are the two tokens left and right?) is enough to take action for a given token. Some preselected input tokens are +still given global attention, but the attention matrix has way less parameters, resulting in a speed-up. See the +:ref:`local attention section ` for more information. + +It is pretrained the same way a RoBERTa otherwise. + +**Note:** This model could be very well be used in an autoregressive setting, there is no checkpoint for such a +pretraining yet, though. + +The library provides a version of the model for masked language modeling, token classification, sentence +classification, multiple choice classification and question answering. + +More information in this :doc:`model documentation `. + + +.. _seq-to-seq-models: + +Sequence-to-sequence models +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +As mentioned before, these models keep both the encoder and the decoder of the original transformer. + +BART +---------------------------------------------- + +`BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension `_, +Mike Lewis et al. + +Sequence-to-sequence model with an encoder and a decoder. Encoder is fed a corrupted version of the tokens, decoder is +fed the tokens (but has a mask to hide the future words like a regular transformers decoder). For the encoder, on the +pretraining tasks, a composition of the following transformations are applied: + + * mask random tokens (like in BERT) + * delete random tokens + * mask a span of k tokens with a single mask token (a span of 0 tokens is an insertion of a mask token) + * permute sentences + * rotate the document to make it start by a specific token + +The library provides a version of this model for conditional generation and sequence classification. + +More information in this :doc:`model documentation `. + +MarianMT +---------------------------------------------- + +`Marian: Fast Neural Machine Translation in C++ `_, Marcin Junczys-Dowmunt et al. + +A framework for translation models, using the same models as BART + +The library provides a version of this model for conditional generation. + +More information in this :doc:`model documentation `. + +T5 +---------------------------------------------- + +`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer `_, +Colin Raffel et al. + +Uses the traditional transformer model (except a slight change with the positional embeddings, which are learned at +each layer). To be able to operate on all NLP tasks, it transforms them in text-to-text problems by using certain +prefixes: “Summarize: …”, “question: …”, “translate English to German: …” and so forth. + +The pretraining includes both supervised and self-supervised training. Supervised training is conducted on downstream +tasks provided by the GLUE and SuperGLUE benchmarks (changing them to text-to-text tasks as explained above). + +Self-supervised training consists of corrupted pretrained, which means randomly removing 15% of the tokens and +replacing them by individual sentinel tokens (if several consecutive tokens are marked for removal, they are replaced +by one single sentinel token). The input of the encoder is the corrupted sentence, the input of the decoder the +original sentence and the target is then the dropped out tokens delimited by their sentinel tokens. + +For instance, if we have the sentence “My dog is very cute .”, and we decide to remove the token dog, is and cute, the +input becomes “My very .” and the target is “ dog is . ” + +The library provides a version of this model for conditional generation. + +More information in this :doc:`model documentation `. + +.. _multimodal-models: + +Multimodal models +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +There is one multimodal model in the library which has not been pretrained in the self-supervised fashion like the +others. + +MMBT +---------------------------------------------- + +`Supervised Multimodal Bitransformers for Classifying Images and Text `_, Douwe Kiela +et al. + +A transformers model used in multimodal settings, combining a text and an image to make predictions. The transformer +model takes as inputs the embeddings of the tokenized text and a the final activations of a pretrained resnet on the +images (after the pooling layer) that goes through a linear layer (to go from number of features at the end of the +resnet to the hidden state dimension of the transformer). + +The different inputs are concatenated, and on top of the positional embeddings, a segment embedding is added to let the +model know which part of the input vector corresponds to the text or the image. + +The pretrained model only works for classification. + +.. + More information in this :doc:`model documentation `. + TODO: write this page + +More technical aspects +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Full vs sparse attention +---------------------------------------------- + +Most transformer models use full attention in the sense that the attention matrix is square. It can be a big +computational bottleneck when you have long texts. Longformer and reformer are models that try to be more efficient and +use a sparse version of the attention matrix to speed up training. + +.. _lsh-attention: + +**LSH attention** + +:ref:`Reformer ` uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax +dimension) of the matrix QK^t are going to give useful contributions. So for each query q in Q, we can only consider +the keys k in K that are close to q. A hash function is used to determine if q and k are close. The attention mask is +modified to mask the current token (except at the first position) because it will give a query and key equal (so very +similar to each other). Since the hash can be a bit random, several hash functions are used in practice (determined by +a n_rounds parameter) then are averaged together. + +.. _local-attention: + +**Local attention** + +:ref:`Longformer ` uses local attention: often, the local context (e.g., what are the two tokens left and +right?) is enough to take action for a given token. Also, by stacking attention layers that have a small window, the +last layer will have a receptive field of more than just the tokens on the window, allowing them to build a +representation of the whole sentence. + +Some preselected input tokens are also given global attention: for those few tokens, the attention matrix can access +all tokens and this process is symmetric: all other tokens have access to those specific tokens (on top of the ones in +their local window). This is shown in Figure 2d of the paper, see below for a sample attention mask: + +.. image:: imgs/local_attention_mask.png + :scale: 50 % + :align: center + +Using those attention matrices with less parameters then allows the model to have inputs having a bigger sequence +length. + +Other tricks +---------------------------------------------- + +.. _axial-pos-encoding: + +**Axial positional encodings** + +:ref:`Reformer ` uses axial positional encodings: in traditional transformer models, the positional encoding +E is a matrix of size :math:`l` by :math:`d`, :math:`l` being the sequence length and :math:`d` the dimension of the +hidden state. If you have very long texts, this matrix can be huge and take way too much space on the GPU. + +To alleviate that, axial positional encodings consists in factorizing that big matrix E in two smaller matrices E1 and +E2, with dimensions :math:`l_{1} \times d_{1}` and :math:`l_{2} \times d_{2}`, such that :math:`l_{1} \times l_{2} = l` +and :math:`d_{1} + d_{2} = d` (with the product for the lengths, this ends up being way smaller). The embedding for +time step :math:`j` in E is obtained by concatenating the embeddings for timestep :math:`j \% l1` in E1 and +:math:`j // l1` in E2. +