read().splitlines() -> readlines()

splitlines() does not work as what we expect here for bert-base-chinese because there is a '\u2028' (unicode line seperator) token in vocab file. Value of '\u2028'.splitlines() is ['', ''].
Perhaps we should use readlines() instead.
This commit is contained in:
Yiqing-Zhou
2019-07-22 20:49:09 +08:00
committed by GitHub
parent 2f869dc665
commit 897d0841be

View File

@@ -67,10 +67,9 @@ def load_vocab(vocab_file):
"""Loads a vocabulary file into a dictionary."""
vocab = collections.OrderedDict()
with open(vocab_file, "r", encoding="utf-8") as reader:
tokens = reader.read().splitlines()
tokens = reader.readlines()
for index, token in enumerate(tokens):
vocab[token] = index
index += 1
return vocab