Ролан
05/07/2021, 8:15 AMРолан
05/07/2021, 8:17 AMHampus Londögård
05/07/2021, 8:25 AMWhat is SentencePiece
SentencePiece is a re-implementation of sub-word units, an effective way to alleviate the open vocabulary problems in neural machine translation. SentencePiece supports two segmentation algorithms, byte-pair-encoding (BPE) [Sennrich et al.] and unigram language model [Kudo.]. Here are the high level differences from other implementations.
Ролан
05/07/2021, 8:27 AMРолан
05/07/2021, 8:29 AMHampus Londögård
05/07/2021, 8:30 AMbpemb
- https://bpemb.h-its.org/ which are embeddings based on subwords which greatly reduces size while maintaining a high perplexity. Really impressive and this is also supported in the toolkit.
They found that a 11MB embedding had equivalent performance to 6GB fastText embeddings! 😮 This was in english, I think morphological languages like finish & swedish could have even better resultsHampus Londögård
05/07/2021, 8:31 AMHampus Londögård
05/11/2021, 5:17 AMHampus Londögård
05/11/2021, 5:24 AMРолан
05/11/2021, 6:03 AMРолан
05/11/2021, 6:34 AMHampus Londögård
05/11/2021, 6:36 AMРолан
05/11/2021, 7:31 AMРолан
05/11/2021, 7:34 AMHampus Londögård
05/11/2021, 8:12 AM