Hampus Londögård
03/11/2021, 6:48 PMHampus Londögård
03/11/2021, 6:48 PMWordEmbeddings & LightWordEmbeddings)
✔️Stopwords
✔️WordFrequencies
✔️Tokenizer (CharTokenizer & SimpleTokenizer)
✔️Stemmer
✔️Basic Trie
✔️Sentence Embeddings (AvgSentenceEmbeddings & USifEmbeddings)Hampus Londögård
03/11/2021, 6:52 PMIlya Muradyan
03/11/2021, 7:20 PMHampus Londögård
03/11/2021, 7:31 PMHampus Londögård
03/26/2021, 2:19 PMREADME.ipynb which contains interactive examples.
Removed the examples from README.md to not have anything fall out-of-sync.
https://github.com/londogard/londogard-nlp-toolkit/blob/main/README.ipynbHampus Londögård
03/26/2021, 2:22 PMSentencePieceTokenizer (including simple download for 275 languages with ~7 different vocab sizes to choose from)
BpeEmbeddings which are BytePieceEncoded embeddings, has been shown to be very effective with little space (11mb perform approximately the same as 6GB of fastText embeddings)
And now there's a helper to simply instantiate either WordEmbeddings or LightWordEmbeddings through LanguageSupport where it will download embeddings from fastText meaning that there's 175 languages supported from the get-go!Ilya Muradyan
03/26/2021, 2:28 PMHampus Londögård
03/26/2021, 2:31 PMHampus Londögård
04/07/2021, 4:16 PMIlya Muradyan
04/07/2021, 4:59 PM