Hampus Londögård
03/11/2021, 6:48 PMHampus Londögård
03/11/2021, 6:48 PMWordEmbeddings
& LightWordEmbeddings
)
✔️Stopwords
✔️WordFrequencies
✔️Tokenizer (CharTokenizer
& SimpleTokenizer
)
✔️Stemmer
✔️Basic Trie
✔️Sentence Embeddings (AvgSentenceEmbeddings
& USifEmbeddings
)Hampus Londögård
03/11/2021, 6:52 PMIlya Muradyan
03/11/2021, 7:20 PMHampus Londögård
03/11/2021, 7:31 PMHampus Londögård
03/26/2021, 2:19 PMREADME.ipynb
which contains interactive examples.
Removed the examples from README.md
to not have anything fall out-of-sync.
https://github.com/londogard/londogard-nlp-toolkit/blob/main/README.ipynbHampus Londögård
03/26/2021, 2:22 PMSentencePieceTokenizer
(including simple download for 275 languages with ~7 different vocab sizes to choose from)
BpeEmbeddings
which are BytePieceEncoded embeddings, has been shown to be very effective with little space (11mb perform approximately the same as 6GB of fastText embeddings)
And now there's a helper to simply instantiate either WordEmbeddings
or LightWordEmbeddings
through LanguageSupport
where it will download embeddings from fastText
meaning that there's 175 languages supported from the get-go!Ilya Muradyan
03/26/2021, 2:28 PMHampus Londögård
03/26/2021, 2:31 PMHampus Londögård
04/07/2021, 4:16 PMIlya Muradyan
04/07/2021, 4:59 PM