Just did a 1.0-SNAPSHOT release of some NLP tools ...
# datascience
h
Just did a 1.0-SNAPSHOT release of some NLP tools I've thrown together 🎉 https://github.com/londogard/londogard-nlp-toolkit 🧵 for details
👍 6
✔️WordEmbeddings (
WordEmbeddings
 & 
LightWordEmbeddings
) ✔️Stopwords ✔️WordFrequencies ✔️Tokenizer (
CharTokenizer
 & 
SimpleTokenizer
) ✔️Stemmer ✔️Basic Trie ✔️Sentence Embeddings (
AvgSentenceEmbeddings
 & 
USifEmbeddings
)
At the top of TODOs: • SubWordTokenization (think SentencePiece, BPE, WordPiece & Unigram) • Vectorization methods (TF-IDF, BM25, BagOfWords & so on) • Classifiers (leaning on adding another library to be used, e.g. smile or something like that)
i
Usage examples in K notebooks are very welcome)
👍 4
h
Will try to get that up, added usage in the README for a lot of the tools.
@Ilya Muradyan added a
README.ipynb
which contains interactive examples. Removed the examples from
README.md
to not have anything fall out-of-sync. https://github.com/londogard/londogard-nlp-toolkit/blob/main/README.ipynb
Also added support for:
SentencePieceTokenizer
(including simple download for 275 languages with ~7 different vocab sizes to choose from)
BpeEmbeddings
which are BytePieceEncoded embeddings, has been shown to be very effective with little space (11mb perform approximately the same as 6GB of fastText embeddings) And now there's a helper to simply instantiate either
WordEmbeddings
or
LightWordEmbeddings
through
LanguageSupport
where it will download embeddings from
fastText
meaning that there's 175 languages supported from the get-go!
🔥 1
i
Cool! Could you please PR a descriptor for your library? A link to it will be included in this list Example: https://github.com/Kotlin/kotlin-jupyter/blob/master/libraries/kmath.json
h
Will do when I find the time for sure! Hopefully this weekend🙂
👍 1
i
Thank you, merged!
🙏 1