londogard-nlp-toolkit 1.1.0 release! :partying_fac...
# datascience
h
londogard-nlp-toolkit 1.1.0 release! 🥳 One of the, if not the, best NLP toolkit on the JVM. GitHub - londogard/londogard-nlp-toolkit Some of the updates includes optimizations, classifiers, vectorizers & unsupervised keyword extraction. See 🧵 for more details
🎉 3
👍 4
Updates:🚀 Vectorizers ◦ BagOfWords (CountVectorizer), TF-IDF, BM-25 • 🚀 Classifiers ◦ Binary Classifiers: Logistic Regression, Naïve Bayes ◦ Regression: Linear Regression ◦ Sequence Classifier: Hidden Markov Model • 🚀 Keyword Extraction ◦ Co-occurrence based • 🚀 Sentence Splitting • 🗲 More Efficient Caching ◦ Using Caffeine which supplies State-of-the-Art cache ◦ Also added for 
LightWordEmbeddings
📖Documentation by JavaDoc via Dokka • Multiple Dependency updates ◦ Including my own addition to DJL which supplies Windows support for SentencePiece Tokenizer What it already contains: • Word Embeddings (multiple variants) • Sentence Embeddings (multiple variants) • Stemming • WordFrequencies • Stopwords • Tokenizers (including subword-tokenization) See the README for more information, or open the Kotlin Jupyter Notebook with interactive docs.
Another fun fact, moved from EJML to Multik as my main provider of matrices. Only using EJML for one case, doing a SVD.
h
Cool project. Do you know if NLP techniques have been applied to do log-data modelling, segmentation and classification? E.g. to spot anomalous events in a series of log records?
h
Thanks 😊 There exists some projects that I've heard about that attempted this. But most in production ends up using simple rules. For current unsupervised state-of-the-art see LogBERT (https://www.researchgate.net/publication/349913512_LogBERT_Log_Anomaly_Detection_via_BERT). I think the most successful NLP application for logs is a better search (so called "semantic search"). Another application I've seen is to aggregate logs and dedupe to create a better overview. Logs are a project I've thought about going about personally, moving from simpler to more advanced techniques.
h
Thanks for this great and very useful answer.
👍 1
I love the release wording "One of the, if not the, best". Humble but also making a bold statement.
😅 1
h
Gotta be a bit bold ;)