<https://github.com/londogard/londogard-nlp-toolki...
# mathematics
р
https://github.com/londogard/londogard-nlp-toolkit that the one, @Hampus Londögård, it is great. I haven't dived too deep into it, but could it be possible to make it compatible with tokenisation required in BERT?
h
I do support https://github.com/google/sentencepiece (via DJL dependency) which is a type of subword tokenization. I’m pretty sure this is used to train a few BERTs, but the original BERT used wordpiece right? But to support a pretrained BERT you need to use the same tokenizer with the same parameters. The sentencepiece (DJLs) that is included is very fast, it uses JNI to call the c++-sentencepiece. I could run a benchmark this weekend for the 1GB test they use. 🙂
What is SentencePiece
SentencePiece is a re-implementation of sub-word units, an effective way to alleviate the open vocabulary problems in neural machine translation. SentencePiece supports two segmentation algorithms, byte-pair-encoding (BPE) [Sennrich et al.] and unigram language model [Kudo.]. Here are the high level differences from other implementations.
р
Oh nice, yes I think it's wordpiece.
What would be your approach then, because it will be very easy for us to load a pretrained BERT with the kotlin pytorch binding I am writing (or the one JavaCPP is working on as well) but I don't know yet what to do with tokenisation
h
This toolkit is mainly to access different types of component you sometimes take for granted in Python for now. Like: • Stemming • Stopwords • WordProbabilities • Different type of embeddings (pretrained, no neural network) • Tokenizers that make sense I’m a big fan of the
bpemb
- https://bpemb.h-its.org/ which are embeddings based on subwords which greatly reduces size while maintaining a high perplexity. Really impressive and this is also supported in the toolkit. They found that a 11MB embedding had equivalent performance to 6GB fastText embeddings! 😮 This was in english, I think morphological languages like finish & swedish could have even better results
🙌 1
WIP is to add simple tools to actually extract things like Classifiers & Keyword Extractions.
Sorry for the slow response, had a busy weekend ^^, Because of time I never set up a good environment to test, but here we go. It ran a lot slower than the HF timings (5min vs 20s (!)), but I'd like to note a few things that might greatly impact results: 1. Running on a laptop with not great specs + other programs running at the same time 2. Using a standard JVM on WSL through Windows. Should've perhaps used GraalVM 3. Not "warming it up" (only ran through the text once before) 4. JNI vs pure Rust (shuffling 1GB of data is NOT a smart move hehe. Allowing the native code to directly read file & write to file would be much faster) 5. My SentencePiece was a generic one trained on Wiki w/ vocabsize = 10k (the "default-mode" in londogard-nlp-toolkit). The configuration of a subword-tokenizer makes a big difference as it splits the words in wildly different ways But I was indeed dissapointed to see such a huge difference when running on large files. I'm thinking about either A. Creating bindings to HF Tokenizers B. Simply try to wrap the current tokenizer to allow tokenization directly on disk instead of transfering strings from JVM to native to JVM. Where B is the really important one if you want to achieve really high speeds and throughoutput
Working with my own applications I never reach these big amount of text-data when on the JVM (only using deep learning, and then I'm on python). So haven't noticed this and the tokenization is indeed really fast when running on smaller data.
р
sorry so what did you benchmark against what there?
I should say that typically the reason why we want to port NLP tasks to the JVM is precisely because the data comes from the JVM. So if FFI hurts way too much then one need to look for a pure JVM solution I believe (at least for tokenisation). SparkNLP is something to consider for example.
h
SparkNLP often moves with FFI to both Python & Native though. I ran the tokenizer on 1GB random text-data like HF Tokenizers, but the subword tokenization in my lib is wrapping a native solution as I haven’t had the time to reimplement a JVM variant (which is on my TODO). The FFI was crazy expensive in that case 😛
р
Yes my bad wording, I did not mean that SparkNLP was a pure JVM solution. Have you ever run any benchmarks against it? And in terms of accuracy? It would be great to actually have a kotlin friendly API that integrated smoothly with https://github.com/JetBrains/kotlin-spark-api
On the other hand spark is a beast, so sometimes one just wants a small, fast and lightweight solution
h
Yeah small, fast & lightweight is what I aim to be. But any full JVM program that don’t use native should be easy to distr on spark