Is there a KMP library for running LLM inference (...
# ai
f
Is there a KMP library for running LLM inference (locally)?
s
Not kmp, but there is pure java library based project panama and vector APIs..It’s pretty neat - https://github.com/tjake/Jlama
1
p
Not quite pure, after all there’s some native code in there. But yea, it's good library Also there is nirmato-ollama, client for ollama
👍 1
f
Thanks guys, But jlama only does CPU inference, And I'm looking for in-process inference. I think I'll make a KMP library starting with Java-llama.cpp (not to be confused with Jlama)
p
interesting, what targets do you need need? btw, there's also kinference for running onnx models, and here's example with gpt2. But I’m not sure if it will be supported in the future. There haven’t been any commits recently
f
Currently desktop, android ,web
so for web a webgpu backend is important
s
CPU inference,
Yeah, i think wip for webgpu support - https://github.com/tjake/Jlama/pull/150 , but yeah this won’t be multiplatform.
f
The state of LLM inference outside of desktop is sad, web libraries either lack gpu inference or freeze the browser. No android library as far as I can see Actually there might be something
yeah so apparently java-llamacpp doesn't use the gpu version of llamacpp
After 2 weeks of trying to make it work, I'm starting to think we are not at the point we can have good local LLM inference. For anyone interested, this is the sorry state of local LLM inference across all platforms, other than iOS. I'm guessing python libraries do better, but we don't have that in KMP world. If anyone has suggestions, I'm happy to hear them. Current issues of LLM backends All • Do not differentiate between loading progress of downloading and loading from disk Web WebLLM • Lags out the entire browser when running inference (#694) • Progress listener reports only raw text output instead of structured info (related: #666) • Doesn’t work on Firefox (#644) • Can cancel inference, but can’t cancel model loading (#499) • Requires special mc-llm model builds Transformers.js • Hard caps at certain model sizes (#952) • No way to cancel inference or loading (#1182) • Vendor locked to Hugging Face repositories; only ONNX models supported Desktop Java-llama.cpp • No progress indicator (#113) • Supports only GGUF format • No GPU build available yet • Model loading may be impossible to interrupt Jllama • Developed from scratch by a single person; limited model support • Only CPU inference right now (#150) • Most promising one to convert to KMP and use across all platforms (nice) Android Java-llama.cpp • Same issues as desktop Java-llama.cpp • Even more difficult to build and integrate MediaPipe • Doesn’t support chat templates (#5558) • Very fast (nice) • Only .task files supported • No way to interrupt inference—even closing the model doesn’t work (#5740) • Probably no way to cancel model loading • No progress indicator (#6002)
👍 1
s
@Fudge there is also GPULlama3 - https://github.com/beehive-lab/GPULlama3.java
t
There is also KInference with TFJS and full MPP support: https://github.com/JetBrains-Research/kinference
m
Hi @Fudge I am playing around with the same topic(https://skainet.sk), do you have particular models, model formats, HW requiremnets, use cases, why you want to run llm locally? I am particullary interesten in KMP aspect, what platforms are you targeting. It could help us in priorotizing the tasks we are working on, in order to get the first version out ....
f
Hey @Michal Harakal, Local LLM inferrence on edge devices, i.e. android web iOS is extremely important to me, as most users are on those platforms, and many users value the privacy, control and reduced costs that local LLMs provide
I am looking to build an app based on LLM chat
@suresh @TanVD Thank you for the suggestions, I will take a look around
GPULlama3 seems to have limited model support, and KInference does not seem maintained unfortunately (compared to what I would expect for a library for this topic)
m
@Fudge this sounds very interesting. Do you have some adapted, finetunned/whatever models or you want to go with existing models? Do you have some preferences regadning file format (gguf, safetensors ...) ?
I understand you focus on edge and on device-AI, but for JVM, as something to start with I can recommend Jlama from Jake, because of maturity and great support from on discord ...
f
@Michal Harakal It's all about being able to use as many models as possible, including finetunes. If Google release a new model, it should be supported asap. Same with Meta, OpenAI, etc... The format doesn't matter, as long as we can convert the original model to the format we use. Most models are pytorch (safetensors), but gguf is more convenient for applications.