Is there a KMP library for running LLM inference locally kotlinlang #ai

Join Slack

Is there a KMP library for running LLM inference (...

# ai

Fudge

05/26/2025, 11:04 PM

Is there a KMP library for running LLM inference (locally)?

suresh

05/26/2025, 11:26 PM

Not kmp, but there is pure java library based project panama and vector APIs..It’s pretty neat - https://github.com/tjake/Jlama

☕ 1

Pavel Gorgulov

05/27/2025, 7:07 AM

Not quite pure, after all there’s some native code in there. But yea, it's good library Also there is nirmato-ollama, client for ollama

👍 1

Fudge

05/27/2025, 7:57 AM

Thanks guys, But jlama only does CPU inference, And I'm looking for in-process inference. I think I'll make a KMP library starting with Java-llama.cpp (not to be confused with Jlama)

Pavel Gorgulov

05/27/2025, 8:08 AM

interesting, what targets do you need need? btw, there's also kinference for running onnx models, and here's example with gpt2. But I’m not sure if it will be supported in the future. There haven’t been any commits recently

Fudge

05/27/2025, 9:15 AM

Currently desktop, android ,web

Fudge

05/27/2025, 9:16 AM

so for web a webgpu backend is important

suresh

05/27/2025, 1:49 PM

CPU inference,

Yeah, i think wip for webgpu support - https://github.com/tjake/Jlama/pull/150 , but yeah this won’t be multiplatform.

Fudge

05/29/2025, 10:29 PM

The state of LLM inference outside of desktop is sad, web libraries either lack gpu inference or freeze the browser. ~~No android library as far as I can see~~ Actually there might be something

Fudge

05/31/2025, 7:31 AM

yeah so apparently java-llamacpp doesn't use the gpu version of llamacpp

Fudge

06/09/2025, 9:24 AM

After 2 weeks of trying to make it work, I'm starting to think we are not at the point we can have good local LLM inference. For anyone interested, this is the sorry state of local LLM inference across all platforms, other than iOS. I'm guessing python libraries do better, but we don't have that in KMP world. If anyone has suggestions, I'm happy to hear them. Current issues of LLM backends All • Do not differentiate between loading progress of downloading and loading from disk Web WebLLM • Lags out the entire browser when running inference (#694) • Progress listener reports only raw text output instead of structured info (related: #666) • Doesn’t work on Firefox (#644) • Can cancel inference, but can’t cancel model loading (#499) • Requires special mc-llm model builds Transformers.js • Hard caps at certain model sizes (#952) • No way to cancel inference or loading (#1182) • Vendor locked to Hugging Face repositories; only ONNX models supported Desktop Java-llama.cpp • No progress indicator (#113) • Supports only GGUF format • No GPU build available yet • Model loading may be impossible to interrupt Jllama • Developed from scratch by a single person; limited model support • Only CPU inference right now (#150) • Most promising one to convert to KMP and use across all platforms (nice) Android Java-llama.cpp • Same issues as desktop Java-llama.cpp • Even more difficult to build and integrate MediaPipe • Doesn’t support chat templates (#5558) • Very fast (nice) • Only .task files supported • No way to interrupt inference—even closing the model doesn’t work (#5740) • Probably no way to cancel model loading • No progress indicator (#6002)

👍 1

suresh

06/10/2025, 6:13 AM

@Fudge there is also GPULlama3 - https://github.com/beehive-lab/GPULlama3.java

TanVD

06/10/2025, 12:47 PM

There is also KInference with TFJS and full MPP support: https://github.com/JetBrains-Research/kinference

Michal Harakal

06/11/2025, 7:50 AM

Hi @Fudge I am playing around with the same topic(https://skainet.sk), do you have particular models, model formats, HW requiremnets, use cases, why you want to run llm locally? I am particullary interesten in KMP aspect, what platforms are you targeting. It could help us in priorotizing the tasks we are working on, in order to get the first version out ....

Fudge

06/11/2025, 10:40 AM

Hey @Michal Harakal, Local LLM inferrence on edge devices, i.e. android web iOS is extremely important to me, as most users are on those platforms, and many users value the privacy, control and reduced costs that local LLMs provide

Fudge

06/11/2025, 10:41 AM

I am looking to build an app based on LLM chat

Fudge

06/11/2025, 10:41 AM

@suresh @TanVD Thank you for the suggestions, I will take a look around

Fudge

06/11/2025, 10:51 AM

GPULlama3 seems to have limited model support, and KInference does not seem maintained unfortunately (compared to what I would expect for a library for this topic)

Michal Harakal

06/11/2025, 10:53 AM

@Fudge this sounds very interesting. Do you have some adapted, finetunned/whatever models or you want to go with existing models? Do you have some preferences regadning file format (gguf, safetensors ...) ?

Michal Harakal

06/11/2025, 10:56 AM

I understand you focus on edge and on device-AI, but for JVM, as something to start with I can recommend Jlama from Jake, because of maturity and great support from on discord ...

Fudge

06/11/2025, 2:07 PM

@Michal Harakal It's all about being able to use as many models as possible, including finetunes. If Google release a new model, it should be supported asap. Same with Meta, OpenAI, etc... The format doesn't matter, as long as we can convert the original model to the format we use. Most models are pytorch (safetensors), but gguf is more convenient for applications.

46 Views

Open in Slack

Previous Next