After 2 weeks of trying to make it work, I'm starting to think we are not at the point we can have good local LLM inference.
For anyone interested, this is the sorry state of local LLM inference across all platforms, other than iOS. I'm guessing python libraries do better, but we don't have that in KMP world. If anyone has suggestions, I'm happy to hear them.
Current issues of LLM backends
All
• Do not differentiate between loading progress of downloading and loading from disk
Web
WebLLM
• Lags out the entire browser when running inference (
#694)
• Progress listener reports only raw text output instead of structured info (related:
#666)
• Doesn’t work on Firefox (
#644)
• Can cancel inference, but can’t cancel model loading (
#499)
• Requires special mc-llm model builds
Transformers.js
• Hard caps at certain model sizes (
#952)
• No way to cancel inference or loading (
#1182)
• Vendor locked to Hugging Face repositories; only ONNX models supported
Desktop
Java-llama.cpp
• No progress indicator (
#113)
• Supports only GGUF format
• No GPU build available yet
• Model loading may be impossible to interrupt
Jllama
• Developed from scratch by a single person; limited model support
• Only CPU inference right now (
#150)
• Most promising one to convert to KMP and use across all platforms (nice)
Android
Java-llama.cpp
• Same issues as desktop Java-llama.cpp
• Even more difficult to build and integrate
MediaPipe
• Doesn’t support chat templates (
#5558)
• Very fast (nice)
• Only .task files supported
• No way to interrupt inference—even closing the model doesn’t work (
#5740)
• Probably no way to cancel model loading
• No progress indicator (
#6002)