llama.cpp
Verified
The C/C++ engine powering local AI — lightning-fast inference that Ollama and LM Studio build on.
About llama.cpp
llama.cpp is the foundational C/C++ library for running quantized LLMs on consumer hardware. Created by Georgi Gerganov, it powers tools like Ollama and LM Studio behind the scenes. It supports GGUF model format, GPU offloading, and runs on virtually any platform.
Key Features
- C/C++ for maximum performance
- GGUF quantization format
- GPU offloading (CUDA, Metal, Vulkan)
- Server mode with OpenAI-compatible API
- Runs on everything from Raspberry Pi to servers
Pros & Cons
Pros
+ Fastest local inference engine
+ Runs on virtually any hardware
+ Foundation of the local AI ecosystem
Cons
- Command-line interface only
- Requires compilation for best performance
- Steep learning curve for beginners
Use Cases
Building local AI applicationsMaximum performance local inferenceEmbedded AI in appsResearch and benchmarking
Pricing
Open Source
Free and open-source. MIT license.
Who It's For
C/C++ developersML engineersEmbedded systems developersPerformance enthusiasts
Details