TL;DR

  • I’m building a local chatbot to ask questions about my garden. This series documents how I make it fast enough to actually use.
  • The critical metrics are TTFT and ITL (latency) and Answer Quality (a small eval set of gardening questions) so we can confirm speed gains don’t degrade the model.
  • llama.cpp for Apple Silicon Metal and CPU support.

The Project

I grow a garden and often have questions about companion planting, soil amendments, what to do when the leaves start yellowing in July, etc. I wanted a chatbot I could ask those questions without sending them to a cloud API. Something local, private, and fast enough that it doesn’t feel like a chore to use.

The “fast enough” part turned out to be more interesting than I expected. A chatbot that pauses for three seconds before the first word appears feels broken, even if the answer is adequate. One that streams smoothly and starts immediately feels like a conversation.

This series is about closing that gap. We’ll start from a working local setup, measure where we are, and improve it one step at a time. Each post focuses on a single lever so the before/after is clear.

The Metrics

We will use 3 metrics to determine whether a chatbot feels like a conversation and whether the answers are actually worth reading:

MetricLowHigh
TTFT (Time to First Token)First word appears quickly. Feels snappy and attentiveFirst token is delayed. Feels laggy or frozen
ITL (Inter-Token Latency)Tokens stream smoothly and continuouslyPauses between tokens make output feel choppy and slow
Answer Quality (Eval Score)Low score: answers are vague, incorrect, or off-topic; the optimization degraded the modelHigh score: answers are accurate, specific, and useful; quality is preserved

TTFT matters more for first impressions. A 2-second wait before anything appears feels broken, even if generation is fast after that. ITL matters for readability. If tokens arrive faster than you can read, you won’t notice gaps; if they don’t, you may be frustrated waiting for text to be streamed.

Answer Quality tracks whether the model still gives useful answers after optimization. We’ll use a small fixed eval set (about ten gardening questions with reference answers) scored by an LLM judge on a 1–5 scale and reported as a pass rate. The questions stay fixed across every post so results are directly comparable.

In later posts we’ll report p50, p95, and p99 rather than averages. Averages hide the worst cases, and slow outliers are what stick in memory.

PercentilePurpose
P50Used to detect broad regressions
P95Used to tune system performance
P99Used to expose architectural bottlenecks & outliers

Why macOS + llama.cpp

Most LLM inference optimization content assumes NVIDIA hardware. That’s fine if you have at least 1 GPU, but it leaves out a lot of people, including me on my home machine unless we rent extra hardware.

Apple Silicon gets Metal acceleration, but even an older Intel machine can follow along. The profiling tool we’ll use in Part 3 (Instruments) ships free with Xcode so no special hardware or licenses needed. And llama.cpp, which we’ll use throughout, runs on Metal and CPU with the same code. When we’re done optimizing for Apple Silicon, the same work applies to any laptop.

llama.cpp is one of the most portable option in the local inference space. It powers applications like Ollama under the hood, has excellent quantization support, and has a Python API (llama-cpp-python) that keeps the examples readable without hiding what’s happening.

The Series Roadmap

  1. Part 1: Introduction (this post): metrics, stack, and what’s coming
  2. Part 2: Baseline: install llama.cpp, pick a model, write a minimal decode loop, measure TTFT, ITL, and model answer quality
  3. Part 3: Profiling: use Instruments and llama.cpp’s built-in timing to see where time is actually going
  4. Part 4: Optimization 1: quantization level choices and their impact on speed vs. quality
  5. Part 5 and on+: more optimizations and dropping down to CPU-only for readers without Apple Silicon

Each post in Parts 2–5 will have a runnable code example in the repo.

What’s Next

In part 2 we’ll install llama.cpp, pick a small model, write a minimal Python decode loop, and ask it a question about tomatoes. Finally, we’ll measure our baseline TTFT and ITL.

Code

The full code is in the chatbot-macos-optimization repo.

References / Further Reading