TL;DR

  • Installed llama-cpp-python with Metal GPU support; Qwen2.5 14B Q4_K_M runs in ~9 GB RAM
  • Baseline warm TTFT: 214 ms (p50), 220 ms (p95), 221 ms (p99); cold TTFT: 382 ms; ITL: 70.8 ms/tok (p50), 71.7 ms/tok (p95/p99)
  • Eval score: 90% pass rate on 10 fixed gardening questions. Our optimizations must not degrade the quality of the model below this threshold.

The Problem

I typed my first question, something about why tomato leaves go yellow in July, and waited. The cursor held still for a moment then words started arriving.

It worked but there were two problems. Before the first word appeared there was a slight drag, making me wonder if something went wrong. And once words did start arriving, they came in slow enough that reading felt choppy: a word, a pause, another word. The experience was frustrating. Both problems are what this series is about.

Before we can improve anything we need baseline numbers for our model latency and quality.

Context

This is Part 2 of Optimizing a Chatbot on macOS. The goal is a local garden chatbot that feels fast enough to actually use with offline support and zero reliance on external or cloud GPUs.

The Approach

Key Decisions

Model: Qwen2.5 14B Instruct. At 3B, inference on Apple Silicon is already fast enough that the optimization headroom is limited (TTFT barely registers as a pause). However, the 3B model performs considerably worse with a 70% pass rate. At 14B, TTFT is in the 210-230ms range, which is perceptible and worth optimizing. The model also reasons more reliably on specific gardening questions (90% pass rate), giving us a cleaner quality signal to protect as we tune. Q4_K_M stays within ~9 GB.

Quantization: Q4_K_M. A good all-around default that balances size and quality. We will compare quantization levels explicitly in Part 4; for now this is the starting point. Q4_K_M is a 4-bit quantization that keeps quality close to the original while cutting memory roughly in half compared to the full-precision model.

llama-cpp-python over the raw CLI. The Python bindings keep examples readable without hiding what’s happening. Metal acceleration is identical either way.

Steps

1. Install llama-cpp-python with Metal support:

CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python

Verify Metal is active at model load. You should see ggml_metal_... in the console output if you run the script with the flag --verbose.

2. Download the model:

pip install huggingface_hub
hf download bartowski/Qwen2.5-14B-Instruct-GGUF \
  Qwen2.5-14B-Instruct-Q4_K_M.gguf --local-dir models/

GGUF is the file format llama.cpp uses for quantized models.

3. The decode loop.

The script loads the model, warms it up, then streams answers to 10 questions. TTFT is measured as the wall time from the function call to the moment the first token arrives. ITL is the time to generate all tokens after the first, divided by the number of those tokens: a per-token latency that is a rough proxy for streaming throughput.

class Benchmarker:
  
  ...

      def _ask(
        self, *, question: str, max_tokens: int, stop_tokens: list[str]
    ) -> tuple[str, float, float]:
        prompt = self.template.format(question=question)
        chunks: list[str] = []
        first_token_time: float | None = None
        start = time.perf_counter()

        for chunk in self.model(
            prompt, max_tokens=max_tokens, stream=True, stop=stop_tokens
        ):
            text = chunk["choices"][0]["text"]
            if first_token_time is None:
                first_token_time = time.perf_counter()
            chunks.append(text)

        end = time.perf_counter()
        if first_token_time is None:
            raise RuntimeError("model generated no tokens")
        answer = "".join(chunks).strip()
        ttft_ms = (first_token_time - start) * 1000
        # Tokenize for accurate ITL. Streaming chunks are not 1:1 with tokens
        n_tokens = len(self.model.tokenize(answer.encode(), add_bos=False))
        if n_tokens < 2:
            raise RuntimeError(
                f"model generated {n_tokens} token(s); need at least 2 to measure ITL"
            )
        time_after_first = end - first_token_time
        itl_ms = time_after_first / (n_tokens - 1) * 1000
        return answer, ttft_ms, itl_ms
    
    ...

4. The eval set.

Ten fixed gardening questions with reference answers, scored by claude-haiku-4-5 on a 1–5 scale. Answers scoring 4 or 5 count as passing. The questions cover companion planting, soil amendments, watering, pests, and plant nutrition: topics representative of what I actually ask the chatbot.

The questions stay fixed across every post in the series. If an optimization changes the eval score, we know it affected answer quality.

We haven’t validated that claude-haiku’s scores align with human judgment. LLM judges tend to reward confident, verbose answers and can be inconsistent across runs. The 4-or-5 threshold was chosen pragmatically, not calibrated against human labels. For this series, “90% pass rate” means “nothing obviously broke” rather than “objectively good.”

Results

Running all 10 questions after a warm start:

Metricp50p95p99
TTFT (ms)214220221
ITL (ms/tok)70.871.771.7

Cold TTFT (first inference, no prior warmup): 382 ms

Eval pass rate: 90% (9/10 questions scored ≥4)

Every optimization post will repeat this table so the before/after is unambiguous.

Lessons Learned

Cold TTFT is nearly double warm TTFT. The first inference after loading the model took 382 ms, 78% slower than the 214 ms warm baseline. That gap comes from Metal command buffer initialization and KV cache allocation (the memory that stores prior attention states), which happen once and are amortized over subsequent calls. In practice: the first question a user asks takes close to 400 ms before any text appears; every question after that takes 214 ms.

Warm TTFT variance is tight. The spread from p50 to p99 is only 7 ms. Prefill time on Apple Silicon is consistent run to run. The jitter a user might attribute to “the model being slow sometimes” is not coming from here.

The model already decodes faster than you can read, but chunky streaming is still perceptible. At 70.8 ms/tok, Qwen 14B produces roughly 14 tokens per second, which exceeds comfortable reading speed of 5–7 tok/s. But inter-token latency is not perfectly uniform (attention cost grows with sequence length, so later tokens arrive slightly after earlier ones), and watching text appear token-by-token is perceptually different from reading. A token every 70 ms is individually visible even when the average throughput is adequate. TTFT is the dominant lever; reducing the 214 ms wait before the first word matters more than smoothing ITL variance.

The failing eval question is a clear miss, not a borderline one. Question 4 (“When should I start hardening off tomato seedlings?”) scored 2/5. The model answered “about one week before you plan to transplant them outside” and missed both the frost date framing and the 1–2 week window the reference specifies. The other nine questions: seven scored 4, two scored 5. There are no borderline passes at risk of failing if quality degrades slightly; the one failure is an identifiable knowledge gap, not noise.

That makes the 90% threshold a reasonable guard for this series.

When to Use This (or Not)

  • Good for: macOS with Apple Silicon, privacy-first local inference, 16+ GB RAM, Python-based experimentation
  • Not good for: NVIDIA GPU owners (llama.cpp on CUDA has a different install path and hardware-specific tuning); machines with less than 16 GB RAM (the ~9 GB model file plus system overhead leaves little headroom)

If you just want a working local chatbot without measuring it, Ollama wraps llama.cpp and is simpler to set up.

Code

The full code is in the chatbot-macos-optimization repo.

References / Further Reading