MilkCrunch

Rent the Builder, Own the Tool

· by Michael Doornbos · 3744 words

The frontier model doesn’t do my work. It builds the tool that does my work. That’s the move.

Two years ago, when I tried running an open-weight model locally, it was a parlor trick. I could do it, but the output reminded me why I was paying for the hosted version.

That isn’t true anymore.

Open-weight models have closed most of the gap on routine work. Code completion, summarization, classification, drafting, translation, and basic reasoning. The frontier labs still pull ahead on the hard stuff, the agentic stuff, the very-long-context stuff. For the work, I find myself using AI for most days, a model running on a laptop is fine.

The piece I wrote earlier today on the hosted AI pricing problem covered the cost side. This is the other half. If hosted plans are being sold at a loss and the subsidies are ending, the question isn’t only about what to do when prices rise. It’s what role the hosted models should play in the first place.

I keep coming back to one shift. The frontier model isn’t what my work is about. It’s the thing I hired once to build the tools that do my work. Rent the builder. Own the tool. Run your work on the tool, not the frontier.

This piece is how I’m thinking through that shift. Some of what I’m working on might be useful for you, too.

What changed

The models got smaller and smarter. Llama, Qwen, Mistral, DeepSeek, Gemma. The 7-to-14 billion parameter class went from “interesting research” to “actually useful for daily work” in about eighteen months. The 30B class is doing things that needed GPT-4 not long ago.

The licenses got friendlier. Not all of them, but enough to run a capable model commercially without lawyering up first. The open-weight movement is the open-source story that actually worked this decade.

The tooling got boring in a good way. Ollama is the one I keep coming back to. It’s a thin, well-behaved wrapper around llama.cpp that does one job. Pull a model, run it, and expose a local API. Type ollama run llama3.2, and the inference server is up. The Unix philosophy keeps winning.

The hardware caught up. Apple’s unified memory architecture turned consumer Macs into reasonable inference machines almost by accident. A second-hand workstation with a single 24GB GPU can run most useful models. I don’t need a data center anymore. I need a desk.

What “good enough” actually means

I used to write off local models because they couldn’t match the frontier on everything. They don’t need to.

Once I started paying attention to what I was actually using AI for, most of it turned out to be routine work.

Code completion. Local models do this well. The autocomplete pattern is exactly the kind of solved-problem space where models shine. It doesn’t need genius, just context, low latency, and reasonable suggestions. A local model never rate-limits me and never sends my private repo to a third party.

Summarization and drafts. Meeting notes, email drafts, document summaries. A 7B model does this fine, and the output gets a human pass anyway.

Classification and extraction. Pull entities out of text, tag a document, route a ticket. Smaller models excel here. Often, a fine-tuned local model beats a generic hosted one on the specific task.

Translation and reformatting. Mature problems. Local models handle them.

Quick lookups and explanations. The “rubber duck” usage that’s become so common. Local works. I don’t need the smartest model to be a thinking partner for the next ten minutes.

The hosted frontier still earns its keep on the work that actually needs it. Hard architectural reasoning. Cross-thousand-file refactors. Novel-domain code. Anything where I need the smartest available model, and I’m willing to pay for it.

Where local still loses

I’m not going to pretend local is ready for everything. It isn’t.

Frontier reasoning. The biggest hosted models still pull ahead on hard problems. Math, complex code generation, multi-step planning. I feel it whenever I try to swap them out for everything.

Very long context. Hosted models routinely handle hundreds of thousands of tokens. Local models technically can, but the memory bill and the speed penalty are brutal at that scale.

Multimodal. Vision, audio, video. Open-weight models exist here and are improving, but the gap is wider than in text.

Bleeding-edge agentic tool use. The orchestration stacks that chain dozens of tool calls are tuned for hosted models. They work locally, but with friction.

Specialty work. Anything where the frontier got an extra year of training on the exact problem you’re solving.

I think of it as a portfolio. Local for the routine. Hosted for the hard. I try to notice which one I’m reaching for, and why.

What it costs to play

Hardware is the other side of the math. The floor is lower than I thought it would be going in, and the ceiling is reachable without selling a kidney.

What you probably already have. Any Apple Silicon Mac with 16GB of unified memory runs a quantized 7-to-8B model at usable speed. 32GB gets you to the 13-14B range. Apple’s machines punch above their weight because of the unified memory trick. RAM is VRAM, and that single design choice put a capable inference machine in front of millions of people who didn’t know they had one.

On the PC side, a discrete NVIDIA card with 8GB of VRAM (a five-year-old RTX 3070, for example) still runs 7B-class models. A 12GB card pushes into the 13B range. If you have a gaming machine made in the last five years, you can run a useful model on it tonight.

Modest spend, big jump ($800 to $1,500). A used RTX 3090 with 24GB of VRAM is the obvious play. The used market has settled in the $700-to-$900 range, and 24GB is enough to run 30B models comfortably and 70B models with aggressive quantization. The other path at this tier is a Mac Mini M4 Pro with 24GB or 48GB of unified memory, which is quieter, draws far less power, and remains useful for everything else a Mac can do.

Serious spend ($2,500 to $5,000). This is where the 70B-class models become daily usable. A Mac Studio with 96GB or 128GB of unified memory runs them at speeds that don’t make you wait. A workstation with a 48GB NVIDIA card (e.g., RTX A6000) does the same job with more raw compute and more heat. A new RTX 5090 with 32GB sits in this tier when you can actually buy one at list price, which is not always the case.

Beyond that. Multi-GPU rigs, datacenter cards, and used H100-class hardware exist, but they stop being personal infrastructure and become a team purchase. The math at that tier is the same as any small company does when buying versus renting.

NVIDIA consumer cards stay in tight supply, and prices above MSRP are routine on anything new. The used market is the value path, and has been for years now. Apple Silicon is the quiet winner for individual buyers because its supply chain isn’t competing with hyperscaler orders for the same parts.

The economic case

A workstation with a decent GPU runs $2,000 to $5,000. A used M-series Mac runs less. Amortize that over three years, and you get a fixed monthly cost that doesn’t change when your provider doubles its API rates.

A hosted plan at $200/month is $7,200 over three years. Two heavy users on the same plan cross the cost of the hardware in well under a year. A team of ten crosses it in months. And that’s at today’s price, which is almost certainly not tomorrow’s price. The hardware bill doesn’t change when the subsidy ends. The hosted bill does.

That math doesn’t favor local for everyone. It favors local for teams running constant, routine inference where the frontier isn’t required. Which, in my experience, is most of the work.

The other piece of the math is rate limits. Hosted plans have them, but the local plan doesn’t. The token I’d burn asking the same model the same thing in a loop is free at home, and that’s not nothing for anyone doing repeated extraction, classification, or batch processing.

The quiet bonus

My prompts aren’t going anywhere when the model runs locally. Not into a training set, not into a logged conversation, not through someone else’s network.

This isn’t a paranoid concern for most of what I do. It is a real one for anyone touching private code, customer data, medical records, legal documents, or anything under an NDA. The compliance story for hosted AI is slowly improving. The compliance story for “the model is on this laptop” is simple and has been for thirty years.

Where to start

If you’ve never tried this, install Ollama. Pull a model that fits in your RAM. Llama 3 or Mistral Small for general work, and Qwen Coder if you also want code completion. 8B for 16GB machines, 14B-30B if you have more. Point your editor at the local API and use it for a week.

You’ll notice three things, the same three I did. The speed beats your expectations. The quality covers more of your work than you thought. And the frontier model you were paying for ends up with a narrower role.

That’s the goal. Not replacement. Optionality.

One small example

This is the kind of thing I started using local models for. Say you have a folder of meeting notes from the past month, and you want one document pulling out the action items across all of them.

cat meetings/*.md | ollama run llama3 \
  "Pull every action item out of these notes.
   Format as: assignee, deadline, action.
   Skip anything already marked done."

Four-line shell command. Runs in maybe twenty seconds on a recent Mac, doesn’t touch the internet, and costs nothing in API tokens. You get back a table you would have spent thirty minutes building by hand.

The pattern works for anything text-shaped. Tag a directory of receipts. Classify a year of support tickets by urgency. Draft first-pass replies to a batch of emails. Pull keywords out of every blog post you’ve written. It’s cat, grep, and awk for text that needs a little bit of understanding rather than just pattern matching.

The pipe is the whole trick. Ollama exposes the model as something that takes stdin and writes stdout, which makes it a shell tool rather than a service. That fits the way I already work.

Your own specialty model

The point with local AI isn’t running someone else’s model. It’s making one that’s better than theirs on the work I actually do.

A specialty model is an open-weight base model that’s been further trained on a specific domain. It costs less compute than the people who built the base, runs on hardware you own, and outperforms a much larger generic model on the narrow problem you trained it for.

This used to be expensive enough that only research labs bothered. That changed when parameter-efficient fine-tuning techniques like LoRA and QLoRA arrived. They let you adjust the model’s behavior by training a small set of additional weights rather than the whole network. A 7B model that took weeks of GPU time to train from scratch can be fine-tuned for a specific task in hours on a single 24GB card.

A few examples already in the wild.

Code-specific models. Qwen Coder and Code Llama are base models tuned for programming. They beat their generic siblings on code tasks. A team can take one of these and further tune it in its own codebase to make the autocomplete actually know the project’s conventions.

Medical and scientific models. PubMedBERT and the BioGPT line are tuned on biomedical literature. Pharma and research teams build their own internal versions for drug discovery, clinical decision support, and literature review.

Legal models. Law firms have started fine-tuning open-weight bases on case law and their own document corpora. The output is more useful for that firm than the best hosted model, partly because the hosted model has never read the firm’s internal templates.

Customer support agents. A small model fine-tuned on your support ticket history and product docs answers questions in your company’s voice and gets the product details right. The hosted equivalent guesses, plausibly and sometimes wrongly.

Building one is more concrete than it sounds.

Collect your domain data. Code, tickets, documents, conversations, whatever the model needs to learn the patterns of your world. The quality of this data is the whole game.

Pick a base model. Llama, Qwen, Mistral, and the others all release base versions for exactly this purpose. Pick one in a size that fits your hardware.

Use a fine-tuning framework. Unsloth, axolotl, and Apple’s MLX-LM on Mac hardware are the common starting points. Each abstracts away the parts you don’t want to handwrite.

Evaluate honestly. Build a test set from real examples of the work you want the model to do. Compare the fine-tuned model against the base model, and its previous version hosted model. A specialty model that’s worse than the base on its own domain is a failure, and you only find that out with measurement.

Deploy it like any other local model. Ollama can serve custom models, and so can llama.cpp directly. The same inference stack runs your tune.

I think most people miss this part. A fine-tuned model is yours. It encodes what your team has figured out, in a form nobody else has. Send that same data to a hosted provider, and it ends up in their training pipeline. The questions you ask the cloud model are the same questions you’d train a specialty model on. One makes you faster. The other makes them smarter.

The catch is that this requires real work. Data curation is hard. Evaluation is harder. Most teams that try fine-tuning produce something mediocre before they produce something good. The reward, when you get there, is a model that does the narrow work better than anything you could rent.

A worked example

Pick a use case. Say you want a small model that drafts first-pass replies to customer support tickets in your team’s actual voice, not a generic AI tone. I’ll walk through that one.

The data you need is your ticket history, specifically pairs of incoming questions and the reply your team actually sent. A year of those is usually plenty. Export them as JSONL, one line per pair:

{"prompt": "I can't log in to my account...", "completion": "Hi, sorry about the trouble. Have you tried..."}

That’s the whole dataset format for most fine-tuning frameworks. A few thousand of these will start to teach a model of your team’s voice, the structure your replies usually take, and a lot of common product knowledge along the way.

Pick a base model in a size you can run. Llama 3.2 8B is a reasonable default, since it fits on a 16GB Mac and a wide range of GPUs.

Then run the fine-tune. Unsloth’s notebook, on a single 24GB GPU, completes training in 2 to 6 hours for a dataset like this. The default settings work for a first attempt:

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Llama-3.2-8B",
    load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
    model, r=16, lora_alpha=32,
)
# train with your JSONL

That’s roughly the whole thing, minus the trainer loop, which is another twenty lines of standard code Unsloth gives you as a starting template.

Test before you ship it. Take 20 real tickets the model has never seen, run them through both the fine-tuned and base models, and read the outputs side by side. Sometimes the fine-tune is sharper. Sometimes it overfits and parrots specific old replies. You only find that out by looking.

When the output is good enough, export the model in GGUF format, drop it into Ollama, and you’re serving it the same way you serve any other model:

ollama create support-replies -f Modelfile
ollama run support-replies "I can't log in to my account..."

The first time you do this, plan to spend a day on it. The dataset is half the work, the training is the easy part, and the evaluation is where you find out whether the result is actually any good. The second time, an afternoon is enough.

Bootstrap with the frontier

These days, I think of the frontier model as a contractor rather than a utility. Pay it once to build the thing, then run the thing yourself.

The most obvious version of this is to use a hosted model to help write the local model’s fine-tuning code, eval runner, and Ollama Model file. Thirty minutes of frontier-model help to design a workflow, which I then run for years on my own hardware. Small expense up front, recurring win after.

It goes further than tooling. The most useful frontier-to-local pattern right now is distillation, where you use a smart hosted model to generate the training data for a smaller local one. Run a frontier model on a thousand of your real customer questions and capture their responses. Now you have a dataset that captures something close to the frontier’s reasoning on your domain, in your format. Fine-tune Llama 3.2 8B on that, and you’ve got a local model that does most of what the frontier did for that task, at a fraction of the ongoing cost.

There are a few other patterns I keep coming back to.

Labeling. You have 10,000 support tickets but no time to categorize them by hand. A frontier model can label them in an afternoon for a few dollars. Then you train a local classifier on the labels, and the classifier handles every new ticket for free from then on.

Evaluation. Build a test set for your local model and have a frontier model grade the outputs. “LLM-as-judge” is now standard practice for evals at scale. The local model is the worker, and the frontier model is the QA.

Glue code. Hand the frontier model your stack and ask it to write the orchestration layer that calls your local model, handles retries, manages context windows, and falls back to a hosted model when the local one bails. This is exactly the kind of solved-problem code hosted AI is good at.

Prompt engineering. Iterate prompts against the frontier model first, because feedback is faster and the surface area is bigger. Once a prompt is working, port it to your local model and tune it to address the gap. The local model usually needs more explicit instructions, but the structure carries over.

The frontier model isn’t a service I pay for every month. It’s a power tool I rent for the day, use to build something, and then put back. I bought time. The thing I built is mine.

The frontier builds the tool

I keep coming back to the personal version of this. I have a vault of notes that goes back over thirty years. Lab logs, journals, project notes, what worked and what didn’t, every Obsidian file synced, and every old text file imported. It’s the most useful corpus I have. I’d like a model to actually know it, not just see chunks of it when I paste them in.

When I start a lab log entry, I usually go into my own notes. Did I try this already? Did it work? What’s the existing note I should link to? A frontier model handles all of that, and handles it well. I’ve been feeding mine the vault for the better part of a year, and it’s been useful. The question isn’t whether the frontier can do this work. It’s whether an expensive, subsidized frontier model is the right tool for it.

A small local model trained on my own vault doesn’t need to beat the frontier. It needs to do this specific work well, on hardware I own, without metering every query. It’s not about which model is smartest. It’s about which model is right-sized.

I used a frontier model once, as a contractor. It reads a sample of the vault. It helps me design the training data, the Q&A pairs grounded in my actual notes, or note continuations, or wikilink suggestions, whatever shape I want the tool to take. The frontier writes the fine-tuning pipeline and the eval, then sets up the Modelfile. Thirty minutes of frontier work, not afternoons. The current models are good enough that the entire build is a single conversation.

Then I train. On hardware I own, with data that never leaves my machine. The output is a small model that knows the vault.

After that, the frontier is optional. The daily work runs on the tool, not the frontier, whether that’s drafting lab log entries, suggesting wikilinks, finding precedent in old notes, or summarizing decade-long threads. The frontier model is called only when I need something it can do; otherwise, it’s idle and unbilled.

The frontier model doesn’t do my work. It builds the tool that does my work. That’s the move.

One expense, paid once. The result is a tool right-sized for my specific world, one that gets more useful as the vault grows, and one I own.

Most of the AI conversation right now is about renting capability by the month. The more useful version is hiring a contractor to build something you keep. The frontier did the building. I kept the tool. I run my projects, my business, and my daily writing on it. Not on the frontier. The tool gets better every time I add to the vault.

The era we’re heading into

Hosted AI doesn’t go away. The frontier keeps moving, and the people willing to pay for it will keep paying. What changes is the share of routine work that runs on your own hardware.

Five years from now, I think the typical developer probably runs three or four models in parallel. Something local handles autocomplete and routine work. A mid-tier hosted model is the everyday assist. The frontier comes out for the hard problems. A specialty model might handle their specific domain.

The teams that figure this out first will spend less on AI and lose less sleep over their providers’ pricing decisions. The teams that don’t will keep paying loss-leader rates until the loss-leader phase ends.

Local AI got good. I noticed. You might want to.


What’s the first thing you’d move to a local model if you set one up tomorrow? I’m at mike@imapenguin.com | @mrdoornbos

<< Previous Post

|

Next Post >>

#Opinion #Ai