Build Log: Local LLM on an Intel N100 Mini PC — 80 Gets You 3 Tokens/Second

TL;DR: We built a local AI inference server on an Intel N100 mini PC using Ollama. Results: 2-9 tokens/second on 7B models, 16GB RAM bottleneck, 6W idle power. It's viable for batch processing and light automation — not for real-time chat. Total build cost: $180.

The Problem: API Costs at Scale

Running AI agents in production means making thousands of LLM API calls per month. At $0.14-$3.00 per million input tokens, even a modest automation pipeline can hit $50-100/month in API costs. For tasks that don't need frontier models — embeddings, classification, simple generation — paying per-token feels wasteful when you have hardware sitting idle.

8.0 / 10

Build Log Review 2026

🛡️ AI Tool · Updated 2026

We wanted to know: can a $180 Intel N100 mini PC replace cloud API calls for non-critical AI workloads? The answer is nuanced — but the data tells a clear story.

What We Built

An always-on local AI inference server running Ollama on a CachyOS Linux system. The hardware is a generic N100 mini PC with 16GB DDR4 RAM and a 512GB NVMe SSD. Total parts cost: $180.

Component Specification
CPU Intel N100 (4 E-cores, up to 3.4 GHz)
RAM 16GB DDR4-3200 (single channel)
Storage 512GB NVMe SSD
Power 6W idle, ~15W under load
OS CachyOS Linux (Arch-based)
Software Ollama + OpenClaw (local model fallback)
Cost $180 total

The Build Process

Setting up Ollama on the N100 was straightforward:

class="language-bash"># Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull models for testing
ollama pull llama3.1:8b
ollama pull mistral:7b
ollama pull qwen2.5:7b
ollama pull nomic-embed-text

# Verify API is running
curl http://localhost:11434/api/tags

# Integrate with OpenClaw as a local provider
# Add to models.json under ollama provider

The real work was testing what models actually ran within the N100's constraints. 16GB RAM disappears quickly — the OS uses ~2GB, Ollama uses another 500MB, and any model larger than 7B parameters at Q4 quantization would cause swapping.

Benchmarks: Tokens Per Second

We tested three models across two prompt types: short prompts (50 tokens) and long prompts (2000 tokens). All models at Q4_K_M quantization.

Model Short Prompt Long Prompt Context Load Time Usability
Llama 3.1 8B (Q4) 2.1 tok/s 1.8 tok/s ~3s Slow — borderline usable
Mistral 7B (Q4) 3.6 tok/s 3.1 tok/s ~2s Almost reading speed
Qwen 2.5 7B (Q4) 3.2 tok/s 2.8 tok/s ~2.5s Slow but functional
nomic-embed-text 850 tok/s 620 tok/s Instant Excellent — ideal for embeddings

The key finding: Mistral 7B at Q4 quantization is the sweet spot for the N100. At 3.6 tok/s, it's "reading speed" — not fast enough for interactive chat, but perfectly usable for batch processing, summarization, and classification where a 30-second wait per response is acceptable.

By contrast, running the same models on our main machine with a GPU achieves 30-60 tok/s — 10-15x faster. The tradeoff is cost: the GPU machine consumes 150W+, while the N100 sips 6-15W.

Real-World Integration: OpenClaw Provider Fallback

The most valuable use case we found was integrating Ollama as a fallback tier in the OpenClaw provider chain. When the primary DeepSeek API is rate-limited or unavailable, the N100's local Ollama instance handles embedding and simple generation tasks:

class="language-bash"># OpenClaw models.json configuration
{
 "ollama": {
 "baseUrl": "http://n100.local:11434",
 "api": "ollama",
 "models": [{ "id": "mistral:7b", "name": "Mistral 7B" }]
 }
}

For nomic-embed-text, the N100 delivers 620-850 tok/s — more than adequate for vector embedding pipelines. We route all embedding operations to the local Ollama instance, saving ~$8/month in API costs while adding negligible latency.

Cost Analysis

Metric N100 (Local) DeepSeek V4 Flash (API) Savings
Cost per 1M tokens $0.00 (electricity) $0.14 ~100%
Monthly operating cost $1.20 (electricity) ~$15-25 ~90%
Upfront hardware $180 (one-time) $0 Recouped in 7-12 months
Speed (7B model) 2-4 tok/s 40-80 tok/s API is 20x faster
Model quality Older open models Latest frontier models API is significantly better

Lessons Learned

What Worked Well

  • Embedding offloading: Moving all vector embeddings to the N100 saved significant API costs with zero quality difference. nomic-embed-text runs fast enough to be indistinguishable from an API call.
  • Provider fallback: Having a local model as a circuit-breaker fallback in OpenClaw's provider chain meant zero downtime during API outages — even if responses were slower.
  • Power efficiency: The N100 runs 24/7 consuming less power than a light bulb. At $0.12/kWh, the annual electricity cost is about $15.
  • Batch processing: Queuing non-urgent generation tasks for the local model worked well. We send overnight batch jobs to the N100 and collect results in the morning.

What We'd Do Differently

  • 32GB RAM would have been worth the upgrade. The 16GB limit is the single biggest constraint. Models swap to disk under load, turning 3 tok/s into 0.5 tok/s. Some N100 boards unofficially support 32GB — worth testing.
  • Dual-channel memory matters. The N100's single-channel memory bandwidth is a hidden bottleneck. Even a 2-core CPU with dual-channel DDR4 would likely outperform the 4-core N100 for LLM inference.
  • Don't use it for interactive chat. 2-4 tok/s feels frustrating compared to the instant responses of cloud APIs. Use it for fire-and-forget tasks where latency doesn't matter.
  • Quantization is non-negotiable. Running models at Q4 or Q3 quantization is essential. At Q8, the same 7B model needs 8GB of RAM just for weights — leaving almost nothing for the operating system and context.

The Verdict

The Intel N100 is a viable entry point for local AI inference at a very low price point. For $180, you get an always-on server that can handle embeddings, simple generation, and batch processing — all at near-zero operating cost. It's not a replacement for cloud APIs on performance, but as a fallback tier and embedding engine, it pays for itself within a year.

For the toolbrain.net setup, the N100 handles all embedding operations and acts as a last-resort fallback in the multi-provider failover chain. The combination of cloud API billing and local inference gives the best of both worlds: speed when you need it, free operation when you don't.

For more on building cost-efficient AI infrastructure, see our provider failover build log and OpenClaw automation guide.

Frequently Asked Questions

Is an N100 good for running LLMs?

For small models (7B and under at Q4 quantization), yes. For anything larger, no. It's best suited for embeddings, simple generation, and batch processing.

What models can I run on an N100?

Mistral 7B and Qwen 2.5 7B at Q4 quantization give the best results. Llama 3.1 8B is usable but slower. nomic-embed-text for embeddings works excellently. Models larger than 7B will be too slow for practical use.

How much RAM do I need?

16GB is the practical minimum. 32GB is strongly recommended if your N100 board supports it. The OS uses ~2GB, Ollama uses ~500MB, and a 7B model at Q4 takes ~5GB of RAM. You need headroom for context, concurrent requests, and system operations.

Is it cheaper than using an API?

After the $180 upfront cost, the N100 costs about $1.20/month in electricity. The breakeven point is 7-12 months depending on your API usage volume. For heavy embedding workloads, it pays for itself much faster.

← Back to all posts