🤖 Local LLM Enthusiast

DGX Spark

RTX Pro 6000

RTX 3090

I run and optimize local LLMs for private, low-latency, cost-controlled AI workflows. My focus is practical deployment: inference speed, eval quality, tool calling, and fitting models into real teams.

Hand-drawn wizard penguin tuning a private local LLM inference engine with cache, evals, and tool output

What I work on Link to heading

Running models with llama.cpp, vLLM, and SGLang
Fine-tuning open models such as Qwen, Gemma, Llama, and Pi.dev variants
Building evals with lm-evaluation-harness, evalplus, and custom benchmarks
Optimizing quantization, batching, KV-cache, and throughput for real workloads
Connecting local models to agent tools, RAG pipelines, and coding workflows

Selected work Link to heading

Shipped DFlash support for Blackwell consumer GPUs in Luce.
Benchmarked Bonsai models on DGX Spark / GB10.
Run local junior-dev style agent workflows with Qwen and coding models.

If your team needs private local inference or local LLM workflow design, start with my Forward Deployed Engineer page.