馃 Local LLM Enthusiast
I run and optimize local LLMs for private, low-latency, cost-controlled AI workflows. My focus is practical deployment: inference speed, eval quality, tool calling, and fitting models into real teams.
What I work on Link to heading
- Running models with
llama.cpp,vLLM, andSGLang - Fine-tuning open models such as Qwen, Gemma, Llama, and Pi.dev variants
- Building evals with
lm-evaluation-harness,evalplus, and custom benchmarks - Optimizing quantization, batching, KV-cache, and throughput for real workloads
- Connecting local models to agent tools, RAG pipelines, and coding workflows
Selected work Link to heading
- Shipped DFlash support for Blackwell consumer GPUs in Luce.
- Benchmarked Bonsai models on DGX Spark / GB10.
- Run local junior-dev style agent workflows with Qwen and coding models.
If your team needs private local inference or local LLM workflow design, start with my Forward Deployed Engineer page.