馃 Local LLM Enthusiast

DGX Spark RTX Pro 6000 RTX 3090

I run and optimize local LLMs for private, low-latency, cost-controlled AI workflows. My focus is practical deployment: inference speed, eval quality, tool calling, and fitting models into real teams.

What I work on Link to heading

  • Running models with llama.cpp, vLLM, and SGLang
  • Fine-tuning open models such as Qwen, Gemma, Llama, and Pi.dev variants
  • Building evals with lm-evaluation-harness, evalplus, and custom benchmarks
  • Optimizing quantization, batching, KV-cache, and throughput for real workloads
  • Connecting local models to agent tools, RAG pipelines, and coding workflows

Selected work Link to heading

  • Shipped DFlash support for Blackwell consumer GPUs in Luce.
  • Benchmarked Bonsai models on DGX Spark / GB10.
  • Run local junior-dev style agent workflows with Qwen and coding models.

If your team needs private local inference or local LLM workflow design, start with my Forward Deployed Engineer page.