<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>VLLM on KnightLi Blog</title>
        <link>https://www.knightli.com/en/tags/vllm/</link>
        <description>Recent content in VLLM on KnightLi Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Fri, 10 Apr 2026 22:54:17 +0800</lastBuildDate><atom:link href="https://www.knightli.com/en/tags/vllm/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>Gemma 4 Local Runtime Guide: From One-Command Start to Dev Integration</title>
        <link>https://www.knightli.com/en/2026/04/10/gemma4-local-runtime-options/</link>
        <pubDate>Fri, 10 Apr 2026 22:54:17 +0800</pubDate>
        
        <guid>https://www.knightli.com/en/2026/04/10/gemma4-local-runtime-options/</guid>
        <description>&lt;p&gt;If you want to run Gemma 4 locally, you can choose from four practical paths depending on your goal and hardware.&lt;/p&gt;
&lt;h2 id=&#34;1-fastest-start-ollama-recommended&#34;&gt;1) Fastest start: Ollama (recommended)
&lt;/h2&gt;&lt;p&gt;This is the lowest-friction option for quick testing, daily chat, and local API usage.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama run gemma4
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Highlights:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Works on Windows, macOS, and Linux&lt;/li&gt;
&lt;li&gt;Handles hardware acceleration automatically&lt;/li&gt;
&lt;li&gt;Offers OpenAI-style local API compatibility&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;2-gui-workflow-lm-studio--unsloth-studio&#34;&gt;2) GUI workflow: LM Studio / Unsloth Studio
&lt;/h2&gt;&lt;p&gt;If you prefer a desktop UI instead of terminal commands:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;LM Studio: browse and run Gemma 4 quantized variants from Hugging Face (for example 4-bit, 8-bit), with resource visibility.&lt;/li&gt;
&lt;li&gt;Unsloth Studio: supports both inference and low-VRAM fine-tuning, often friendlier on 6GB-8GB GPUs.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;3-low-spec-and-maximum-control-llamacpp&#34;&gt;3) Low-spec and maximum control: llama.cpp
&lt;/h2&gt;&lt;p&gt;Good for older hardware, CPU-focused setups, or users who want deeper runtime control.&lt;/p&gt;
&lt;p&gt;With &lt;code&gt;.gguf&lt;/code&gt; model files and quantization, Gemma 4 can be made practical on much smaller hardware budgets.&lt;/p&gt;
&lt;h2 id=&#34;4-developer-integration-transformers--vllm&#34;&gt;4) Developer integration: Transformers / vLLM
&lt;/h2&gt;&lt;p&gt;If you need Gemma 4 inside your own application:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Transformers: straightforward Python integration&lt;/li&gt;
&lt;li&gt;vLLM: high-throughput inference for stronger GPU environments&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;quick-selection&#34;&gt;Quick selection
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Need&lt;/th&gt;
          &lt;th&gt;Recommended tools&lt;/th&gt;
          &lt;th&gt;Hardware bar&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;I just want it running now&lt;/td&gt;
          &lt;td&gt;Ollama&lt;/td&gt;
          &lt;td&gt;Low&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;I want a ChatGPT-like UI&lt;/td&gt;
          &lt;td&gt;LM Studio&lt;/td&gt;
          &lt;td&gt;Medium&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;My VRAM is limited (6GB-8GB)&lt;/td&gt;
          &lt;td&gt;Unsloth / llama.cpp&lt;/td&gt;
          &lt;td&gt;Low&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;I am building local AI apps&lt;/td&gt;
          &lt;td&gt;Ollama / Transformers / vLLM&lt;/td&gt;
          &lt;td&gt;Medium to high&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;I need fine-tuning&lt;/td&gt;
          &lt;td&gt;Unsloth Studio&lt;/td&gt;
          &lt;td&gt;Medium to high&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&#34;model-size-suggestion&#34;&gt;Model size suggestion
&lt;/h2&gt;&lt;p&gt;Gemma 4 comes in multiple sizes (for example E2B, E4B, 31B).&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Start with quantized E2B/E4B on mainstream laptops&lt;/li&gt;
&lt;li&gt;Move to larger variants only after your baseline pipeline is stable&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        
    </channel>
</rss>
