AI Tools on KnightLi Blog

LLM Quantization Explained: How to Choose FP16, Q8, Q5, Q4, or Q2

Sun, 05 Apr 2026 22:09:11 +0800

The core goal of quantization is simple: trade a small amount of precision for a smaller model size, lower VRAM usage, and faster inference.
For local deployment, picking the right quantization format is often more important than chasing a larger parameter count.

What Is Quantization

Quantization means compressing model parameters from higher-precision formats (such as FP16) into lower-bit formats (such as Q8 and Q4).

A simple analogy:

Original model: like a high-quality photo, clear but large.
Quantized model: like a compressed photo, slightly less detail but lighter and faster.

Common Quantization Formats

Quantization	Precision / Bit Width	Size	Quality Loss	Recommended Use
FP16	16-bit float	Largest	Almost none	Research, evaluation, max quality
Q8_0	8-bit integer	Larger	Almost none	High-end PCs, quality + performance
Q5_K_M	5-bit mixed	Medium	Slight	Daily driver, balanced choice
Q4_K_M	4-bit mixed	Smaller	Acceptable	General default, strong value
Q3_K_M	3-bit mixed	Very small	Noticeable	Low-spec devices, run-first
Q2_K	2-bit mixed	Smallest	Significant	Extreme resource limits, fallback

Quantization Naming Rules

Take gemma-4:4b-q4_k_m as an example:

gemma-4:4b: model name and parameter scale.
q4: 4-bit quantization.
k: K-quants (an improved quantization method).
m: medium level (common options also include s/small and l/large).

Quick Selection by VRAM

RAM / VRAM	Recommended Quantization
4 GB	Q3_K_M / Q2_K
8 GB	Q4_K_M
16 GB	Q5_K_M / Q8_0
32 GB+	FP16 / Q8_0

Start with a version that runs stably on your machine, then move up in precision step by step instead of jumping straight to the biggest model.

Practical Tips

Start with Q4_K_M by default and test real tasks first.
If response quality is not enough, move up to Q5_K_M or Q8_0.
If VRAM or speed is the main bottleneck, move down to Q3_K_M.
Use the same test set every time you switch quantization formats.

Conclusion

Quality first: FP16 or Q8_0.
Balance first: Q5_K_M.
General default: Q4_K_M.
Low-spec fallback: Q3_K_M or Q2_K.

The key is not “bigger is always better”, but “the most stable and usable result under your hardware limits.”

Google Gemma 4 Model Comparison: How to Choose Between 2B/4B/26B/31B

Sun, 05 Apr 2026 08:30:00 +0800

Gemma 4 focuses on multimodality and local offline inference, with a full range from lightweight to high-performance models. For most local deployment users, the key is not choosing the largest model, but choosing the one that best matches hardware and task needs.

Gemma 4 Model Comparison

The table below is for quick model selection. Actual performance and resource usage should be validated in your own environment.

Model	Parameter Size	Positioning	Key Strengths	Main Limitations	Recommended Scenarios
Gemma 4 2B	2B	Ultra-lightweight	Low latency, low resource usage, lowest deployment barrier	Limited performance on complex reasoning and long task chains	Mobile, IoT, lightweight Q&A, simple automation
Gemma 4 4B	4B	Lightweight enhanced	Stronger understanding and generation than 2B, still easy to deploy locally	Limited ceiling for heavy coding and complex agent tasks	Local assistant, basic document work, multilingual daily tasks
Gemma 4 26B	26B	High-performance (MoE)	Better reasoning and tool use, suitable for production workflows	Significantly higher VRAM requirement and hardware threshold	Coding assistant, complex workflows, enterprise internal agents
Gemma 4 31B	31B	High-performance (dense)	Best overall capability and stronger stability on complex tasks	Highest resource cost and tuning complexity	Advanced reasoning, complex coding tasks, heavy automation

How to Choose: Start from Hardware and Tasks

If your top concern is whether it runs smoothly, use this guideline:

8GB VRAM: prioritize 2B/4B.
12GB VRAM: prioritize 4B or quantized variants of larger models.
24GB VRAM: focus on 26B, and evaluate quantized 31B based on workload.
Higher VRAM or multi-GPU: consider high-precision 31B setups.

Prioritize stability and inference speed first, then scale up model size gradually.

Four Typical Use Cases

1) Local General Assistant

Preferred model: 4B
Why: strong balance between cost and quality, suitable for long-running local use.

2) Coding and Automation

Preferred model: 26B
Why: more stable in multi-step tasks, tool calls, and script generation.

3) Advanced Reasoning and Complex Agents

Preferred model: 31B
Why: stronger robustness under complex context.

4) Edge Devices and Lightweight Offline Use

Preferred model: 2B
Why: easiest to deploy on resource-constrained devices.

Deployment Suggestions (Ollama)

A practical approach is to iterate in small steps:

Start with 4B to establish a baseline (latency, memory, quality).
Build a fixed test set from real tasks (for example, 20 common questions + 10 automation tasks).
Compare 26B/31B against that set for accuracy, latency, and VRAM cost.
Upgrade only when the gain is clear.

This avoids jumping to a large model too early and running into lag, low throughput, and maintenance overhead.

Conclusion

The real value of Gemma 4 is not just larger parameter counts, but a practical model ladder from lightweight to high-performance:

For low-cost fast rollout: start with 2B/4B.
For production-grade local AI workflows: prioritize 26B.
For advanced reasoning and heavy automation: move to 31B.

In most cases, the best Gemma 4 choice is not the biggest model, but the one with the best fit for your hardware and task goals.

Analyzing Anthropic's docx Agent Skill: Features, Code Structure, Usage, and Caveats

Sat, 04 Apr 2026 11:00:00 +0800

Anthropic’s skills/docx is essentially a workflow spec plus a script toolkit for handling Word documents more reliably with AI.
It does not just tell a model to “generate a .docx.” Instead, it breaks document work into explicit paths: create, read, edit existing files, handle tracked changes, add comments, convert formats, and validate OOXML structure.

If we reduce it to one line:

It treats .docx as ZIP + XML + Office compatibility constraints, not as a black box.

What this skill solves

When general-purpose models handle Word files, we often see the same failure patterns:

They output text, but not a structurally valid .docx.
They break OOXML while editing existing documents.
They do not know which XML parts to update for comments or tracked changes.
Output opens in one app but behaves inconsistently across Word, LibreOffice, and Google Docs.
They lack clear routing for when to use pandoc vs. unpack/edit/repack.

The value of this skill is that it front-loads those decisions:

Use pandoc or unpacking for reading and analysis.
Use docx-js for creating new .docx files.
Use “unpack -> edit XML -> repack -> validate” for existing documents.
Use dedicated scripts for tracked changes/comments/schema-sensitive operations.

That approach works because Word problems are usually not about wording quality. They are about structural correctness and compatibility.

Directory and code structure

This skill can be understood in four layers.

1. Guidance layer: `SKILL.md`

SKILL.md does two important jobs:

It defines trigger conditions.
If a request mentions Word, .docx, comments, tracked changes, TOC, page numbers, or polished document formatting, this skill should be activated.
It defines execution routes.
Different task types map to different toolchains, instead of improvising every run.

It also captures practical compatibility rules, for example:

docx-js defaults to A4, not US Letter.
Landscape page sizing must follow docx-js internals.
Lists should not be built from manual Unicode bullets.
Table width needs coordinated settings at table and cell levels.
Image type is required.
Generated files should be validated.

That is a strong signal that the goal is not just “generate something,” but “generate something that is robust.”

2. Office package layer: `scripts/office/*`

This layer treats .docx/.pptx/.xlsx as Open XML packages.

`unpack.py`

This script unpacks files and prepares XML for safer editing:

Extracts ZIP package content
Pretty-prints XML and .rels
Optionally runs merge_runs for DOCX
Optionally runs simplify_redlines for DOCX
Escapes smart quotes into XML entities

So it is not just decompression. It normalizes content into an editing-friendly shape.

`pack.py`

This script repacks a directory into .docx/.pptx/.xlsx.
Before packaging, it can:

Run validation and auto-repair
Condense XML formatting safely

If --original is provided, it compares and validates against the source context.
That matters because “repacked successfully” is not equal to “semantically safe.”

`validate.py`

This is the quality gate. It checks:

XML well-formedness
Namespace correctness
Unique ID constraints
Relationship/content type consistency
XSD compliance
Whitespace preservation rules
Insertion/deletion/comment marker constraints

For DOCX work, this is a core component, not an optional extra.

`soffice.py`

This helper wraps LibreOffice execution for restricted/sandboxed environments.
It configures SAL_USE_VCLPLUGIN=svp and can apply a shim for AF_UNIX socket limitations when needed.

That tells us the skill is designed for automated agent workflows, not only local manual usage.

3. Word-specific layer: comments, revisions, and redlines

`comment.py`

This script adds comments to DOCX, including required package plumbing across multiple parts:

word/comments.xml
commentsExtended.xml
commentsIds.xml
commentsExtensible.xml
comment range markers in document.xml
declarations in [Content_Types].xml and document.xml.rels

If comment parts do not exist yet, it can initialize templates and required relationships/content types.

`accept_changes.py`

This script accepts all tracked changes via LibreOffice headless + macro (.uno:AcceptAllTrackedChanges) rather than fragile raw XML surgery.

That is a pragmatic choice because accepting revisions is a behavior-level operation, not just deleting <w:ins> / <w:del> tags.

`validators/redlining.py`

This is one of the most valuable pieces.
It removes tracked changes for a specific author in both original and modified documents, then compares resulting text to verify that changes are properly represented in revision markup.

So it validates revision semantics, not only XML syntax.

4. Schema and support layer: `schemas/`, `helpers/`, `templates/`

`schemas/`

Contains OOXML/ECMA/Microsoft-related XSD files used by validators.
Validation is therefore grounded in formal schema constraints.

`helpers/`

Includes utilities such as:

merge_runs.py
simplify_redlines.py

These stabilize XML structure for clearer edits and diffs.

`templates/`

Contains XML templates needed for comment support, including:

comments.xml
commentsExtended.xml
commentsIds.xml
commentsExtensible.xml
people.xml

These templates help avoid package-level inconsistencies when creating comment-related parts.

Typical usage patterns

From SKILL.md, the most common workflows are:

Scenario 1: Read/analyze an existing DOCX

Use pandoc for text-level extraction with tracked changes:

`1`	`pandoc --track-changes=all document.docx -o output.md`

Use unpacking for raw XML inspection:

`1`	`python scripts/office/unpack.py document.docx unpacked/`

Scenario 2: Create a new DOCX

Use docx-js for generation:

`1`	`npm install -g docx`

Then validate:

`1`	`python scripts/office/validate.py doc.docx`

Scenario 3: Edit an existing DOCX

Core workflow:

1
2
3

python scripts/office/unpack.py document.docx unpacked/
# edit XML under unpacked/
python scripts/office/pack.py unpacked/ output.docx --original document.docx

--original is the critical part because it enables stronger structural and revision-aware checks.

Scenario 4: Accept all tracked changes

`1`	`python scripts/accept_changes.py input.docx output.docx`

Requires LibreOffice; useful for producing a clean post-review file.

Scenario 5: Add comments

1
2

python comment.py unpacked/ 0 "Comment text"
python comment.py unpacked/ 1 "Reply text" --parent 0

You still need to place comment range markers in document.xml where the comment should attach.

Key caveats to remember

1. `.docx` is not a plain text file

A single edit may involve body XML, relationships, content types, comment parts, IDs, and schema constraints.

2. `docx-js` generation still needs explicit guardrails

Defaults can be wrong for your target layout and compatibility goals.

3. Comments and tracked changes are multi-part operations

They are package-level features, not single-tag edits.

4. “Opens successfully” does not mean “correctly modified”

Many issues only surface later during editing, reviewing, cross-app opening, or acceptance of changes.

5. Environment readiness matters

You need tools such as pandoc, LibreOffice/soffice, docx-js, and Python deps (defusedxml, lxml) available.

What this skill is good for (and not)

Good fit

Batch Word report generation
Structured formal document production
Automated edits to existing .docx
Tracked-changes aware workflows
Automated comment insertion
Agent/script-driven document pipelines

Not ideal

Very simple PDF-only output cases
Pure text extraction with no document fidelity requirement
Fully manual visual editing workflows
Zero-dependency expectations for end-to-end Word automation

Summary

Anthropic’s skills/docx is strong not because it can “generate Word files,” but because it encodes why Word automation fails and how to handle those failure modes systematically.
It combines generation, low-level XML editing, revision semantics, schema validation, and cross-app compatibility into one executable workflow.

If your use case includes existing DOCX edits, comments, tracked changes, or compatibility-sensitive automation, this design is very practical and high value.

Code location: https://github.com/anthropics/skills/tree/main/skills/docx

AI Tools on KnightLi Blog

LLM Quantization Explained: How to Choose FP16, Q8, Q5, Q4, or Q2

What Is Quantization

Common Quantization Formats

Quantization Naming Rules

Quick Selection by VRAM

Practical Tips

Conclusion

Google Gemma 4 Model Comparison: How to Choose Between 2B/4B/26B/31B

Gemma 4 Model Comparison

How to Choose: Start from Hardware and Tasks

Four Typical Use Cases

1) Local General Assistant

2) Coding and Automation

3) Advanced Reasoning and Complex Agents

4) Edge Devices and Lightweight Offline Use

Deployment Suggestions (Ollama)

Conclusion

Analyzing Anthropic's docx Agent Skill: Features, Code Structure, Usage, and Caveats

What this skill solves

Directory and code structure

1. Guidance layer: SKILL.md

2. Office package layer: scripts/office/*

unpack.py

pack.py

validate.py

soffice.py

3. Word-specific layer: comments, revisions, and redlines

comment.py

accept_changes.py

validators/redlining.py

4. Schema and support layer: schemas/, helpers/, templates/

schemas/

helpers/

templates/

Typical usage patterns

Scenario 1: Read/analyze an existing DOCX

Scenario 2: Create a new DOCX

Scenario 3: Edit an existing DOCX

Scenario 4: Accept all tracked changes

Scenario 5: Add comments

Key caveats to remember

1. .docx is not a plain text file

2. docx-js generation still needs explicit guardrails

3. Comments and tracked changes are multi-part operations

4. “Opens successfully” does not mean “correctly modified”

5. Environment readiness matters

What this skill is good for (and not)

Good fit

Not ideal

Summary

1. Guidance layer: `SKILL.md`

2. Office package layer: `scripts/office/*`

`unpack.py`

`pack.py`

`validate.py`

`soffice.py`

`comment.py`

`accept_changes.py`

`validators/redlining.py`

4. Schema and support layer: `schemas/`, `helpers/`, `templates/`

`schemas/`

`helpers/`

`templates/`

1. `.docx` is not a plain text file

2. `docx-js` generation still needs explicit guardrails