<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>LangExtract on KnightLi Blog</title>
        <link>https://www.knightli.com/en/tags/langextract/</link>
        <description>Recent content in LangExtract on KnightLi Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Fri, 01 May 2026 02:58:21 +0800</lastBuildDate><atom:link href="https://www.knightli.com/en/tags/langextract/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>Google LangExtract: Extract Structured Data from Long Text with LLMs</title>
        <link>https://www.knightli.com/en/2026/05/01/google-langextract-llm-structured-data-extraction/</link>
        <pubDate>Fri, 01 May 2026 02:58:21 +0800</pubDate>
        
        <guid>https://www.knightli.com/en/2026/05/01/google-langextract-llm-structured-data-extraction/</guid>
        <description>&lt;p&gt;&lt;code&gt;LangExtract&lt;/code&gt; is an open-source Python library from Google for extracting structured information from unstructured text.&lt;/p&gt;
&lt;p&gt;Its use case is straightforward: give it a piece of text, a prompt, and a few examples, then let a large language model extract fields according to your definition and organize the result into data that can be processed.&lt;/p&gt;
&lt;p&gt;Unlike simply asking a model to summarize something, &lt;code&gt;LangExtract&lt;/code&gt; focuses on three things:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Extracting information into a fixed structure&lt;/li&gt;
&lt;li&gt;Preserving the relationship between extracted results and their source locations&lt;/li&gt;
&lt;li&gt;Supporting long documents and visual inspection&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you often need to extract entities, events, relationships, or attributes from reports, papers, medical notes, contracts, logs, or web pages, this kind of tool is more flexible than hand-written regular expressions and easier to connect to downstream data workflows than plain chat-style questioning.&lt;/p&gt;
&lt;h2 id=&#34;what-problem-does-it-solve&#34;&gt;What Problem Does It Solve?
&lt;/h2&gt;&lt;p&gt;Many text extraction tasks look simple, but become troublesome in practice.&lt;/p&gt;
&lt;p&gt;For example, you may want to extract:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;People, organizations, and locations&lt;/li&gt;
&lt;li&gt;Events, times, and participants&lt;/li&gt;
&lt;li&gt;Drugs, dosages, and adverse reactions&lt;/li&gt;
&lt;li&gt;Product models, parameters, and prices&lt;/li&gt;
&lt;li&gt;Contract clauses, obligations, and deadlines&lt;/li&gt;
&lt;li&gt;Error types and context from logs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If the format is fixed, regular expressions or traditional parsers can work.&lt;br&gt;
But once the text becomes more natural, the rules quickly become complicated.&lt;/p&gt;
&lt;p&gt;Large language models are good at understanding natural language, but directly asking a model to &amp;ldquo;extract it&amp;rdquo; often causes several problems:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Output format is unstable&lt;/li&gt;
&lt;li&gt;It is unclear where the information came from in the source text&lt;/li&gt;
&lt;li&gt;Long documents are easy to miss&lt;/li&gt;
&lt;li&gt;Batch processing is difficult&lt;/li&gt;
&lt;li&gt;Results are inconvenient to review manually&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;code&gt;LangExtract&lt;/code&gt; addresses this layer of the problem: it wraps LLM understanding into a more controllable extraction workflow.&lt;/p&gt;
&lt;h2 id=&#34;key-features-of-langextract&#34;&gt;Key Features of LangExtract
&lt;/h2&gt;&lt;h3 id=&#34;1-use-examples-to-constrain-the-extraction-format&#34;&gt;1. Use Examples to Constrain the Extraction Format
&lt;/h3&gt;&lt;p&gt;&lt;code&gt;LangExtract&lt;/code&gt; does not rely on a vague one-line prompt. Instead, it uses prompts and examples to tell the model:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What to extract&lt;/li&gt;
&lt;li&gt;What each field is called&lt;/li&gt;
&lt;li&gt;How each field should be filled&lt;/li&gt;
&lt;li&gt;What to do when information is uncertain&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This few-shot approach works well for information extraction.&lt;br&gt;
The closer your examples are to real data, the more stable the model&amp;rsquo;s structured output becomes.&lt;/p&gt;
&lt;h3 id=&#34;2-extracted-results-can-link-back-to-the-source&#34;&gt;2. Extracted Results Can Link Back to the Source
&lt;/h3&gt;&lt;p&gt;The worst kind of extraction result is one that looks right but cannot be traced back.&lt;/p&gt;
&lt;p&gt;One of the important points of &lt;code&gt;LangExtract&lt;/code&gt; is aligning extracted results with source locations. When reviewing later, you do not only see a JSON result; you can also return to the original text and see where the information came from.&lt;/p&gt;
&lt;p&gt;This matters in scenarios that require review, such as medical text, legal text, research material, and internal business documents.&lt;/p&gt;
&lt;h3 id=&#34;3-support-for-long-documents&#34;&gt;3. Support for Long Documents
&lt;/h3&gt;&lt;p&gt;Long-document extraction often runs into context-window limits, missed results, and duplicate results.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;LangExtract&lt;/code&gt; provides a workflow for long text: split the document, process chunks in parallel, and then organize the extracted results.&lt;/p&gt;
&lt;p&gt;This makes it more suitable for complete reports, papers, long web pages, and bulk documents, rather than only short snippets.&lt;/p&gt;
&lt;h3 id=&#34;4-visual-inspection&#34;&gt;4. Visual Inspection
&lt;/h3&gt;&lt;p&gt;If extraction results are only available as JSON, problems are easy to miss.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;LangExtract&lt;/code&gt; supports visualizing extracted results, making it easier to see what the model extracted and where it came from.&lt;br&gt;
This is useful for tuning prompts, checking missed extractions, and finding false positives.&lt;/p&gt;
&lt;h2 id=&#34;when-should-you-use-it&#34;&gt;When Should You Use It?
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;LangExtract&lt;/code&gt; is suitable when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You need to extract structured fields from natural-language text&lt;/li&gt;
&lt;li&gt;The text format is not fully fixed&lt;/li&gt;
&lt;li&gt;You need to preserve the relationship between extracted results and the source text&lt;/li&gt;
&lt;li&gt;You need to process longer documents&lt;/li&gt;
&lt;li&gt;Results require human review&lt;/li&gt;
&lt;li&gt;The output will later go into tables, databases, or data analysis workflows&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Typical examples include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Extracting symptoms, medications, dosages, and reactions from medical text&lt;/li&gt;
&lt;li&gt;Extracting parties, obligations, amounts, and deadlines from contracts&lt;/li&gt;
&lt;li&gt;Extracting subjects, methods, and conclusions from papers&lt;/li&gt;
&lt;li&gt;Extracting specification parameters from product documents&lt;/li&gt;
&lt;li&gt;Extracting issue types and resolutions from support records&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you only need a temporary summary of a short piece of text, an ordinary chat model is enough.&lt;br&gt;
If you want to turn text into data that can be processed later, &lt;code&gt;LangExtract&lt;/code&gt; is a better fit.&lt;/p&gt;
&lt;h2 id=&#34;basic-installation&#34;&gt;Basic Installation
&lt;/h2&gt;&lt;p&gt;The project supports installation through &lt;code&gt;pip&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pip install langextract
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;You can also install it from source:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;git clone https://github.com/google/langextract.git
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;cd&lt;/span&gt; langextract
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pip install -e .
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you want to use a model API, configure the API key for the corresponding model provider.&lt;br&gt;
The project documentation focuses on Gemini usage, and it can also connect to other model providers through adapters.&lt;/p&gt;
&lt;h2 id=&#34;basic-usage-flow&#34;&gt;Basic Usage Flow
&lt;/h2&gt;&lt;p&gt;A typical workflow looks like this:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Prepare the source text&lt;/li&gt;
&lt;li&gt;Clearly describe the extraction target&lt;/li&gt;
&lt;li&gt;Provide a few examples&lt;/li&gt;
&lt;li&gt;Call &lt;code&gt;LangExtract&lt;/code&gt; to perform extraction&lt;/li&gt;
&lt;li&gt;Inspect the structured result&lt;/li&gt;
&lt;li&gt;Generate a visualization page for review if needed&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The second and third steps are the most important.&lt;/p&gt;
&lt;p&gt;The prompt should clearly describe the task, for example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Extract only information explicitly present in the text&lt;/li&gt;
&lt;li&gt;Do not fill in missing facts from common sense&lt;/li&gt;
&lt;li&gt;Leave fields empty when information is missing&lt;/li&gt;
&lt;li&gt;Keep the same field structure for the same type of entity&lt;/li&gt;
&lt;li&gt;Preserve source snippets or positions in the output&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Examples should be as close as possible to real inputs.&lt;br&gt;
If the real text has noise, abbreviations, line breaks, or table residue, the examples should reflect that.&lt;/p&gt;
&lt;h2 id=&#34;things-to-watch-out-for&#34;&gt;Things to Watch Out For
&lt;/h2&gt;&lt;p&gt;First, do not make the extraction task too broad.&lt;/p&gt;
&lt;p&gt;&amp;ldquo;Extract useful information&amp;rdquo; is too vague.&lt;br&gt;
A better instruction would be &amp;ldquo;extract medication name, dosage, frequency, and adverse reactions.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;Second, do not fully trust model output.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;LangExtract&lt;/code&gt; can align results with the source text, but that does not mean the model will never miss or mis-extract information. Important scenarios still require sampling checks or human review.&lt;/p&gt;
&lt;p&gt;Third, examples are more useful than long explanations.&lt;/p&gt;
&lt;p&gt;In information extraction tasks, models often rely more on examples to understand the output format.&lt;br&gt;
Instead of writing a long abstract rule set, provide a few high-quality examples.&lt;/p&gt;
&lt;p&gt;Fourth, pay attention to cost and speed for long documents.&lt;/p&gt;
&lt;p&gt;Splitting long documents, parallel extraction, and model calls all have costs. Before batch processing, use a small sample set to tune the prompt and field structure.&lt;/p&gt;
&lt;h2 id=&#34;how-is-it-different-from-regex-or-traditional-nlp&#34;&gt;How Is It Different from Regex or Traditional NLP?
&lt;/h2&gt;&lt;p&gt;Regular expressions are good for stable, well-defined text formats.&lt;/p&gt;
&lt;p&gt;Traditional NLP pipelines work well when task boundaries are clear and the model or dictionary is already prepared.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;LangExtract&lt;/code&gt; is better for text whose format is less fixed but whose meaning is clear.&lt;br&gt;
It does not require you to write a rule for every possible expression; instead, the LLM learns the extraction target from examples.&lt;/p&gt;
&lt;p&gt;But it is not a complete replacement for regular expressions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;For fixed-format text, regex is cheaper and more stable&lt;/li&gt;
&lt;li&gt;For high-risk scenarios, validation and review are still required&lt;/li&gt;
&lt;li&gt;For large-scale batch processing, model-call cost matters&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A practical approach is to handle the rule-clear parts with code and use &lt;code&gt;LangExtract&lt;/code&gt; for the parts with more semantic variation.&lt;/p&gt;
&lt;h2 id=&#34;who-is-it-for&#34;&gt;Who Is It For?
&lt;/h2&gt;&lt;p&gt;You may want to look at &lt;code&gt;LangExtract&lt;/code&gt; if you are doing any of the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Turning long text into tables&lt;/li&gt;
&lt;li&gt;Extracting entities and relationships from documents&lt;/li&gt;
&lt;li&gt;Cleaning data before putting it into a knowledge base&lt;/li&gt;
&lt;li&gt;Extracting fields from business text&lt;/li&gt;
&lt;li&gt;Building an LLM-driven information extraction prototype&lt;/li&gt;
&lt;li&gt;Preserving evidence between extracted results and source text&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It is not a &amp;ldquo;click once and understand every document&amp;rdquo; tool. It is more like a library for engineering an LLM extraction workflow.&lt;/p&gt;
&lt;p&gt;You still need to design fields, write examples, and inspect results.&lt;br&gt;
But compared with manually writing model calls, stitching prompts, and parsing output every time, it provides a more complete extraction framework.&lt;/p&gt;
&lt;h2 id=&#34;reference&#34;&gt;Reference
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://github.com/google/langextract&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;google/langextract&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;final-thought&#34;&gt;Final Thought
&lt;/h2&gt;&lt;p&gt;The value of &lt;code&gt;LangExtract&lt;/code&gt; is making &amp;ldquo;let an LLM find information in text&amp;rdquo; more controllable.&lt;/p&gt;
&lt;p&gt;It is not for casual summaries. It is for information extraction tasks with fields, evidence, and review requirements.&lt;br&gt;
If your work often turns long text into structured data, it is worth trying.&lt;/p&gt;
</description>
        </item>
        
    </channel>
</rss>
