AI Video on KnightLi Blog

miHoYo LPM 1.0 Explained: How an AI Video Model Could Reshape Game NPCs

Fri, 08 May 2026 22:27:10 +0800

LPM 1.0 is easy to mistake for another AI video generation model. Judging only by demos, it may not look as visually explosive as some text-to-video systems. But viewed through the paper’s goal, it is not mainly trying to generate a good-looking clip. It is trying to make a digital character feel present during interaction.

That is the biggest difference between LPM 1.0 and ordinary video models. A typical video model focuses on image quality, camera continuity, and prompt following. LPM 1.0 focuses on character performance: lip sync, rhythm, and expression while speaking; nods, gaze, pauses, and micro-expressions while listening; and stable identity across long interactions.

From generating video to generating performance

LPM stands for Large Performance Model. The name matters because it shifts the task boundary from “video” to “performance”.

In real conversation, whether someone feels natural is not only about what they say. Listening is part of communication: the timing of nods, the direction of gaze, and subtle emotional changes all affect whether we believe a character is alive.

Many digital human systems still attach text, speech, and lip motion to a character. The character can talk, but may not truly listen. It can output lines, but may not react continuously to the previous second of input. LPM 1.0 aims to turn passive playback into real-time interaction.

The three hard problems

The LPM 1.0 paper describes a trilemma in AI character performance: expressiveness, real-time inference, and long-horizon identity stability. A system may look detailed but be slow, respond quickly but feel rigid, or stay stable briefly but drift over time. Achieving all three is much harder.

To address this, LPM 1.0 uses richer character conditioning. Instead of giving the model only one reference image, it introduces multi-granularity identity references, including global appearance, multi-view body images, and facial expression examples. The goal is to reduce hallucinated details such as profile shape, teeth, expression texture, and body proportions.

The paper also separates speaking and listening behavior. Speaking audio mainly drives lip sync, speech rhythm, head motion, and body rhythm. Listening audio triggers gaze, nodding, posture changes, and micro-expressions. If both signals are mixed into one control stream, the model can easily learn the wrong behavior. LPM 1.0 models speaking and listening separately, then connects them in one online interaction system.

Base LPM and Online LPM

According to the public paper, LPM 1.0 is built on a 17B-parameter Diffusion Transformer. Base LPM learns high-quality, controllable, identity-consistent character performance video. Online LPM is a distilled streaming generator designed for low-latency, long-running interaction.

This split is important. Offline models can focus on quality, but interactive systems cannot make users wait. When a user starts speaking, the character should begin listening immediately. When the character starts speaking, lip sync, expression, and body motion must follow at once. Online LPM is valuable because it compresses complex video generation into something closer to real-time interaction.

So LPM 1.0 is not just a short-video asset tool for creators. It is closer to a visual engine for conversational agents, virtual streamers, and game NPCs: the language model understands and generates content, the speech model provides the voice, and LPM makes the on-screen character perform credibly.

What it means for games

In games, LPM 1.0 points less toward prettier cutscenes and more toward the next generation of interactive characters.

Traditional NPCs rely on prewritten scripts, fixed animations, and limited branches. Players can talk to them, but their responses are usually predesigned. In the AI era, the target goes further: different players may experience different story paths in the same world, and the same character may respond with actions, emotions, and dialogue that fit each player’s context.

That is what a truly personalized game experience needs underneath. Language models can generate lines, and behavior systems can choose goals, but if the character on screen still looks stiff, players will struggle to believe it understands them. LPM 1.0 tries to fill that visual and performance layer.

Not a finished magic product

LPM 1.0 should still be understood as a technical direction, not an immediately scalable commercial product. The paper and demos show a possibility: real-time, full-duplex, identity-stable character video generation is getting closer to usable. But before it can enter games broadly, there are still problems around cost, latency, edge deployment, content safety, character rights, multiplayer scenes, and engine integration.

A more realistic path may start with virtual streamers, AI companions, story interaction, character support agents, and educational coaching. As model cost falls and latency improves, the technology can move into more complex game systems.

Summary

The value of LPM 1.0 is not whether it can generate the most spectacular video clip. It is that it pushes AI video from “image generation” toward “character presence”.

If future games become more personalized, more dynamic, and more dependent on AI characters, language, speech, motion, expression, and identity consistency must be designed together. LPM 1.0 offers one possible path: digital characters that do not just talk, but listen, react, and remain recognizably themselves over long interactions.

References:

Pixelle-Video: An Open-Source AI Engine for Generating Short Videos From One Topic

Thu, 07 May 2026 20:25:17 +0800

Pixelle-Video is an open-source fully automated short-video generation engine from AIDC-AI. Its goal is direct: the user enters a topic, and the system automatically writes the script, generates AI images or videos, creates voice narration, adds background music, and renders the final video.

This kind of tool is useful for batch short-video creation, knowledge explainers, talking-head content, novel recaps, history and culture videos, and self-media experiments. It is not a single text-to-video model. It is a production pipeline that connects several AI capabilities.

What It Automates

Pixelle-Video’s default flow can be summarized as:

enter a topic or fixed script;
use an LLM to generate narration;
plan scenes and generate images or video clips;
use TTS to create voice narration;
add background music;
apply a video template and render the final result.

The README describes the flow as “script generation → image planning → frame-by-frame processing → video composition.” The modular design is clear: each step can be replaced, tuned, or connected to a custom workflow.

Key Features

The project covers a fairly complete set of capabilities:

AI script writing: automatically generate narration from a topic;
AI image generation: create illustrations for each line or scene;
AI video generation: connect to video generation models such as WAN 2.1;
TTS voice: support Edge-TTS, Index-TTS, and other options;
background music: use built-in BGM or custom music;
multiple aspect ratios: support vertical, horizontal, and other video sizes;
multiple models: connect to GPT, Qwen, DeepSeek, Ollama, and more;
ComfyUI workflows: use built-in workflows or replace image, TTS, and video generation steps.

Recent updates also mention motion transfer, digital-human talking videos, image-to-video pipelines, multilingual TTS voices, RunningHub support, and a Windows all-in-one package. The project is clearly moving beyond a simple script toward a fuller creation tool.

Installation and Launch

Windows users can first look at the official all-in-one package. It is designed to reduce setup friction: no manual Python, uv, or ffmpeg installation is required. After extracting the package, run start.bat, open the web interface, and configure the required APIs and image generation service.

For source installation, the README gives this basic flow:

1
2
3

git clone https://github.com/AIDC-AI/Pixelle-Video.git
cd Pixelle-Video
uv run streamlit run web/app.py

The source route is suitable for macOS and Linux users, and for anyone who wants to modify templates, workflows, or service configuration. The main prerequisites are uv and ffmpeg.

Configuration Priorities

On first use, the key is not to click “generate” immediately. The important part is connecting the external capabilities properly.

LLM configuration determines script quality. You can choose models such as Qwen, GPT, DeepSeek, or Ollama, then fill in the API Key, Base URL, and model name. If you want to minimize cost, local Ollama is one option. If you want more stable results, a cloud model is usually easier.

Image and video generation configuration determines visual quality. The project supports local ComfyUI and RunningHub. Users who understand ComfyUI can place their own workflows under workflows/ to replace the default image, video, or TTS pipeline.

Template configuration determines the final visual form. The project organizes video templates under templates/, with naming rules for static templates, image templates, and video templates. For creators, this is more practical than generating raw assets only, because the output is a video that can be previewed and downloaded directly.

Who It Is For

Pixelle-Video is especially suitable for three groups:

Short-video creators who want to turn ideas into draft videos quickly.
AIGC tool users who want to connect LLMs, ComfyUI, TTS, and video composition.
Developers and automation users who want to modify templates, workflows, or integrate their own materials and models.

If you only want to make one polished premium video, it may not replace manual editing. But if you want to generate many explainers, talking videos, or science and education videos with a consistent structure, its pipeline approach is valuable.

Things to Note

The ceiling of this kind of tool is determined by multiple links in the chain. A weak script model produces empty content; a weak image model gives scattered visuals; unnatural TTS makes the video feel rough; and a poor template weakens the final result.

So it is better to start with one fixed scenario, such as a “60-second vertical science explainer.” Fix the LLM, visual style, TTS voice, BGM, and template first, then expand to more topics.

The project supports a local free setup, but local setups often require a GPU, ComfyUI configuration, and model files. Users without a local inference environment can reduce setup difficulty by using a cloud LLM plus RunningHub, while keeping an eye on usage cost.

Short Take

Pixelle-Video is interesting not merely because it can “generate a video from one sentence.” Its real value is that it breaks short-video production into replaceable modules: script, visuals, voice, music, templates, and rendering. For ordinary users, it is a low-barrier AI video tool. For developers, it is closer to a hackable short-video automation framework.

If you are studying AI short-video pipelines, or want to connect ComfyUI, TTS, LLMs, and template rendering into a usable product, Pixelle-Video is worth trying and dissecting.