<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>GUI Agent on KnightLi Blog</title>
        <link>https://www.knightli.com/en/tags/gui-agent/</link>
        <description>Recent content in GUI Agent on KnightLi Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Tue, 19 May 2026 10:56:50 +0800</lastBuildDate><atom:link href="https://www.knightli.com/en/tags/gui-agent/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>Let AI Operate Your Computer? UI-TARS-desktop Connects Desktop, Browser, and Tools</title>
        <link>https://www.knightli.com/en/2026/05/19/ui-tars-desktop-multimodal-ai-agent-stack/</link>
        <pubDate>Tue, 19 May 2026 10:56:50 +0800</pubDate>
        
        <guid>https://www.knightli.com/en/2026/05/19/ui-tars-desktop-multimodal-ai-agent-stack/</guid>
        <description>&lt;p&gt;&lt;code&gt;bytedance/UI-TARS-desktop&lt;/code&gt; is ByteDance&amp;rsquo;s open source multimodal AI agent project. It is not just a single desktop app, but an agent stack. The current README mainly contains two directions: &lt;code&gt;Agent TARS&lt;/code&gt; and &lt;code&gt;UI-TARS Desktop&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Project URL: &lt;a class=&#34;link&#34; href=&#34;https://github.com/bytedance/UI-TARS-desktop&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/bytedance/UI-TARS-desktop&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Official site: &lt;a class=&#34;link&#34; href=&#34;https://agent-tars.com&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://agent-tars.com&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;At the time of writing, the GitHub API showed about 34k stars, TypeScript as the main language, and an Apache-2.0 license. The README describes it as an &amp;ldquo;Open-Source Multimodal AI Agent Stack.&amp;rdquo;&lt;/p&gt;
&lt;h2 id=&#34;difference-between-agent-tars-and-ui-tars-desktop&#34;&gt;Difference Between Agent TARS and UI-TARS Desktop
&lt;/h2&gt;&lt;p&gt;The README places the two projects in one comparison table:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Agent TARS&lt;/code&gt;: a general multimodal AI agent stack that connects GUI agents, vision, terminal, browser, and product workflows.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;UI-TARS Desktop&lt;/code&gt;: a desktop application based on UI-TARS models, providing native GUI agent capabilities for operating local or remote computers and browsers.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Simply put, Agent TARS is more like a general agent runtime, while UI-TARS Desktop is the desktop GUI operation entry point.&lt;/p&gt;
&lt;h2 id=&#34;what-agent-tars-can-do&#34;&gt;What Agent TARS Can Do
&lt;/h2&gt;&lt;p&gt;Agent TARS mainly provides a CLI and Web UI. Its goal is to let multimodal models complete task flows closer to human operation through MCP and various tools.&lt;/p&gt;
&lt;p&gt;Core capabilities listed in the README include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;One-command CLI startup, supporting headful Web UI and headless server.&lt;/li&gt;
&lt;li&gt;Hybrid browser agent control through GUI Agent, DOM, or mixed strategies.&lt;/li&gt;
&lt;li&gt;Event Stream for tracing and debugging data flows.&lt;/li&gt;
&lt;li&gt;MCP integration for mounting MCP Servers and real tools.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Quick start:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;npx @agent-tars/cli@latest
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Global installation:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;npm install @agent-tars/cli@latest -g
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Run with a model provider:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;agent-tars --provider volcengine --model doubao-1-5-thinking-vision-pro-250428 --apiKey your-api-key
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;agent-tars --provider anthropic --model claude-3-7-sonnet-latest --apiKey your-api-key
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;what-ui-tars-desktop-can-do&#34;&gt;What UI-TARS Desktop Can Do
&lt;/h2&gt;&lt;p&gt;UI-TARS Desktop is a desktop GUI Agent. Based on UI-TARS and Seed-1.5-VL / 1.6 model families, it focuses on letting the model understand the screen and execute mouse and keyboard operations.&lt;/p&gt;
&lt;p&gt;Capabilities listed in the README include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Natural language control.&lt;/li&gt;
&lt;li&gt;Screenshots and visual recognition.&lt;/li&gt;
&lt;li&gt;Precise mouse and keyboard control.&lt;/li&gt;
&lt;li&gt;Cross-platform support for Windows, macOS, and browsers.&lt;/li&gt;
&lt;li&gt;Real-time feedback and status display.&lt;/li&gt;
&lt;li&gt;Local processing with an emphasis on privacy and security.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Example tasks include changing VS Code settings, checking GitHub issues, and operating remote computers or browsers.&lt;/p&gt;
&lt;h2 id=&#34;why-gui-agents-matter&#34;&gt;Why GUI Agents Matter
&lt;/h2&gt;&lt;p&gt;Traditional automation depends on APIs, DOM, or scripts. A GUI Agent starts from the interface: it sees buttons, input boxes, menus, and state, then operates through mouse and keyboard.&lt;/p&gt;
&lt;p&gt;This has two values. First, many applications do not have stable APIs, or APIs do not cover the full workflow. A GUI Agent can interact from the same surface a human uses.&lt;/p&gt;
&lt;p&gt;Second, multimodal models can handle screenshots, documents, web pages, and app interfaces, combining visual understanding with execution.&lt;/p&gt;
&lt;p&gt;The limitation is also clear. GUI operations are affected by resolution, language, layout changes, pop-ups, and network latency. Production workflows still need permission control, confirmation steps, and rollback plans.&lt;/p&gt;
&lt;h2 id=&#34;relationship-with-mcp&#34;&gt;Relationship With MCP
&lt;/h2&gt;&lt;p&gt;Agent TARS emphasizes MCP integration. MCP is useful because it gives agents a unified way to call browsers, files, command lines, databases, internal services, and other tools.&lt;/p&gt;
&lt;p&gt;For complex tasks, GUI clicking alone is not stable enough. A better pattern is often:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use APIs where APIs are available.&lt;/li&gt;
&lt;li&gt;Use vision when page state must be understood.&lt;/li&gt;
&lt;li&gt;Use browser control when real web interaction is needed.&lt;/li&gt;
&lt;li&gt;Use GUI Agent when local software must be operated.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Projects like UI-TARS-desktop are exploring how to place these capabilities in one agent stack.&lt;/p&gt;
&lt;h2 id=&#34;what-to-watch-out-for&#34;&gt;What To Watch Out For
&lt;/h2&gt;&lt;p&gt;First, desktop agents have execution risk. They can operate mouse, keyboard, and browser, so permissions must be limited to avoid accidental file changes, account operations, payment, or production system actions.&lt;/p&gt;
&lt;p&gt;Second, remote computer and remote browser control needs a clear security boundary. Do not expose unauthenticated control endpoints to the public internet.&lt;/p&gt;
&lt;p&gt;Third, multimodal models can misread interfaces. Critical operations should require human confirmation, especially delete, submit, pay, publish, trade, or other irreversible actions.&lt;/p&gt;
&lt;h2 id=&#34;who-it-is-for&#34;&gt;Who It Is For
&lt;/h2&gt;&lt;p&gt;UI-TARS-desktop is suitable for developers exploring GUI agents, teams building AI assistants for desktop workflows, and researchers comparing browser, DOM, MCP, and visual-control strategies. It is not a simple consumer assistant yet.&lt;/p&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;UI-TARS-desktop is worth watching because it moves AI agents from &amp;ldquo;answering in chat&amp;rdquo; toward &amp;ldquo;seeing the screen and operating tools.&amp;rdquo; Its value is not only in desktop control, but in combining GUI, browser, terminal, and MCP capabilities in one stack.&lt;/p&gt;
</description>
        </item>
        
    </channel>
</rss>
