PC Desktop Automation Support
Midscene can drive native keyboard and mouse controls to support PC desktop automation on Windows, macOS, and Linux.
By leveraging a visual model solution, the automation process works with any desktop application—whether built with Electron, Qt, WPF, or native technologies. Developers only need to focus on the final user experience when debugging UI automation scripts.
The PC desktop automation solution comes with all the features of Midscene:
- Supports zero-code trial using Playground
- Supports JavaScript SDK for scripting
- Supports automation scripts in YAML format and command-line tools
- Supports HTML reports to replay all operation paths
- Works across Windows, macOS, and Linux platforms
- Headless mode for Linux CI via Xvfb (no physical display required)
- Multi-display support for complex setups
Showcases
Prompt (macOS): Help me post a tweet promoting Midscene's support for AutoGLM through safari, with the following requirements:
- Text content: Midscene now supports AutoGLM!
- Media content: Use the AutoGLM video from the download folder!
View the full report for this task: report.html
Prompt (Windows): Open Sauce Demo e-commerce site, login and add items to cart
View the full report for this task: report.html
Prompt (macOS): Open Google and query San Jose tomorrow weather temperature
View the full report for this task: report.html
Prompt (Linux): Open TodoMVC, add multiple tasks and filter them
View the full report for this task: report.html
See more showcases: showcases
This guide walks you through everything required to automate PC desktop applications with Midscene: install dependencies, configure model credentials, and run your first JavaScript script.
Control PC desktop with JavaScript: https://github.com/web-infra-dev/midscene-example/tree/main/computer/javascript-sdk-demo
Integrate Vitest for testing: https://github.com/web-infra-dev/midscene-example/tree/main/computer/vitest-demo
Control a remote Windows desktop over RDP: https://github.com/web-infra-dev/midscene-example/tree/main/computer/rdp-demo
Test Obsidian (an Electron app) on headless Linux CI with @midscene/computer: https://github.com/web-infra-dev/midscene-example/tree/main/computer/electron-demo
Set up API keys for model
Set your model configs into the environment variables. You may refer to Model strategy for more details.
For more configuration details, please refer to Model strategy and Model configuration.
System Requirements
Node.js
Node.js 18.19.0 or higher is required.
Platform-Specific Dependencies
macOS: Accessibility permissions are required for keyboard and mouse control. When you run the script for the first time, macOS will prompt you to grant access. Go to System Settings > Privacy & Security > Accessibility and enable permissions for the application running your script (e.g., Terminal, iTerm2, VS Code, WebStorm, or other IDEs). For more details, see nut.js macOS setup.
Windows: No extra setup is needed for ordinary apps. However, Windows isolates input across privilege levels (UIPI): a non-elevated process cannot send mouse or keyboard input to a window that runs as Administrator (elevated). The input is silently dropped — the cursor still moves to the right spot, but clicks and keystrokes have no effect. Prefer running the target application without Administrator privileges. If the target application must stay elevated, run the terminal or Node.js that launches Midscene as Administrator too, so both processes share the same privilege level. See Windows: clicks have no effect on some apps.
Linux: ImageMagick is required for screenshot functionality.
Headless Linux (CI): To run desktop automation on a headless Linux server (e.g. GitHub Actions), install Xvfb and its dependencies, then enable headless mode:
Xvfb creates a virtual display so that mouse, keyboard, and screenshot operations work without a physical monitor. See API Reference for details.
Try Playground (no code)
Playground is the fastest way to validate the connection and observe AI-driven steps without writing code. It shares the same core as @midscene/computer, so anything that works here will behave the same once scripted.
- Launch the Playground CLI:
- Click the gear icon in the Playground window, then paste your API key configuration. Refer back to Model configuration if you still need credentials.
Start experiencing
After configuration, you can start using Midscene right away. It provides several key operation tabs:
- Act: interact with the page. This is Auto Planning, corresponding to
aiAct. For example:
- Query: extract JSON data from the interface, corresponding to
aiQuery.
Similar methods include aiBoolean(), aiNumber(), and aiString() for directly extracting booleans, numbers, and strings.
- Assert: understand the page and assert; if the condition is not met, throw an error, corresponding to
aiAssert.
- Tap: click on an element. This is Instant Action, corresponding to
aiTap.
For the difference between Auto Planning and Instant Action, see the API document.
Integration with Midscene Agent
Once Playground works, move to a repeatable script with the JavaScript SDK.
Step 1. Install dependencies
Step 2. Write your first script
Create example.ts:
Step 3. Run the script
Connect to a Remote Windows Desktop via RDP
@midscene/computer can also drive a remote Windows desktop directly over the RDP protocol through the dedicated agentForRDPComputer() factory.
Prerequisites
- A reachable Windows machine with RDP enabled.
- FreeRDP installed on the machine running your script.
Example
Common RDP Options
host: Remote Windows host or IP.port: RDP port. Defaults to3389.username/password: Account credentials for the remote session.domain: Optional Windows domain.ignoreCertificate: Skip certificate validation for self-signed setups.desktopWidth/desktopHeight: Request a specific remote desktop resolution.adminSession: Request the remote admin session when the server allows it.
RDP sessions are exposed to Midscene as a single remote display. You can still use the same aiAct, aiQuery, aiAssert, and report features as local desktop automation.
Multi-Display Support
If you have multiple displays, you can control a specific one:
Example Usage
Basic Mouse Operations
Keyboard Operations
Query Information
Complex Workflows
Environment Check
You can check if your system is properly configured:
FAQ
macOS: Script cannot control mouse or keyboard
macOS requires Accessibility permissions for keyboard and mouse control. Go to System Settings > Privacy & Security > Accessibility and enable the toggle for the application running your script (e.g., Terminal, iTerm2, VS Code, or WebStorm).
If you have already granted permission but it still doesn't work, try removing the app from the Accessibility list and re-adding it — macOS sometimes caches stale permissions.
Windows: clicks have no effect on some apps
If the cursor moves to the correct position but clicks or key presses do nothing on a particular application — while other apps work fine — check whether the target app is running as Administrator (elevated). Windows UIPI blocks input injected from a lower-privilege process into an elevated window and drops it silently, with no error.
Prefer lowering the target application's privilege level first, for example by launching it without "Run as Administrator" or disabling any setting that always starts it elevated. If the target app must stay elevated, run the terminal or Node.js that launches Midscene as Administrator so it matches the target app's privilege level, then try again. System-level shortcuts such as Win+Tab are handled by the shell and keep working even when this happens, which is why keyboard shortcuts may appear to work while in-app clicks do not.
The health check logged at connection time prints this troubleshooting link when Midscene is not running as Administrator on Windows.
Linux: Screenshots or interactions fail on a headless server
A headless Linux environment (e.g. CI) has no physical display. You need to install Xvfb and ImageMagick, and enable headless mode:
Or set the environment variable:
API reference
This section documents the PC desktop-specific APIs provided by @midscene/computer.
For common APIs that work across all platforms, see Common API Reference.
Agent Creation
agentForComputer(opts?): Promise<ComputerAgent>
Create an agent for local desktop automation.
Backward compatibility:
agentFromComputeris still available as an alias.
agentForRDPComputer(opts): Promise<ComputerAgent<RDPDevice>>
Create an agent for remote Windows desktop automation over RDP.
Parameters:
Local Desktop Options
displayId(optional): Specify which display to control. Get available displays withComputerDevice.listDisplays().customActions(optional): Add custom actions to the device.headless(optional, Linux only): Set totrueto start a virtual display via Xvfb, enabling desktop automation on headless Linux servers and CI environments without a physical display. Can also be set via theMIDSCENE_COMPUTER_HEADLESS_LINUX=trueenvironment variable.xvfbResolution(optional): Resolution for the Xvfb virtual display. Defaults to'1920x1080x24'.
RDP Options
host: Remote Windows host or IP.port: RDP port. Defaults to3389.username/password: Credentials for the remote session.domain: Optional Windows domain.adminSession: Request the remote admin session when the server allows it.ignoreCertificate: Skip certificate validation for self-signed setups.securityProtocol: Choose'auto','tls','nla', or'rdp'.desktopWidth/desktopHeight: Request a specific remote desktop resolution.
A complete demo of testing Obsidian (an Electron app) on headless Linux CI with @midscene/computer: https://github.com/web-infra-dev/midscene-example/tree/main/computer/electron-demo
Example:
Example: connect to a remote Windows desktop over RDP
A runnable demo that connects to a remote Windows machine over RDP, opens Settings, navigates into Windows Update, and emits a structured report: https://github.com/web-infra-dev/midscene-example/tree/main/computer/rdp-demo
Use localAddress only when the machine running Midscene has multiple outbound
routes and the RDP server must be reached from a specific local source IP. Pass
an IP address, not a network interface name.
Device Management
ComputerDevice.listDisplays(): Promise<DisplayInfo[]>
List all available displays.
Returns:
Example:
checkComputerEnvironment(): Promise<EnvironmentCheck>
Check if the computer environment is properly configured.
Returns:
Example:
ComputerAgent
The ComputerAgent class extends PageAgent<ComputerDevice> and inherits all common agent methods:
aiAct(action: string): Perform an action with AIaiQuery(query: string): Extract information with AIaiAssert(assertion: string): Assert a condition with AIaiWaitFor(condition: string): Wait for a conditionaiLocate(description: string): Locate an element- And more...
Instant actions are also available for direct, deterministic control once an element is located:
aiTap(),aiDoubleClick(),aiRightClick(),aiHover(): Mouse actionsaiInput(),aiClearInput(),aiKeyboardPress(): Keyboard actionsaiScroll(): Scroll action
See Common API Reference for details.
Available Actions
The ComputerDevice supports the following actions:
Mouse Actions
Tap (Click)
Single click at the target location.
DoubleClick
Double-click at the target location.
RightClick
Right-click to open context menu.
MouseMove (Hover)
Move the mouse to an element — also known as hovering — for example to reveal a hover menu or tooltip.
DragAndDrop
Drag from one location and drop at another.
Keyboard Actions
KeyboardPress
Press keyboard keys with optional modifiers.
Supported keys:
- Regular keys:
a-z,0-9,Enter,Escape,Space,Tab, etc. - Arrow keys:
ArrowUp,ArrowDown,ArrowLeft,ArrowRight - Function keys:
F1-F12 - Modifiers:
Command/Cmd(macOS),Control/Ctrl,Alt,Shift,Win(Windows) - Media keys:
VolumeUp,VolumeDown,Mute, etc.
Examples:
Input
Type text into an input field.
ClearInput
Clear the content of an input field.
Scroll Actions
Scroll the screen or a specific area.
Display Actions
ListDisplays
Get information about all connected displays.
When you use RDP, ListDisplays returns the current remote session as a single display.
Examples
Open Application and Navigate
Multi-Display Workflow
Web Browser Automation
TypeScript Types
See Also
- Common API Reference - APIs that work across all platforms
- Model Configuration - Configure AI models
- Caching - Improve performance with caching

