iOS Automation Support

Midscene can drive WebDriver tools to support iOS automation.

By adapting a visual model solution, the automation process works with any app tech stack—whether built with Native, Flutter, React Native, or Lynx. Developers only need to focus on the final experience when debugging UI automation scripts.

The iOS UI automation solution comes with all the features of Midscene:

  • Supports zero-code trial using Playground.
  • Supports JavaScript SDK.
  • Supports automation scripts in YAML format and command-line tools.
  • Supports HTML reports to replay all operation paths.

Showcases

Prompt : Open Twitter and auto-like the first tweet by @midscene_ai

View the full report of this task: report.html

See more showcases: showcases

Understand WebDriverAgent

WebDriver is a standard protocol established by W3C for browser automation, providing a unified API to control different browsers and applications. The WebDriver protocol defines the communication method between client and server, enabling automation tools to control various user interfaces across platforms.

Through the efforts of the Appium team and other open source communities, the industry now has many excellent libraries that convert desktop and mobile device automation operations into WebDriver protocol. These tools include:

  • Appium - Cross-platform mobile automation framework
  • WebDriverAgent - Service dedicated to iOS device automation
  • Selenium - Web browser automation tool
  • WinAppDriver - Windows application automation tool

Midscene adapts to the WebDriver protocol, which means developers can use AI models to perform intelligent automated operations on any device that supports WebDriver. Through this design, Midscene can not only control traditional operations like clicking and typing, but also:

  • Understand interface content and context
  • Execute complex multi-step operations
  • Perform intelligent assertions and validations
  • Extract and analyze interface data

On iOS platform, Midscene connects to iOS devices through WebDriverAgent, allowing you to control iOS apps and system using natural language descriptions.

This guide walks you through everything required to automate an iOS device with Midscene: connect a real phone through WebDriverAgent, configure model credentials, try the no-code Playground, and run your first JavaScript script.

Set up API keys for model

Set your model configs into the environment variables. You may refer to Model strategy for more details.

export MIDSCENE_MODEL_BASE_URL="https://replace-with-your-model-service-url/v1"
export MIDSCENE_MODEL_API_KEY="replace-with-your-api-key"
export MIDSCENE_MODEL_NAME="replace-with-your-model-name"
export MIDSCENE_MODEL_FAMILY="replace-with-your-model-family"

For more configuration details, please refer to Model strategy and Model configuration.

Preparation

Install Node.js

Install Node.js 18 or higher.

Prepare API Key

Prepare an API Key for a visual language (VL) model.

You can find supported models and configurations for Midscene.js in the Model strategy documentation.

Prepare WebDriver Server

Before getting started, you need to set up the iOS development environment:

  • macOS (required for iOS development)
  • Xcode and Xcode command line tools
  • iOS Simulator or real device

Environment Configuration

Before using Midscene iOS, you need to prepare the WebDriverAgent service.

Version Requirement

WebDriverAgent version must be >= 7.0.0

Please refer to the official documentation for setup:

Verify Environment Configuration

After completing the configuration, you can verify whether the service is working properly by accessing WebDriverAgent's status endpoint:

Access URL: http://localhost:8100/status

Correct Response Example:

{
  "value": {
    "build": {
      "version": "10.1.1",
      "time": "Sep 24 2025 18:56:41",
      "productBundleIdentifier": "com.facebook.WebDriverAgentRunner"
    },
    "os": {
      "testmanagerdVersion": 65535,
      "name": "iOS",
      "sdkVersion": "26.0",
      "version": "26.0"
    },
    "device": "iphone",
    "ios": {
      "ip": "10.91.115.63"
    },
    "message": "WebDriverAgent is ready to accept commands",
    "state": "success",
    "ready": true
  },
  "sessionId": "BCAD9603-F714-447C-A9E6-07D58267966B"
}

If you can successfully access this endpoint and receive a similar JSON response as shown above, it indicates that WebDriverAgent is properly configured and running.

Try Playground (no code)

Playground is the fastest way to validate the connection and observe AI-driven steps without writing code. It shares the same core as @midscene/ios, so anything that works here will behave the same once scripted.

  1. Launch the Playground CLI:
npx --yes @midscene/ios-playground
  1. Click the gear button to enter the configuration page and paste your API key config. Refer back to Model configuration if you still need credentials.

Start experiencing

After configuration, you can start using Midscene right away. It provides several key operation tabs:

  • Act: interact with the page. This is Auto Planning, corresponding to aiAct. For example:
Type “Midscene” in the search box, run the search, and open the first result
Fill out the registration form and make sure every field passes validation
  • Query: extract JSON data from the interface, corresponding to aiQuery.

Similar methods include aiBoolean(), aiNumber(), and aiString() for directly extracting booleans, numbers, and strings.

Extract the user ID from the page and return JSON data in the { id: string } structure
  • Assert: understand the page and assert; if the condition is not met, throw an error, corresponding to aiAssert.
There is a login button on the page, with a user agreement link below it
  • Tap: click on an element. This is Instant Action, corresponding to aiTap.
Click the login button

For the difference between Auto Planning and Instant Action, see the API document.

Integration with Midscene Agent

Once Playground works, move to a repeatable script with the JavaScript SDK.

Step 1. Install dependencies

npm
yarn
pnpm
bun
deno
npm install @midscene/ios dotenv --save-dev

Step 2. Write scripts

Save the following code as ./demo.ts. It opens Safari on the device, searches eBay, and asserts the result list.

./demo.ts
import 'dotenv/config'; // load Midscene environment variables from .env if present
import {
  IOSAgent,
  IOSDevice,
  agentFromWebDriverAgent,
} from '@midscene/ios';

const sleep = (ms) => new Promise((r) => setTimeout(r, ms));
Promise.resolve(
  (async () => {
    // Method 1: Create device and agent directly
    const page = new IOSDevice({
      wdaPort: 8100,
      wdaHost: 'localhost',
    });

    // 👀 Initialize Midscene agent
    const agent = new IOSAgent(page, {
      aiActionContext:
        'If any location, permission, user agreement, etc. popup appears, click agree. If login page appears, close it.',
    });
    await page.connect();

    // Method 2: Or use convenience function (recommended)
    // const agent = await agentFromWebDriverAgent({
    //   wdaPort: 8100,
    //   wdaHost: 'localhost',
    //   aiActionContext: 'If any location, permission, user agreement, etc. popup appears, click agree. If login page appears, close it.',
    // });

    // 👀 Directly open ebay.com webpage (recommended approach)
    await page.launch('https://ebay.com');
    await sleep(3000);

    // 👀 Enter keywords and perform search
    await agent.aiAct('Search for "Headphones"');

    // 👀 Wait for loading to complete
    await agent.aiWaitFor('At least one headphone product is displayed on the page');
    // Or you can use a simple sleep:
    // await sleep(5000);

    // 👀 Understand page content and extract data
    const items = await agent.aiQuery(
      '{itemTitle: string, price: Number}[], find product titles and prices in the list',
    );
    console.log('Headphone product information', items);

    // 👀 Use AI assertion
    await agent.aiAssert('Multiple headphone products are displayed on the interface');

    await page.destroy();
  })(),
);

Step 3. Run

npx tsx demo.ts

Step 4: View the report

Successful runs print Midscene - report file updated: /path/to/report/some_id.html. Open the generated HTML file in a browser to replay every interaction, query, and assertion.

API reference and more resources

Looking for constructors, helper methods, and platform-only device APIs? See the iOS API reference below for detailed parameter lists plus advanced topics like custom actions. For API surfaces shared across platforms, head to the common API reference.

FAQ

Why can't I control my device through WebDriverAgent even though it's connected?

Please check the following:

  1. Developer Mode: Ensure it's enabled in Settings > Privacy & Security > Developer Mode
  2. UI Automation: Ensure it's enabled in Settings > Developer > UI Automation
  3. Device Trust: Ensure the device trusts the current Mac

What are the differences between simulators and real devices?

FeatureReal DeviceSimulator
Port ForwardingRequires iproxyNot required
Developer ModeMust enableAuto-enabled
UI Automation SettingsMust enable manuallyAuto-enabled
PerformanceReal device performanceDepends on Mac performance
SensorsReal hardwareSimulated data

How to use custom WebDriverAgent port and host?

You can specify WebDriverAgent port and host through the IOSDevice constructor or agentFromWebDriverAgent:

// Method 1: Using IOSDevice
const device = new IOSDevice({
  wdaPort: 8100,        // Custom port
  wdaHost: '192.168.1.100', // Custom host
});

// Method 2: Using convenience function (recommended)
const agent = await agentFromWebDriverAgent({
  wdaPort: 8100,        // Custom port
  wdaHost: '192.168.1.100', // Custom host
});

For remote devices, you also need to set up port forwarding accordingly:

iproxy 8100 8100 YOUR_DEVICE_ID

How to get smoother live screen preview in Playground?

Playground's screen preview supports two modes:

  • Polling mode (default): Captures screenshots one by one via the WDA screenshot API, achieving ~5-10fps.
  • Native MJPEG stream (recommended): Proxies WDA's built-in MJPEG Server directly for higher frame rate and lower latency.

To enable the native MJPEG stream, forward the WDA MJPEG Server port (default 9100) to localhost:

# Required for real devices only (simulators don't need this)
iproxy 9100 9100 YOUR_DEVICE_ID

Playground automatically probes port 9100 on startup. If available, the log will show MJPEG: streaming via native WDA MJPEG server; otherwise it falls back to polling mode automatically.

More

API reference

Use this doc when you need to customize iOS device behavior, wire Midscene into WebDriverAgent-driven workflows, or troubleshoot WDA requests. For shared constructor options (reporting, hooks, caching, etc.), see the platform-agnostic API reference (Common).

Action Space

IOSDevice uses the following action space; the Midscene Agent can use these actions while planning tasks:

  • Tap — Tap an element.
  • DoubleClick — Double-tap an element.
  • Input — Enter text with replace/typeOnly/clear modes (append is a deprecated alias for typeOnly). Supports optional autoDismissKeyboard parameter.
  • Scroll — Scroll from an element or screen center in any direction, including scroll-to-top/bottom/left/right helpers.
  • DragAndDrop — Drag from one element to another.
  • KeyboardPress — Press a specified key.
  • LongPress — Long-press a target element with optional duration.
  • Pinch — Two-finger pinch gesture. Use scale > 1 to zoom in, scale < 1 to zoom out.
  • ClearInput — Clear the contents of an input field.
  • Launch — Open a URL, bundle identifier, or URL scheme.
  • Terminate — Close a running iOS app by its bundle identifier.
  • RunWdaRequest — Call WebDriverAgent REST endpoints directly.
  • IOSHomeButton — Trigger the iOS system Home action.
  • IOSAppSwitcher — Open the iOS multitasking view.

IOSDevice

Create a WebDriverAgent-backed instance that an IOSAgent can drive.

Import

import { IOSDevice } from '@midscene/ios';

Constructor

const device = new IOSDevice({
  // device options...
});

Device options

  • wdaPort?: number — WebDriverAgent port. Default 8100.
  • wdaHost?: string — WebDriverAgent host. Default 'localhost'.
  • iOSDeviceClassOverride?: string — Optional npm module path that replaces the default IOSDevice when using agentFromWebDriverAgent() or iOS Playground. The module must export an IOSDevice class or a default class.
  • autoDismissKeyboard?: boolean — Hide the keyboard after text input. Default true.
  • customActions?: DeviceAction<any>[] — Additional device actions exposed to the agent.

Usage notes

  • Ensure Developer Mode is enabled and WDA can reach the device; use iproxy when forwarding ports from a real device.
  • Use wdaHost/wdaPort to target remote devices or custom WDA deployments.
  • For shared interaction methods, see API reference (Common).

Examples

Quick start
import { IOSAgent, IOSDevice } from '@midscene/ios';

const device = new IOSDevice({ wdaHost: 'localhost', wdaPort: 8100 });
await device.connect();

const agent = new IOSAgent(device, {
  aiActionContext: 'If any permission dialog appears, accept it.',
});

await agent.launch('https://ebay.com');
await agent.aiAct('Search for "Headphones"');
const items = await agent.aiQuery(
  '{itemTitle: string, price: Number}[], list headphone products',
);
console.log(items);
Custom host and port
const device = new IOSDevice({
  wdaHost: '192.168.1.100',
  wdaPort: 8300,
});
await device.connect();

IOSAgent

Wire Midscene's AI planner to an IOSDevice for UI automation over WebDriverAgent.

Import

import { IOSAgent } from '@midscene/ios';

Constructor

const agent = new IOSAgent(device, {
  // common agent options...
});

iOS-specific options

  • customActions?: DeviceAction<any>[] — Extend planning with actions defined via defineAction.
  • appNameMapping?: Record<string, string> — Map friendly app names to bundle identifiers. When you pass an app name to launch(target) or terminate(bundleId), the agent will look up the bundle ID in this mapping. If no mapping is found, it will attempt to use target as-is. User-provided mappings take precedence over default mappings.
  • All other fields match API constructors: generateReport, reportFileName, aiActionContext, modelConfig, cacheId, createOpenAIClient, onTaskStartTip, and more.

Usage notes

Info

iOS-specific methods

agent.launch()

Launch a web URL, native application bundle, or custom scheme.

function launch(target: string): Promise<void>;
  • target: string — Target address (web URL, Bundle Identifier, URL scheme, tel/mailto, etc.) or app name. If you pass an app name and it exists in appNameMapping, it will be automatically resolved to the mapped Bundle ID; otherwise, target will be launched as-is.
await agent.launch('https://www.apple.com');
await agent.launch('com.apple.Preferences');
await agent.launch('myapp://profile/user/123');
await agent.launch('tel:+1234567890');
agent.terminate()

Terminate (close) a running iOS app by its bundle ID.

function terminate(bundleId: string): Promise<void>;
  • bundleId: string — The bundle identifier of the app to terminate (e.g. com.apple.Preferences). If you pass an app name and it exists in appNameMapping, it will be automatically resolved to the mapped Bundle ID.
await agent.terminate('com.apple.Preferences');
await agent.terminate('com.apple.mobilesafari');
agent.runWdaRequest()

Execute raw WebDriverAgent REST calls when you need low-level control.

function runWdaRequest(
  method: string,
  endpoint: string,
  data?: Record<string, any>,
): Promise<any>;
  • method: string — HTTP verb (GET, POST, DELETE, etc.).
  • endpoint: string — WebDriverAgent endpoint path.
  • data?: Record<string, any> — Optional JSON body.
const screen = await agent.runWdaRequest('GET', '/wda/screen');
await agent.runWdaRequest('POST', '/session/test/wda/pressButton', { name: 'home' });
  • agent.home(): Promise<void> — Return to the Home screen.
  • agent.appSwitcher(): Promise<void> — Reveal the multitasking view.

Helper utilities

agentFromWebDriverAgent()

Connect to WebDriverAgent and return a ready-to-use IOSAgent.

function agentFromWebDriverAgent(
  opts?: PageAgentOpt & IOSDeviceOpt,
): Promise<IOSAgent>;
  • opts?: PageAgentOpt & IOSDeviceOpt — Combine common agent options with IOSDevice settings.
  • Set MIDSCENE_IOS_DEVICE_CLASS_OVERRIDE to apply the same device class override through the environment. An explicit option takes precedence over the environment variable.
import { agentFromWebDriverAgent } from '@midscene/ios';

const agent = await agentFromWebDriverAgent({
  wdaHost: 'localhost',
  wdaPort: 8100,
  iOSDeviceClassOverride: '@your-scope/ios-device',
  aiActionContext: 'Accept permission dialogs automatically.',
});

Extending custom interaction actions

Extend the Agent's action space by supplying customActions with handlers created via defineAction. These actions appear after the built-in ones and can be called during planning.

import { getMidsceneLocationSchema, z } from '@midscene/core';
import { defineAction } from '@midscene/core/device';
import { agentFromWebDriverAgent } from '@midscene/ios';

const ContinuousClick = defineAction({
  name: 'continuousClick',
  description: 'Click the same target repeatedly',
  paramSchema: z.object({
    locate: getMidsceneLocationSchema(),
    count: z.number().int().positive().describe('How many times to click'),
  }),
  async call({ locate, count }) {
    console.log('click target center', locate.center);
    console.log('click count', count);
  },
});

const agent = await agentFromWebDriverAgent({
  customActions: [ContinuousClick],
});

await agent.aiAct('Click the red button five times');

See also