iOS Automation Support
Midscene can drive WebDriver tools to support iOS automation.
By adapting a visual model solution, the automation process works with any app tech stack—whether built with Native, Flutter, React Native, or Lynx. Developers only need to focus on the final experience when debugging UI automation scripts.
The iOS UI automation solution comes with all the features of Midscene:
- Supports zero-code trial using Playground.
- Supports JavaScript SDK.
- Supports automation scripts in YAML format and command-line tools.
- Supports HTML reports to replay all operation paths.
Showcases
Prompt : Open Twitter and auto-like the first tweet by @midscene_ai
View the full report of this task: report.html
See more showcases: showcases
Understand WebDriverAgent
WebDriver is a standard protocol established by W3C for browser automation, providing a unified API to control different browsers and applications. The WebDriver protocol defines the communication method between client and server, enabling automation tools to control various user interfaces across platforms.
Through the efforts of the Appium team and other open source communities, the industry now has many excellent libraries that convert desktop and mobile device automation operations into WebDriver protocol. These tools include:
- Appium - Cross-platform mobile automation framework
- WebDriverAgent - Service dedicated to iOS device automation
- Selenium - Web browser automation tool
- WinAppDriver - Windows application automation tool
Midscene adapts to the WebDriver protocol, which means developers can use AI models to perform intelligent automated operations on any device that supports WebDriver. Through this design, Midscene can not only control traditional operations like clicking and typing, but also:
- Understand interface content and context
- Execute complex multi-step operations
- Perform intelligent assertions and validations
- Extract and analyze interface data
On iOS platform, Midscene connects to iOS devices through WebDriverAgent, allowing you to control iOS apps and system using natural language descriptions.
This guide walks you through everything required to automate an iOS device with Midscene: connect a real phone through WebDriverAgent, configure model credentials, try the no-code Playground, and run your first JavaScript script.
Control iOS devices with JavaScript: https://github.com/web-infra-dev/midscene-example/blob/main/ios/javascript-sdk-demo
Integrate Vitest for testing: https://github.com/web-infra-dev/midscene-example/tree/main/ios/vitest-demo
Set up API keys for model
Set your model configs into the environment variables. You may refer to Model strategy for more details.
For more configuration details, please refer to Model strategy and Model configuration.
Preparation
Install Node.js
Install Node.js 18 or higher.
Prepare API Key
Prepare an API Key for a visual language (VL) model.
You can find supported models and configurations for Midscene.js in the Model strategy documentation.
Prepare WebDriver Server
Before getting started, you need to set up the iOS development environment:
- macOS (required for iOS development)
- Xcode and Xcode command line tools
- iOS Simulator or real device
Environment Configuration
Before using Midscene iOS, you need to prepare the WebDriverAgent service.
WebDriverAgent version must be >= 7.0.0
Please refer to the official documentation for setup:
- Simulator Configuration: Run Prebuilt WDA
- Real Device Configuration: Real Device Configuration
Verify Environment Configuration
After completing the configuration, you can verify whether the service is working properly by accessing WebDriverAgent's status endpoint:
Access URL: http://localhost:8100/status
Correct Response Example:
If you can successfully access this endpoint and receive a similar JSON response as shown above, it indicates that WebDriverAgent is properly configured and running.
Try Playground (no code)
Playground is the fastest way to validate the connection and observe AI-driven steps without writing code. It shares the same core as @midscene/ios, so anything that works here will behave the same once scripted.
- Launch the Playground CLI:
- Click the gear button to enter the configuration page and paste your API key config. Refer back to Model configuration if you still need credentials.

Start experiencing
After configuration, you can start using Midscene right away. It provides several key operation tabs:
- Act: interact with the page. This is Auto Planning, corresponding to
aiAct. For example:
- Query: extract JSON data from the interface, corresponding to
aiQuery.
Similar methods include aiBoolean(), aiNumber(), and aiString() for directly extracting booleans, numbers, and strings.
- Assert: understand the page and assert; if the condition is not met, throw an error, corresponding to
aiAssert.
- Tap: click on an element. This is Instant Action, corresponding to
aiTap.
For the difference between Auto Planning and Instant Action, see the API document.
Integration with Midscene Agent
Once Playground works, move to a repeatable script with the JavaScript SDK.
Step 1. Install dependencies
Step 2. Write scripts
Save the following code as ./demo.ts. It opens Safari on the device, searches eBay, and asserts the result list.
Step 3. Run
Step 4: View the report
Successful runs print Midscene - report file updated: /path/to/report/some_id.html. Open the generated HTML file in a browser to replay every interaction, query, and assertion.
API reference and more resources
Looking for constructors, helper methods, and platform-only device APIs? See the iOS API reference below for detailed parameter lists plus advanced topics like custom actions. For API surfaces shared across platforms, head to the common API reference.
FAQ
Why can't I control my device through WebDriverAgent even though it's connected?
Please check the following:
- Developer Mode: Ensure it's enabled in Settings > Privacy & Security > Developer Mode
- UI Automation: Ensure it's enabled in Settings > Developer > UI Automation
- Device Trust: Ensure the device trusts the current Mac
What are the differences between simulators and real devices?
How to use custom WebDriverAgent port and host?
You can specify WebDriverAgent port and host through the IOSDevice constructor or agentFromWebDriverAgent:
For remote devices, you also need to set up port forwarding accordingly:
How to get smoother live screen preview in Playground?
Playground's screen preview supports two modes:
- Polling mode (default): Captures screenshots one by one via the WDA screenshot API, achieving ~5-10fps.
- Native MJPEG stream (recommended): Proxies WDA's built-in MJPEG Server directly for higher frame rate and lower latency.
To enable the native MJPEG stream, forward the WDA MJPEG Server port (default 9100) to localhost:
Playground automatically probes port 9100 on startup. If available, the log will show MJPEG: streaming via native WDA MJPEG server; otherwise it falls back to polling mode automatically.
More
- For every Agent method, check the API reference (Common).
- For the iOS API reference, see iOS Agent API.
- Demo projects
- iOS JavaScript SDK demo: https://github.com/web-infra-dev/midscene-example/blob/main/ios/javascript-sdk-demo
- iOS + Vitest demo: https://github.com/web-infra-dev/midscene-example/tree/main/ios/vitest-demo
API reference
Use this doc when you need to customize iOS device behavior, wire Midscene into WebDriverAgent-driven workflows, or troubleshoot WDA requests. For shared constructor options (reporting, hooks, caching, etc.), see the platform-agnostic API reference (Common).
Action Space
IOSDevice uses the following action space; the Midscene Agent can use these actions while planning tasks:
Tap— Tap an element.DoubleClick— Double-tap an element.Input— Enter text withreplace/typeOnly/clearmodes (appendis a deprecated alias fortypeOnly). Supports optionalautoDismissKeyboardparameter.Scroll— Scroll from an element or screen center in any direction, including scroll-to-top/bottom/left/right helpers.DragAndDrop— Drag from one element to another.KeyboardPress— Press a specified key.LongPress— Long-press a target element with optional duration.Pinch— Two-finger pinch gesture. Usescale > 1to zoom in,scale < 1to zoom out.ClearInput— Clear the contents of an input field.Launch— Open a URL, bundle identifier, or URL scheme.Terminate— Close a running iOS app by its bundle identifier.RunWdaRequest— Call WebDriverAgent REST endpoints directly.IOSHomeButton— Trigger the iOS system Home action.IOSAppSwitcher— Open the iOS multitasking view.
IOSDevice
Create a WebDriverAgent-backed instance that an IOSAgent can drive.
Import
Constructor
Device options
wdaPort?: number— WebDriverAgent port. Default8100.wdaHost?: string— WebDriverAgent host. Default'localhost'.iOSDeviceClassOverride?: string— Optional npm module path that replaces the defaultIOSDevicewhen usingagentFromWebDriverAgent()or iOS Playground. The module must export anIOSDeviceclass or a default class.autoDismissKeyboard?: boolean— Hide the keyboard after text input. Defaulttrue.customActions?: DeviceAction<any>[]— Additional device actions exposed to the agent.
Usage notes
- Ensure Developer Mode is enabled and WDA can reach the device; use
iproxywhen forwarding ports from a real device. - Use
wdaHost/wdaPortto target remote devices or custom WDA deployments. - For shared interaction methods, see API reference (Common).
Examples
Quick start
Custom host and port
IOSAgent
Wire Midscene's AI planner to an IOSDevice for UI automation over WebDriverAgent.
Import
Constructor
iOS-specific options
customActions?: DeviceAction<any>[]— Extend planning with actions defined viadefineAction.appNameMapping?: Record<string, string>— Map friendly app names to bundle identifiers. When you pass an app name tolaunch(target)orterminate(bundleId), the agent will look up the bundle ID in this mapping. If no mapping is found, it will attempt to usetargetas-is. User-provided mappings take precedence over default mappings.- All other fields match API constructors:
generateReport,reportFileName,aiActionContext,modelConfig,cacheId,createOpenAIClient,onTaskStartTip, and more.
Usage notes
- Use one agent per device connection.
- iOS-only helpers such as
launch,terminate, andrunWdaRequestare also exposed in YAML scripts. See iOS platform-specific actions. - For shared interaction methods, see API reference (Common).
iOS-specific methods
agent.launch()
Launch a web URL, native application bundle, or custom scheme.
target: string— Target address (web URL, Bundle Identifier, URL scheme, tel/mailto, etc.) or app name. If you pass an app name and it exists inappNameMapping, it will be automatically resolved to the mapped Bundle ID; otherwise,targetwill be launched as-is.
agent.terminate()
Terminate (close) a running iOS app by its bundle ID.
bundleId: string— The bundle identifier of the app to terminate (e.g.com.apple.Preferences). If you pass an app name and it exists inappNameMapping, it will be automatically resolved to the mapped Bundle ID.
agent.runWdaRequest()
Execute raw WebDriverAgent REST calls when you need low-level control.
method: string— HTTP verb (GET,POST,DELETE, etc.).endpoint: string— WebDriverAgent endpoint path.data?: Record<string, any>— Optional JSON body.
Navigation helpers
agent.home(): Promise<void>— Return to the Home screen.agent.appSwitcher(): Promise<void>— Reveal the multitasking view.
Helper utilities
agentFromWebDriverAgent()
Connect to WebDriverAgent and return a ready-to-use IOSAgent.
opts?: PageAgentOpt & IOSDeviceOpt— Combine common agent options withIOSDevicesettings.- Set
MIDSCENE_IOS_DEVICE_CLASS_OVERRIDEto apply the same device class override through the environment. An explicit option takes precedence over the environment variable.
Extending custom interaction actions
Extend the Agent's action space by supplying customActions with handlers created via defineAction. These actions appear after the built-in ones and can be called during planning.
See also
- Integrate with any interface for custom actions and schemas.

