Android Automation Support

Midscene can drive adb tools to support Android automation.

By adapting a visual model solution, the automation process works with any app tech stack—whether built with Native, Flutter, React Native, or Lynx. Developers only need to focus on the final experience when debugging UI automation scripts.

The Android UI automation solution comes with all the features of Midscene:

  • Supports zero-code trial using Playground.
  • Supports JavaScript SDK.
  • Supports automation scripts in YAML format and command-line tools.
  • Supports HTML reports to replay all operation paths.

Showcases

Prompt : Open the Booking App, search for a hotel in Tokyo for four adults on Christmas, with a score of 8 or above.

View the full report of this task: report.html

See more showcases: showcases

This guide walks you through everything required to automate an Android device with Midscene: connect a real phone over adb, configure model credentials, try the no-code Playground, and run your first JavaScript script.

Set up API keys for model

Set your model configs into the environment variables. You may refer to Model strategy for more details.

export MIDSCENE_MODEL_BASE_URL="https://replace-with-your-model-service-url/v1"
export MIDSCENE_MODEL_API_KEY="replace-with-your-api-key"
export MIDSCENE_MODEL_NAME="replace-with-your-model-name"
export MIDSCENE_MODEL_FAMILY="replace-with-your-model-family"

For more configuration details, please refer to Model strategy and Model configuration.

Prepare your Android device

Before scripting, confirm adb can talk to your device and the device trusts your machine.

Install adb and set ANDROID_HOME

adb --version

Example output indicates success:

Android Debug Bridge version 1.0.41
Version 34.0.4-10411341
Installed as /usr/local/bin//adb
Running on Darwin 24.3.0 (arm64)
echo $ANDROID_HOME

Any non-empty output means it is configured:

/Users/your_username/Library/Android/sdk

Enable USB debugging and verify the device

In the system settings developer options, enable USB debugging (and USB debugging (Security settings) if present), then connect the device via USB.

android usb debug

Verify the connection:

adb devices -l

Example success output:

List of devices attached
s4ey59	device usb:34603008X product:cezanne model:M2006J device:cezan transport_id:3

Try Playground (no code)

Playground is the fastest way to validate the connection and observe AI-driven steps without writing code. It shares the same core as @midscene/android, so anything that works here will behave the same once scripted.

  1. Launch the Playground CLI:
npx --yes @midscene/android-playground
  1. Click the gear icon in the Playground window, then paste your API key configuration. Refer back to Model configuration if you still need credentials.

Start experiencing

After configuration, you can start using Midscene right away. It provides several key operation tabs:

  • Act: interact with the page. This is Auto Planning, corresponding to aiAct. For example:
Type “Midscene” in the search box, run the search, and open the first result
Fill out the registration form and make sure every field passes validation
  • Query: extract JSON data from the interface, corresponding to aiQuery.

Similar methods include aiBoolean(), aiNumber(), and aiString() for directly extracting booleans, numbers, and strings.

Extract the user ID from the page and return JSON data in the { id: string } structure
  • Assert: understand the page and assert; if the condition is not met, throw an error, corresponding to aiAssert.
There is a login button on the page, with a user agreement link below it
  • Tap: click on an element. This is Instant Action, corresponding to aiTap.
Click the login button

For the difference between Auto Planning and Instant Action, see the API document.

Integration with Midscene Agent

Once Playground works, move to a repeatable script with the JavaScript SDK.

Step 1. Install dependencies

npm
yarn
pnpm
bun
deno
npm install @midscene/android dotenv --save-dev

Step 2. Write scripts

Save the following code as ./demo.ts. It opens the browser on the device, searches eBay, and asserts the result list.

./demo.ts
import 'dotenv/config'; // load Midscene environment variables from .env if present
import {
  AndroidAgent,
  AndroidDevice,
  getConnectedDevices,
} from '@midscene/android';

const sleep = (ms) => new Promise((r) => setTimeout(r, ms));
Promise.resolve(
  (async () => {
    const devices = await getConnectedDevices();
    const device = new AndroidDevice(devices[0].udid);

    const agent = new AndroidAgent(device, {
      aiActionContext:
        'If any location, permission, user agreement, etc. popup, click agree. If login page pops up, close it.',
    });
    await device.connect();

    await agent.aiAct('open browser and navigate to ebay.com');
    await sleep(5000);
    await agent.aiAct('type "Headphones" in search box, hit Enter');
    await agent.aiWaitFor('There is at least one headphone product');

    const items = await agent.aiQuery(
      '{itemTitle: string, price: Number}[], find item in list and corresponding price',
    );
    console.log('headphones in stock', items);

    await agent.aiAssert('There is a category filter on the left');
  })(),
);

Step 3. Run

npx tsx demo.ts

Step 4: View the report

Successful runs print Midscene - report file updated: /path/to/report/some_id.html. Open the generated HTML file in a browser to replay every interaction, query, and assertion.

Advanced

Use this section when you need to customize device behavior, wire Midscene into your framework, or troubleshoot adb issues. For detailed constructor parameters, jump to the API reference(Android).

Extend Midscene on Android

Use defineAction() for custom gestures and pass them through customActions. Midscene will append them to the planner so AI can call your domain-specific action names.

import { getMidsceneLocationSchema, z } from '@midscene/core';
import { defineAction } from '@midscene/core/device';
import { AndroidAgent, AndroidDevice, getConnectedDevices } from '@midscene/android';

const ContinuousClick = defineAction({
  name: 'continuousClick',
  description: 'Click the same target repeatedly',
  paramSchema: z.object({
    locate: getMidsceneLocationSchema(),
    count: z.number().int().positive().describe('How many times to click'),
  }),
  async call(param) {
    const { locate, count } = param;
    console.log('click target center', locate.center);
    console.log('click count', count);
  },
});

const devices = await getConnectedDevices();
const device = new AndroidDevice(devices[0].udid);
await device.connect();

const agent = new AndroidAgent(device, {
  customActions: [ContinuousClick],
});

await agent.aiAct('click the red button five times');

See Integrate with any interface for a deeper explanation of custom actions and action schemas.

More

Complete example (Vitest + AndroidAgent)

import {
  AndroidAgent,
  AndroidDevice,
  getConnectedDevices,
} from '@midscene/android';
import type { TestStatus } from '@midscene/core';
import { ReportMergingTool } from '@midscene/core/report';
import { sleep } from '@midscene/core/utils';
import type ADB from 'appium-adb';
import {
  afterAll,
  afterEach,
  beforeAll,
  beforeEach,
  describe,
  it,
} from 'vitest';

describe('Android Settings Test', () => {
  let page: AndroidDevice;
  let adb: ADB;
  let agent: AndroidAgent;
  let startTime: number;
  let itTestStatus: TestStatus = 'passed';
  const reportMergingTool = new ReportMergingTool();

  beforeAll(async () => {
    const devices = await getConnectedDevices();
    page = new AndroidDevice(devices[0].udid);
    adb = await page.getAdb();
  });

  beforeEach((ctx) => {
    startTime = performance.now();
    agent = new AndroidAgent(page, {
      groupName: ctx.task.name,
    });
  });

  afterEach((ctx) => {
    if (ctx.task.result?.state === 'pass') {
      itTestStatus = 'passed';
    } else if (ctx.task.result?.state === 'skip') {
      itTestStatus = 'skipped';
    } else if (ctx.task.result?.errors?.[0].message.includes('timed out')) {
      itTestStatus = 'timedOut';
    } else {
      itTestStatus = 'failed';
    }
    reportMergingTool.append({
      reportFilePath: agent.reportFile as string,
      reportAttributes: {
        testId: `${ctx.task.name}`,
        testTitle: `${ctx.task.name}`,
        testDescription: 'description',
        testDuration: (Date.now() - ctx.task.result?.startTime!) | 0,
        testStatus: itTestStatus,
      },
    });
  });

  afterAll(() => {
    reportMergingTool.mergeReports('my-android-setting-test-report');
  });

  it('toggle wlan', async () => {
    await adb.shell('input keyevent KEYCODE_HOME');
    await sleep(1000);
    await adb.shell('am start -n com.android.settings/.Settings');
    await sleep(1000);
    await agent.aiAct('find and enter WLAN setting');
    await agent.aiAct(
      'toggle WLAN status *once*, if WLAN is off pls turn it on, otherwise turn it off.',
    );
  });

  it('toggle bluetooth', async () => {
    await adb.shell('input keyevent KEYCODE_HOME');
    await sleep(1000);
    await adb.shell('am start -n com.android.settings/.Settings');
    await sleep(1000);
    await agent.aiAct('find and enter bluetooth setting');
    await agent.aiAct(
      'toggle bluetooth status *once*, if bluetooth is off pls turn it on, otherwise turn it off.',
    );
  });
});
Tip

Merged reports are stored inside midscene_run/report by default. Override the directory with MIDSCENE_RUN_DIR when running in CI.

FAQ

Why can't I control the device even though I've connected it?

A common error is:

Error:
Exception occurred while executing 'tap':
java.lang.SecurityException: Injecting input events requires the caller (or the source of the instrumentation, if any) to have the INJECT_EVENTS permission.

Make sure USB debugging is enabled and the device is unlocked in developer options.

android usb debug

Text input is cleared or lost after typing

After entering text, Midscene automatically dismisses the keyboard. The default behavior sends an ESC key event. However, some input fields (especially those inside WebView) listen for the ESC key event, which can cause side effects such as:

  • Clearing the text just entered
  • Closing the popup/modal containing the input field
  • Navigating away from the current page

You can try the following solutions in order of priority:

Option 1: Use the BACK key (Android back button) to dismiss the keyboard

Set keyboardDismissStrategy to 'back-first' to use the Android BACK key instead of ESC to dismiss the keyboard:

const device = new AndroidDevice('device-id', {
  keyboardDismissStrategy: 'back-first',
});

Option 2: Disable auto keyboard dismiss

If your input field also listens for the BACK key, you can disable auto keyboard dismiss entirely and let the AI Agent or subsequent actions manage the keyboard state:

const device = new AndroidDevice('device-id', {
  autoDismissKeyboard: false,
});

With auto dismiss disabled, the keyboard will remain visible and may cover a large portion of the screen. You can work around this by:

  • Using aiAct to manually dismiss the keyboard, e.g. await agent.aiAct('tap the collapse button on the keyboard')
  • Installing and switching to ADBKeyBoard — a minimal virtual keyboard that takes up very little screen space, so it barely affects screen interactions even when visible

English text is rewritten by the Android keyboard

If the report shows the correct input parameter, but the app receives different text, missing text, or Chinese/pinyin candidates, the active Android input method may be rewriting the text. This can happen when pure ASCII text goes through the native adb input text path while a Chinese keyboard or autocorrect keyboard is active.

Use the existing imeStrategy option and force all text input through yadb:

const device = new AndroidDevice('device-id', {
  imeStrategy: 'always-yadb',
});

For YAML scripts:

android:
  imeStrategy: always-yadb

Or set the environment variable:

export MIDSCENE_ANDROID_IME_STRATEGY=always-yadb

This is different from text being cleared after typing. If the text is entered correctly and then disappears, check keyboardDismissStrategy or autoDismissKeyboard instead.

How do I use a custom adb path or remote adb server?

Set the environment variables first:

export MIDSCENE_ADB_PATH=/path/to/adb
export MIDSCENE_ADB_REMOTE_HOST=192.168.1.100
export MIDSCENE_ADB_REMOTE_PORT=5037

You can also provide the same information via the constructor:

const device = new AndroidDevice('s4ey59', {
  androidAdbPath: '/path/to/adb',
  remoteAdbHost: '192.168.1.100',
  remoteAdbPort: 5037,
});

API reference

Use this doc when you need to customize Midscene's Android automation or review Android-only constructor options. For shared parameters (reporting, hooks, caching, etc.), see the platform-agnostic API reference (Common).

Action Space

AndroidDevice uses the following action space; the Midscene Agent can use these actions while planning tasks:

  • Tap — Tap an element.
  • DoubleClick — Double-tap an element.
  • Input — Enter text with replace/typeOnly/clear modes (append is a deprecated alias for typeOnly). Supports optional autoDismissKeyboard parameter.
  • Scroll — Scroll from an element or screen center in any direction, with helpers to reach the top, bottom, left, or right.
  • DragAndDrop — Drag from one element to another.
  • KeyboardPress — Press a specified key.
  • LongPress — Long-press a target element with optional duration.
  • PullGesture — Pull up or down (e.g., to refresh) with optional distance and duration.
  • Pinch — Two-finger pinch gesture. Use scale > 1 to zoom in, scale < 1 to zoom out.
  • ClearInput — Clear the contents of an input field.
  • Launch — Open a web URL or package/.Activity string.
  • Terminate — Force-stop an app by package name.
  • RunAdbShell — Execute raw adb shell commands.
  • AndroidBackButton — Trigger the system back action.
  • AndroidHomeButton — Return to the home screen.
  • AndroidRecentAppsButton — Open the multitasking/recent apps view.

AndroidDevice

Create a connection to an adb-managed device that an AndroidAgent can drive.

Import

import { AndroidDevice, getConnectedDevices } from '@midscene/android';

Constructor

const device = new AndroidDevice(deviceId, {
  // device options...
});

Device options

  • deviceId: string — Value returned by adb devices or getConnectedDevices().
  • autoDismissKeyboard?: boolean — Automatically hide the keyboard after input. Default true.
  • keyboardDismissStrategy?: 'esc-first' | 'back-first' — Order for dismissing keyboards. Default 'esc-first'.
  • androidAdbPath?: string — Custom path to the adb executable.
  • remoteAdbHost?: string / remoteAdbPort?: number — Point to a remote adb server.
  • imeStrategy?: 'always-yadb' | 'yadb-for-non-ascii' — Choose when to invoke yadb for text input. Default 'yadb-for-non-ascii'.
    • 'yadb-for-non-ascii' (default) — Uses yadb for Unicode characters (including Latin Unicode like ö, é, ñ), Chinese, Japanese, and format specifiers (like %s, %d). Pure ASCII text uses the faster native adb input text.
    • 'always-yadb' — Always uses yadb for all text input, providing maximum compatibility but slightly slower for pure ASCII text.
  • displayId?: number — Target a specific virtual display if the device mirrors multiple displays.
  • screenshotResizeScale?: numberDeprecated. This option has been removed and no longer has any effect. Use screenshotShrinkFactor in AgentOpt instead to control screenshot size sent to the AI model.
  • minScreenshotBufferSize?: number — Screenshot buffer size validation threshold in bytes. Buffers below this value are treated as failed or corrupted captures. Default 1024 (1KB). Set to 0 to skip only this size check; Midscene still rejects empty buffers and invalid image formats.
  • alwaysRefreshScreenInfo?: boolean — Re-query rotation and screen size every step. Default false.
  • scrcpyConfig?: object — Scrcpy high-performance screenshot configuration, disabled by default. See Scrcpy Screenshot Mode below.

Scrcpy Screenshot Mode

By default, Midscene captures screenshots via adb shell screencap, which takes ~500–2000ms per call. Enabling Scrcpy mode streams H.264 video from the device and captures frames in real time, reducing screenshot latency to approximately 100–200ms.

How to enable:

const device = new AndroidDevice(deviceId, {
  scrcpyConfig: {
    enabled: true,
  },
});

Optional parameters:

ParameterTypeDefaultDescription
enabledbooleanfalseEnable Scrcpy screenshots
maxSizenumber0Max video dimension (width or height). 0 = no scaling
videoBitRatenumber2000000H.264 encoding bitrate (bps)
idleTimeoutMsnumber30000Auto-disconnect after idle (ms). Set to 0 to disable
Tip

Scrcpy mode automatically falls back to ADB screenshots if the connection fails. No extra error handling is needed.

Usage notes

  • Discover devices with getConnectedDevices(); the udid matches adb devices.
  • Supports remote adb via remoteAdbHost/remoteAdbPort; set androidAdbPath if adb is not on PATH.
  • Use screenshotShrinkFactor in AgentOpt to cut latency on high-DPI devices.

Examples

Quick start
import { AndroidAgent, AndroidDevice, getConnectedDevices } from '@midscene/android';

const [first] = await getConnectedDevices();
const device = new AndroidDevice(first.udid);
await device.connect();

const agent = new AndroidAgent(device, {
  aiActionContext: 'If a permissions dialog appears, accept it.',
});

await agent.launch('https://www.ebay.com');
await agent.aiAct('search "Headphones" and wait for results');
const items = await agent.aiQuery(
  '{itemTitle: string, price: number}[], find item in list and corresponding price',
);
console.log(items);
Launch native packages
await agent.launch('com.android.settings/.Settings');
await agent.back();
await agent.home();

AndroidAgent

Wire Midscene's AI planner to an AndroidDevice for UI automation.

Import

import { AndroidAgent } from '@midscene/android';

Constructor

const agent = new AndroidAgent(device, {
  // common agent options...
});

Android-specific options

  • customActions?: DeviceAction[] — Extend planning with actions defined via defineAction.
  • appNameMapping?: Record<string, string> — Map friendly app names to package names. When you pass an app name to launch(target), the agent will look up the package name in this mapping. If no mapping is found, it will attempt to launch target as-is. User-provided mappings take precedence over default mappings.
  • All other fields match API constructors: generateReport, reportFileName, aiActionContext, modelConfig, cacheId, createOpenAIClient, onTaskStartTip, and more.

Usage notes

Info

Android-specific methods

agent.launch()

Launch a web URL or native Android activity/package.

function launch(target: string): Promise<void>;
  • target: string — Can be a web URL, a string in package/.Activity format (e.g., com.android.settings/.Settings), an app package name, or an app name. If you pass an app name and it exists in appNameMapping, it will be automatically resolved to the mapped package name; otherwise, target will be launched as-is.
agent.runAdbShell()

Run a raw adb shell command through the connected device. Pass only the shell command itself, without the adb shell prefix.

function runAdbShell(command: string, opt?: { timeout?: number }): Promise<string>;
  • command: string — Command passed verbatim to adb shell. For example, use input tap 100 200, not adb shell input tap 100 200.
  • opt.timeout?: number — Optional command execution timeout in milliseconds.
const result = await agent.runAdbShell('dumpsys battery', { timeout: 60 * 1000 });
console.log(result);

await agent.runAdbShell('input tap 100 200');
agent.terminate()

Terminate (force-stop) a running Android app.

function terminate(uri: string): Promise<void>;
  • uri: string — Package name, app name in appNameMapping, or package/.Activity (only the package part is used).
await agent.terminate('com.android.settings');
  • agent.back(): Promise<void> — Trigger the Android system Back action.
  • agent.home(): Promise<void> — Return to the launcher.
  • agent.recentApps(): Promise<void> — Open the Recents/Overview screen.

Helper utilities

agentFromAdbDevice()

Create an AndroidAgent from any connected adb device.

function agentFromAdbDevice(
  deviceId?: string,
  opts?: PageAgentOpt & AndroidDeviceOpt,
): Promise<AndroidAgent>;
  • deviceId?: string — Connect to a specific device; omitted means “first available”.
  • opts?: PageAgentOpt & AndroidDeviceOpt — Combine agent options with AndroidDevice settings.
getConnectedDevices()

Enumerate adb devices Midscene can drive.

function getConnectedDevices(): Promise<Array<{
  udid: string;
  state: string;
  port?: number;
}>>;