Testing Without Test Code: AI-Driven QA on Mobile Apps

Test automation has a dirty secret: the tests become a second codebase. XCUITest, Appium, Detox — they're all variations on the same idea. You write code that mimics a user. When the UI changes, you update the test code. When a third-party screen (a payment WebView, a deeplink landing) breaks an element ID, you find out in CI, three hours after the engineer went home.

The problem isn't that test automation is wrong. It's that it optimizes for the wrong thing. It automates mechanical execution, not judgment. A senior QA engineer isn't valuable because they tap buttons faster. They're valuable because they notice that the keyboard shifted the layout and the coordinate they're tapping is now someone else's button. That's not codeable.

We've been exploring a different approach with one of our clients — a consumer mobile app with a multi-path checkout flow. What we found is worth writing down.

What we did

Instead of writing test code, we wrote scenario documents. Plain Markdown. Each scenario describes a user journey as a sequence of actions and expected states — the kind of thing a QA engineer would write in a test plan, except structured enough that an AI agent can read it and know what to do.

The agent runs on the real iOS simulator. It uses the OS-level accessibility tree to find elements, fires real keystrokes via osascript, launches deeplinks with xcrun simctl openurl, and takes screenshots at each step. It reads the scenario, executes it, and writes a report — in the same Markdown format, with the screenshots it took inline.

The test run produces the documentation. The scenario document and the run report are the same artifact, in the same format, readable by anyone on the team.

In a recent session we ran three order paths through a live Shopify checkout in the same test run — a shared-design deeplink, a merch-order deeplink, and a blank garment with artwork applied — all on a real simulator, all against the Bogus Gateway, all within about three minutes. Three successful orders, one network error recovered with a retry, and a post-checkout artist-name popup that was required on one path and correctly absent on the other two.

What makes this different

Traditional automation

  • Tests are code — maintained separately from specs
  • Brittle to element ID changes and layout shifts
  • Binary output: pass or fail
  • Blind to visual state — asserts on properties, not pixels
  • Fails silently on third-party screens it can't inspect
  • Writing new test coverage requires an engineer

AI-driven scenario execution

  • Tests are prose — the same document QA would write anyway
  • Adapts to layout shifts; retries with revised coordinates
  • Output is a narrative run report, not a boolean
  • Takes screenshots and uses them as ground truth
  • Operates on any screen the OS can see, including WebViews
  • New scenario = new Markdown section, not new code

The adaptation problem

The most interesting thing we observed wasn't what went right — it was how the agent handled what went wrong.

In a Shopify checkout WebView, when the keyboard appears, the entire layout shifts. The element coordinates from before the keyboard appeared are now wrong. In our session, the agent noticed that keystrokes were landing in the previously focused field rather than the intended one — because the element it tapped had moved after the keyboard pushed the layout up. It adapted: dismissed the keyboard between fields, re-queried element positions, and re-typed from a clean state.

Traditional test code fails here. It has the old coordinates hardcoded, or it has a brittle query for an element ID that may or may not be in the right position. The agent doesn't have coordinates hardcoded — it has intent. It knows it's trying to fill the Card number field. When that fails, it figures out why and tries again.

Similarly, after an artwork generation step that takes ~25 seconds, the accessibility tree returned the pre-generation UI even though the screenshot showed the generation was complete. The agent treated the screenshot as authoritative, ignored the stale accessibility dump, and continued correctly. That's a judgment call a static test runner can't make.

The report is the output

After each run, what you have is a Markdown document with:

Screenshots at every significant step. A precise narrative of what happened. An INVESTIGATE section — written by the agent — that reads like a senior QA engineer's testing notes. Not vague failures. Specific observations: the coordinate trap created by the Pay Now button when the keyboard is up; the hypothesis that the network error on the third consecutive order may correlate with rapid back-to-back runs; the A11y buffering lag that appears specifically during state transitions.

A product manager can read this document. A developer can act on the INVESTIGATE items directly. There's no translation step between "the test failed" and "here's what actually happened."

This is the part that surprised us most. We expected the agent to execute steps. We didn't expect it to produce better QA notes than most human testers do.

What it doesn't replace

Unit tests, component tests, contract tests, snapshot tests — these are fine as code and should stay as code. They run fast, they're deterministic, they catch regressions at the right layer. Nothing here changes that.

What this replaces is the expensive, brittle, high-maintenance end-to-end UI test suite that everyone has and no one trusts. The one that's 40% flaky, passes locally, fails in CI, and takes a dedicated engineer to maintain. That suite exists because there was no better option for exercising real user flows on real device surfaces. Now there is.

Where this is going

The scenario documents are the asset. Once you have them, you can run them against any build, on any simulator, triggered by CI or by a human. The agent adapts to UI changes without requiring the scenario to change — unless the actual user flow changes, in which case updating a Markdown document is the right level of effort anyway.

We're running this on a production mobile app with a real payment integration, multiple checkout paths, deeplinks, artwork generation, and multi-screen flows. It works. The runs produce reports we actually read, flag issues we act on, and require no test code to maintain.

If you're carrying a failing E2E suite, or if your mobile QA process is a human doing the same taps before every release, it's worth a conversation.

start@eloquentix.com →

← All posts