AI for Software Testing

AI for testing is a category with one mature use case (test generation), several emerging ones (test prioritisation, flake detection, oracle inference), and a few that don't quite work yet (autonomous test maintenance). This page is the working set, with honest assessment of where each fits.

Test generation: the obvious win, with caveats

LLMs can write tests from code or from a description. For routine cases this works well; for edge cases it's hit-or-miss.

Patterns that work:

Test scaffolding from code

```

You: Generate unit tests for this function:

[function]

AI: [tests with happy path, common edge cases, error conditions]

```

Wins: typing-heavy. The model writes the assertions, the fixtures, the parameterized cases. You read, edit, accept.

What you get: 70-90% of the test code without typing it. What you don't get: tests that catch bugs you didn't anticipate. The model writes tests that *match the implementation*, which means tests that pass against the buggy code as readily as the correct code.

Mitigation: pair with property-based testing (Hypothesis, fast-check, jqwik) or fuzzing. The model writes the unit tests; the property-based runner explores the input space.

Tests from specification / docstring

```

You: Write tests verifying this function matches its docstring:

[function with docstring]

AI: [tests grounded in what the docstring says, not what the code does]

```

This catches docstring-vs-code mismatches. Also catches under-specified docstrings (the model can only test what's specified).

Tests for legacy code

```

You: Generate tests for this untested function. The tests should pass

on the current behaviour — they're characterisation tests for refactoring safety.

AI: [tests of current behaviour, including any quirks]

```

Useful before refactoring. The tests pin down behaviour; refactor proceeds; tests fail when behaviour changes.

Caveat: the tests will encode bugs as expected behaviour. You're not testing for correctness; you're testing for stability. Convert to spec-tests later.

What test generation doesn't do well

- **Discover edge cases the implementation doesn't already handle.** The model sees the implementation; it tests what's there. Bugs of omission survive.

- **Match team conventions** without explicit guidance. Show it your existing tests; ask it to match.

- **Generate good integration tests.** Unit tests are easy; integration tests need realistic fixtures and the model rarely produces them well.

- **Think about timing-sensitive bugs.** Concurrency, race conditions — model won't generate tests that surface these without explicit prompting.

Test prioritisation

Run the most-likely-to-fail tests first. AI predicts which tests are most likely to fail given the diff.

Approaches:

- **Train a classifier** on historical "test outcome given diff" pairs. Predicts failure probability per test.

- **Use code-aware heuristics**: tests touching the changed files first; tests that historically failed for similar diffs first.

- **LLM ranking** of tests by relevance to the diff.

When it pays off:

- Long test suites (> 30 minutes) where you want fast signal on likely failures.

- Pre-commit hooks where you can't run the full suite.

- Flaky tests where re-running them last (instead of first) reduces noise.

This is mostly a CI optimisation, not a testing-quality improvement. But it's a meaningful CI optimisation — getting failure signal in 5 minutes instead of 30 minutes affects developer behaviour.

Flake detection

Tests that pass and fail nondeterministically erode trust. AI can:

- **Identify likely flake patterns** in test code (timing-sensitive code, network calls, randomness).

- **Cluster historical failures** to identify "this test fails 1% of the time, here's the pattern."

- **Suggest fixes** — add explicit waits, use deterministic clocks, mock the flaky boundary.

Pure AI flake detection is less mature than pattern-based detection. The combination — historical-pattern detector + LLM suggestion of root cause — works well.

Oracle inference

The "oracle problem" in testing: how do you know if the output is correct? For generated tests, the oracle is usually the existing implementation, which (as discussed) means tests that match implementation rather than spec.

Emerging pattern: derive the oracle from natural-language specification.

```

You: This function should compute compound interest using monthly compounding.

Given a $1000 initial investment at 5% annual interest for 1 year:

- Calculate expected output from the specification.

- Generate test cases verifying the implementation matches.

AI: Expected output: $1051.16 (at 12 monthly compoundings).

[generates parameterized tests with computed expected values]

```

This works when the spec is precise enough. For fuzzy specifications, the model will plausibly invent answers; verify.

Property-based testing assistance

Property-based testing is "given any valid input, this property should hold." Hypothesis (Python), fast-check (JS), jqwik (Java).

AI helps with:

- **Generating properties** from code or spec. "What invariants does this function maintain?"

- **Generating shrinking strategies** for custom data types.

- **Explaining failed tests** — the property failed for input X; the model can sometimes explain why.

Property-based testing's big win is catching edge cases that hand-written tests miss; AI helps you write the properties without changing the win.

Mutation testing

Mutation testing flips bits in your code; checks that at least one test fails per mutation. If no test fails, your tests don't actually cover that code.

AI angle:

- **Generate mutations targeted at code patterns** the model thinks tests miss.

- **Suggest tests** that would kill surviving mutants.

Mature mutation-testing tools (Stryker, PIT) exist. AI is augmentation, not replacement.

End-to-end and UI test generation

Several tools (Tessera, Mabl, Octomind) generate UI tests from descriptions or by recording user behaviour. AI generates the test steps; humans verify they're correct.

Trade-off: UI tests are notoriously brittle. AI-generated UI tests are no exception. The benefit is reducing the human cost of writing UI tests; the cost is maintaining them as the UI changes.

For LLM-based UI tests specifically: the test description should be at the user-intent level ("user can submit a feedback form") rather than the DOM level ("click button with id submit-btn"). This makes them more robust to UI changes but moves more responsibility into the AI runtime.

Workflows that work

TDD with AI assistance

1. Engineer writes the test specification (what should happen).

2. AI fills in the test code.

3. Engineer runs; sees it fail.

4. AI implements the function; engineer reviews.

5. Tests pass; refactor.

This is faster than pure-human TDD without sacrificing the discipline. The engineer thinks through cases; AI types.

Coverage gap filling

Identify uncovered branches; ask AI to generate tests targeting them.

```

Coverage report: lines 142-158 of payment.py are uncovered.

You: Generate tests that exercise the error-handling code in the

payment processor, lines 142-158.

```

Caveat: you'll get tests that hit those lines, not necessarily tests that *should pass* — sometimes covering uncovered code reveals it shouldn't behave that way.

Mutation hardening

When mutation testing reveals weak tests, ask AI to generate stronger ones:

```

You: This mutation survived: changing `<` to `<=` on line 47.

Generate a test that would fail if the mutation were applied.

```

Targeted, often produces good edge-case tests.

Anti-patterns

- **Generating tests after the fact for already-buggy code.** Tests will encode the bugs.

- **Trusting AI test names without reading the assertions.** "test_handles_empty_input" might not actually test empty input.

- **Generating tests at scale without review.** 1000 generated tests where 800 are vacuous adds noise to the suite.

- **Replacing thinking with generation.** "What should this function do?" is still your job. AI can write tests for what you specify; can't decide what to specify.

- **Letting AI fix tests by changing assertions.** Common: test fails, AI "fixes" by relaxing assertion. Catches bugs in development; introduces them in production.

Honest assessment

For most teams in 2026:

- **Test generation** is a real productivity gain on routine code.

- **Test prioritisation in CI** is a real time-saver on large suites.

- **AI-generated UI tests** are still inferior to well-written human ones, but cheaper.

- **Autonomous test maintenance** ("AI updates tests when the code changes") is more aspirational than reliable.

The right adoption pattern: layer AI assistance on top of existing testing discipline, not as a replacement. Teams without good testing practice can't AI their way out of it.

Tools

- **GitHub Copilot, Cursor** — inline test completion.

- **Codium / Qodo** — test-focused AI tooling.

- **Diffblue Cover** (Java) — automated test generation; mature.

- **Mabl, Tessera, Octomind** — UI test generation.

- **Stryker, PIT** — mutation testing (not AI but pairs well).

Further reading

- [AiPairProgramming]() — broader AI coding context

- [AiForCodeReview]() — review of test code specifically

- [AgentTesting]() — testing AI agents (vs using AI for testing)

- [TestDrivenDevelopment] — discipline AI augments, not replaces