The taming of the beast - about automated UI testing

Matthias · September 2009

This is an article about a technical issue, but basically it's about a heroic attempt to deal with the impossible.
At the risk of appearing presumptuous, I would like to share these experiences with you: It's about establishing automated user interface tests.

I owe you an explanation why I called this "impossible": The general theory of a test is to stimulate the system unter test with some known input.
Provided, the system behaves in reproducible way, we can then compare the system's output with a nominal output and apply some suitable
pass/fail criterion to mark the test as being passed successfully or now. That, at least, is the theory.

For a classical software test, this principle holds well. But the theory tends to neglect an important issue: the choice of the
test cases, the way how and against what nominal output the system's response is compared against is a manifestation of an engineering process.
What is tested and how response is checked against what should be guided by a specification or a wider system knowledge. Thus,
the test is reflecting the "important" aspects of the system while putting less emphasis on the less important aspects.

Why is this important? The answer is: because that choice reflects the freedom of implementation. And that is basically what makes up
engineering. A system should meet the requirements laid down in a specification. What is not part of the specification
should be left to the implementor and ideally this freedom will be used to find a solution that is simple and effective.
A test should focus on the specification and not the implementation details.

That is not a new concept: it is the basis of the test driven software engineering approach. Programmers, being particular lazy, tend
to adopt such methods quickly and use the implementation freedom to minimize their coding efforts. In addition, to focus the testing on the
important aspects enables refactoring - a technique, where code is frequently reorganized or even rewritten. Being the opposite of the
"never change a running system" principle, refactoring can be a very effective way of keeping code maintainable, provided there
is a good test coverage and a high tolerance of tests for irrelevant implementation side effects.

Coming back to the UI tests, most frameworks available for UI test automation follow the theory: somehow, the user interface is stimulated,
i.e. by recoding and replaying mouse movements, button events and keyboard interactions. That is the input side. On the response side
however, there are numerous channels, mostly somehow graphical. How should any test system deal with output then? That is the crucial
point here and taking into account my previous statements about the importance of a careful and selective response checking, it should not be hard
to see that we have an issue here. These are the facts: I have not seen many people performing automated user interface tests. And if so, I have heard a lot of complaints about
high maintenance efforts, unreliable execution and much else. There seem to be many people who regard automated user interface tests "impossible".

With KLayout I tried to deal with that issue. Here are my considerations:

Using Qt as the only platform should offer a couple of benefits. For example, events can be generated on Qt level rather than on mouse or keyboard level. In addition, events can be recorded and generated associated with a certain widget embedded inside the Qt widget hierarchy. All of this should allow some abstraction of the events and reduce the dependencies on window geometry, menu layout and other design aspects.
The response should not be checked unconditionally, i.e. by taking screenshots of the application window and comparing it to reference screenshots. Again, that would involve many details which are implementation dependent and may change frequently. Instead, the test creator has to decide which portions of the screen are important for a certain test and have to be checked against a reference. The check is not necessarily graphical. In many cases, i.e. text boxes, the check can be a pure string comparison.
For a quick and reproducible test execution without artificial wait cycles, any synchronous operations (i.e. background drawing thread) should be made synchronous when a test is run.

And this is the implementation:

KLayout can operate in a recording mode (option "-gr"). In this mode, all actions are recorded in a more or less abstract way. For example, menu actions are not recorded as mouse clicks but as abstract, named actions. Thus the menu structure is irrelevant and can be changed without invalidating the test. Similar, mouse events are recorded relative to the target widget (the lowest-level widget where is mouse cursor is in). The widget is identified by a widget hierarchy path using widget names where possible. This provides some degree of freedom to change the user interface without invalidating the tests.
In record mode, all input events are recorded which are necessary to reproduce a test session. In addition, widgets can be "asked" for their content by pressing Alt+Ctrl and clicking with the mouse on a widget. The widget under the mouse will be asked to produce it's content in a suitable way. A text box for example, delivers a simple string. The layout drawing canvas delivers a bitmap image. The response is written to the test log intermixed with the input events.
The log file, created in recording mode, can be replayed in replay mode (option "-gp"). In this mode, the input events will be synthesized as if they had been issue by the mouse or keyboard. Output requests will also be made to the respective widgets.
By combining replay and recording mode, a new log file is written. A test is successful, if the original and the new log files are identical.

So far, this approach holds well. In fact, some simple tests have very successfully been automated using that concept.
However, it gets nasty we it comes to the details. For example, the simple comparison of test logs is not quite useful. It happens
frequently, that the drawing of a layout changes slightly, i.e. because different rounding or similar. Then, just a few pixels
of the drawn image change. To evaluate the differences between two test logs with embedded images, a more elaborate solution is
required than just a simple "diff". That for, KLayout comes with a simple utility ("gtfui") which is basically a graphical "diff" tool with the
capability of comparing images and showing image differences. This utility has been proven extremely useful, but of course is was
some effort to build it.

Some issues have not been solved yet:

The tests are very dependent of the Qt version. I.e. 4.4.x behaves differently than 4.5.x with respect to widget naming. This is in particular the case for the standard dialogs, where the names and widget hierarchies change frequently. Currently, I confine myself to a certain Qt version for test purposes.
Even without explicitly synchronous behavior, some Qt internals seem to work asynchronously, i.e. the drop down lists of combo boxes and similar. Some tests still exhibit a somewhat non-reproducible behavior.
Even at a higher level of abstraction for event localization, changes in the user interface frequently require updates of the tests which are pretty time-consuming. Once, input events get lost due to a changed widget hierarchy, the replay mechanism will loose track and the test is bound to fail. Repair of the test then requires localization of the cause and manual editing of the test log files.

However, the tests have been proven extremely useful. With user interface tests, it is possible to cover a broad variety of functionality
and tests are easily created (recorded). The UI test suite has shown me a couple of bugs after refactoring sessions which have not been
found by the unit tests. Another conclusion is, that checking selected widgets for correct content is a very powerful way to keep
tests maintainable. However, such check points must not be set sparsely. Otherwise it is hard to track down the root cause of a test failure.

Coming back to the title, I would conclude that I did not manage to completely tame the beast. Right now, I have about 70 user interface tests and
most of them run stable and provide a high degree of coverage for database and user interface functionality, but some frequently require
updating. I would like
to see Troll Tech to integrate something like my solution into Qt to save me the effort of implementing a test framework and enhance
the stability of the tests.
On the other hand, I feel
that a real solution would involve a user interface architecture which is consistently optimized for testability, i.e. by providing a view model
layer below the user interface that could be used as a test interface.

That however, is beyond my scope currently and what I really would like to see right now is a cold beer sitting on my table ...

Howdy, Stranger!

Categories

The taming of the beast - about automated UI testing

Howdy, Stranger!

Quick Links

Categories

The taming of the beast - about automated UI testing