Saturday, 31 May 2014

HtmlUnit - For Integration Testing and Webcrawling

To put it in just a few words: HtmlUnit is a web browser without a window.

Intended for integration testing, HtmlUnit allows user programmatically to manipulate a webpage on a high level, i.e. as if doing it with a normal web browser. The calling program can fill and submit forms, click on buttons, imagemaps and hyperlinks, or activate JavaScript created object. JavaScript, cookies and AJAX are supported. So are proxies and immediate redirection.

GUI integration testing

This kind of testing is about as close to human testing we can get with automated testing. Testing static webpages is always easy because the content only gets loaded once from the remote server but nowadays webpages have more often dynamic content than not. Once the page is loaded not only the outward appearance but also the content itself is changed with the help of JavaScript, CSS (Cascading Style Sheets), AJAX and Adobe Flash (although flash - being a self contained "applet" or videoplayer - is outside the scope of HtmlUnit.

With HtmlUnit the test program can "crawl" through the HTML code section by section confirming that content is correct. Or it can jump straight to a certain part identified by id or name tag. It can "hover" the mouse pointer (emulated, of course) over parts of text or a button on a form, or e.g. select an item from a select (list) button which is wired with JavaScript, and then confirm that the page or form content changes as planned.

HtmlUnit does UI testing for webpages, or more precisely integration testing for HTML elements' and JavaScript's integration.


Because HtmlUnit is a headless (i.e. windowless) web browser, it can also be used to programmatically browse websites and extract information. On many webpages JavaScript is intimately linked to the processing of forms so that a form cannot be submitted properly without JavaScript's help. These kind of pages are of course examples of poor webform design (separation of concerns is not completed; business logic is mixed with the program flow) - but ours being an imperfect world, even they must be accepted. And that's where HtmlUnit shows what it's made of.

There is plenty of pages where user only needs to log in through the front page, and immediate the sought after information is available, or maybe via a simple form, like logging to your telephone company's website only to see how much saldo or network quota you still have left for the current month. Many simple hardware devices, such as home routers, only provide a Web interface, no SOAP or REST API. HtmlUnit to the rescue! Earlier it was impossible or close to it to get to this content.

Let's see an example in Java:
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlForm;
import com.gargoylesoftware.htmlunit.html.HtmlInput;
final WebClient webClient = new WebClient(BrowserVersion.CHROME, proxyIP, proxyPort);

We have imported some HtmlUnit element classes. We create a new WebClient instance by tell it which browser it should spoof and which server to use as a proxy. Both of these are optional. Sometimes an HTTP server or the client side JavaScript changes layout of the page depending on the requesting browser. We also enable redirection, JavaScript support and cookies support. Another way:

   final WebClient webClient = new WebClient();

Let's continue. We want to find the submit button and input fields for userid and password. Once we get them, we can finish logging in by clicking the submit button and loading a new page in the bargain.

HtmlInput submitButton = null;
HtmlPage titlePage = null;
try {
    titlePage = webClient.getPage(hostname);
} catch (IOException e) {
final List forms = titlePage.getForms();
// iterate through the list to find what we need.

submitButton = loginForm.getInputByName("login");
final HtmlTextInput usernameTextField = loginForm.getInputByName("login_id");
final HtmlPasswordInput passwordTextField = loginForm.getInputByName("login_password");

try {
  entryPage =;
} catch (IOException e) {
List links = entryPage.getAnchors();
for (HtmlAnchor link : links) {
  logger.debug("Entry Page link: " + link.asXml());
  if (link.asXml().contains("")) {
    linkToJobAdPage = link;

HtmlUnit for Perl

HtmlUnit is not a Java monopoli just because it was developed on Java. It's also available for other programming languages.

Celerity is a JRuby wrapper around HtmlUnit – a headless Java browser with JavaScript support.

WWW::HtmlUnit is the Perl equivalent, an Inline::Java based wrapper of the HtmlUnit v2.14 library