13.4.  Whitespace Processing

Almost all tests compare strings. Many comparisons would fail if whitespaces remained as they are. So you can control the way whitespaces are handled using the attribute whitespaces=".." with one of the three predefined constants. NORMALIZE is the default if nothing is declared:

<!-- Constants for whitespace processing: -->

<xxx whitespaces="IGNORE"    />  1
<xxx whitespaces="KEEP"      />  2
<xxx whitespaces="NORMALIZE" />  3

1

All whitespaces are deleted before comparing two strings.

2

Existing whitespaces are not changed.

3

Whitespaces at the beginning and at the end of a string are deleted. Any sequences of whitespaces within a text are reduced to one space.

The constants can be used in the following tags:

<!-- Tags which allow to define the whitespace processing -->

<hasXXXAction>
  <containing       whitespaces=".." />
  <matchingComplete whitespaces=".." />
</hasXXXAction>

<hasText>
  <containing       whitespaces=".." />
  <notContaining    whitespaces=".." />
  <matchingComplete whitespaces=".." />
</hasText>

An example:

<testcase name="hasText_WithLineBreaks_UsingIGNORE">
  <assertThat testDocument="content/diverseContentOnMultiplePages.pdf">
    <hasText on="FIRST_PAGE">
      <matchingComplete whitespaces="IGNORE">
        PDFUnit - Automated PDF Tests http://pdfunit.com/
        This is a document that is used for unit tests of PDFUnit itself.
        Content on first page.
        odd pagenumber
        Page # 1 of 4
      </matchingComplete>
    </hasText>
  </assertThat>
</testcase>

The expected string in this example is written with many linebreaks, which are different from the linebreaks inside the PDF page. However when using whitespaces="IGNORE" the test runs successfully.

As an exception to this rule, no tag involving regular expressions changes whitespaces automatically. It is up to you to integrate the whitespace processing into the regular expression, for example like this:

(?ms).*print(.*)

The term (?ms) means that the search extends over multiple lines. Line breaks are interpreted as characters.