The most common test case for PDF documents is probably to check the
presence of expected text. That can be done with the tag <hasText />
which can contain other tags and attributes.
<!-- Tags to verify content: --> <hasText /> <!-- Nested tags of <hasText />: --> <hasText > <inClippingArea /> (optional) <!-- Comparing content: --> <containing /> (optional) <endingWith /> (optional) <matchingComplete /> (optional) <matchingRegex /> (optional) <startingWith /> (optional) <!-- Prove the absence of text: --> <notContaining /> (optional) <notEndingWith /> (optional) <!-- <notMatchingRegex /> is itentionally not provided --> <notMatchingRegex /> (optional) <notStartingWith /> (optional) <hasText /> <!-- Attributes of <hasText /> to select pages. --> <!-- One of these attributes has to be used: --> <hasText on=".." /> <hasText onPage=".." /> <hasText onEveryPageAfter=".." /> <hasText onEveryPageBefore=".." /> <hasText onAnyPageAfter=".." /> <hasText onAnyPageBefore=".." /> <!-- Whitespace processing, see: 13.4: “Whitespace Processing” -->
If you are looking for a text on the first page of a letter, test it this way:
<testcase name="hasText_OnFirstPage_Containing"> <assertThat testDocument="content/diverseContentOnMultiplePages.pdf"> <hasText on="FIRST_PAGE"> <containing>Content on first page.</containing> </hasText> </assertThat> </testcase>
You can declare specific pages using the attribute on=".."
which provides
several constants, e.g. on="FIRST_PAGE"
, on="EVERY_PAGE"
,
on="ODD_PAGES"
etc.
Chapter
13.2: “Page Selection”
describes more constants and how to use them.
The next example searches a text on the last page:
<testcase name="hasText_OnLastPage"> <assertThat testDocument="content/diverseContentOnMultiplePages.pdf"> <hasText on="LAST_PAGE"> <containing>Content on last page.</containing> </hasText> </assertThat> </testcase>
Also, you can test individual pages using the
attribute onPage=".."
.
<testcase name="hasText_OnIndividualPages"> <assertThat testDocument="content/diverseContentOnMultiplePages.pdf"> <hasText onPage="2, 3"> <containing>Content on</containing> </hasText> </assertThat> </testcase>
Page numbers in the attribute onPage=".."
must be separated
by commas.
The chapter 13.2: “Page Selection” describes page selection in detail.
There are two constants for the attribute on=".."
to search
text on multiple pages: on="EACH_PAGE"
, on="EVERY_PAGE"
and on="ANY_PAGE"
.
<testcase name="hasText_OnEveryPage"> <assertThat testDocument="content/diverseContentOnMultiplePages.pdf"> <hasText on="EVERY_PAGE" > <startingWith>PDFUnit</startingWith> </hasText> </assertThat> </testcase>
<testcase name="hasText_OnAnyPage"> <assertThat testDocument="content/diverseContentOnMultiplePages.pdf"> <hasText on="ANY_PAGE"> <containing>Page # 3</containing> </hasText> </assertThat> </testcase>
The constants on="EVERY_PAGE"
and on="EACH_PAGE"
require
that the text really exists on every page.
When you use the constant on="ANY_PAGE"
, a test is successful if
the expected text exists on one or more of the pages.
The logic of the two previous examples is clear. But the logic becomes unclear when you negate both statements. In everyday speech, the difference between “Every page does not contain the expected text” and “Any page does not contain the expected text” is unclear. And the last sentence itself has an unclear meaning.
To avoid mistakes, PDFUnit does not allow negated tests with the constant ON_ANY_PAGE
.
The following test is not allowed and throws an exception:
<testcase name="hasText_NotMatchingRegex"> <assertThat testDocument="content/diverseContentOnMultiplePages.pdf"> <hasText on="ANY_PAGE"> <notEndingWith>wrongValueIntended</notEndingWith> </hasText> </assertThat> </testcase>
The error message is:
Searching text 'ON_ANY_PAGE' in combination with negated methods is not supported.
Instead of asking that “any page does NOT contain an expected text” it is better to write “every page contains the expected text” and catch the exception.
When searching text, line breaks and other whitespaces are ignored in the expected as well as in the text being tested. In the following example the text to be searched for belongs to the document “Digital Signatures for PDF Documents” from Bruno Lowagie (iText). The first chapter has some line breaks:
|
The following tests for the marked text use different line breaks. They both succeed:
<!-- The PDF document has a (visible) line break after the word "The". The search string does not contain a line break. --> <testcase name="hasText_ContainingLineBreaks_LineBreakInPDF"> <assertThat testDocument="digitalsignatures20121017.pdf"> <hasText on="FIRST_PAGE"> <containing>The technology was conceived</containing> </hasText> </assertThat> </testcase>
<!-- The expected search string intentionally contains other line breaks. --> <testcase name="hasText_ContainingLineBreaks_LineBreakInExpectedString"> <assertThat testDocument="digitalsignatures20121017.pdf"> <hasText on="FIRST_PAGE"> <containing> The technology was conceived </containing> </hasText> </assertThat> </testcase>
Text can be searched not only on whole pages, but also on a section of a page. The chapter 13.6: “Defining Page Areas” describes that topic.
You can verify that your PDF document does not have empty pages:
<testcase name="hasText_AnyPageEmpty"> <assertThat testDocument="content/diverseContentOnMultiplePages.pdf"> <hasText on="EVERY_PAGE" /> </assertThat> </testcase>
If you want to verify that a page or a section of a page does not
contain text, you can use the method hasNoText
:
<testcase name="hasNoTextInClippingArea" > <assertThat testDocument="&pdfdir;/emptyPages/pagesPartiallyEmpty.pdf"> <hasNoText on="FIRST_PAGE" > <inClippingArea upperLeftX="70" upperLeftY="80" width="90" height="60" /> </hasNoText> </assertThat> </testcase>
It is annoying to write a separate test for every expected text on a page. So you can invoke some tags more than once:
<testcase name="hasText_Containing_MultipleTokens"> <assertThat testDocument="content/diverseContentOnMultiplePages.pdf"> <hasText on="ODD_PAGES"> <containing>on</containing> <containing>page</containing> <containing>odd pagenumber</containing> </hasText> </assertThat> </testcase>
<testcase name="hasText_NotContaining_MultipleTokens"> <assertThat testDocument="content/diverseContentOnMultiplePages.pdf"> <hasText on="FIRST_PAGE"> <notContaining>even pagenumber</notContaining> <notContaining>Page #2</notContaining> </hasText> </assertThat> </testcase>
In the first example the test is successful when all expected tokens are found, and the second test is successful when none of the expected tokens are found.
You can only use the tags <startingWith />
and
<endingWith />
.
Multiple text comparisons are all related to the specificied page
numbers declared in the outer tag <hasText />
:
<testcase name="hasText_MultipleInvocation"> <assertThat testDocument="content/diverseContentOnMultiplePages.pdf"> <hasText on="ANY_PAGE"> <startingWith>PDFUnit</startingWith> <containing>Content on last page.</containing> <matchingRegex>.*[Cc]ontent.*</matchingRegex> <endingWith>of 4</endingWith> </hasText> </assertThat> </testcase>
The tag <hasText />
must be used
multiple times if multiple validations are used pointing to different pages:
<!-- Different pages and different comparisons in one concatenated statement. This test works, but it is not recommended. When the test fails, the error analysis is more complicated than if you had 3 individual tests. --> <testcase name="hasText_ComplexSearchOverDifferentPages"> <assertThat testDocument="content/diverseContentOnMultiplePages.pdf"> <hasText on="ANY_PAGE"> <startingWith>PDFUnit - Automated PDF Tests</startingWith> </hasText> <hasText on="EVEN_PAGES"> <containing>Content</containing> <containing>even pagenumber</containing> </hasText> <hasText on="ODD_PAGES"> <containing>odd pagenumber</containing> </hasText> </assertThat> </testcase>
This test is not good because the name of the test is not clear enough.
Do you need to know that an expected text can be found on every page except the first page? Such a test looks like this:
<testcase name="hasText_OnAnyPageAfter"> <assertThat testDocument="content/diverseContentOnMultiplePages.pdf"> <hasText onAnyPageAfter="1"> <containing>Content on</containing> </hasText> </assertThat> </testcase>
Page numbers start from “1”.
Invalid page limits are not necessarily an error. In the following example, the text is searched for on all pages between 1 and 99 (exclusive). Although the document has only 4 pages, the test ends successfully because the expected string is found on page 1:
<!-- Attention: the document has the search token on page 1. And '1' is before '99'. So this test ends successfully. --> <testcase name="hasText_OnAllPagesBefore_WrongPageNumber"> <assertThat testDocument="content/diverseContentOnMultiplePages.pdf"> <hasText onAnyPageBefore="99"> <containing>Content on</containing> </hasText> </assertThat> </testcase>
The visible sequence of text on a PDF page does not necessarily correspond to the text sequence within the PDF document. This might result in PDFUnit does not recognizing text sequences, but PDFUnit uses iText's powerful text recognition which assembles text objects based on their positions on a page.
Although the text in the next example is a separate text object in a frame,
a test for the text sequence "the beginning. This is content"
succeeds:
|
<!-- The PDF document does not store the text in the visible order. --> <testcase name="hasText_TextNotInVisibleOrder"> <assertThat testDocument="content/contentNotInVisibleOrder.pdf"> <hasText on="FIRST_PAGE"> <containing> Content at the beginning. This is content, placed in a frame by OpenOffice. Content at the end. </containing> </hasText> </assertThat> </testcase>