The most common test case for PDF documents is probably to check the presence of expected text. Various methods can be used:
// Testing page content: .hasText() // pages has to be specified before // Validating expected text: .hasText().containing(..) .hasText().containing(.., WhitespaceProcessing) .hasText().endingWith(..) .hasText().endingWith(.., WhitespaceProcessing) .hasText().equalsTo(..) .hasText().equalsTo(.., WhitespaceProcessing) .hasText().matchingRegex(..) .hasText().startingWith(..) // Prove the absence of defined text: .hasText().notContaining(..) .hasText().notContaining(.., WhitespaceProcessing) .hasText().notEndingWith(..) .hasText().notMatchingRegex(..) .hasText().notStartingWith(..) // Validate multiple text in an expected order: .hasText().inOrder(..) .hasText().containingFirst(..).then(..) // Comparing visible text with ZUGFeRD data: .hasText.containingZugferdData(..)
Chapter 13.5: “Whitespace Processing” describes the different options to handle whitespaces. |
|
Chapter 3.39: “ZUGFeRD” describes how to compare the visible content of a PDF document with the invisible ZUGFeRD data. |
If you are looking for a text on the first page of a letter, test it this way:
@Test public void hasText_OnFirstPage() throws Exception { String filename = "documentUnderTest.pdf"; AssertThat.document(filename) .restrictedTo(FIRST_PAGE) .hasText() .containing("Content on first page.") ; }
The next example searches a text on the last page:
@Test public void hasText_OnLastPage() throws Exception { String filename = "documentUnderTest.pdf"; AssertThat.document(filename) .restrictedTo(LAST_PAGE) .hasText() .containing("Content on last page.") ; }
Also, you can tests individual pages:
@Test public void hasText_OnIndividualPages() throws Exception { String filename = "documentUnderTest.pdf"; PagesToUse pages23 = PagesToUse.getPages(2, 3); AssertThat.document(filename) .restrictedTo(pages23) .hasText() .containing("Content on") ; }
Using the method |
Several constants are available for typical pages, e.g. FIRST_PAGE, LAST_PAGE, EVEN_PAGES
und ODD_PAGES
.
Chapter
13.2: “Page Selection”
describes more constants and how to use them.
There are three constants for searching text on multiple pages:
ANY_PAGE
, EACH_PAGE
and
EVERY_PAGE
. The last two are functionally identical.
Together they enable a higher linguistic flexibility.
@Test public void hasText_OnEveryPage() throws Exception { String filename = "documentUnderTest.pdf"; AssertThat.document(filename) .retricted(EVERY_PAGE) .hasText() .startingWith("PDFUnit") ; }
@Test public void hasText_OnAnyPage() throws Exception { String filename = "documentUnderTest.pdf"; AssertThat.document(filename) .restrictedTo(ANY_PAGE) .hasText() .containing("Page # 3") ; }
The constants EVERY_PAGE
and EACH_PAGE
require that the text really exists on each page.
When you use the constant ANY_PAGE
, a test is successful
if the expected text exists on one or more pages.
Text can be searched not only on whole pages, but also in a region of a page. Chapter 3.32: “Text - in Page Regions” describes that topic.
Do you need to know that an expected text can be found on every page except the first page? Such a test looks like this:
@Test public void hasText_OnAllPagesAfter3() throws Exception { String filename = "documentUnderTest.pdf"; PagesToUse pagesAfter3 = ON_EVERY_PAGE.after(3); AssertThat.document(filename) .restrictedTo(pagesAfter3) .hasText() .containing("Content") ; }
Page numbers start from “1”.
Invalid page limits are not necessarily an error. In the following example, the text is searched for on all pages between 1 and 99 (exclusive). Although the document has only 4 pages, the test ends successfully because the expected string is found on page 1:
/** * Attention: The document has the search token on page 1. * And '1' is before '99'. So, this test ends successfully. */ @Test public void hasText_OnAnyPageBefore_WrongUpperLimit() throws Exception { String filename = "documentUnderTest.pdf"; AssertThat.document(filename) .restrictedTo(AnyPage.before(99)) .hasText() .containing("Content on") ; }
Text to be validated may extend over 2 or more pages, but be interrupted by the header and footer on each page. Such text can be tested as follows:
@Test public void hasText_SpanningOver2Pages() throws Exception { String filename = "documentUnderTest.pdf"; String textOnPage1 = "Text starts on page 1 and "; String textOnPage2 = "continues on page 2"; String expectedText = textOnPage1 + textOnPage2; PagesToUse pages1to2 = PagesToUse.spanningFrom(1).to(2); // Define the section without header and footer: int leftX = 18; int upperY = 30; int width = 182; int height = 238; PageRegion regionWithoutHeaderAndFooter = new PageRegion(leftX, upperY, width, height); AssertThat.document(filename) .restrictedTo(pages1to2) .restrictedTo(regionWithoutHeaderAndFooter) .hasText() .containing(expectedText) ; }
Other methods can be used instead of the method .containing()
to compare text.
The absence of text can also be a test objective, particularly in a certain region of a page. Tests for that are follow common speech patterns:
@Test public void hasText_NotMatchingRegex() throws Exception { String filename = "documentUnderTest.pdf"; PagesToUse page2 = PagesToUse.getPage(2); PageRegion region = new PageRegion(70, 80, 90, 60); AssertThat.document(filename) .restrictedTo(page2) .restrictedTo(region) .hasNoText() ; }
When searching text, line breaks and other whitespaces are ignored in the expected text as well as in the text being tested. In the following example the text to be searched belongs to the document “Digital Signatures for PDF Documents” from Bruno Lowagie (iText). The first chapter has some line breaks:
The following tests for the marked text use different line breaks. They both succeed because whitespaces will be normalized by default:
/** * The expected search string does not contain a line break. */ @Test public void hasText_LineBreakInPDF() throws Exception { String filename = "digitalsignatures20121017.pdf"; String text = "The technology was conceived"; AssertThat.document(filename) .restrictedTo(FIRST_PAGE) .hasText() .containing(text) ; }
/** * The expected search string intentionally contains other line breaks. */ @Test public void hasText_LineBreakInExpectedString() throws Exception { String filename = "digitalsignatures20121017.pdf"; String text = "The " + "\n " + "technology " + "\n " + "was " + "\n " + "conceived"; AssertThat.document(filename) .restrictedTo(FIRST_PAGE) .hasText() .containing(text) ; }
If a normalization of whitespace is not desired, most methods allow an additional parameter defining the intended whitespace handling. The following constants are available:
// Constants to define whitespace processing:
WhitespaceProcessing.IGNORE
WhitespaceProcessing.NORMALIZE
WhitespaceProcessing.KEEP
You can verify that your PDF document does not have empty pages:
@Test public void hasText_AnyPageEmpty() throws Exception { String filename = "documentUnderTest.pdf"; AssertThat.document(filename) .hasText() ; }
It is annoying to write a separate test for every expected text.
So, it is possible to invoke the methods containing(..)
and
notContaining(..)
with an array of expected texts:
@Test public void hasText_Containing_MultipleTokens() throws Exception { String filename = "documentUnderTest.pdf"; AssertThat.document(filename) .restrictedTo(ODD_PAGES) .hasText() .containing("on", "page", "odd pagenumber") // multiple search tokens ; }
@Test public void hasText_NotContaining_MultipleTokens() throws Exception { String filename = "documentUnderTest.pdf"; AssertThat.document(filename) .restrictedTo(FIRST_PAGE) .hasText() .notContaining("even pagenumber", "Page #2") ; }
The first example is successful when all expected tokens are found, and the second test is successful when none of the expected tokens are found.
All chained test methods refer to the same pages:
@Test public void hasText_MultipleInvocation() throws Exception { String filename = "documentUnderTest.pdf"; AssertThat.document(filename) .restrictedTo(ANY_PAGE) .hasText() .startingWith("PDFUnit") .containing("Content on last page.") .matchingRegex(".*[Cc]ontent.*") .endingWith("of 4") ; }
The visible sequence of text on a PDF page does not necessarily correspond to the text sequence within the PDF document. The next screenshot shows a text frame which is not part of the 'normal' text of the page body. That is the reason why the next test succeeds:
@Test public void hasText_TextNotInVisibleOrder() throws Exception { String filename = "documentUnderTest.pdf"; String firstAndLastLine = "Content at the beginning. Content at the end."; AssertThat.document(filename) .restrictedTo(FIRST_PAGE) .hasText() .containing(firstAndLastLine) ; }
If you imagine removing the border of the frame, it seems that the text inside the frame is the middle part of one text block. However, a test expecting the complete text would fail.