3.30. Text

Overview

The most common test case for PDF documents is probably to check the presence of expected text. Various methods can be used:

// Testing page content:
.hasText()      // pages has to be specified before

// Validating expected text:
.hasText().containing(..) 
.hasText().containing(.., WhitespaceProcessing)    
.hasText().endingWith(..)
.hasText().endingWith(.., WhitespaceProcessing)
.hasText().equalsTo(..) 
.hasText().equalsTo(.., WhitespaceProcessing)
.hasText().matchingRegex(..) 
.hasText().startingWith(..) 

// Prove the absence of defined text:
.hasText().notContaining(..) 
.hasText().notContaining(.., WhitespaceProcessing)
.hasText().notEndingWith(..)
.hasText().notMatchingRegex(..) 
.hasText().notStartingWith(..)

// Validate multiple text in an expected order:
.hasText().inOrder(..)
.hasText().containingFirst(..).then(..)

// Comparing visible text with ZUGFeRD data:
.hasText.containingZugferdData(..)

	Chapter 13.5: “Whitespace Processing” describes the different options to handle whitespaces.
	Chapter 3.39: “ZUGFeRD” describes how to compare the visible content of a PDF document with the invisible ZUGFeRD data.

Text on Individual Pages

If you are looking for a text on the first page of a letter, test it this way:

@Test
public void hasText_OnFirstPage() throws Exception {
  String filename = "documentUnderTest.pdf";

  AssertThat.document(filename)
            .restrictedTo(FIRST_PAGE)
            .hasText() 
            .containing("Content on first page.") 
  ;
}

The next example searches a text on the last page:

@Test
public void hasText_OnLastPage() throws Exception {
  String filename = "documentUnderTest.pdf";

  AssertThat.document(filename)
            .restrictedTo(LAST_PAGE)
            .hasText() 
            .containing("Content on last page.")
  ;
}

Also, you can tests individual pages:

@Test
public void hasText_OnIndividualPages() throws Exception {
  String filename = "documentUnderTest.pdf";
  PagesToUse pages23 = PagesToUse.getPages(2, 3);  

  AssertThat.document(filename)
            .restrictedTo(pages23)
            .hasText() 
            .containing("Content on")
  ;
}

Using the method getPages(Integer[]) any individual combination of pages can be used. For a single page you can use the method PagesToUse.getPage(int).

Several constants are available for typical pages, e.g. FIRST_PAGE, LAST_PAGE, EVEN_PAGES und ODD_PAGES. Chapter 13.2: “Page Selection” describes more constants and how to use them.

Text on All Pages

There are three constants for searching text on multiple pages: ANY_PAGE, EACH_PAGE and EVERY_PAGE. The last two are functionally identical. Together they enable a higher linguistic flexibility.

@Test 
public void hasText_OnEveryPage() throws Exception {
  String filename = "documentUnderTest.pdf";
  
  AssertThat.document(filename)
            .retricted(EVERY_PAGE)
            .hasText()
            .startingWith("PDFUnit")
  ;
}

@Test  
public void hasText_OnAnyPage() throws Exception {
  String filename = "documentUnderTest.pdf";

  AssertThat.document(filename)
            .restrictedTo(ANY_PAGE)
            .hasText() 
            .containing("Page # 3") 
  ;
}

The constants EVERY_PAGE and EACH_PAGE require that the text really exists on each page. When you use the constant ANY_PAGE, a test is successful if the expected text exists on one or more pages.

Text in Parts of a Page

Text can be searched not only on whole pages, but also in a region of a page. Chapter 3.32: “Text - in Page Regions” describes that topic.

Individual Pages with Upper and Lower Limit

Do you need to know that an expected text can be found on every page except the first page? Such a test looks like this:

@Test
public void hasText_OnAllPagesAfter3() throws Exception {
  String filename = "documentUnderTest.pdf";
  PagesToUse pagesAfter3 = ON_EVERY_PAGE.after(3);   

  AssertThat.document(filename)
            .restrictedTo(pagesAfter3)
            .hasText() 
            .containing("Content")
  ;
}

The value '3' is an exclusive lower bound. Validation starts at page 4.

Page numbers start from “1”.

Invalid page limits are not necessarily an error. In the following example, the text is searched for on all pages between 1 and 99 (exclusive). Although the document has only 4 pages, the test ends successfully because the expected string is found on page 1:

/**
 * Attention: The document has the search token on page 1. 
 * And '1' is before '99'. So, this test ends successfully.
 */
@Test
public void hasText_OnAnyPageBefore_WrongUpperLimit() throws Exception {
  String filename = "documentUnderTest.pdf";
  
  AssertThat.document(filename)
            .restrictedTo(AnyPage.before(99))
            .hasText()
            .containing("Content on")
  ;
}

Validate Text That Spans Over Multiple Pages

Text to be validated may extend over 2 or more pages, but be interrupted by the header and footer on each page. Such text can be tested as follows:


@Test
public void hasText_SpanningOver2Pages() throws Exception {
  String filename = "documentUnderTest.pdf";
  String textOnPage1 = "Text starts  on page 1 and ";
  String textOnPage2 = "continues on page 2";
  String expectedText = textOnPage1 + textOnPage2;
  PagesToUse pages1to2 = PagesToUse.spanningFrom(1).to(2);
  
  // Define the section without header and footer:
  int leftX  = 18;
  int upperY = 30;
  int width  = 182;
  int height = 238;
  PageRegion regionWithoutHeaderAndFooter = new PageRegion(leftX, upperY, width, height);
  
  AssertThat.document(filename)
            .restrictedTo(pages1to2)
            .restrictedTo(regionWithoutHeaderAndFooter)
            .hasText() 
            .containing(expectedText)
  ;
}

Other methods can be used instead of the method .containing() to compare text.

Negated Search

The absence of text can also be a test objective, particularly in a certain region of a page. Tests for that are follow common speech patterns:

@Test
public void hasText_NotMatchingRegex() throws Exception {
  String filename = "documentUnderTest.pdf";
    
  PagesToUse page2 = PagesToUse.getPage(2);
  PageRegion region = new PageRegion(70, 80, 90, 60);
  AssertThat.document(filename)
            .restrictedTo(page2)
            .restrictedTo(region)
            .hasNoText()
  ;
}

Line Breaks in Text

When searching text, line breaks and other whitespaces are ignored in the expected text as well as in the text being tested. In the following example the text to be searched belongs to the document “Digital Signatures for PDF Documents” from Bruno Lowagie (iText). The first chapter has some line breaks:

The following tests for the marked text use different line breaks. They both succeed because whitespaces will be normalized by default:

/**
 * The expected search string does not contain a line break. 
 */
@Test
public void hasText_LineBreakInPDF() throws Exception {
  String filename = "digitalsignatures20121017.pdf";
  String text = "The technology was conceived";

  AssertThat.document(filename)
            .restrictedTo(FIRST_PAGE)
            .hasText() 
            .containing(text)
  ;
}

/**
 * The expected search string intentionally contains other line breaks. 
 */
@Test
public void hasText_LineBreakInExpectedString() throws Exception {
  String filename = "digitalsignatures20121017.pdf";
  String text = "The " +
                "\n " +
                "technology " +
                "\n " +
                "was " +
                "\n " +
                "conceived";
                
  AssertThat.document(filename)
            .restrictedTo(FIRST_PAGE)
            .hasText() 
            .containing(text)
  ;
}

If a normalization of whitespace is not desired, most methods allow an additional parameter defining the intended whitespace handling. The following constants are available:

// Constants to define whitespace processing:
WhitespaceProcessing.IGNORE
WhitespaceProcessing.NORMALIZE
WhitespaceProcessing.KEEP

No Empty Pages

You can verify that your PDF document does not have empty pages:

@Test
public void hasText_AnyPageEmpty() throws Exception {
  String filename = "documentUnderTest.pdf";
  
  AssertThat.document(filename)
            .hasText()
  ;
}

Multiple Search Tokens

It is annoying to write a separate test for every expected text. So, it is possible to invoke the methods containing(..) and notContaining(..) with an array of expected texts:

@Test
public void hasText_Containing_MultipleTokens() throws Exception {
  String filename = "documentUnderTest.pdf";
  
  AssertThat.document(filename)
            .restrictedTo(ODD_PAGES)
            .hasText() 
            .containing("on", "page", "odd pagenumber") // multiple search tokens
  ;
}

@Test
public void hasText_NotContaining_MultipleTokens() throws Exception {
  String filename = "documentUnderTest.pdf";
  
  AssertThat.document(filename)
            .restrictedTo(FIRST_PAGE)
            .hasText() 
            .notContaining("even pagenumber", "Page #2") 
  ;
}

The first example is successful when all expected tokens are found, and the second test is successful when none of the expected tokens are found.

Concatenation of Methods

All chained test methods refer to the same pages:

@Test
public void hasText_MultipleInvocation() throws Exception {
  String filename = "documentUnderTest.pdf";
  
  AssertThat.document(filename)
            .restrictedTo(ANY_PAGE)
            .hasText() 
            .startingWith("PDFUnit")
            .containing("Content on last page.")
            .matchingRegex(".*[Cc]ontent.*")  
            .endingWith("of 4")
  ;
}

Visible Text Order - Potential Problem

The visible sequence of text on a PDF page does not necessarily correspond to the text sequence within the PDF document. The next screenshot shows a text frame which is not part of the 'normal' text of the page body. That is the reason why the next test succeeds:

@Test
public void hasText_TextNotInVisibleOrder() throws Exception {
  String filename = "documentUnderTest.pdf";
  
  String firstAndLastLine = "Content at the beginning. Content at the end.";
                           
  AssertThat.document(filename)
            .restrictedTo(FIRST_PAGE)
            .hasText()
            .containing(firstAndLastLine) 
             
  ;
}

If you imagine removing the border of the frame, it seems that the text inside the frame is the middle part of one text block. However, a test expecting the complete text would fail.

Prev	Up	Next
3.29. Tagged Documents	Home	3.31. Text - in Images (OCR)