PDFUnit can extract text from images and can validate this text in the
same way as normal text. The syntax of these OCR tests follows natural
language as much as possible. Each OCR tests starts with the methods
hasImage().withText()
or hasImage().withTextInRegion()
.
The following methods can be used to validate text in images:
// Tests for text in images: .hasImage().withText().containing(..) .hasImage().withText().endingWith(..) .hasImage().withText().equalsTo(..) .hasImage().withText().matchesRegex(..) .hasImage().withText().startingWith(..) // Tests for text in parts of an image: .hasImage().withTextInRegion(imageRegion).containing(..) .hasImage().withTextInRegion(imageRegion).endingWith(..) .hasImage().withTextInRegion(imageRegion).equalsTo(..) .hasImage().withTextInRegion(imageRegion).matchesRegex(..) .hasImage().withTextInRegion(imageRegion).startingWith(..)
The text recognition function uses the OCR-processor Tesseract.
The following example uses a PDF file which contains an image showing the text of the novel 'Little Lord Fauntleroy'. The image has a slightly coloured background.
@Test public void hasImageWithText() throws Exception { String filename = "ocr_little-lord-fauntleroy.pdf"; int leftX = 10; // millimeter int upperY = 35; int width = 160; int height = 135; PageRegion pageRegion = new PageRegion(leftX, upperY, width, height); String expectedText = "Cedric himself knew nothing whatever about it."; AssertThat.document(filename) .restrictedTo(FIRST_PAGE) .restrictedTo(pageRegion) .hasImage() .withText() .containing(expectedText) ; }
If you look at the image, you can see the line break after the word 'nothing'. Despite this line break, the test is successful, because all whitespaces are eliminated by PDFUnit before comparing the OCR text with the expected text.
Steps in the normalization of OCR Text:
Characters are converted to lower-case
All whitespaces are deleted
12 different hyphen/dash characters are deleted
10 different underscore characters are deleted
Punctuation characters are deleted
The result of text recognition can be improved by "training" the OCR-processor. Language specific training data can be downloaded from https://github.com/tesseract-ocr/tessdata.
Sometimes an expected text should be located in a certain region of an image. You can define an image region to handle such a requirement:
@Test public void hasImageWithTextInRegion() throws Exception { String filename = "ocr_little-lord-fauntleroy.pdf"; int leftX = 10; // millimeter int upperY = 35; int width = 160; int height = 135; PageRegion pageRegion = new PageRegion(leftX, upperY, width, height); int imgLeftX = 250; // pixel int imgUpperY = 90; int imgWidth = 130; int imgHeight = 30; ImageRegion imageRegion = new ImageRegion(imgLeftX, imgUpperY, imgWidth, imgHeight); String expectedText = "Englishman"; AssertThat.document(filename) .restrictedTo(FIRST_PAGE) .restrictedTo(pageRegion) .hasImage() .withTextInRegion(imageRegion) .containing(expectedText) ; }
The unit for image size values is always pixel, since images in PDFs may be scaled.
This means that using the unit millimeter might lead to incorrect measurements.
To find the right values for an image region,
extract all images from the PDF and use a simple image processing tool to get
the values for the desired region. PDFUnit provides the tool
ExtractImages
to extract images. Chapter
9.7: “Extract Images from PDF” explains how to use it.
Water marks and some other text in images may be intentionally rotated or flipped. Such text can be validated using the following methods:
// Method to rotate and flip images before OCR processing:
.hasImage().flipped(FlipDirection).withText()...
.hasImage().rotatedBy(Rotation).withText()...
The horribly mangled text in the next image can be validated.
The text in this image is rotated 270 degrees and flipped vertically. If you know these data, the text can be checked:
@Test public void testFlippedAndRotated() throws Exception { String filename = "image-with-rotated-and-flipped-text.pdf"; int leftX = 80; // in millimeter int upperY = 65; int width = 50; int height = 75; PageRegion pageRegion = new PageRegion(leftX, upperY, width, height); String expectedText = "text rotated 270 and flipped vertically"; AssertThat.document(filename) .restrictedTo(FIRST_PAGE) .restrictedTo(pageRegion) .hasImage() .rotatedBy(Rotation.DEGREES_270) .flipped(FlipDirection.VERTICAL) .withText() .equalsTo(expectedText) ; }
Allowed values for rotation or flipping are:
Rotation.DEGREES_0 Rotation.DEGREES_90 Rotation.DEGREES_180 Rotation.DEGREES_270 FlipDirection.NONE FlipDirection.HORIZONTAL FlipDirection.VERTICAL