Chapter 10. Validation Constraints in Excel Files

Rules for validating PDF documents can be written in Excel files. Their structure and the provided validation functions are described in the following chapters.

Structure of an Excel File

An Excel file is searched for sheets with the following names:

Excel Sheet Name Comment

regions

Definitions of page regions

check

Definition of test cases for individual PDF documents

compare

Definition of test cases comparing the document under test with a reference document

The expected structure of all three sheets is described in the following sections. In all three sheets a star '*' in the first column indicates a commented line.

The order of the columns must not be changed. Additional columns after the expected ones are allowed. Additional sheets are also allowed.

A sheet can have empty lines. However, when a sheet has too many of them, the last lines with data may not be read. It's therefore better to use the character '*' in the first column of each empty line.

Definition of Page Regions - Sheet 'regions'

Mostly, you will need to restrict a validation rule to a region of a page, not just to the whole page. For example, it makes no sense to compare the text of full pages between two documents when the text contains a date. So, PDFUnit requires that each test case references a defined page region.

A page region is given by 4 values: the x and the y coordinates of the upper-left corner, as well as the width and the height of the region. All values are interpreted in millimeters. The values may have decimal values, but PDFUnit rounds the decimals to the next integer values. The following image shows examples:

You can see that the sheet contains the column id in addition to the 4 columns with numeric values. This ID has to be unique and will be referenced by the test case definitions of the Excel sheets 'check' and 'compare'.

Test Cases for Single PDF Documents - Sheet 'check'

The sheet 'check' must be used to define test cases which relate to a single PDF document. It does not cover test cases in which two documents are compared with each other. The existing columns are:

Column Name Comment

id

Name (ID) of the test case

pages

Pages this test should be used on

region

Name of a page regions which is defined in the Excel sheet 'regions'

constraint

Kind of validation. The allowed values are described below.

expected value

The expected value, if a validation needs one.

whitespace

This column contains a value indicating how whitespace will be handled. The allowed values are described below.

message

In this column an error message with placeholders can be defined. The placeholders are also described below.

Pages to which a Test Case is Restricted - Column 'page'

A test case definition is often restricted to individual pages of a document. The following list shows all available syntax elements:

Pages Syntax in Excel

a single page

1

multiple, individual pages

1, 3, 5

all pages

all

all pages after a given page (inclusive)

2...

all pages before a given page (inclusive)

...5

all pages between (inclusive)

2...5

Two page numbers must be separated by a blank. A comma is optional.

Different Ways of Comparing Text - Column 'constraint'

Values in the column 'constraint' in the sheet 'check' are used for specifying how the actual content of a document and the expected text will be compared. The following list shows the allowed values:

Keyword Behaviour

'must contain'

The text in the column 'expected value' must be part of the page region. Additionally, this constraint type requires whitespace handling information.

'must not contain'

The text in the column 'expected value' must not exist in the page region of the document. Additionally, this constraint type requires whitespace handling information.

'must be empty'

The referenced region must not contain any text.

'must not be empty'

The referenced region must have text.

'must match'

The text in the column 'expected value' will be taken as a regular expression and executed against the text in the referenced region. At least one piece of text must match.

'must not match'

The text in the column 'expected value' will be executed as a regular expression against the text in the referenced region. The test is successful if no match is found.

The column 'constraint' must not be empty. In such a case, PDFUnit throws an error message.

The sheet 'compare' may have other values in the column 'constraint'. Those values are described later below.

Validate Signatures and Images - Column 'constraint'

The column 'constraint' in the sheet 'check' can also contain keywords for validating signatures or images in a document:

Keyword Behaviour

'is signed'

The PDF document has to be signed.

'is signed by'

A PDF document has to be signed. The expected name of the signatory has to be put into the column 'expected value'.

'has number of images'

The number of all visible images in the referenced page region will be compared with the number of the expected images, which must be put into the column 'expected value'.

Handling whitespaces - Column 'whitespace'

Text comparisons may fail due to differences in the whitespace. For example, text which is rendered in different fonts may have line breaks at different positions.

Keyword Behaviour

'ignore'

All whitespaces are removed from the text before two strings are compared.

'keep'

Whitespaces will be taken 'as is'.

'normalize'

Whitespace at the beginning or the end of a text are deleted. Multiple whitespaces between words are reduced to one blank.

Wrong values in the column 'whitespace' will cause an error message. If the 'whitespace' column is left blank, the program defaults to 'normalize'.

Some validations, such as comparing bookmarks, are independent of whitespaces. For such validations, any declaration of whitespace handling will be ignored.

Expected Value - Column 'expected value'

For a test case to verify that a region of a test document contains an expected value, the Excel file must provide a column for that value. This column is named 'expected value'.

If the 'constraint' column has the values 'must match' or 'must not match', the contents of the 'expected value' column are used as a regular expression. More information about regular expressions can be found online, for example on Wikipedia.

If the column 'constraint' has the value 'has number of images', then the content of the 'expected value' column will be parsed to an integer value.

Error Messages with Placeholders - Column 'message'

Individual error messages can be defined in the Excel sheets. These messages are shown in addition to PDFUnit's validation messages. An error message in the Excel sheet may have placeholders for runtime data. The following image shows some examples:

The image shows clearly that placeholders in a text are enclosed with curly brackets. The following placeholders can be used:

Placeholder Meaning

{id}

The ID of the current test case

{pages}

The page-number of the page where the error is detected

{region}

The value in the column 'region'

{constraint}

The value in the column 'constraint'

Placeholders can be used anywhere inside a text. The values of the placeholders at runtime are enclosed in single quotation marks, so their is no need to use single quotes in the error messages in the Excel sheet.

Test Cases Comparing Two Documents - Sheet 'compare'

This Excel sheet can be used to declare validation rules for comparing two PDF documents. One document is the 'document under test' and the second document is a reference document.

For comparative testing no information about an expected text need be given in the Excel sheet, so the column 'expected value' is not provided in the sheet 'compare'.

The meaning of columns is the same as described in the above sections for the sheet 'check'. But in the 'compare' sheet, other values are allowed in the 'constraint' column:

Keyword Behaviour

'same text'

Two PDF documents must have the same text in the referenced region. Additionally, this constraint type requires whitespace handling information.

'same appearance'

The referenced regions of the two PDF documents must be identical when compared as rendered images.

'same bookmarks'

The two PDF documents must have the same bookmarks. Obviously this validation does not require page regions, but for technical reasons the column 'region' must not be empty. Instead, the value 'NO_REGION' should be supplied.

The image below shows a test comparing bookmarks of a 'PDF under test' with a reference PDF:

PDFUnit searches the reference documents for a given PDF document in a subdirectory of its folder. The subdirectory has to have the name 'reference'. The filename of the reference has to be the same as the PDF under test.

Error Messages at Runtime

The validation of a PDF document does not end with the first detected failure. All defined rules of an Excel file are processed, and then an error message is created for each detected failure.