Rules for validating PDF documents can be written in Excel files. Their structure and the provided validation functions are described in the following chapters.
An Excel file is searched for sheets with the following names:
Excel Sheet Name | Comment |
---|---|
regions |
Definitions of page regions |
check |
Definition of test cases for individual PDF documents |
compare |
Definition of test cases comparing the document under test with a reference document |
The expected structure of all three sheets is described in the following
sections. In all three sheets a star '*
' in the first column
indicates a commented line.
The order of the columns must not be changed. Additional columns after the expected ones are allowed. Additional sheets are also allowed.
A sheet can have empty lines. However, when a sheet has too many of them, the last lines with data may not be read.
It's therefore better to use the character '*
' in the first column of each empty line.
Mostly, you will need to restrict a validation rule to a region of a page, not just to the whole page. For example, it makes no sense to compare the text of full pages between two documents when the text contains a date. So, PDFUnit requires that each test case references a defined page region.
A page region is given by 4 values: the x and the y coordinates of the upper-left corner, as well as the width and the height of the region. All values are interpreted in millimeters. The values may have decimal values, but PDFUnit rounds the decimals to the next integer values. The following image shows examples:
You can see that the sheet contains the column id
in addition
to the 4 columns with numeric values. This ID has to be unique and
will be referenced by the test case definitions of the Excel sheets
'check' and 'compare'.
The sheet 'check' must be used to define test cases which relate to a single PDF document. It does not cover test cases in which two documents are compared with each other. The existing columns are:
Column Name | Comment |
---|---|
id |
Name (ID) of the test case |
pages |
Pages this test should be used on |
region |
Name of a page regions which is defined in the Excel sheet 'regions' |
constraint |
Kind of validation. The allowed values are described below. |
expected value |
The expected value, if a validation needs one. |
whitespace |
This column contains a value indicating how whitespace will be handled. The allowed values are described below. |
message |
In this column an error message with placeholders can be defined. The placeholders are also described below. |
A test case definition is often restricted to individual pages of a document. The following list shows all available syntax elements:
Pages | Syntax in Excel |
---|---|
a single page |
1 |
multiple, individual pages |
1, 3, 5 |
all pages |
all |
all pages after a given page (inclusive) |
2... |
all pages before a given page (inclusive) |
...5 |
all pages between (inclusive) |
2...5 |
Two page numbers must be separated by a blank. A comma is optional.
Values in the column 'constraint' in the sheet 'check' are used for specifying how the actual content of a document and the expected text will be compared. The following list shows the allowed values:
Keyword | Behaviour |
---|---|
'must contain' |
The text in the column 'expected value' must be part of the page region. Additionally, this constraint type requires whitespace handling information. |
'must not contain' |
The text in the column 'expected value' must not exist in the page region of the document. Additionally, this constraint type requires whitespace handling information. |
'must be empty' |
The referenced region must not contain any text. |
'must not be empty' |
The referenced region must have text. |
'must match' |
The text in the column 'expected value' will be taken as a regular expression and executed against the text in the referenced region. At least one piece of text must match. |
'must not match' |
The text in the column 'expected value' will be executed as a regular expression against the text in the referenced region. The test is successful if no match is found. |
The column 'constraint' must not be empty. In such a case, PDFUnit throws an error message.
The sheet 'compare' may have other values in the column 'constraint'. Those values are described later below.
The column 'constraint' in the sheet 'check' can also contain keywords for validating signatures or images in a document:
Keyword | Behaviour |
---|---|
'is signed' |
The PDF document has to be signed. |
'is signed by' |
A PDF document has to be signed. The expected name of the signatory has to be put into the column 'expected value'. |
'has number of images' |
The number of all visible images in the referenced page region will be compared with the number of the expected images, which must be put into the column 'expected value'. |
Text comparisons may fail due to differences in the whitespace. For example, text which is rendered in different fonts may have line breaks at different positions.
Keyword | Behaviour |
---|---|
'ignore' |
All whitespaces are removed from the text before two strings are compared. |
'keep' |
Whitespaces will be taken 'as is'. |
'normalize' |
Whitespace at the beginning or the end of a text are deleted. Multiple whitespaces between words are reduced to one blank. |
Wrong values in the column 'whitespace' will cause an error message. If the 'whitespace' column is left blank, the program defaults to 'normalize'.
Some validations, such as comparing bookmarks, are independent of whitespaces. For such validations, any declaration of whitespace handling will be ignored.
For a test case to verify that a region of a test document contains an expected value, the Excel file must provide a column for that value. This column is named 'expected value'.
If the 'constraint' column has the values 'must match' or 'must not match', the contents of the 'expected value' column are used as a regular expression. More information about regular expressions can be found online, for example on Wikipedia.
If the column 'constraint' has the value 'has number of images', then the content of the 'expected value' column will be parsed to an integer value.
Individual error messages can be defined in the Excel sheets. These messages are shown in addition to PDFUnit's validation messages. An error message in the Excel sheet may have placeholders for runtime data. The following image shows some examples:
The image shows clearly that placeholders in a text are enclosed with curly brackets. The following placeholders can be used:
Placeholder | Meaning |
---|---|
{id} |
The ID of the current test case |
{pages} |
The page-number of the page where the error is detected |
{region} |
The value in the column 'region' |
{constraint} |
The value in the column 'constraint' |
Placeholders can be used anywhere inside a text. The values of the placeholders at runtime are enclosed in single quotation marks, so their is no need to use single quotes in the error messages in the Excel sheet.
This Excel sheet can be used to declare validation rules for comparing two PDF documents. One document is the 'document under test' and the second document is a reference document.
For comparative testing no information about an expected text need be given in the Excel sheet, so the column 'expected value' is not provided in the sheet 'compare'.
The meaning of columns is the same as described in the above sections for the sheet 'check'. But in the 'compare' sheet, other values are allowed in the 'constraint' column:
Keyword | Behaviour |
---|---|
'same text' |
Two PDF documents must have the same text in the referenced region. Additionally, this constraint type requires whitespace handling information. |
'same appearance' |
The referenced regions of the two PDF documents must be identical when compared as rendered images. |
'same bookmarks' |
The two PDF documents must have the same bookmarks. Obviously this validation does not require page regions, but for technical reasons the column 'region' must not be empty. Instead, the value 'NO_REGION' should be supplied. |
The image below shows a test comparing bookmarks of a 'PDF under test' with a reference PDF:
PDFUnit searches the reference documents for a given PDF document in a subdirectory of its folder. The subdirectory has to have the name 'reference'. The filename of the reference has to be the same as the PDF under test.