13.11.  Using XPath

General Comments about XPath in PDFUnit

Using XPath to evaluate parts of a PDF document opens a wider range of testing capabilities than an API alone can provide.

Several chapters in this manual describe XPath tests. The current chapter gives you an overview with references to the special chapters.

// Validating a single PDF using XPath:
.hasXFAData().matchingXPath(..)         3.37: “XFA Data” 
.hasXMPData().matchingXPath(..)         3.38: “XMP Data” 
.hasZugferdData().matchingXPath(..)     3.39: “ZUGFeRD” 

// Comparing two documents using XPath: 
.haveXFAData().matchingXPath(..)        4.15: “Comparing XFA Data” 
.haveXMPData().matchingXPath(..)        4.16: “Comparing XMP Data” 

Extract Data as XML

PDFUnit provides utility programs for all parts of a PDF document which can be tested using XML/XPath. They extract the information into XML files:

// Utilities to extract XML from PDF:

com.pdfunit.tools.ExtractBookmarks
com.pdfunit.tools.ExtractFieldInfo
com.pdfunit.tools.ExtractFontInfo
com.pdfunit.tools.ExtractNamedDestinations
com.pdfunit.tools.ExtractSignatureInfo
com.pdfunit.tools.ExtractXFAData
com.pdfunit.tools.ExtractXMPData
com.pdfunit.tools.ExtractZugferdData

The utilities are described in chapter 9.1: “General Remarks for all Utilities”:

Namespaces with Prefix

A namespace with an existing prefix will be detected automatically by PDFUnit.

Default Namespace

The default namespace is not detected automatically because the XML standard allows the definition of namespaces multiple times in an XML document. A default namespace has to be declared and you have to use a prefix:

/**
 * The default namespace has to be declared, 
 * but any alias can be used for it.
 */
@Test
public void hasXFAData_UsingDefaultNamespace() throws Exception {
  String filename = "documentUnderTest.pdf";
  DefaultNamespace defaultNS = new DefaultNamespace("http://www.xfa.org/schema/xci/2.6/");
  XMLNode aliasFoo = new XMLNode("foo:log/foo:to", "memory", defaultNS);

  AssertThat.document(filename)
            .hasXFAData()
            .withNode(aliasFoo)
  ;
}

It may seem strange to use an arbitrary prefix, but the Java standard requires a user-defined one. It cannot be omitted. Please use one with a better name in real projects.

The next example shows the usage of a default namespace for an XPathExpression:

@Test
public void hasXMPData_MatchingXPath_WithDefaultNamespace() throws Exception {
  String filename = "documentUnderTest.pdf";

  String xpathAsString = "//default:format = 'application/pdf'";
  String stringDefaultNS = "http://purl.org/dc/elements/1.1/";
  DefaultNamespace defaultNS = new DefaultNamespace(stringDefaultNS);        
  XPathExpression expression = new XPathExpression(xpathAsString, defaultNS); 

  AssertThat.document(filename)
            .hasXMPData()
            .matchingXPath(expression)
  ;
}

XPath Compatibility

XPath expressions can use all syntax elements and functions of XPath. However, the number of available features of the XPath engine is version dependent. PDFUnit uses the XPath engine of the JDK. So, the JDK version determines the compatibility to the XPath standard.

Chapter 13.12: “JAXP-Configuration” describes the JAXP configuration of a JRE/JDK and how to use an external XML-parser or XSLT-processor.