9.2. Convert Unicode Text into Hex Code

Java “understands Unicode” as does XML. So PDFUnit also “understands” Unicode. The section 7: “Unicode” deals with Unicode in detail.

This section describes a utility program that converts a Unicode string into its ASCII hex code. The hex code can be used in many of your tests. If you are using a small number of Unicode characters it is easier to use ASCII hex code than to install a new font on your computer. And maybe you don't have permission anything.

The utility ConvertUnicodeToHex converts any string into ASCII and escapes all non-ASCII characters into their corresponding Unicode hex code. For example, the Euro character is converted into \u20AC.

The input file can be of any encoding, but you have to define the right encoding before executing the program.

Program Start

You start the Java program with the parameter -D:

::
:: Converting Unicode content of the input file to hex code.
::
  
@echo off
setlocal
set CLASSPATH=./lib/pdfunit-2015.10/*;%CLASSPATH%

set TOOL=com.pdfunit.tools.ConvertUnicodeToHex
set OUT_DIR=./tmp
set IN_FILE=convert-unicode-to-hex.in.txt

java -Dfile.encoding=UTF-8 %TOOL%  %IN_FILE%  %OUT_DIR% 
endlocal

Input

The input file convert-unicode-to-hex.in.txt contains this data:

äöü € @

Output

The name of the output file is derived from the name of the input file. So _convert-unicode-to-hex.out.txt with the following content is generated:

#Unicode created by com.pdfunit.tools.ConvertUnicodeToHex
#Wed Jan 16 21:50:04 CET 2013
convert-unicode-to-hex.in_as-ascii=\u00E4\u00F6\u00FC \u20AC @

The output file is written in the encoding of the Java Runtime, derived from the environment parameter file.encoding.

Leading and trailing whitespaces in the input string will be trimmed! When you need them for your test, add them later by hand.

Prev	Up	Next
Chapter 9. Utility Programs	Home	9.3. Extract Field Information to XML