From Simon Pepping <spepp...@leverkruid.eu>
Subject New experimental hyphenation patterns
Date Wed, 25 Nov 2009 20:40:35 GMT
I just uploaded new experimental hyphenation patterns for FOP, see
http://sourceforge.net/projects/offo, select the tab files, select the
newest files, or the files in offo-hyphenation-utf8/0.1.

>From the readme file (index.html in the downloaded zip files):

Recently the TeX community have converted their hyphenation pattern
files to utf-8 format. Most of such pattern files can be trivially
converted to pattern files in the XML format used by FOP. Therefore
the OFFO maintainer joined the maintainers of the TeX hyphenation
patterns, and in the future the hyphenation patterns offered by OFFO
will be simple conversions from the TeX patterns.

This is the first release of the TeX utf-8 patterns for FOP. There are
a few unsolved problems:

Naming: FOP uses the POSIX naming convention ll_CC for language and
country. There are a couple of patterns that do not fit into this

When a language uses various alternative scripts, the script name is
appended to the file name, e.g. sh_Cyrl and sh_Latn. The user will
have to rename the pattern file of his preferred script in the jar
file by removing the script suffix. The final solution is probably to
merge the patterns for different scripts in one pattern file.

When a language uses various alternative spelling rules, some
descriptive suffix is appended to the file name, e.g. de_1901; users
who prefer these pattern files over the default ones will have to
rename the pattern files in the jar file.

Licenses: No overview of the licenses has yet been made. To find
information about the license, one has to look into the comments in
the XML or TeX pattern files.

Comments: The conversion from TeX to XML is done by a
program. Comments provide a problem, because in TeX the trailing new
line is part of the comment. In comment sections in XML this is less
desirable, and we have done our best to format comments in a legible
way. However, at the moment the formatting is spoiled by text data
between comments (usually blank lines), and all following comments are
on a single line.

Classes: The TeX patterns, and therefore also the XML patterns do not
contain classes, i.e. a list of characters used in words (Unicode
class Letter). Since 3 September 2009 these classes are built into
FOP. Therefore these patterns can only be used with FOP versions
created after that date. Until now no release was made after that
date, and these patterns only work with code from the subversion

Not included:

There are no separate hyphenation patterns for Norwegian Nynorsk and
Norwegian Bokmal. Instead, there is a single pattern file for

There are no patterns for esperanto, because the TeX pattern file is
not in a format that can be converted to XML.

There are no patterns for hungarian, because the TeX pattern file
contains too many patterns for my machine to compile (stack overflow).

I would appreciate your comments on the usability of these hyphenation

Regards, Simon

Simon Pepping
home page: http://www.leverkruid.eu

