xmlgraphics-fop-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "J.Pietschmann" <j3322...@yahoo.de>
Subject Re: Hyphenation foundry [was: Re: proposed font project]
Date Wed, 16 Jun 2004 21:15:42 GMT
Simon Pepping wrote:
> I think it is time to create a project for the hyphenation files at
> Sourceforge. The project should be a home for all sorts of accessories
> to FOP, or even to FO processors in general. Do you want to
> participate? Do you know a nice name?

Well, sf.net would appeal to a larger body of developers, I think,
and is certainly easier to menage for small projects, but we
can also ask on jakarta-commons, xml-commons and even declare it
a FOP (or XML graphics) subproject.

Anyway, I just uploaded
  http://cvs.apache.org/~pietsch/t.tar.gz
which contains several unfinished stuff I produced the last year:
- Utilities to generate tables for the Unicode line break property
- A class keeping a line break state according to TR14, which should
   be easier to usee than the java.text.BreakIterator for FOP
- A Java port of MySpell
- An attempt at providing a layered hierarchy for spell checking
  and hyphenation interfaces.
- A Java port of the link grammar parser (incomplete, badly designed,
  buggy and without approvement of the original authors, *please* use
  only for personal study, don't redistribute).
- An attempt at a morphological analyzer for german words.
Somehow, the simple port of patgen as well as other attempts at
simplifying the current FOP hyphenator are missing, I hope I
remember to upload them tomorrow.

If someone want some problems to chew on:
- Implementation of an optimized trie or ternary or PATRICIA tree.
  Issues here: The FOP implementation packs both tree construction and
  retrieval into a single class, while the data structure is WORM.
  Furthermore, while it is fast, it could be implemented with much
  less memory, especially peak memory during construction. I ultimately
  concluded compiling the data into Java bytecode would be the best.
  Consider inserting the words WORD and WORM. A PATRICIA tree would
  collapse this to
    root: WOR -> leaf D
              -> leaf M
  In order to map this, the root node gets an operation "match string"
  with the string "WOR" leading to the subtree. Statistical compression
  could optimize the necessary operation, like "switch array", match
  2char string, match 3char string, match n-char string etc. May utilize
  BCEL.
- Institutionalized alphabet transformation. This is somewhat of a
  generalization of the hyphenation character classes. Java uses 16bit
  characters, but in many languages it is rare that more than 256
  characters are actually used in words. TeX/PatGen also map the
  characters onto the numbers 1..N (<256), folding character
  classification into the process. Mapping chars onto bytes saves almost
  half the memory. Because there are languages which requires more than
  256 characters, at least two implementation of the trie/whatever
  holding the patterns are necessary, one where the keys are byte
  sequences, another with char sequences. Too bad generics aren't ready
  yet, but if the data is byte compiled into a Java class, the compiler
  may analyze the patterns and decide whether bytes are sufficient.
  Stuff like Unicode character normalization should probably be folded
  into the classification/alphabet transformation too. It would be too
  bad if hyphenation failed because someone decided to use unnormalized
  characters like FI LIGATURE.
- API design. Need a hierarchy of interfaces which allow polymorphy
  at various levels:
   + Hyphenator
       implementations: pattern hyphenator, dictionary hyphenator,
       composite hyphenator: delegate to a collection of child
       hyphenators
   + Pattern hyphenator - pattern storage
      implementations: HashTable (very easy to understand but slow),
      R/W-trie, optimized WORM class, ...
   + Dictionary hyphenator - dictionary ...
  For reuse in interactive applications, R/W storage may be useful (user
  dictionaries)
- Generalized line breaking strategies. Possible strategies
  + naive, break before the first non-space after a space
  + TR14
  + break before any character
  + pattern, regexp or dictionary pased
- Other ideas: API for processing the Unicode data files. Optimized
  compile for Unicode properties into Java class data: select the
  properties you want, get it. Use this to get the latest Unicode data
  into your Java applications rather than the outdated stuff in the
  JRE.


J.Pietschmann

---------------------------------------------------------------------
To unsubscribe, e-mail: fop-user-unsubscribe@xml.apache.org
For additional commands, e-mail: fop-user-help@xml.apache.org


Mime
View raw message