Hi,
for indexing PDF files we have to undo word hyphenation. The basic idea
is simply to remove the hyphen when a new line and a small letter
follows. Of course this approach isnt 100%-foolproofed but checking
against a dictionary wouldnt be as well...
Since we face this problem too when highlighting using HTMLCharStripper
(yes, we do have hyphenation in our HTML docs...) it seems to me I have
to adjust the JFlex generated StandardTokenizerImpl.
Is this the right approach and hwo would I have to modify this script?
Thanks
Wulf
PS: I see that there are changes made in the brand new 3.1.0 version we
are using 3.0.3, but as far I understand no relevant changes in this
respect.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
|