lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wulf Berschin <>
Subject Undo hyphenation when indexing
Date Fri, 01 Apr 2011 15:50:28 GMT

for indexing PDF files we have to undo word hyphenation. The basic idea 
is simply to remove the hyphen when a new line and a small letter 
follows. Of course this approach isnt 100%-foolproofed but checking 
against a dictionary wouldnt be as well...

Since we face this problem too when highlighting using HTMLCharStripper 
(yes, we do have hyphenation in our HTML docs...) it seems to me I have 
to adjust the JFlex generated StandardTokenizerImpl.

Is this the right approach and hwo would I have to modify this script?


PS: I see that there are changes made in the brand new 3.1.0 version we 
are using 3.0.3, but as far I understand no relevant changes in this 

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message