lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <>
Subject RE: preserving markup of content ?
Date Thu, 08 Apr 2010 06:27:39 GMT
The "simple" solution is very easy:
Index the markup-free document by adding with new Field.Index.ANALYZED and Field.Store.NO,
so it does not get stored. Then again add the same data (but with markup) to the index with
Field.Store.YES but Field.Index.NO. If you like you can do this even with the same field name.

This works, as long as you don't need query highlighting, because the offsets from the first
field addition cannot be used for highlighting inside the text with markup. In this case,
you have to write your own analyzer that removes the markup in the tokenizer, but preserves
the original offsets. Examples of this are e.g. The Wikipedia contrib in Lucene, which has
an hand-crafted analyzer that can handle Mediawiki Markup syntax.

Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen

> -----Original Message-----
> From: Sulman Sarwar []
> Sent: Thursday, April 08, 2010 5:04 AM
> To:
> Subject: preserving markup of content ?
> Hi All,
> I am working on some language data and i need to index/search it. I
> have used lucene for indexing plain text documents before as well (no
> fancy tricks, just plain text indexing). The data that i have now is
> transcribed text and is heavily marked up. (Its mostly conversations
> and interviews). I can easily remove the markup and extract the text
> and feed it to a lucene indexer but i need to preserve some important
> markups so that at the time of retrieval the text can make some sense.
> Now if i leave the required markup intact and index the documents, i
> fear the markup will get tokenized too and will become searchable. I
> dont want the markup to be searched but i need to keep it somehow
> attached with the actual text to make the retrieval process easy. Can
> you suggest me what/how to do it? Correct me if i am wrong. :)
> Thanks for the help.
> Sulman.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message