lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sulman Sarwar <>
Subject preserving markup of content ?
Date Thu, 08 Apr 2010 03:04:08 GMT
Hi All,

I am working on some language data and i need to index/search it. I
have used lucene for indexing plain text documents before as well (no
fancy tricks, just plain text indexing). The data that i have now is
transcribed text and is heavily marked up. (Its mostly conversations
and interviews). I can easily remove the markup and extract the text
and feed it to a lucene indexer but i need to preserve some important
markups so that at the time of retrieval the text can make some sense.
Now if i leave the required markup intact and index the documents, i
fear the markup will get tokenized too and will become searchable. I
dont want the markup to be searched but i need to keep it somehow
attached with the actual text to make the retrieval process easy. Can
you suggest me what/how to do it? Correct me if i am wrong. :)

Thanks for the help.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message