lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mailing Lists Account" <>
Subject Re: Correlating matched terms with Document
Date Thu, 23 Jan 2003 12:34:07 GMT
Hi Lukas,

1. My problem was not parser. I am able to extract the required text
from the html document and index it. But when lucene returns a Hit, I am not
how I can correlate it back to different portions of html document. Assuming
that I use JTidy and I have
a DOM, how will I know whether matched keywords are from an hyperlink or
header node ?

2. I didn't know what Phrase query would work once I split the content into
multiple fields.
I will try out this. However, your last suggestion to have one "content"
field that contains everything and
other fields for hyperlink and header tags should work too.

In any case, I need to retokenize/reparse the original document to figure
out if matched terms belong
to a specific tag.


----- Original Message -----
From: "Lukas Zapletal" <>
To: "Lucene Users List" <>
Sent: Wednesday, January 22, 2003 2:35 AM
Subject: Re: Correlating matched terms with Document

> Hello
> >I have a strange requirement. I am indexing a single HTML Document and
> >searching it immediately for one or more keywords (Boolean/Phrase query).
> >When the keywords are found in the document, I would like to
> >know if the matched keywords are from hyperlink text, a paragraph or one
> ><h1>, <h2> etc tags.
> >
> I suggest to use JTidy when parsing document. It supports DOM XML model.
> You can easily extract all hyperlinks or headlines. The great thing is
> that also BAD documents are cleaned and repaired before DOM parsing..
> >a) I cannot add multiple fields as I need to do "Phrase" query.
> >
> Why not to split it to fields? It should work...
> >b) During the tokenization, I know exactly if a particular token is from
> >specific tag. Can this be stored in
> >the index as some user-defined flags or something like that and later
> >retrieve it. Looking at the API, it doesn't seem to be possible.
> >I see that I can associate token type (such as "word", "eol" ) with the
> >analyzer token, but this is not stored in the index.
> >
> Change parser. See JTidy above.
> >c) One option seems to be to re-tokenize the document after search - like
> >some of the highlight summary examples are doing.  Then
> >I can match the document tokens with the terms.
> >
> >
> I suggest to make this fields:
> content: all content with links and headlines
> headlines: only headlines
> hrefs: only hrefs etc
> When you search a phrase, it will match at least content. When there
> will be some hits from headlines, you know this is headline. Am I right?
> --
> Lukas Zapletal      []
> No viruses in this mail. AVASThome
> --
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message