lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lukas Zapletal <l...@root.cz>
Subject Re: Correlating matched terms with Document
Date Tue, 21 Jan 2003 21:05:47 GMT
Hello

>I have a strange requirement. I am indexing a single HTML Document and
>searching it immediately for one or more keywords (Boolean/Phrase query). 
>When the keywords are found in the document, I would like to
>know if the matched keywords are from hyperlink text, a paragraph or one of
><h1>, <h2> etc tags.  
>
I suggest to use JTidy when parsing document. It supports DOM XML model. 
You can easily extract all hyperlinks or headlines. The great thing is 
that also BAD documents are cleaned and repaired before DOM parsing..

>a) I cannot add multiple fields as I need to do "Phrase" query.
>
Why not to split it to fields? It should work...

>b) During the tokenization, I know exactly if a particular token is from a
>specific tag. Can this be stored in
>the index as some user-defined flags or something like that and later
>retrieve it. Looking at the API, it doesn't seem to be possible.
>I see that I can associate token type (such as "word", "eol" ) with the
>analyzer token, but this is not stored in the index.
>
Change parser. See JTidy above.

>c) One option seems to be to re-tokenize the document after search - like
>some of the highlight summary examples are doing.  Then
>I can match the document tokens with the terms.
>  
>
I suggest to make this fields:

content: all content with links and headlines
headlines: only headlines
hrefs: only hrefs etc

When you search a phrase, it will match at least content. When there 
will be some hits from headlines, you know this is headline. Am I right?

-- 
Lukas Zapletal      [lzap@root.cz]
http://www.tanecni-olomouc.cz/lzap
No viruses in this mail. AVASThome




--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message