lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: MoreLikeThis: How to get quality terms from html from content stream?
Date Sat, 08 Aug 2009 14:42:47 GMT

On Aug 7, 2009, at 5:23pm, Jay Hill wrote:

> I'm using the MoreLikeThisHandler with a content stream to get  
> documents
> from my index that match content from an html page like this:
> http://localhost:8080/solr/mlt?stream.url=http://www.sfgate.com/cgi-bin/article.cgi 
> ?f=/c/a/2009/08/06/SP5R194Q13.DTL&mlt.fl=body&rows=4&debugQuery=true
>
> But, not surprisingly, the query generated is meaningless because a  
> lot of
> the markup is picked out as terms:
> <str name="parsedquery_toString">
> body:li body:href  body:div body:class body:a body:script body:type  
> body:js
> body:ul body:text body:javascript body:style body:css body:h body:img
> body:var body:articl body:ad body:http body:span body:prop
> </str>
>
> Does anyone know a way to transform the html so that the content can  
> be
> parsed out of the content stream and processed w/o the markup? Or do  
> I need
> to write my own HTMLParsingMoreLikeThisHandler?

You'd want to parse the HTML to extract only text first, and use that  
for your index data.

Both the Nutch and Tika OSS projects have examples of using HTML  
parsers (based on TagSoup or CyberNeko) to generate content suitable  
for indexing.

-- Ken

> If I parse the content out to a plain text file and point the  
> stream.url
> param to file:///parsedfile.txt it works great.
>
> -Jay

--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378


Mime
View raw message