nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Nutch/Solr: storing the page cache in Solr
Date Thu, 14 May 2009 11:59:16 GMT
Siddhartha Reddy wrote:
> I'm trying to patch Nutch to allow the page cache to be added to the 
> Solr index when using the SolrIndexer tool. Is there any reason this is 
> not done by default? The Solr schema even has the "cache" field but it 
> is left empty.
> 

This issue is more complicated. We would need to handle also non-string 
content such as various binary formats (PDF, Office, images, etc), and 
there is no support for this in Solr (yet).

Additionally, storing large binary blobs in Lucene index has some 
performance consequences.

Currently Nutch uses Solr for searching, and a separate (set of) segment 
servers for content serving.

> I'm enclosing a patch of the changes I have made. I have done some 
> testing and this seems to work fine. Can someone please take a look at 
> it let me know if I'm doing anything wrong? I'm especially not sure 
> about the character encoding to assume when converting the Content 
> (which is stored as byte[]) to a String; I'm getting the encoding from 
> Metadata (using the key Metadata.ORIGINAL_CHAR_ENCODING) but it is 
> always null.

The patch looks ok, if handling String content is all you need. Char 
encoding should be available in ParseData.getMeta().

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Mime
View raw message