nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Siddhartha Reddy <>
Subject Re: Nutch/Solr: storing the page cache in Solr
Date Fri, 15 May 2009 09:06:58 GMT
Thanks a lot, Andrzej. I only need handling of String content at the moment;
so this should suffice. But if someone would like to store other content as
well, they can take a look at the Binary FieldType that is in the works for
Solr (


On Thu, May 14, 2009 at 5:29 PM, Andrzej Bialecki <> wrote:

> Siddhartha Reddy wrote:
>> I'm trying to patch Nutch to allow the page cache to be added to the Solr
>> index when using the SolrIndexer tool. Is there any reason this is not done
>> by default? The Solr schema even has the "cache" field but it is left empty.
> This issue is more complicated. We would need to handle also non-string
> content such as various binary formats (PDF, Office, images, etc), and there
> is no support for this in Solr (yet).
> Additionally, storing large binary blobs in Lucene index has some
> performance consequences.
> Currently Nutch uses Solr for searching, and a separate (set of) segment
> servers for content serving.
>  I'm enclosing a patch of the changes I have made. I have done some testing
>> and this seems to work fine. Can someone please take a look at it let me
>> know if I'm doing anything wrong? I'm especially not sure about the
>> character encoding to assume when converting the Content (which is stored as
>> byte[]) to a String; I'm getting the encoding from Metadata (using the key
>> Metadata.ORIGINAL_CHAR_ENCODING) but it is always null.
> The patch looks ok, if handling String content is all you need. Char
> encoding should be available in ParseData.getMeta().
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>  Contact: info at sigram dot com

View raw message