nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alfonso Nishikawa <alfonso.nishik...@gmail.com>
Subject Re: Query about indexing crawled data from Nutch to Solr
Date Thu, 27 Nov 2014 01:29:39 GMT
Hi, Prashant,

What version of Nutch are you using?

Regards,

Alfonso Nishikawa

2014-11-26 19:33 GMT+01:00 Prashant Shekar <shekarprashant@gmail.com>:

> Hi,
>
> I had a question about how data from raw crawled data from Nutch is
> indexed into Solr. We crawled the Acadis dataset using Nutch and there were
> 47,580 files that it retrieved. However, while indexing these files into
> Solr, only 2929 of these documents were actually indexed. I had 2 questions:
>
> 1) What can be the reasons why only 2929 out of 47,580 files were actually
> indexed in Solr? Does Solr do some deduplication on its end that Nutch does
> not?
>
> 2) While checking the number of unique URLs, I found that there were
> 12,201 unique URLs. We had used the URL as a key for Solr indexing. So, if
> there were no errors while indexing to Solr, can the number of indexed
> files still be less than 12,201?
>
> Thanks,
> Prasanth Iyer
>

Mime
View raw message