lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jon Baer <jonb...@gmail.com>
Subject Re: solr with hadoop
Date Tue, 22 Jun 2010 16:47:14 GMT
I was playing around w/ Sqoop the other day, its a simple Cloudera tool for imports (mysql
-> hdfs) @ http://www.cloudera.com/developers/downloads/sqoop/

It seems to me (it would be pretty efficient) to dump to HDFS and have something like Data
Import Handler be able to read from hdfs:// directly ...

Has this route been discussed / developed before (ie DIH w/ hdfs:// handler)?

- Jon

On Jun 22, 2010, at 12:29 PM, MitchK wrote:

> 
> I wanted to add a Jira-issue about exactly what Otis is asking here.
> Unfortunately, I haven't time for it because of my exams.
> 
> However, I'd like to add a question to Otis' ones:
> If you destribute the indexing-progress this way, are you able to replicate
> the different documents correctly?
> 
> Thank you.
> - Mitch
> 
> Otis Gospodnetic-2 wrote:
>> 
>> Stu,
>> 
>> Interesting!  Can you provide more details about your setup?  By "load
>> balance the indexing stage" you mean "distribute the indexing process",
>> right?  Do you simply take your content to be indexed, split it into N
>> chunks where N matches the number of TaskNodes in your Hadoop cluster and
>> provide a map function that does the indexing?  What does the reduce
>> function do?  Does that call IndexWriter.addAllIndexes or do you do that
>> outside Hadoop?
>> 
>> Thanks,
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> 
>> ----- Original Message ----
>> From: Stu Hood <stuhood@webmail.us>
>> To: solr-user@lucene.apache.org
>> Sent: Monday, January 7, 2008 7:14:20 PM
>> Subject: Re: solr with hadoop
>> 
>> As Mike suggested, we use Hadoop to organize our data en route to Solr.
>> Hadoop allows us to load balance the indexing stage, and then we use
>> the raw Lucene IndexWriter.addAllIndexes method to merge the data to be
>> hosted on Solr instances.
>> 
>> Thanks,
>> Stu
>> 
>> 
>> 
>> -----Original Message-----
>> From: Mike Klaas <mike.klaas@gmail.com>
>> Sent: Friday, January 4, 2008 3:04pm
>> To: solr-user@lucene.apache.org
>> Subject: Re: solr with hadoop
>> 
>> On 4-Jan-08, at 11:37 AM, Evgeniy Strokin wrote:
>> 
>>> I have huge index base (about 110 millions documents, 100 fields  
>>> each). But size of the index base is reasonable, it's about 70 Gb.  
>>> All I need is increase performance, since some queries, which match  
>>> big number of documents, are running slow.
>>> So I was thinking is any benefits to use hadoop for this? And if  
>>> so, what direction should I go? Is anybody did something for  
>>> integration Solr with Hadoop? Does it give any performance boost?
>>> 
>> Hadoop might be useful for organizing your data enroute to Solr, but  
>> I don't see how it could be used to boost performance over a huge  
>> Solr index.  To accomplish that, you need to split it up over two  
>> machines (for which you might find hadoop useful).
>> 
>> -Mike
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> -- 
> View this message in context: http://lucene.472066.n3.nabble.com/solr-with-hadoop-tp482688p914589.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Mime
View raw message