lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Generating Index offline and loading into solrcloud
Date Thu, 19 Nov 2015 23:00:01 GMT
Sure, you can use Lucene to create indexes for shards
if (and only if) you deal with the routing issues....

About updates: I'm not talking about atomic updates at all.
The usual model for Solr is if you have a unique key
defined, new versions of documents replace old versions
of documents based on uniqueKey. That process is
not guaranteed by MRIT is all.

Best,
Erick

On Thu, Nov 19, 2015 at 12:56 PM, KNitin <nitin.tnvl@gmail.com> wrote:
> Thanks, Eric.  Looks like  MRIT uses Embedded solr running per
> mapper/reducer and uses that to index documents. Is that the recommended
> model? Can we use raw lucene libraries to generate index and then load them
> into solrcloud? (Barring the complexities for indexing into right shard and
> merging them).
>
> I am thinking of using this for regular offline indexing which needs to be
> idempotent.  When you mean update do you mean partial updates using _set?
> If we add and delete every time for a document that should work, right?
> (since all docs are indexed by doc id which contains all operational
> history)? Let me know if I am missing something.
>
> On Thu, Nov 19, 2015 at 12:09 PM, Erick Erickson <erickerickson@gmail.com>
> wrote:
>
>> Note two things:
>>
>> 1> this is running on Hadoop
>> 2> it is part of the standard Solr release as MapReduceIndexerTool,
>> look in the contribs...
>>
>> If you're trying to do this yourself, you must be very careful to index
>> docs
>> to the correct shard then merge the correct shards. MRIT does this all
>> automatically.
>>
>> Additionally, it has the cool feature that if (and only if) your Solr
>> index is running over
>> HDFS, the --go-live option will automatically merge the indexes into
>> the appropriate
>> running Solr instances.
>>
>> One caveat. This tool doesn't handle _updating_ documents. So if you
>> run it twice
>> on the same data set, you'll have two copies of every doc. It's
>> designed as a bulk
>> initial-load tool.
>>
>> Best,
>> Erick
>>
>>
>>
>> On Thu, Nov 19, 2015 at 11:45 AM, KNitin <nitin.tnvl@gmail.com> wrote:
>> > Great. Thanks!
>> >
>> > On Thu, Nov 19, 2015 at 11:24 AM, Sameer Maggon <
>> sameer@measuredsearch.com>
>> > wrote:
>> >
>> >> If you are trying to create a large index and want speedups there, you
>> >> could use the MapReduceTool -
>> >> https://github.com/cloudera/search/tree/cdh5-1.0.0_5.2.1/search-mr. At
>> a
>> >> high level, it takes your files (csv, json, etc) as input can create
>> either
>> >> a single or a sharded index that you can either copy it to your Solr
>> >> Servers. I've used this to create indexes that include hundreds of
>> millions
>> >> of documents in fairly decent amount of time.
>> >>
>> >> Thanks,
>> >> --
>> >> *Sameer Maggon*
>> >> Measured Search
>> >> www.measuredsearch.com <http://measuredsearch.com/>
>> >>
>> >> On Thu, Nov 19, 2015 at 11:17 AM, KNitin <nitin.tnvl@gmail.com> wrote:
>> >>
>> >> > Hi,
>> >> >
>> >> >  I was wondering if there are existing tools that will generate solr
>> >> index
>> >> > offline (in solrcloud mode)  that can be later on loaded into
>> solrcloud,
>> >> > before I decide to implement my own. I found some tools that do only
>> solr
>> >> > based index loading (non-zk mode). Is there one with zk mode enabled?
>> >> >
>> >> >
>> >> > Thanks in advance!
>> >> > Nitin
>> >> >
>> >>
>>

Mime
View raw message