lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bharath Kumar <bharath.mvku...@gmail.com>
Subject Re: How to re-index SOLR data
Date Thu, 11 Aug 2016 00:24:45 GMT
Hi All,

Thanks so much for your inputs. We have a MYSQL data source and i think we
will try to re-index using the MYSQL data.

I wanted something where i can export all my current data say to an excel
file or some data source and then import it on another node with the same
collection with empty data.

On Tue, Aug 9, 2016 at 8:44 PM, Erick Erickson <erickerickson@gmail.com>
wrote:

> Assuming you can re-index....
>
> Consider "collection aliasing". Say your current collection is C1.
> Create C2 (using the same cluster, Zookeeper and the like). Go
> ahead and index to C2 (however you do that). NOTE: the physical
> machines may be _different_ than C1, or not. That's up to you. The
> critical bit is that you use the same Zookeeper.
>
> Now, when you are done you use the Collections API CREATEALIAS
> command to point a "pseudo collection" to C1 (call it "prod"). This is
> seamless to the users.
>
> The flaw in my plan so far is that you probably go at Collection C1
> directly. So what you might do is create the "prod" alias and point it at
> C1. Now change your LB (or client or whatever) to use the "prod"
> collection,
> then when indexing is complete use CREATEALIAS to point "prod" at C2
> instead.
>
> This is actually a quite well-tested process, often used when you want to
> change "atomically", e.g. when you reindex the same data nightly but want
> all the new data available in its entirety only after it has been QA'd or
> such.
>
> Best,
> Erick
>
> On Tue, Aug 9, 2016 at 2:43 PM, John Bickerstaff
> <john@johnbickerstaff.com> wrote:
> > In my case, I've done two things....  neither of them involved taking the
> > data from SOLR to SOLR...  although in my reading, I've seen that this is
> > theoretically possible (I.E. sending data from one SOLR server to another
> > SOLR server and  having the second SOLR instance re-index...)
> >
> > I haven't used the python script...  that was news to me, but it sounds
> > interesting...
> >
> > What I've done is one of the following:
> >
> > a. Get the data from the original source (database, whatever) and massage
> > it again so that i's ready for SOLR and then submit it to my new
> SolrCloud
> > for indexing.
> >
> > b. Keep a separate store of EVERY Solr document as it comes out of my
> code
> > (in xml) and store it in Kafka or a text file.  Then it's easy to push
> back
> > into another SOLR instance any time - multiple times if necessary.
> >
> > I'm guessing you don't have the data stored away as in "b"...  And if you
> > don't have a way of getting the data from some central source, then "a"
> > won't work either...  Which leaves you with the concept of sending data
> > from SOLR "A" to SOLR "B" and having "B" reindex...
> >
> > This might serve as a starting point in that case...
> > https://wiki.apache.org/solr/HowToReindex
> >
> > You'll note that there are limitations and a strong caveat against doing
> > this with SOLR, but if you have no other option, then it's the best you
> can
> > do.
> >
> > Do you have the ability to get all the data again from an authoritative
> > source?  (Relational Database or something similar?)
> >
> > On Tue, Aug 9, 2016 at 3:21 PM, Bharath Kumar <bharath.mvkumar@gmail.com
> >
> > wrote:
> >
> >> Hi John,
> >>
> >> Thanks so much for your inputs. We have time to build another system. So
> >> how did you index the same data on the main SOLR node to the new SOLR
> node?
> >> Did you use the re-index python script? The new data will be indexed
> >> correctly with the new rules, but what about the old data?
> >>
> >> Our SOLR data is around 30GB with around 60 million documents. We use
> SOLR
> >> cloud with 3 solr nodes and 3 zookeepers.
> >>
> >> On Tue, Aug 9, 2016 at 2:13 PM, John Bickerstaff <
> john@johnbickerstaff.com
> >> >
> >> wrote:
> >>
> >> > In case this helps...
> >> >
> >> > Assuming you have the resources to build a copy of your production
> >> > environment and assuming you have the time, you don't need to take
> your
> >> > production down - or even affect it's processing...
> >> >
> >> > What I've done (with admittedly smaller data sets) is build a separate
> >> > environment (usually on VM's) and once it's set up, I do the new
> indexing
> >> > according to the new "rules"  (Like your change of long to string)
> >> >
> >> > Then, in a sense, I don't care how long it takes because it is not
> >> > affecting Prod.
> >> >
> >> > When it's done, I simply switch my load balancer to point to the new
> >> > environment and shut down the old one.
> >> >
> >> > To users, this could be seamless if you handle the load balancer
> >> correctly
> >> > and have it refuse new connections to the old servers while routing
> all
> >> new
> >> > connections to the new Solr servers...
> >> >
> >> > On Tue, Aug 9, 2016 at 3:04 PM, Bharath Kumar <
> bharath.mvkumar@gmail.com
> >> >
> >> > wrote:
> >> >
> >> > > Hi Nick and Shawn,
> >> > >
> >> > > Thanks so much for the pointers. I will try that out. Thank you
> again!
> >> > >
> >> > > On Tue, Aug 9, 2016 at 9:40 AM, Nick Vasilyev <
> >> nick.vasilyev1@gmail.com>
> >> > > wrote:
> >> > >
> >> > > > Hi, I work on a python Solr Client
> >> > > > <http://solrclient.readthedocs.io/en/latest/> library and
there
> is a
> >> > > > reindexing helper module that you can use if you are on Solr
> 4.9+. I
> >> > use
> >> > > it
> >> > > > all the time and I think it works pretty well. You can re-index
> all
> >> > > > documents from a collection into another collection or dump them
> to
> >> the
> >> > > > filesystem as JSON. It also supports parallel execution and can
> run
> >> > > > independently on each shard. There is also a way to resume if
your
> >> job
> >> > > > craps out half way through if your existing schema is set up
with
> a
> >> > good
> >> > > > date field and unique id.
> >> > > >
> >> > > > You can read the documentation here:
> >> > > > http://solrclient.readthedocs.io/en/latest/Reindexer.html
> >> > > >
> >> > > > Code is pretty short and is here:
> >> > > > https://github.com/moonlitesolutions/SolrClient/
> >> > blob/master/SolrClient/
> >> > > > helpers/reindexer.py
> >> > > >
> >> > > > Here is sample:
> >> > > > from SolrClient import SolrClient
> >> > > > from SolrClient.helpers import Reindexer
> >> > > >
> >> > > > r = Reindexer(SolrClient('http://source_solr:8983/solr'),
> >> SolrClient('
> >> > > > http://destination_solr:8983/solr') , source_coll='source_
> >> collection',
> >> > > > dest_coll='destination-collection')
> >> > > > r.reindex()
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Tue, Aug 9, 2016 at 9:56 AM, Shawn Heisey <apache@elyograg.org
> >
> >> > > wrote:
> >> > > >
> >> > > > > On 8/9/2016 1:48 AM, bharath.mvkumar wrote:
> >> > > > > > What would be the best way to re-index the data in
the SOLR
> >> cloud?
> >> > We
> >> > > > > > have around 65 million data and we are planning to
change the
> >> > schema
> >> > > > > > by changing the unique key type from long to string.
How long
> >> does
> >> > it
> >> > > > > > take to re-index 65 million documents in SOLR and can
you
> please
> >> > > > > > suggest how to do that?
> >> > > > >
> >> > > > > There is no magic bullet.  And there's no way for anybody
but
> you
> >> to
> >> > > > > determine how long it's going to take.  There are people
who
> have
> >> > > > > achieved over 50K inserts per second, and others who have
> >> difficulty
> >> > > > > reaching 1000 per second.  Many factors affect indexing
speed,
> >> > > including
> >> > > > > the size of your documents, the complexity of your analysis,
the
> >> > > > > capabilities of your hardware, and how many threads/processes
> you
> >> are
> >> > > > > using at the same time when you index.
> >> > > > >
> >> > > > > Here's some more detailed info about reindexing, but it's
> probably
> >> > not
> >> > > > > what you wanted to hear:
> >> > > > >
> >> > > > > https://wiki.apache.org/solr/HowToReindex
> >> > > > >
> >> > > > > Thanks,
> >> > > > > Shawn
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Thanks & Regards,
> >> > > Bharath MV Kumar
> >> > >
> >> > > "Life is short, enjoy every moment of it"
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Thanks & Regards,
> >> Bharath MV Kumar
> >>
> >> "Life is short, enjoy every moment of it"
> >>
>



-- 
Thanks & Regards,
Bharath MV Kumar

"Life is short, enjoy every moment of it"

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message