lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paras Lehana <paras.leh...@indiamart.com>
Subject Re: [Q] Faster Atomic Updates - use docValues?
Date Thu, 05 Dec 2019 05:57:04 GMT
Hey Erick,

This is a huge red flag to me: "(but I could only test for the first few
> thousand documents”.


Yup, that's probably where the culprit lies. I could only test for the
starting batch because I had to wait for a day to actually compare. I
tweaked the merge values and kept whatever gave a speed boost. My first
batch of 5 million docs took only 40 minutes (atomic updates included) and
the last batch of 5 million took more than 18 hours. If this is an issue of
mergePolicy, I think I should have also done optimize between batches, no?
I remember, when I indexed a single XML of 80 million after optimizing the
core already indexed with 30 XMLs of 5 million each, I could post 80
million in a day only.



> The indexing rate you’re seeing is abysmal unless these are _huge_
> documents


Documents only contain the suggestion name, possible titles,
phonetics/spellcheck/synonym fields and numerical fields for boosting. They
are far smaller than what a Search Document would contain. Auto-Suggest is
only concerned about suggestions so you can guess how simple the documents
would be.


Some data is held on the heap and some in the OS RAM due to MMapDirectory


I'm using StandardDirectory (which will make Solr choose the right
implementation). Also, planning to read more about these (looking forward
to use MMap). Thanks for the article!


You're right. I should change one thing at a time. Let me experiment and
then I will summarize here what I tried. Thank you for your responses. :)

On Wed, 4 Dec 2019 at 20:31, Erick Erickson <erickerickson@gmail.com> wrote:

> This is a huge red flag to me: "(but I could only test for the first few
> thousand documents”
>
> You’re probably right that that would speed things up, but pretty soon
> when you’re indexing
> your entire corpus there are lots of other considerations.
>
> The indexing rate you’re seeing is abysmal unless these are _huge_
> documents, but you
> indicate that at the start you’re getting 1,400 docs/second so I don’t
> think the complexity
> of the docs is the issue here.
>
> Do note that when we’re throwing RAM figures out, we need to draw a sharp
> distinction
> between Java heap and total RAM. Some data is held on the heap and some in
> the OS
> RAM due to MMapDirectory, see Uwe’s excellent article:
> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>
> Uwe recommends about 25% of your available physical RAM be allocated to
> Java as
> a starting point. Your particular Solr installation may need a larger
> percent, IDK.
>
> But basically I’d go back to all default settings and change one thing at
> a time.
> First, I’d look at GC performance. Is it taking all your CPU? In which
> case you probably need to
> increase your heap. I pick this first because it’s very common that this
> is a root cause.
>
> Next, I’d put a profiler on it to see exactly where I’m spending time.
> Otherwise you wind
> up making random changes and hoping one of them works.
>
> Best,
> Erick
>
> > On Dec 4, 2019, at 3:21 AM, Paras Lehana <paras.lehana@indiamart.com>
> wrote:
> >
> > (but I could only test for the first few
> > thousand documents
>
>

-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

-- 
*
*

 <https://www.facebook.com/IndiaMART/videos/578196442936091/>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message