lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Koch <...@issuu.com>
Subject Re: SolrCloud and exernal file fields
Date Wed, 21 Nov 2012 07:53:54 GMT
Mikhail

I appreciate your input, it's very useful :)

On Wed, Nov 21, 2012 at 6:30 AM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> Martin,
> This deployment seems a little bit confusing to me. You have 16-way fairy
> virtual "box", and send 16 request for really heavy operation at the same
> moment, it does not surprise me that you loosing it for some period of
> time. At that time you should have more than 16 in load average metrics.
> I suggest to send commit to those cores one-by-one and have inconsistency
> and some sort of blinking as a trade-off for availability. In this case
> only single virtual CPU will be fully consumed by the commit's _thread
> divergence action_ and others will serve requests.
>

I wasn't aware until now that it is possible to send a commit to one core
only. What we observed was the effect of curl
localhost:8080/solr/update?commit=true but perhaps we should experiment
with solr/coreN/update?commit=true. A quick trial run seems to indicate
that a commit to a single core causes commits on all cores.


Perhaps I should clarify that we are using SOLR as a black box; we do not
touch the code at all - we only install the distribution WAR file and
proceed from there.


> Also from my POV such deployments should start at least from *16* 4-way
> vboxes, it's more expensive, but should be much better available during
> cpu-consuming operations.
>

Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts with
16 cores? Or am I misunderstanding something :) ?


> Other details, if you use single jetty for all of them, are you sure that
> jetty's threadpool doesn't limit requests? is it large enough?
> You have 60G and set -Xmx=10G. are you sure that total size of cores index
> directories is less than 45G?
>
> The total index size is 230 GB, so it won't fit in ram, but we're using an
SSD disk to minimize disk access time. We have tried putting the EFF onto a
ram disk, but this didn't have a measurable effect.

Thanks,
/Martin


> Thanks
>
>
> On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch <mak@issuu.com> wrote:
>
> > Mikhail
> >
> > PSB
> >
> > On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev <
> > mkhludnev@griddynamics.com> wrote:
> >
> > > Martin,
> > >
> > > Please find additional question from me below.
> > >
> > > Simone,
> > >
> > > I'm sorry for hijacking your thread. The only what I've heard about it
> at
> > > recent ApacheCon sessions is that Zookeeper is supposed to replicate
> > those
> > > files as configs under solr home. And I'm really looking forward to
> know
> > > how it works with huge files in production.
> > >
> > > Thank You, Guys!
> > >
> > > 20.11.2012 18:06 пользователь "Martin Koch" <mak@issuu.com>
написал:
> > > >
> > > > Hi Mikhail
> > > >
> > > > Please see answers below.
> > > >
> > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
> > > > mkhludnev@griddynamics.com> wrote:
> > > >
> > > > > Martin,
> > > > >
> > > > > Thank you for telling your own "war-story". It's really useful for
> > > > > community.
> > > > > The first question might seems not really conscious, but would you
> > tell
> > > me
> > > > > what blocks searching during EFF reload, when it's triggered by
> > handler
> > > or
> > > > > by listener?
> > > > >
> > > >
> > > > We continuously index new documents using CommitWithin to get regular
> > > > commits. However, we observed that the EFFs were not re-read, so we
> had
> > > to
> > > > do external commits (curl '.../solr/update?commit=true') to force
> > reload.
> > > > When this is done, solr blocks. I can't tell you exactly why it's
> doing
> > > > that (it was related to SOLR-3985).
> > >
> > > Is there a chance to get a thread dump when they are blocked?
> > >
> > >
> > Well I could try to recreate the situation. But the setup is fairly
> simple:
> > Create a large EFF in a largeish index with many shards. Issue a commit,
> > and then try to do a search. Solr will not respond to the search before
> the
> > commit has completed, and this will take a long time.
> >
> >
> > >
> > > >
> > > >
> > > > > I don't really get the sentence about sequential commits and number
> > of
> > > > > cores. Do I get right that file is replicated via Zookeeper?
> Doesn't
> > it
> > > > >
> > > >
> > > > Again, this is observed behavior. When we issue a commit on a system
> > with
> > > a
> > > > system with many solr cores using EFFs, the system blocks for a long
> > time
> > > > (15 minutes).  We do NOT use zookeeper for anything. The EFF is a
> > symlink
> > > > from each cores index dir to the actual file, which is updated by an
> > > > external process.
> > >
> > > Hold on, I asked about Zookeeper because the subj mentions SolrCloud.
> > >
> > > Do you use SolrCloud, SolrShards, or these cores are just replicas of
> the
> > > same index?
> > >
> >
> > Ah - we use solr 4 out of the box, so I guess this is SolrCloud. I'm a
> bit
> > unsure about the terminology here, but we've got a single index divided
> > into 16 shard. Each shard is hosted in a solr core.
> >
> >
> > > Also, about simlink - Don't you share that file via some NFS?
> > >
> > > No, we generate the EFF on the local solr host (there is only one
> > physical
> > host that holds all shards), so there is no need for NFS or copying files
> > around. No need for Zookeeper either.
> >
> >
> > > how many cores you run per box?
> > >
> > This box is a 16-virtual core (8 hyperthreaded cores)  with 60GB of RAM.
> We
> > run 16 solr cores on this box in Jetty.
> >
> >
> > > Do boxes has plenty of ram to cache filesystem beside of jvm heaps?
> > >
> > > Yes. We've allocated 10GB for jetty, and left the rest for the OS.
> >
> >
> > > I assume you use 64 bit linux and mmap directory. Please confirm that.
> > >
> > >
> > We use 64-bit linux. I'm not sure about the mmap directory or where that
> > would be configured in solr - can you explain that?
> >
> > >
> > > >
> > > >
> > > > > causes scalability problem or long time to reload? Will it help if
> > > we'll
> > > > > have, let's say ExternalDatabaseField which will pull values from
> > jdbc.
> > > ie.
> > > > >
> > > >
> > > > I think the possibility of having some fields being retrieved from an
> > > > external, dynamically updatable store would be really interesting.
> This
> > > > could be JDBC, something in-memory like redis, or a NoSql product
> (e.g.
> > > > Cassandra).
> > >
> > > Ok. Let's have it in mind as a possible direction.
> > >
> >
> > Alternatively, an API that would allow updating a single field for a
> > document might be an option.
> >
> >
> > >
> > > >
> > > >
> > > > > why all cores can't read these values simultaneously?
> > > > >
> > > >
> > > > Again, this is a solr implementation detail that I can't answer :)
> > > >
> > > >
> > > > > Can you confirm that IDs in the file is ordered by the index term
> > > order?
> > > > >
> > > >
> > > > Yes, we sorted the files (standard UNIX sort).
> > > >
> > > >
> > > > > AFAIK it can impact load time.
> > > > >
> > > > Yes, it does
> > >
> > > Ok, I've got that you aware of it, and your IDs are just strings, not
> > > integers.
> > >
> > >
> > Yes, ids are strings.
> >
> > >
> > > >
> > > >
> > > > > Regarding your post-query solution can you tell me if query found
> > 10000
> > > > > docs, but I need to display only first page with 100 rows, whether
> I
> > > need
> > > > > to pull all 10K results to frontend to order them by the rank?
> > > > >
> > > > >
> > > > In our architecture, the clients query an API that generates the SOLR
> > > > query, retrieves the relevant additional fields that we needs, and
> > > returns
> > > > the relevant JSON to the front-end.
> > > >
> > > > In our use case, results are returned from SOLR by the 10's, not by
> the
> > > > 1000's, so it is a manageable job. Even so, if solr returned
> thousands
> > of
> > > > results, it would be up to the implementation of the api to augment
> > only
> > > > the results that needed to be returned to the front-end.
> > > >
> > > > Even so, patching up a JSON structure with 10000 results should be
> > > > possible.
> > >
> > > You are right. I'm concerned anyway because retrieving whole result is
> > > expensive, and not always possible.
> > >
> > >
> > In our case, getting the whole result is almost impossible, because that
> > would be millions of documents, and returning the Nth result seems to be
> a
> > quadratic (or worse) operation in SOLR.
> >
> > >
> > > >
> > > >
> > > > > I'm really appreciate if you comment on the questions above.
> > > > > PS: It's time to pitch, how much
> > > > > https://issues.apache.org/jira/browse/SOLR-4085 "Commit-free
> > > > > ExternalFileField" can help you?
> > > > >
> > > > >
> > > > > It looks very interesting :) Does it make it possible to avoid
> > > re-reading
> > > > the EFF on every commit, and only re-read the values that have
> actually
> > > > changed?
> > >
> > >
> > > You don't need commit (in SOLR-4085) to reload file content, but after
> > > commit you need to read whole file and scan all key terms and postings.
> > > That's because EFF sits on top of top level searcher. it's a Solr-like
> > way.
> > > In some future we might have per-segment EFF, in this case adding a
> > segment
> > > will trigger full file scan, but in the index only that new segment
> will
> > be
> > > scanned. It should be faster. You know, straightforward sharing
> internal
> > > data structures between different index views/generations is not
> > possible.
> > > If you are asking about applying delta changes on external file that's
> > > something what we did ourselves http://goo.gl/P8GFq . This feature is
> > much
> > > more doubtful and vague, although it might be the next contribution
> after
> > > SOLR-4085.
> > >
> > > >
> > > > /Martin
> > > >
> > > >
> > > > >
> > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <mak@issuu.com>
> wrote:
> > > > >
> > > > > > Solr 4.0 does support using EFFs, but it might not give you
what
> > > you're
> > > > > > hoping fore.
> > > > > >
> > > > > > We tried using Solr Cloud, and have given up again.
> > > > > >
> > > > > > The EFF is placed in the parent of the index directory in each
> > core;
> > > each
> > > > > > core reads the entire EFF and picks out the IDs that it is
> > > responsible
> > > > > for.
> > > > > >
> > > > > > In the current 4.0.0 release of solr, solr blocks (doesn't answer
> > > > > queries)
> > > > > > while re-reading the EFF. Even worse, it seems that the time
to
> > > re-read
> > > > > the
> > > > > > EFF is multiplied by the number of cores in use (i.e. the EFF
is
> > > re-read
> > > > > by
> > > > > > each core sequentially). The contents of the EFF become active
> > after
> > > the
> > > > > > first EXTERNAL commit (commitWithin does NOT work here) after
the
> > > file
> > > > > has
> > > > > > been updated.
> > > > > >
> > > > > > In our case, the EFF was quite large - around 450MB - and we
use
> 16
> > > > > shards,
> > > > > > so when we triggered an external commit to force re-reading,
the
> > > whole
> > > > > > system would block for several (10-15) minutes. This won't work
> in
> > a
> > > > > > production environment. The reason for the size of the EFF is
> that
> > we
> > > > > have
> > > > > > around 7M documents in the index; each document has a 45
> character
> > > ID.
> > > > > >
> > > > > > We got some help to try to fix the problem so that the re-read
of
> > the
> > > EFF
> > > > > > proceeds in the background (see
> > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985>
for
> > > > > > a fix on the 4.1 branch). However, even though the re-read
> proceeds
> > > in
> > > > > the
> > > > > > background, the time required to launch solr now takes at least
> as
> > > long
> > > > > as
> > > > > > re-reading the EFFs. Again, this is not good enough for our
> needs.
> > > > > >
> > > > > > The next issue is that you cannot sort on EFF fields (though
you
> > can
> > > > > return
> > > > > > them as values using &fl=field(my_eff_field). This is also
fixed
> in
> > > the
> > > > > 4.1
> > > > > > branch here <https://issues.apache.org/jira/browse/SOLR-4022>.
> > > > > >
> > > > > > So: Even after these fixes, EFF performance is not that great.
> Our
> > > > > solution
> > > > > > is as follows: The actual value of the popularity measure (say,
> > > reads)
> > > > > that
> > > > > > we want to report to the user is inserted into the search
> response
> > > > > > post-query by our query front-end. This value will then be the
> > > > > > authoritative value at the time of the query. The value of the
> > > popularity
> > > > > > measure that we use for boosting in the ranking of the search
> > results
> > > is
> > > > > > only updated when the value has changed enough so that the impact
> > on
> > > the
> > > > > > boost will be significant (say, more than 2%). This does require
> > > frequent
> > > > > > re-indexing of the documents that have significant changes in
the
> > > number
> > > > > of
> > > > > > reads, but at least we won't have to update a document if it
> moves
> > > from,
> > > > > > say, 1000000 to 1000001 reads.
> > > > > >
> > > > > > /Martin Koch - ISSUU - senior systems architect.
> > > > > >
> > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
> simoneg@apache.org
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > > I'm planning to move a quite big Solr index to SolrCloud.
> > However,
> > > in
> > > > > > this
> > > > > > > index, an external file field is used for popularity ranking.
> > > > > > >
> > > > > > > Does SolrCloud supports external file fields? How does
it cope
> > with
> > > > > > > sharding and replication? Where should the external file
be
> > placed
> > > now
> > > > > > that
> > > > > > > the index folder is not local but in the cloud?
> > > > > > >
> > > > > > > Are there otherwise other best practices to deal with the
use
> > cases
> > > > > > > external file fields were used for, like popularity/ranking,
in
> > > > > > SolrCloud?
> > > > > > > Custom ValueSources going to something external?
> > > > > > >
> > > > > > > Thanks in advance,
> > > > > > > Simone
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Sincerely yours
> > > > > Mikhail Khludnev
> > > > > Principal Engineer,
> > > > > Grid Dynamics
> > > > >
> > > > > <http://www.griddynamics.com>
> > > > >  <mkhludnev@griddynamics.com>
> > > > >
> > >  20.11.2012 18:06 пользователь "Martin Koch" <mak@issuu.com>
написал:
> > >
> > > > Hi Mikhail
> > > >
> > > > Please see answers below.
> > > >
> > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
> > > > mkhludnev@griddynamics.com> wrote:
> > > >
> > > > > Martin,
> > > > >
> > > > > Thank you for telling your own "war-story". It's really useful for
> > > > > community.
> > > > > The first question might seems not really conscious, but would you
> > tell
> > > > me
> > > > > what blocks searching during EFF reload, when it's triggered by
> > handler
> > > > or
> > > > > by listener?
> > > > >
> > > >
> > > > We continuously index new documents using CommitWithin to get regular
> > > > commits. However, we observed that the EFFs were not re-read, so we
> had
> > > to
> > > > do external commits (curl '.../solr/update?commit=true') to force
> > reload.
> > > > When this is done, solr blocks. I can't tell you exactly why it's
> doing
> > > > that (it was related to SOLR-3985).
> > > >
> > > >
> > > > > I don't really get the sentence about sequential commits and number
> > of
> > > > > cores. Do I get right that file is replicated via Zookeeper?
> Doesn't
> > it
> > > > >
> > > >
> > > > Again, this is observed behavior. When we issue a commit on a system
> > > with a
> > > > system with many solr cores using EFFs, the system blocks for a long
> > time
> > > > (15 minutes).  We do NOT use zookeeper for anything. The EFF is a
> > symlink
> > > > from each cores index dir to the actual file, which is updated by an
> > > > external process.
> > > >
> > > >
> > > > > causes scalability problem or long time to reload? Will it help if
> > > we'll
> > > > > have, let's say ExternalDatabaseField which will pull values from
> > jdbc.
> > > > ie.
> > > > >
> > > >
> > > > I think the possibility of having some fields being retrieved from an
> > > > external, dynamically updatable store would be really interesting.
> This
> > > > could be JDBC, something in-memory like redis, or a NoSql product
> (e.g.
> > > > Cassandra).
> > > >
> > > >
> > > > > why all cores can't read these values simultaneously?
> > > > >
> > > >
> > > > Again, this is a solr implementation detail that I can't answer :)
> > > >
> > > >
> > > > > Can you confirm that IDs in the file is ordered by the index term
> > > order?
> > > > >
> > > >
> > > > Yes, we sorted the files (standard UNIX sort).
> > > >
> > > >
> > > > > AFAIK it can impact load time.
> > > > >
> > > > Yes, it does.
> > > >
> > > >
> > > > > Regarding your post-query solution can you tell me if query found
> > 10000
> > > > > docs, but I need to display only first page with 100 rows, whether
> I
> > > need
> > > > > to pull all 10K results to frontend to order them by the rank?
> > > > >
> > > > >
> > > > In our architecture, the clients query an API that generates the SOLR
> > > > query, retrieves the relevant additional fields that we needs, and
> > > returns
> > > > the relevant JSON to the front-end.
> > > >
> > > > In our use case, results are returned from SOLR by the 10's, not by
> the
> > > > 1000's, so it is a manageable job. Even so, if solr returned
> thousands
> > of
> > > > results, it would be up to the implementation of the api to augment
> > only
> > > > the results that needed to be returned to the front-end.
> > > >
> > > > Even so, patching up a JSON structure with 10000 results should be
> > > > possible.
> > > >
> > > >
> > > > > I'm really appreciate if you comment on the questions above.
> > > > > PS: It's time to pitch, how much
> > > > > https://issues.apache.org/jira/browse/SOLR-4085 "Commit-free
> > > > > ExternalFileField" can help you?
> > > > >
> > > > >
> > > > > It looks very interesting :) Does it make it possible to avoid
> > > re-reading
> > > > the EFF on every commit, and only re-read the values that have
> actually
> > > > changed?
> > > >
> > > > /Martin
> > > >
> > > >
> > > > >
> > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <mak@issuu.com>
> wrote:
> > > > >
> > > > > > Solr 4.0 does support using EFFs, but it might not give you
what
> > > you're
> > > > > > hoping fore.
> > > > > >
> > > > > > We tried using Solr Cloud, and have given up again.
> > > > > >
> > > > > > The EFF is placed in the parent of the index directory in each
> > core;
> > > > each
> > > > > > core reads the entire EFF and picks out the IDs that it is
> > > responsible
> > > > > for.
> > > > > >
> > > > > > In the current 4.0.0 release of solr, solr blocks (doesn't answer
> > > > > queries)
> > > > > > while re-reading the EFF. Even worse, it seems that the time
to
> > > re-read
> > > > > the
> > > > > > EFF is multiplied by the number of cores in use (i.e. the EFF
is
> > > > re-read
> > > > > by
> > > > > > each core sequentially). The contents of the EFF become active
> > after
> > > > the
> > > > > > first EXTERNAL commit (commitWithin does NOT work here) after
the
> > > file
> > > > > has
> > > > > > been updated.
> > > > > >
> > > > > > In our case, the EFF was quite large - around 450MB - and we
use
> 16
> > > > > shards,
> > > > > > so when we triggered an external commit to force re-reading,
the
> > > whole
> > > > > > system would block for several (10-15) minutes. This won't work
> in
> > a
> > > > > > production environment. The reason for the size of the EFF is
> that
> > we
> > > > > have
> > > > > > around 7M documents in the index; each document has a 45
> character
> > > ID.
> > > > > >
> > > > > > We got some help to try to fix the problem so that the re-read
of
> > the
> > > > EFF
> > > > > > proceeds in the background (see
> > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985>
for
> > > > > > a fix on the 4.1 branch). However, even though the re-read
> proceeds
> > > in
> > > > > the
> > > > > > background, the time required to launch solr now takes at least
> as
> > > long
> > > > > as
> > > > > > re-reading the EFFs. Again, this is not good enough for our
> needs.
> > > > > >
> > > > > > The next issue is that you cannot sort on EFF fields (though
you
> > can
> > > > > return
> > > > > > them as values using &fl=field(my_eff_field). This is also
fixed
> in
> > > the
> > > > > 4.1
> > > > > > branch here <https://issues.apache.org/jira/browse/SOLR-4022>.
> > > > > >
> > > > > > So: Even after these fixes, EFF performance is not that great.
> Our
> > > > > solution
> > > > > > is as follows: The actual value of the popularity measure (say,
> > > reads)
> > > > > that
> > > > > > we want to report to the user is inserted into the search
> response
> > > > > > post-query by our query front-end. This value will then be the
> > > > > > authoritative value at the time of the query. The value of the
> > > > popularity
> > > > > > measure that we use for boosting in the ranking of the search
> > results
> > > > is
> > > > > > only updated when the value has changed enough so that the impact
> > on
> > > > the
> > > > > > boost will be significant (say, more than 2%). This does require
> > > > frequent
> > > > > > re-indexing of the documents that have significant changes in
the
> > > > number
> > > > > of
> > > > > > reads, but at least we won't have to update a document if it
> moves
> > > > from,
> > > > > > say, 1000000 to 1000001 reads.
> > > > > >
> > > > > > /Martin Koch - ISSUU - senior systems architect.
> > > > > >
> > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
> simoneg@apache.org
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > > I'm planning to move a quite big Solr index to SolrCloud.
> > However,
> > > in
> > > > > > this
> > > > > > > index, an external file field is used for popularity ranking.
> > > > > > >
> > > > > > > Does SolrCloud supports external file fields? How does
it cope
> > with
> > > > > > > sharding and replication? Where should the external file
be
> > placed
> > > > now
> > > > > > that
> > > > > > > the index folder is not local but in the cloud?
> > > > > > >
> > > > > > > Are there otherwise other best practices to deal with the
use
> > cases
> > > > > > > external file fields were used for, like popularity/ranking,
in
> > > > > > SolrCloud?
> > > > > > > Custom ValueSources going to something external?
> > > > > > >
> > > > > > > Thanks in advance,
> > > > > > > Simone
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Sincerely yours
> > > > > Mikhail Khludnev
> > > > > Principal Engineer,
> > > > > Grid Dynamics
> > > > >
> > > > > <http://www.griddynamics.com>
> > > > >  <mkhludnev@griddynamics.com>
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  <mkhludnev@griddynamics.com>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message