lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Koch <...@issuu.com>
Subject Re: SolrCloud and exernal file fields
Date Wed, 21 Nov 2012 11:56:58 GMT
Mikhail,

PSB

On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch <mak@issuu.com> wrote:
>
> >
> > I wasn't aware until now that it is possible to send a commit to one core
> > only. What we observed was the effect of curl
> > localhost:8080/solr/update?commit=true but perhaps we should experiment
> > with solr/coreN/update?commit=true. A quick trial run seems to indicate
> > that a commit to a single core causes commits on all cores.
> >
> You should see something like this in the log:
> ... SolrCmdDistributor .... Distrib commit to: ...
>
> Yup, a commit towards a single core results in a commit on all cores.


> >
> >
> > Perhaps I should clarify that we are using SOLR as a black box; we do not
> > touch the code at all - we only install the distribution WAR file and
> > proceed from there.
> >
> I still don't understand how you deploy/launch Solr. How many jettys you
> start whether you have -DzkRun -DzkHost -DnumShards=2  or you specifies
> shards= param for every request and distributes updates yourself? What
> collections do you create and with which settings?
>
> We let SOLR do the sharding using one collection with 16 SOLR cores
holding one shard each. We launch only one instance of jetty with the
folllowing arguments:

-DnumShards=16
-DzkHost=<zookeeperhost:port>
-Xmx10G
-Xms10G
-Xmn2G
-server

Would you like to see the solrconfig.xml?

/Martin


> >
> >
> > > Also from my POV such deployments should start at least from *16* 4-way
> > > vboxes, it's more expensive, but should be much better available during
> > > cpu-consuming operations.
> > >
> >
> > Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts
> with
> > 16 cores? Or am I misunderstanding something :) ?
> >
> I prefer to start from 16 hosts with 4 cores each.
>
>
> >
> >
> > > Other details, if you use single jetty for all of them, are you sure
> that
> > > jetty's threadpool doesn't limit requests? is it large enough?
> > > You have 60G and set -Xmx=10G. are you sure that total size of cores
> > index
> > > directories is less than 45G?
> > >
> > > The total index size is 230 GB, so it won't fit in ram, but we're using
> > an
> > SSD disk to minimize disk access time. We have tried putting the EFF
> onto a
> > ram disk, but this didn't have a measurable effect.
> >
> > Thanks,
> > /Martin
> >
> >
> > > Thanks
> > >
> > >
> > > On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch <mak@issuu.com> wrote:
> > >
> > > > Mikhail
> > > >
> > > > PSB
> > > >
> > > > On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev <
> > > > mkhludnev@griddynamics.com> wrote:
> > > >
> > > > > Martin,
> > > > >
> > > > > Please find additional question from me below.
> > > > >
> > > > > Simone,
> > > > >
> > > > > I'm sorry for hijacking your thread. The only what I've heard about
> > it
> > > at
> > > > > recent ApacheCon sessions is that Zookeeper is supposed to
> replicate
> > > > those
> > > > > files as configs under solr home. And I'm really looking forward
to
> > > know
> > > > > how it works with huge files in production.
> > > > >
> > > > > Thank You, Guys!
> > > > >
> > > > > 20.11.2012 18:06 пользователь "Martin Koch" <mak@issuu.com>
> написал:
> > > > > >
> > > > > > Hi Mikhail
> > > > > >
> > > > > > Please see answers below.
> > > > > >
> > > > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
> > > > > > mkhludnev@griddynamics.com> wrote:
> > > > > >
> > > > > > > Martin,
> > > > > > >
> > > > > > > Thank you for telling your own "war-story". It's really
useful
> > for
> > > > > > > community.
> > > > > > > The first question might seems not really conscious, but
would
> > you
> > > > tell
> > > > > me
> > > > > > > what blocks searching during EFF reload, when it's triggered
by
> > > > handler
> > > > > or
> > > > > > > by listener?
> > > > > > >
> > > > > >
> > > > > > We continuously index new documents using CommitWithin to get
> > regular
> > > > > > commits. However, we observed that the EFFs were not re-read,
so
> we
> > > had
> > > > > to
> > > > > > do external commits (curl '.../solr/update?commit=true') to
force
> > > > reload.
> > > > > > When this is done, solr blocks. I can't tell you exactly why
it's
> > > doing
> > > > > > that (it was related to SOLR-3985).
> > > > >
> > > > > Is there a chance to get a thread dump when they are blocked?
> > > > >
> > > > >
> > > > Well I could try to recreate the situation. But the setup is fairly
> > > simple:
> > > > Create a large EFF in a largeish index with many shards. Issue a
> > commit,
> > > > and then try to do a search. Solr will not respond to the search
> before
> > > the
> > > > commit has completed, and this will take a long time.
> > > >
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > > I don't really get the sentence about sequential commits
and
> > number
> > > > of
> > > > > > > cores. Do I get right that file is replicated via Zookeeper?
> > > Doesn't
> > > > it
> > > > > > >
> > > > > >
> > > > > > Again, this is observed behavior. When we issue a commit on
a
> > system
> > > > with
> > > > > a
> > > > > > system with many solr cores using EFFs, the system blocks for
a
> > long
> > > > time
> > > > > > (15 minutes).  We do NOT use zookeeper for anything. The EFF
is a
> > > > symlink
> > > > > > from each cores index dir to the actual file, which is updated
by
> > an
> > > > > > external process.
> > > > >
> > > > > Hold on, I asked about Zookeeper because the subj mentions
> SolrCloud.
> > > > >
> > > > > Do you use SolrCloud, SolrShards, or these cores are just replicas
> of
> > > the
> > > > > same index?
> > > > >
> > > >
> > > > Ah - we use solr 4 out of the box, so I guess this is SolrCloud. I'm
> a
> > > bit
> > > > unsure about the terminology here, but we've got a single index
> divided
> > > > into 16 shard. Each shard is hosted in a solr core.
> > > >
> > > >
> > > > > Also, about simlink - Don't you share that file via some NFS?
> > > > >
> > > > > No, we generate the EFF on the local solr host (there is only one
> > > > physical
> > > > host that holds all shards), so there is no need for NFS or copying
> > files
> > > > around. No need for Zookeeper either.
> > > >
> > > >
> > > > > how many cores you run per box?
> > > > >
> > > > This box is a 16-virtual core (8 hyperthreaded cores)  with 60GB of
> > RAM.
> > > We
> > > > run 16 solr cores on this box in Jetty.
> > > >
> > > >
> > > > > Do boxes has plenty of ram to cache filesystem beside of jvm heaps?
> > > > >
> > > > > Yes. We've allocated 10GB for jetty, and left the rest for the OS.
> > > >
> > > >
> > > > > I assume you use 64 bit linux and mmap directory. Please confirm
> > that.
> > > > >
> > > > >
> > > > We use 64-bit linux. I'm not sure about the mmap directory or where
> > that
> > > > would be configured in solr - can you explain that?
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > > causes scalability problem or long time to reload? Will
it help
> > if
> > > > > we'll
> > > > > > > have, let's say ExternalDatabaseField which will pull values
> from
> > > > jdbc.
> > > > > ie.
> > > > > > >
> > > > > >
> > > > > > I think the possibility of having some fields being retrieved
> from
> > an
> > > > > > external, dynamically updatable store would be really
> interesting.
> > > This
> > > > > > could be JDBC, something in-memory like redis, or a NoSql product
> > > (e.g.
> > > > > > Cassandra).
> > > > >
> > > > > Ok. Let's have it in mind as a possible direction.
> > > > >
> > > >
> > > > Alternatively, an API that would allow updating a single field for a
> > > > document might be an option.
> > > >
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > > why all cores can't read these values simultaneously?
> > > > > > >
> > > > > >
> > > > > > Again, this is a solr implementation detail that I can't answer
> :)
> > > > > >
> > > > > >
> > > > > > > Can you confirm that IDs in the file is ordered by the
index
> term
> > > > > order?
> > > > > > >
> > > > > >
> > > > > > Yes, we sorted the files (standard UNIX sort).
> > > > > >
> > > > > >
> > > > > > > AFAIK it can impact load time.
> > > > > > >
> > > > > > Yes, it does
> > > > >
> > > > > Ok, I've got that you aware of it, and your IDs are just strings,
> not
> > > > > integers.
> > > > >
> > > > >
> > > > Yes, ids are strings.
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > > Regarding your post-query solution can you tell me if query
> found
> > > > 10000
> > > > > > > docs, but I need to display only first page with 100 rows,
> > whether
> > > I
> > > > > need
> > > > > > > to pull all 10K results to frontend to order them by the
rank?
> > > > > > >
> > > > > > >
> > > > > > In our architecture, the clients query an API that generates
the
> > SOLR
> > > > > > query, retrieves the relevant additional fields that we needs,
> and
> > > > > returns
> > > > > > the relevant JSON to the front-end.
> > > > > >
> > > > > > In our use case, results are returned from SOLR by the 10's,
not
> by
> > > the
> > > > > > 1000's, so it is a manageable job. Even so, if solr returned
> > > thousands
> > > > of
> > > > > > results, it would be up to the implementation of the api to
> augment
> > > > only
> > > > > > the results that needed to be returned to the front-end.
> > > > > >
> > > > > > Even so, patching up a JSON structure with 10000 results should
> be
> > > > > > possible.
> > > > >
> > > > > You are right. I'm concerned anyway because retrieving whole result
> > is
> > > > > expensive, and not always possible.
> > > > >
> > > > >
> > > > In our case, getting the whole result is almost impossible, because
> > that
> > > > would be millions of documents, and returning the Nth result seems to
> > be
> > > a
> > > > quadratic (or worse) operation in SOLR.
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > > I'm really appreciate if you comment on the questions above.
> > > > > > > PS: It's time to pitch, how much
> > > > > > > https://issues.apache.org/jira/browse/SOLR-4085 "Commit-free
> > > > > > > ExternalFileField" can help you?
> > > > > > >
> > > > > > >
> > > > > > > It looks very interesting :) Does it make it possible to
avoid
> > > > > re-reading
> > > > > > the EFF on every commit, and only re-read the values that have
> > > actually
> > > > > > changed?
> > > > >
> > > > >
> > > > > You don't need commit (in SOLR-4085) to reload file content, but
> > after
> > > > > commit you need to read whole file and scan all key terms and
> > postings.
> > > > > That's because EFF sits on top of top level searcher. it's a
> > Solr-like
> > > > way.
> > > > > In some future we might have per-segment EFF, in this case adding
a
> > > > segment
> > > > > will trigger full file scan, but in the index only that new segment
> > > will
> > > > be
> > > > > scanned. It should be faster. You know, straightforward sharing
> > > internal
> > > > > data structures between different index views/generations is not
> > > > possible.
> > > > > If you are asking about applying delta changes on external file
> > that's
> > > > > something what we did ourselves http://goo.gl/P8GFq . This feature
> > is
> > > > much
> > > > > more doubtful and vague, although it might be the next contribution
> > > after
> > > > > SOLR-4085.
> > > > >
> > > > > >
> > > > > > /Martin
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <mak@issuu.com>
> > > wrote:
> > > > > > >
> > > > > > > > Solr 4.0 does support using EFFs, but it might not
give you
> > what
> > > > > you're
> > > > > > > > hoping fore.
> > > > > > > >
> > > > > > > > We tried using Solr Cloud, and have given up again.
> > > > > > > >
> > > > > > > > The EFF is placed in the parent of the index directory
in
> each
> > > > core;
> > > > > each
> > > > > > > > core reads the entire EFF and picks out the IDs that
it is
> > > > > responsible
> > > > > > > for.
> > > > > > > >
> > > > > > > > In the current 4.0.0 release of solr, solr blocks
(doesn't
> > answer
> > > > > > > queries)
> > > > > > > > while re-reading the EFF. Even worse, it seems that
the time
> to
> > > > > re-read
> > > > > > > the
> > > > > > > > EFF is multiplied by the number of cores in use (i.e.
the EFF
> > is
> > > > > re-read
> > > > > > > by
> > > > > > > > each core sequentially). The contents of the EFF become
> active
> > > > after
> > > > > the
> > > > > > > > first EXTERNAL commit (commitWithin does NOT work
here) after
> > the
> > > > > file
> > > > > > > has
> > > > > > > > been updated.
> > > > > > > >
> > > > > > > > In our case, the EFF was quite large - around 450MB
- and we
> > use
> > > 16
> > > > > > > shards,
> > > > > > > > so when we triggered an external commit to force re-reading,
> > the
> > > > > whole
> > > > > > > > system would block for several (10-15) minutes. This
won't
> work
> > > in
> > > > a
> > > > > > > > production environment. The reason for the size of
the EFF is
> > > that
> > > > we
> > > > > > > have
> > > > > > > > around 7M documents in the index; each document has
a 45
> > > character
> > > > > ID.
> > > > > > > >
> > > > > > > > We got some help to try to fix the problem so that
the
> re-read
> > of
> > > > the
> > > > > EFF
> > > > > > > > proceeds in the background (see
> > > > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985>
for
> > > > > > > > a fix on the 4.1 branch). However, even though the
re-read
> > > proceeds
> > > > > in
> > > > > > > the
> > > > > > > > background, the time required to launch solr now takes
at
> least
> > > as
> > > > > long
> > > > > > > as
> > > > > > > > re-reading the EFFs. Again, this is not good enough
for our
> > > needs.
> > > > > > > >
> > > > > > > > The next issue is that you cannot sort on EFF fields
(though
> > you
> > > > can
> > > > > > > return
> > > > > > > > them as values using &fl=field(my_eff_field).
This is also
> > fixed
> > > in
> > > > > the
> > > > > > > 4.1
> > > > > > > > branch here <https://issues.apache.org/jira/browse/SOLR-4022
> >.
> > > > > > > >
> > > > > > > > So: Even after these fixes, EFF performance is not
that
> great.
> > > Our
> > > > > > > solution
> > > > > > > > is as follows: The actual value of the popularity
measure
> (say,
> > > > > reads)
> > > > > > > that
> > > > > > > > we want to report to the user is inserted into the
search
> > > response
> > > > > > > > post-query by our query front-end. This value will
then be
> the
> > > > > > > > authoritative value at the time of the query. The
value of
> the
> > > > > popularity
> > > > > > > > measure that we use for boosting in the ranking of
the search
> > > > results
> > > > > is
> > > > > > > > only updated when the value has changed enough so
that the
> > impact
> > > > on
> > > > > the
> > > > > > > > boost will be significant (say, more than 2%). This
does
> > require
> > > > > frequent
> > > > > > > > re-indexing of the documents that have significant
changes in
> > the
> > > > > number
> > > > > > > of
> > > > > > > > reads, but at least we won't have to update a document
if it
> > > moves
> > > > > from,
> > > > > > > > say, 1000000 to 1000001 reads.
> > > > > > > >
> > > > > > > > /Martin Koch - ISSUU - senior systems architect.
> > > > > > > >
> > > > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
> > > simoneg@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > > I'm planning to move a quite big Solr index to
SolrCloud.
> > > > However,
> > > > > in
> > > > > > > > this
> > > > > > > > > index, an external file field is used for popularity
> ranking.
> > > > > > > > >
> > > > > > > > > Does SolrCloud supports external file fields?
How does it
> > cope
> > > > with
> > > > > > > > > sharding and replication? Where should the external
file be
> > > > placed
> > > > > now
> > > > > > > > that
> > > > > > > > > the index folder is not local but in the cloud?
> > > > > > > > >
> > > > > > > > > Are there otherwise other best practices to deal
with the
> use
> > > > cases
> > > > > > > > > external file fields were used for, like
> popularity/ranking,
> > in
> > > > > > > > SolrCloud?
> > > > > > > > > Custom ValueSources going to something external?
> > > > > > > > >
> > > > > > > > > Thanks in advance,
> > > > > > > > > Simone
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Sincerely yours
> > > > > > > Mikhail Khludnev
> > > > > > > Principal Engineer,
> > > > > > > Grid Dynamics
> > > > > > >
> > > > > > > <http://www.griddynamics.com>
> > > > > > >  <mkhludnev@griddynamics.com>
> > > > > > >
> > > > >  20.11.2012 18:06 пользователь "Martin Koch" <mak@issuu.com>
> > написал:
> > > > >
> > > > > > Hi Mikhail
> > > > > >
> > > > > > Please see answers below.
> > > > > >
> > > > > > On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev <
> > > > > > mkhludnev@griddynamics.com> wrote:
> > > > > >
> > > > > > > Martin,
> > > > > > >
> > > > > > > Thank you for telling your own "war-story". It's really
useful
> > for
> > > > > > > community.
> > > > > > > The first question might seems not really conscious, but
would
> > you
> > > > tell
> > > > > > me
> > > > > > > what blocks searching during EFF reload, when it's triggered
by
> > > > handler
> > > > > > or
> > > > > > > by listener?
> > > > > > >
> > > > > >
> > > > > > We continuously index new documents using CommitWithin to get
> > regular
> > > > > > commits. However, we observed that the EFFs were not re-read,
so
> we
> > > had
> > > > > to
> > > > > > do external commits (curl '.../solr/update?commit=true') to
force
> > > > reload.
> > > > > > When this is done, solr blocks. I can't tell you exactly why
it's
> > > doing
> > > > > > that (it was related to SOLR-3985).
> > > > > >
> > > > > >
> > > > > > > I don't really get the sentence about sequential commits
and
> > number
> > > > of
> > > > > > > cores. Do I get right that file is replicated via Zookeeper?
> > > Doesn't
> > > > it
> > > > > > >
> > > > > >
> > > > > > Again, this is observed behavior. When we issue a commit on
a
> > system
> > > > > with a
> > > > > > system with many solr cores using EFFs, the system blocks for
a
> > long
> > > > time
> > > > > > (15 minutes).  We do NOT use zookeeper for anything. The EFF
is a
> > > > symlink
> > > > > > from each cores index dir to the actual file, which is updated
by
> > an
> > > > > > external process.
> > > > > >
> > > > > >
> > > > > > > causes scalability problem or long time to reload? Will
it help
> > if
> > > > > we'll
> > > > > > > have, let's say ExternalDatabaseField which will pull values
> from
> > > > jdbc.
> > > > > > ie.
> > > > > > >
> > > > > >
> > > > > > I think the possibility of having some fields being retrieved
> from
> > an
> > > > > > external, dynamically updatable store would be really
> interesting.
> > > This
> > > > > > could be JDBC, something in-memory like redis, or a NoSql product
> > > (e.g.
> > > > > > Cassandra).
> > > > > >
> > > > > >
> > > > > > > why all cores can't read these values simultaneously?
> > > > > > >
> > > > > >
> > > > > > Again, this is a solr implementation detail that I can't answer
> :)
> > > > > >
> > > > > >
> > > > > > > Can you confirm that IDs in the file is ordered by the
index
> term
> > > > > order?
> > > > > > >
> > > > > >
> > > > > > Yes, we sorted the files (standard UNIX sort).
> > > > > >
> > > > > >
> > > > > > > AFAIK it can impact load time.
> > > > > > >
> > > > > > Yes, it does.
> > > > > >
> > > > > >
> > > > > > > Regarding your post-query solution can you tell me if query
> found
> > > > 10000
> > > > > > > docs, but I need to display only first page with 100 rows,
> > whether
> > > I
> > > > > need
> > > > > > > to pull all 10K results to frontend to order them by the
rank?
> > > > > > >
> > > > > > >
> > > > > > In our architecture, the clients query an API that generates
the
> > SOLR
> > > > > > query, retrieves the relevant additional fields that we needs,
> and
> > > > > returns
> > > > > > the relevant JSON to the front-end.
> > > > > >
> > > > > > In our use case, results are returned from SOLR by the 10's,
not
> by
> > > the
> > > > > > 1000's, so it is a manageable job. Even so, if solr returned
> > > thousands
> > > > of
> > > > > > results, it would be up to the implementation of the api to
> augment
> > > > only
> > > > > > the results that needed to be returned to the front-end.
> > > > > >
> > > > > > Even so, patching up a JSON structure with 10000 results should
> be
> > > > > > possible.
> > > > > >
> > > > > >
> > > > > > > I'm really appreciate if you comment on the questions above.
> > > > > > > PS: It's time to pitch, how much
> > > > > > > https://issues.apache.org/jira/browse/SOLR-4085 "Commit-free
> > > > > > > ExternalFileField" can help you?
> > > > > > >
> > > > > > >
> > > > > > > It looks very interesting :) Does it make it possible to
avoid
> > > > > re-reading
> > > > > > the EFF on every commit, and only re-read the values that have
> > > actually
> > > > > > changed?
> > > > > >
> > > > > > /Martin
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <mak@issuu.com>
> > > wrote:
> > > > > > >
> > > > > > > > Solr 4.0 does support using EFFs, but it might not
give you
> > what
> > > > > you're
> > > > > > > > hoping fore.
> > > > > > > >
> > > > > > > > We tried using Solr Cloud, and have given up again.
> > > > > > > >
> > > > > > > > The EFF is placed in the parent of the index directory
in
> each
> > > > core;
> > > > > > each
> > > > > > > > core reads the entire EFF and picks out the IDs that
it is
> > > > > responsible
> > > > > > > for.
> > > > > > > >
> > > > > > > > In the current 4.0.0 release of solr, solr blocks
(doesn't
> > answer
> > > > > > > queries)
> > > > > > > > while re-reading the EFF. Even worse, it seems that
the time
> to
> > > > > re-read
> > > > > > > the
> > > > > > > > EFF is multiplied by the number of cores in use (i.e.
the EFF
> > is
> > > > > > re-read
> > > > > > > by
> > > > > > > > each core sequentially). The contents of the EFF become
> active
> > > > after
> > > > > > the
> > > > > > > > first EXTERNAL commit (commitWithin does NOT work
here) after
> > the
> > > > > file
> > > > > > > has
> > > > > > > > been updated.
> > > > > > > >
> > > > > > > > In our case, the EFF was quite large - around 450MB
- and we
> > use
> > > 16
> > > > > > > shards,
> > > > > > > > so when we triggered an external commit to force re-reading,
> > the
> > > > > whole
> > > > > > > > system would block for several (10-15) minutes. This
won't
> work
> > > in
> > > > a
> > > > > > > > production environment. The reason for the size of
the EFF is
> > > that
> > > > we
> > > > > > > have
> > > > > > > > around 7M documents in the index; each document has
a 45
> > > character
> > > > > ID.
> > > > > > > >
> > > > > > > > We got some help to try to fix the problem so that
the
> re-read
> > of
> > > > the
> > > > > > EFF
> > > > > > > > proceeds in the background (see
> > > > > > > > here<https://issues.apache.org/jira/browse/SOLR-3985>
for
> > > > > > > > a fix on the 4.1 branch). However, even though the
re-read
> > > proceeds
> > > > > in
> > > > > > > the
> > > > > > > > background, the time required to launch solr now takes
at
> least
> > > as
> > > > > long
> > > > > > > as
> > > > > > > > re-reading the EFFs. Again, this is not good enough
for our
> > > needs.
> > > > > > > >
> > > > > > > > The next issue is that you cannot sort on EFF fields
(though
> > you
> > > > can
> > > > > > > return
> > > > > > > > them as values using &fl=field(my_eff_field).
This is also
> > fixed
> > > in
> > > > > the
> > > > > > > 4.1
> > > > > > > > branch here <https://issues.apache.org/jira/browse/SOLR-4022
> >.
> > > > > > > >
> > > > > > > > So: Even after these fixes, EFF performance is not
that
> great.
> > > Our
> > > > > > > solution
> > > > > > > > is as follows: The actual value of the popularity
measure
> (say,
> > > > > reads)
> > > > > > > that
> > > > > > > > we want to report to the user is inserted into the
search
> > > response
> > > > > > > > post-query by our query front-end. This value will
then be
> the
> > > > > > > > authoritative value at the time of the query. The
value of
> the
> > > > > > popularity
> > > > > > > > measure that we use for boosting in the ranking of
the search
> > > > results
> > > > > > is
> > > > > > > > only updated when the value has changed enough so
that the
> > impact
> > > > on
> > > > > > the
> > > > > > > > boost will be significant (say, more than 2%). This
does
> > require
> > > > > > frequent
> > > > > > > > re-indexing of the documents that have significant
changes in
> > the
> > > > > > number
> > > > > > > of
> > > > > > > > reads, but at least we won't have to update a document
if it
> > > moves
> > > > > > from,
> > > > > > > > say, 1000000 to 1000001 reads.
> > > > > > > >
> > > > > > > > /Martin Koch - ISSUU - senior systems architect.
> > > > > > > >
> > > > > > > > On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <
> > > simoneg@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > > I'm planning to move a quite big Solr index to
SolrCloud.
> > > > However,
> > > > > in
> > > > > > > > this
> > > > > > > > > index, an external file field is used for popularity
> ranking.
> > > > > > > > >
> > > > > > > > > Does SolrCloud supports external file fields?
How does it
> > cope
> > > > with
> > > > > > > > > sharding and replication? Where should the external
file be
> > > > placed
> > > > > > now
> > > > > > > > that
> > > > > > > > > the index folder is not local but in the cloud?
> > > > > > > > >
> > > > > > > > > Are there otherwise other best practices to deal
with the
> use
> > > > cases
> > > > > > > > > external file fields were used for, like
> popularity/ranking,
> > in
> > > > > > > > SolrCloud?
> > > > > > > > > Custom ValueSources going to something external?
> > > > > > > > >
> > > > > > > > > Thanks in advance,
> > > > > > > > > Simone
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Sincerely yours
> > > > > > > Mikhail Khludnev
> > > > > > > Principal Engineer,
> > > > > > > Grid Dynamics
> > > > > > >
> > > > > > > <http://www.griddynamics.com>
> > > > > > >  <mkhludnev@griddynamics.com>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > > Principal Engineer,
> > > Grid Dynamics
> > >
> > > <http://www.griddynamics.com>
> > >  <mkhludnev@griddynamics.com>
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  <mkhludnev@griddynamics.com>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message