lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mikhail Khludnev <mkhlud...@griddynamics.com>
Subject Re: SolrCloud and exernal file fields
Date Tue, 20 Nov 2012 11:28:27 GMT
Martin,

Thank you for telling your own "war-story". It's really useful for
community.
The first question might seems not really conscious, but would you tell me
what blocks searching during EFF reload, when it's triggered by handler or
by listener?
I don't really get the sentence about sequential commits and number of
cores. Do I get right that file is replicated via Zookeeper? Doesn't it
causes scalability problem or long time to reload? Will it help if we'll
have, let's say ExternalDatabaseField which will pull values from jdbc. ie.
why all cores can't read these values simultaneously?
Can you confirm that IDs in the file is ordered by the index term order?
AFAIK it can impact load time.
Regarding your post-query solution can you tell me if query found 10000
docs, but I need to display only first page with 100 rows, whether I need
to pull all 10K results to frontend to order them by the rank?

I'm really appreciate if you comment on the questions above.
PS: It's time to pitch, how much
https://issues.apache.org/jira/browse/SOLR-4085 "Commit-free
ExternalFileField" can help you?



On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch <mak@issuu.com> wrote:

> Solr 4.0 does support using EFFs, but it might not give you what you're
> hoping fore.
>
> We tried using Solr Cloud, and have given up again.
>
> The EFF is placed in the parent of the index directory in each core; each
> core reads the entire EFF and picks out the IDs that it is responsible for.
>
> In the current 4.0.0 release of solr, solr blocks (doesn't answer queries)
> while re-reading the EFF. Even worse, it seems that the time to re-read the
> EFF is multiplied by the number of cores in use (i.e. the EFF is re-read by
> each core sequentially). The contents of the EFF become active after the
> first EXTERNAL commit (commitWithin does NOT work here) after the file has
> been updated.
>
> In our case, the EFF was quite large - around 450MB - and we use 16 shards,
> so when we triggered an external commit to force re-reading, the whole
> system would block for several (10-15) minutes. This won't work in a
> production environment. The reason for the size of the EFF is that we have
> around 7M documents in the index; each document has a 45 character ID.
>
> We got some help to try to fix the problem so that the re-read of the EFF
> proceeds in the background (see
> here<https://issues.apache.org/jira/browse/SOLR-3985> for
> a fix on the 4.1 branch). However, even though the re-read proceeds in the
> background, the time required to launch solr now takes at least as long as
> re-reading the EFFs. Again, this is not good enough for our needs.
>
> The next issue is that you cannot sort on EFF fields (though you can return
> them as values using &fl=field(my_eff_field). This is also fixed in the 4.1
> branch here <https://issues.apache.org/jira/browse/SOLR-4022>.
>
> So: Even after these fixes, EFF performance is not that great. Our solution
> is as follows: The actual value of the popularity measure (say, reads) that
> we want to report to the user is inserted into the search response
> post-query by our query front-end. This value will then be the
> authoritative value at the time of the query. The value of the popularity
> measure that we use for boosting in the ranking of the search results is
> only updated when the value has changed enough so that the impact on the
> boost will be significant (say, more than 2%). This does require frequent
> re-indexing of the documents that have significant changes in the number of
> reads, but at least we won't have to update a document if it moves from,
> say, 1000000 to 1000001 reads.
>
> /Martin Koch - ISSUU - senior systems architect.
>
> On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni <simoneg@apache.org> wrote:
>
> > Hi all,
> > I'm planning to move a quite big Solr index to SolrCloud. However, in
> this
> > index, an external file field is used for popularity ranking.
> >
> > Does SolrCloud supports external file fields? How does it cope with
> > sharding and replication? Where should the external file be placed now
> that
> > the index folder is not local but in the cloud?
> >
> > Are there otherwise other best practices to deal with the use cases
> > external file fields were used for, like popularity/ranking, in
> SolrCloud?
> > Custom ValueSources going to something external?
> >
> > Thanks in advance,
> > Simone
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mkhludnev@griddynamics.com>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message