lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: Solr faceting vs. Lucene faceting
Date Thu, 13 Dec 2012 11:21:01 GMT
As I said, if someone volunteers to do some work on the Solr side, I will
gladly participate in that effort.
I just don't even know where to start w/ Solr :).

One thing that would be really great is if we can build an adapter (I think
someone mentioned that word here)
which supports basic facets capabilities, so that we can at least benchmark
Solr's current
implementation vs the implementation w/ the module. I'm talking something
very basic, a'la the test Mike and I
run on the module (counting 1-2 facets, simple hierarchy, simple queries).

Then we can at least tell if moving Solr to the module makes sense, before
we continue to develop all of current
Solr's functionality on top of the module.

Shai


On Thu, Dec 13, 2012 at 12:59 PM, Robert Muir <rcmuir@gmail.com> wrote:

> even as a step it would be nice to have lucene's faceting exposed to
> solr in a way that only works with a single node.
>
> because it supports NRT, doesnt need to build up massive top-level
> datastructures and so on, many people that currently need multiple
> nodes might be able to work just fine with a single node.
>
> On Wed, Dec 12, 2012 at 2:28 AM, Shai Erera <serera@gmail.com> wrote:
> > There are two ways you can work with the taxonomy index in a distributed
> > environment (at least, these are the things that we've tried):
> > (1) replicate the taxonomy to all shards, so that each shard sees the
> entire
> > global taxonomy
> > (2) each shard maintains its own taxonomy.
> >
> > (1) only makes sense when the shards are built by a side process,e.g.
> > MapReduce, and then copied to their respective nodes.
> > If you index like that, then your distributed faceted search (correcting
> the
> > counts of categories) is done on ordinals rather than
> > strings.
> >
> > (2) is the one that makes sense to most people, and is also NRT (where #1
> > isn't!). Each shard maintains its local search +
> > taxonomy indexes. In that mode, the counts correction cannot be done on
> > ordinals, and has to be done on strings.
> >
> > When you're doing distributed faceted search, you cannot just ask each
> shard
> > to return the top-10 categories for the "Author" dimension/facet,
> > because then you might (1) miss a category that should have been in the
> > total top-10 and (2) return a category in the top-10 with
> > incorrect counts. What you can do is ask for C*10, where C is an
> > over-counting factor. You'd still hope for the best though, b/c
> > theoretically, you could still miss a category / have incorrect count for
> > one.
> >
> > The difference between the two approaches is how big C can be. In the
> first
> > approach, since all you transmit on the wire are
> > integers, and the merge is done on integers, you can set C much higher
> than
> > in the second approach. In practice though, since
> > more and more applications are interested in real-time search, we keep a
> > local taxonomy index per each shard and do the merge
> > on the strings.
> >
> > Also, when you're doing really large-scale, exact counts for categories
> may
> > not be so usable. How is Science Fiction (123,367,129)
> > different than Drama (145,465,987) !? To the user these are just
> categories
> > that are associated with too many documents than I
> > can digest anyway :).
> >
> > For that, we do sampling and display %, which is more consumable by
> users,
> > and then you don't need to worry about exact counts.
> >
> > I think that I wrote a bit too long an answer to your question :).
> >
> > Regarding not deleting categories, we've thought about it in the past and
> > I'm not sure it's a problem. I mean, in theory, yes, you could
> > end up w/ a taxonomy index that has many unused categories. But:
> >
> > * Whenever we were dealing with timestamp-based applications, at large
> > scale, they always created shards per time (e.g. per day / hour)
> >   and when the taxonomy index is local to the shard, then it's gone
> > completely when the shard is gone.
> >
> > * You can always re-map the ordinals to new ones by running a side
> process
> > which checks which of the categories are unused, adds
> >   those that are in use to a new taxonomy index and rewrites the payload
> > posing of the search index. It sounds expensive, but we've
> >   never had to do it yet, so I don't know how much expensive.
> >
> > At the end of the day, the facets module lets you build the faceted
> search
> > that best suits your needs. It can work entirely off-disk,
> > it can be loaded in-memory (similar to FieldCache, Mike and I are
> working on
> > some improvements there - you're welcome to join!), it
> > can support exact counts or sampling, other aggregation methods than just
> > counting and many more.
> >
> > The sidecar taxonomy index is not as bad as it sounds. As I've told you,
> > many IBM products are working with it for many years, at small
> > and large scale.
> >
> > I think that Solr could benefit from this module too, and I hope that I
> > don't sound too biased :).
> > Having Solr reusing Lucene modules is important IMO.
> >
> > Shai
> >
> >
> > On Wed, Dec 12, 2012 at 1:12 AM, Lukáš Vlček <lukas.vlcek@gmail.com>
> wrote:
> >>
> >> Hi Shai,
> >>
> >> thanks for your blog, I am looking forward to your future posts!
> >>
> >> Just two questions: you mentioned that you have been running this in
> >> production in distributed mode. If I understand it correctly the idea is
> >> there is only a single taxonomy index even if the distributed mode means
> >> that the data indices were partitioned/sharded. (Thus the ordinals are
> >> global). The taxonomy index is not partitioned/sharded itself. Am I
> correct?
> >>
> >> Also what seems to be an interesting implication of this implementation
> is
> >> the fact that taxonomy index never cares about deleted documents
> (categories
> >> that are obsolete). In practices this is probably not a bit deal
> because the
> >> taxonomy index is small but I can imagine this might be problematic in
> some
> >> situations (for example imagine that the categories would be based on
> highly
> >> granular timestamp, that could create a lot of categories over short
> period
> >> of time and those would be kept "forever" and still growing...).
> >> (^^ I am just trying to understand how it works.)
> >>
> >> Regards,
> >> Lukas
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Mime
View raw message