cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Petrov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-12915) SASI: Index intersection can be very inefficient
Date Mon, 12 Dec 2016 14:35:59 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15742064#comment-15742064
] 

Alex Petrov commented on CASSANDRA-12915:
-----------------------------------------

This was something I have initially suspected, this was also why I've asked whether the problem
is related to {{LIKE}} [previously|https://issues.apache.org/jira/browse/CASSANDRA-12915?focusedCommentId=15725035&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15725035].

Certainly, {{LIKE}} queries are much less performant, although I'm afraid my benchmarking
has shown bit smaller differences (although still obviously slower).

In order to do the "simple" query, SASI simply grabs the so called term from the index. Term
search is relatively inexpensive, and
locating a single term is cheap. Now, when we're moving to the {{LIKE}} query, instead of
fetching a single term, we're fetching 
_many terms_ using {{TermIterator}}. For each of this terms, {{TokenTree}} is going to be
composed. Unfortunately, on that
level not that many optimisations are done. Most of them are performed on the {{TokenTree}}
level. 

So of course the slowness of the fact that we're doing a whole lot of work is showing up when
one index result is disproportionally 
larger than the other one, but you can see somewhat similar effects even when the token trees
are of comparable sizes (although larger
than just a couple of items).

A couple of things to try / play with in order to improve your query performance: 

   * Take a closer look at tokenisers, might be you'll be better off by tokenising your data
differently and performing EQ queries instead
of {{LIKE}} term ranges
   * Looks like one of the items in your index is disproportionally larger than the other
(possibly even binary, or of a very low cardinality).
It might be possible to just do drop index and do filtering. If you really have to query the
second column, use "native" 2i

To summarise: I agree there's an inefficiency that's shown by the query type you're proposing.
I think we should solve this inefficiency
by changing the way {{searchRange}} / {{searchPoint}} is working in the {{OnDiskIndex}}, but
this may be a very big change, plus we have 
to take into account how that plays with the fact that the index is SSTable attached. 

Having that said, whenever we have a proper query planner at hand on the top-level, we should
definitely include rule-based optimisation
process, and this might become one of the rules. To fully solve it, we'll also have to have
proper cardinality estimation (maybe with 
intersections etc) in place. 

> SASI: Index intersection can be very inefficient
> ------------------------------------------------
>
>                 Key: CASSANDRA-12915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12915
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: sasi
>            Reporter: Corentin Chary
>             Fix For: 3.x
>
>
> It looks like RangeIntersectionIterator.java and be pretty inefficient in some cases.
Let's take the following query:
> SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar';
> In this case:
> * index1 = 'foo' will match 2 items
> * index2 = 'bar' will match ~300k items
> On my setup, the query will take ~1 sec, most of the time being spent in disk.TokenTree.getTokenAt().
> if I patch RangeIntersectionIterator so that it doesn't try to do the intersection (and
effectively only use 'index1') the query will run in a few tenth of milliseconds.
> I see multiple solutions for that:
> * Add a static thresold to avoid the use of the index for the intersection when we know
it will be slow. Probably when the range size factor is very small and the range size is big.
> * CASSANDRA-10765



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message