cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Petrov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-12915) SASI: Index intersection can be very inefficient
Date Mon, 12 Dec 2016 17:37:58 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15742552#comment-15742552
] 

Alex Petrov commented on CASSANDRA-12915:
-----------------------------------------

bq. But really it would be better if the query optimizer was in the database rather than in
our code.

I'm sorry if it sounded like I'm opposed to the change. I'm just saying that the bottleneck
is in the way Terms are iterated, so we may want to address this one first. Also, I'm sure
this way does not bring in any regressions, but since if 
we improve Term iteration, we may bring the performance to the levels comparable with what
happens in the {{TokenTree}}, which would mean that we don't have to ignore one of the indexes.
In my opinion, join of two indexes 
is a manifestation and not a source of the problem, so I'd rather fix the cause.

bq. Do you see a quick way of fixing the 'empty iterator' issue ? 

Sure, I have a patch in my branch, which pretty much literally returns an empty iterator (one
that'd return {{false}} for {{hasNext}} and {{endOfData}} for {{next}}. 

bq. Would you agree to chance the format to include proper cardinality estimation to the index
(so we can build upon later)

It's not my decision, you can post the proposal to the Mailing List and everyone who's involved
with SASI can contribute to the discussion and provide insightful feedback. At this point
I'm not sure about 
a) how exactly to do that, since technically we have counts on token tree level, but it seems
that with LIKE the problem is one level higher, on TERM level, so we have to know how many
items would match a non-EQ query. I'm not aware
of any way to do this without a linear scan, tree or trie (e.g. estimate). My initial comment
was related to EQ. 
b) if we really need to follow this path or understand how to optimise {{LIKE}} prefix queries
instead. But this may be a very significant chunk of work. 

bq. What would be the ETA for a proper query planner ? If O(months), would it make sense to
merge something like what I did to bring some performance improvement in the meantime ?

Unfortunately I do not have any estimates on that. In my personal opinion, Query Planner in
SASI is secondary to the improvements to RangeIterators, #11990 (and all opportunities that
it opens before us), possible optimisations for how LIKE queries are iterated
are going to both solve this particular problem and give much better results in "average case".


Hope that helps. 

> SASI: Index intersection can be very inefficient
> ------------------------------------------------
>
>                 Key: CASSANDRA-12915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12915
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: sasi
>            Reporter: Corentin Chary
>             Fix For: 3.x
>
>
> It looks like RangeIntersectionIterator.java and be pretty inefficient in some cases.
Let's take the following query:
> SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar';
> In this case:
> * index1 = 'foo' will match 2 items
> * index2 = 'bar' will match ~300k items
> On my setup, the query will take ~1 sec, most of the time being spent in disk.TokenTree.getTokenAt().
> if I patch RangeIntersectionIterator so that it doesn't try to do the intersection (and
effectively only use 'index1') the query will run in a few tenth of milliseconds.
> I see multiple solutions for that:
> * Add a static thresold to avoid the use of the index for the intersection when we know
it will be slow. Probably when the range size factor is very small and the range size is big.
> * CASSANDRA-10765



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message