cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-6976) Determining replicas to query is very slow with large numbers of nodes or vnodes
Date Mon, 01 Dec 2014 18:23:13 GMT


Benedict commented on CASSANDRA-6976:

bq. I recall someone on the Mechanical Sympathy group pointing out that you can warm an entire
last level cache in some small amount of time, I think it was 30ish milliseconds. I can't
find the post and I could be very wrong, but it was definitely milliseconds. My guess is that
in the big picture cache effects aren't changing the narrative that this takes 10s to 100s
of milliseconds.

Sure it does - if an action that is likely memory bound (like this one - after all, it does
very little in the way of computation and doesn't touch any disk) takes time X with a warmed
cache, and only touches data that can fit in cache, it will take X*K with a cold cache for
some K (significantly) > 1 - and in real operation, especially with many tokens, there
is a quite reasonable likelihood of a cold cache given the lack of locality and amount of
data as the cluster grows. This is actually one possibility for improving this behaviour,
if we cared at all - ensuring the number of cache lines touched is kept low, working with
primitives for the token ranges and inet addresses to reduce the constant factors. This would
also improve the normal code paths, not just range slices.

bq. If it is slow, what is the solution? Even if we lazily materialize the ranges the run
time of fetching batches of results dominates the in-memory compute of getRestrictedRanges.
When we talked use cases it seems like people would using paging programmatically so only
console users would see this poor performance outside of the lookup table use case you mentioned.

For a lookup (i.e. small) table query, or a range query that can be serviced entirely by the
local node, it is quite unlikely that the fetching would dominate when talking about timescales
>= 1ms.

bq. I didn't quite follow this. Are you talking about getLiveSortedEndpoints called from getRangeSlice?
I haven't dug deep enough into getRangeSlice to tell you where the time in that goes exactly.
I would have to do it again and insert some probes. I assumed it was dominated by sending
remote requests.

Yes - for your benchmark it would not have spent any much time here, since the sort would
be a no-op and the list a single entry, but as the number of data centres and replication
factor grows, and with use of NetworkTopologyStrategy, this could be a significant time expenditure.
It will also on the aggregate affect a certain percentage of cpu time spent on all queries.
However since the sort order is actually pretty consistent, sorting only when the sort order
changes would be a way to eliminate this cost.

bq. Benchmarking in what scope? This microbenchmark, defaults for workloads in cstar, tribal
knowledge when doing performance work?

Like I said, please do feel to drop this particular line of enquiry for the moment, since
even with all of the above I doubt this is a pressing matter. But I don't think this is the
end of the topic entirely - at some point this cost will be a more measurable percentage
done. But these kinds of costs are simply not a part of any of our current benchmarking methodology
since our default configs avoid the code paths entirely (either by having no DCs, low RF,
low node count, no tokens, and SimpleStrategy), and that is something we should address. 

In the meantime it might be worth having a simple short-circuit path for queries that may
be answered by the local node only, though.

> Determining replicas to query is very slow with large numbers of nodes or vnodes
> --------------------------------------------------------------------------------
>                 Key: CASSANDRA-6976
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benedict
>            Assignee: Ariel Weisberg
>              Labels: performance
>         Attachments:, jmh_output.txt, jmh_output_murmur3.txt,
> As described in CASSANDRA-6906, this can be ~100ms for a relatively small cluster with
vnodes, which is longer than it will spend in transit on the network. This should be much

This message was sent by Atlassian JIRA

View raw message