spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Ross <>
Subject RE: Evaluating spark + Cassandra for our use cases
Date Tue, 18 Aug 2015 20:34:39 GMT
Hi Jorn,
Of course we're planning on doing a proof of concept here - the difficulty is that our timeline
is short, so we cannot afford too many PoCs before we have to make a decision.  We also need
to figure out *which* databases to proof of concept.

Note that one tricky aspect of our problem is that we need to support window functions partitioned
on a per account basis.  I've found that support for window functions is very limited in most
databases, and they're also generally slow when available.

Also, 1 customer certainly does not have 100M transactions per month.  There are 100M transactions
total for a given customer when we roll everything up to be per-month.  We do not care about
granularity smaller than a month.  There are also many columns that we care about - on the
order of many thousands.

What makes you suggest that we do not need in-memory technology?


From: Jörn Franke []
Sent: Tuesday, August 18, 2015 4:14 PM
To: Benjamin Ross;
Cc: Ron Gonzalez
Subject: Re: Evaluating spark + Cassandra for our use cases


First you need to make your SLA clear. It does not sound for me they are defined very well
or that your solution is necessary for the scenario. I also find it hard to believe that 1
customer has 100Million transactions per month.

Time series data is easy to precalculate - you do not necessarily need in-memory technology

I recommend your company to do a Proof of Concept and get more details/clarificarion on the
requirements before risking million of dollars of investment.

Le mar. 18 août 2015 à 21:18, Benjamin Ross <<>>
a écrit :
My company is interested in building a real-time time-series querying solution using Spark
and Cassandra.  Specifically, we’re interested in setting up a Spark system against Cassandra
running a hive thrift server.  We need to be able to perform real-time queries on time-series
data – things like, how many accounts have spent in total more than $300 on product X in
the past 3 months, and purchased product Y in the past month.

These queries need to be fast – preferably sub-second but we can deal with a few seconds
if absolutely necessary.  The data sizes are in the millions of records when rolled up to
be per-monthly records.  Something on the order of 100M per customer.

My question is, based on experience, how hard would it be to get Cassandra and Spark working
together to give us sub-second response times in this use case?  Note that we’ll need to
use DataStax enterprise (which is unappealing from a cost standpoint) because it’s the only
thing that provides the hive spark thrift server to Cassandra.

The two top contenders for our solution are Spark+Cassandra and Druid.

Neither of these solutions work perfectly out of the box:

-          Druid would need to be modified, possibly hacked, to support the queries we require.
 I’m also not clear how operationally ready it is.

-          Cassandra and Spark would require paying money for DataStax enterprise.  It really
feels like it’s going to be tricky to configure Cassandra and Spark to be lightning fast
for our use case.  Finally, window functions (which we need – see above) are not supported
unless we use a pre-release milestone of the datastax spark Cassandra connector.

I was wondering if anyone had any thoughts.  How easy is it to get Spark and Cassandra down
to sub-second speeds in our use case?


View raw message