spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?
Date Thu, 24 Nov 2016 21:44:26 GMT
I am not sure what use case you want to demonstrate with select count in general. Maybe you
can elaborate more what your use case is.

Aside from this: this is a Cassandra issue. What is the setup of Cassandra? Dedicated nodes?
How many? Replication strategy? Consistency configuration? How is the data spread on nodes?
Cassandra is more for use cases where you have a lot of data, but select only a subset from
it or where you have a lot of single writes. 

If you want to analyze it you have to export it once to parquet, orc etc and then run queries
on it. Depending on your use case you may want to go for that on hive2+tez+ldap or spark.

> On 24 Nov 2016, at 20:52, kant kodali <kanth909@gmail.com> wrote:
> 
> some accurate numbers here. so it took me 1hr:30 mins to count  698705723 rows (~700
Million)
> 
> and my code is just this 
> 
> sc.cassandraTable("cuneiform", "blocks").cassandraCount
> 
> 
> 
>> On Thu, Nov 24, 2016 at 10:48 AM, kant kodali <kanth909@gmail.com> wrote:
>> Take a look at this https://github.com/brianmhess/cassandra-count
>> 
>> Now It is just matter of incorporating it into spark-cassandra-connector I guess.
>> 
>>> On Thu, Nov 24, 2016 at 1:01 AM, kant kodali <kanth909@gmail.com> wrote:
>>> According to this link https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md
>>> 
>>> I tried the following but it still looks like it is taking forever
>>> 
>>> sc.cassandraTable(keyspace, table).cassandraCount
>>> 
>>>> On Thu, Nov 24, 2016 at 12:56 AM, kant kodali <kanth909@gmail.com>
wrote:
>>>> I would be glad if SELECT COUNT(*) FROM hello can return any value for that
size :) I can say for sure it didn't return anything for 30 mins and I probably need to build
more patience to sit for few more hours after that! Cassandra recommends to use ColumnFamilyStats
using nodetool cfstats which will give a pretty good estimate but not an accurate value.
>>>> 
>>>>> On Thu, Nov 24, 2016 at 12:48 AM, Anastasios Zouzias <zouzias@gmail.com>
wrote:
>>>>> How fast is Cassandra without Spark on the count operation?
>>>>> 
>>>>> cqsh> SELECT COUNT(*) FROM hello
>>>>> 
>>>>> (this is not equivalent with what you are doing but might help you find
the root of the cause)
>>>>> 
>>>>>> On Thu, Nov 24, 2016 at 9:03 AM, kant kodali <kanth909@gmail.com>
wrote:
>>>>>> I have the following code
>>>>>> 
>>>>>> I invoke spark-shell as follows
>>>>>> 
>>>>>>     ./spark-shell --conf spark.cassandra.connection.host=170.99.99.134
--executor-memory 15G --executor-cores 12 --conf spark.cassandra.input.split.size_in_mb=67108864
>>>>>> 
>>>>>> code
>>>>>> 
>>>>>>     scala> val df = spark.sql("SELECT test from hello") // Billion
rows in hello and test column is 1KB
>>>>>>     
>>>>>>     df: org.apache.spark.sql.DataFrame = [test: binary]
>>>>>>     
>>>>>>     scala> df.count
>>>>>>     
>>>>>>     [Stage 0:>   (0 + 2) / 13] // I dont know what these numbers
mean precisely.
>>>>>> 
>>>>>> If I invoke spark-shell as follows
>>>>>> 
>>>>>>     ./spark-shell --conf spark.cassandra.connection.host=170.99.99.134
>>>>>> 
>>>>>> code
>>>>>> 
>>>>>> 
>>>>>>     val df = spark.sql("SELECT test from hello") // This has about
billion rows
>>>>>>     
>>>>>>     scala> df.count
>>>>>>     
>>>>>>     
>>>>>>     [Stage 0:=>  (686 + 2) / 24686] // What are these numbers
precisely?
>>>>>> 
>>>>>> 
>>>>>> Both of these versions didn't work Spark keeps running forever and
I have been waiting for more than 15 mins and no response. Any ideas on what could be wrong
and how to fix this?
>>>>>> 
>>>>>> I am using Spark 2.0.2
>>>>>> and spark-cassandra-connector_2.11-2.0.0-M3.jar
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> -- Anastasios Zouzias
>>>> 
>>> 
>> 
> 

Mime
View raw message