spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anastasios Zouzias <zouz...@gmail.com>
Subject Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?
Date Thu, 24 Nov 2016 08:48:10 GMT
How fast is Cassandra without Spark on the count operation?

cqsh> SELECT COUNT(*) FROM hello

(this is not equivalent with what you are doing but might help you find the
root of the cause)

On Thu, Nov 24, 2016 at 9:03 AM, kant kodali <kanth909@gmail.com> wrote:

> I have the following code
>
> I invoke spark-shell as follows
>
>     ./spark-shell --conf spark.cassandra.connection.host=170.99.99.134
> --executor-memory 15G --executor-cores 12 --conf
> spark.cassandra.input.split.size_in_mb=67108864
>
> code
>
>     scala> val df = spark.sql("SELECT test from hello") // Billion rows in
> hello and test column is 1KB
>
>     df: org.apache.spark.sql.DataFrame = [test: binary]
>
>     scala> df.count
>
>     [Stage 0:>   (0 + 2) / 13] // I dont know what these numbers mean
> precisely.
>
> If I invoke spark-shell as follows
>
>     ./spark-shell --conf spark.cassandra.connection.host=170.99.99.134
>
> code
>
>
>     val df = spark.sql("SELECT test from hello") // This has about billion
> rows
>
>     scala> df.count
>
>
>     [Stage 0:=>  (686 + 2) / 24686] // What are these numbers precisely?
>
>
> Both of these versions didn't work Spark keeps running forever and I have
> been waiting for more than 15 mins and no response. Any ideas on what could
> be wrong and how to fix this?
>
> I am using Spark 2.0.2
> and spark-cassandra-connector_2.11-2.0.0-M3.jar
>
>


-- 
-- Anastasios Zouzias
<azo@zurich.ibm.com>

Mime
View raw message