spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sam Flint <sam.fl...@magnetic.com>
Subject Spark response times for queries seem slow
Date Mon, 05 Jan 2015 23:20:14 GMT
I am running pyspark job over 4GB of data that is split into 17 parquet
files on HDFS cluster.   This is all in cloudera manager.

Here is the query the job is running :

parquetFile.registerTempTable("parquetFileone")

results = sqlContext.sql("SELECT sum(total_impressions), sum(total_clicks)
FROM parquetFileone group by hour")


I also ran this way :
mapped = parquetFile.map(lambda row: (str(row.hour),
(row.total_impressions, row.total_clicks))) counts =
mapped.reduceByKey(lambda a, b: (a[0] + b[0], a[1] + b[1]))


my results where anywhere from 8 - 10 minutes.

I am wondering if there is a configuration that needs to be tweaked or if
this is expected response time.

Machines are 30g RAM and 4 cores. Seems the CPU's are just getting pegged
and that is what is taking so long.

 Any help on this would be amazing.

Thanks,


-- 

*MAGNE**+**I**C*

*Sam Flint* | *Lead Developer, Data Analytics*

Mime
View raw message