That sounds slow to me.

It looks like your sql query is grouping by a column that isn't in the projections, I'm a little surprised that even works.  But you're getting the same time reducing manually?

Have you looked at the shuffle amounts in the UI for the job?  Are you certain there aren't a disproportionate number of rows with the same hour (e.g. null hour)?

On Mon, Jan 5, 2015 at 5:20 PM, Sam Flint <sam.flint@magnetic.com> wrote:
I am running pyspark job over 4GB of data that is split into 17 parquet files on HDFS cluster.   This is all in cloudera manager. 

Here is the query the job is running : 

parquetFile.registerTempTable("parquetFileone")

results = sqlContext.sql("SELECT sum(total_impressions), sum(total_clicks) FROM parquetFileone group by hour")


I also ran this way : 
mapped = parquetFile.map(lambda row: (str(row.hour), (row.total_impressions, row.total_clicks))) counts = mapped.reduceByKey(lambda a, b: (a[0] + b[0], a[1] + b[1]))


my results where anywhere from 8 - 10 minutes.  

I am wondering if there is a configuration that needs to be tweaked or if this is expected response time. 

Machines are 30g RAM and 4 cores. Seems the CPU's are just getting pegged and that is what is taking so long. 

 Any help on this would be amazing. 

Thanks,


--

MAGNE+IC

Sam Flint Lead Developer, Data Analytics