spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Can this performance be improved?
Date Fri, 15 Apr 2016 06:27:07 GMT
You could use a different format and the dataset or dataframe instead of rdd.

> On 14 Apr 2016, at 23:21, Bibudh Lahiri <bibudhlahiri@gmail.com> wrote:
> 
> Hi,
>     As part of a larger program, I am extracting the distinct values of some columns
of an RDD with 100 million records and 4 columns. I am running Spark in standalone cluster
mode on my laptop (2.3 GHz Intel Core i7, 10 GB 1333 MHz DDR3 RAM) with all the 8 cores given
to a single worker. So my statement is something like this:
> 
> age_groups = patients_rdd.map(lambda x:x.split(",")).map(lambda x: x[1]).distinct()
> 
>    It is taking about 3.8 minutes. It is spawning 89 tasks when dealing with this RDD
because (I guess) the block size is 32 MB, and the entire file is 2.8 GB, so there are 2.8*1024/32
= 89 blocks. The ~4 minute time means it is processing about 50k records per second per core/task.
> 
>    Does this performance look typical or is there room for improvement?
> 
> Thanks
>             Bibudh
> 
>    
> 
> -- 
> Bibudh Lahiri
> Data Scientist, Impetus Technolgoies
> 5300 Stevens Creek Blvd
> San Jose, CA 95129
> http://knowthynumbers.blogspot.com/
>  

Mime
View raw message