spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manoj Samel <manojsamelt...@gmail.com>
Subject Spark 1.4 - memory bloat in group by/aggregate???
Date Fri, 26 Jun 2015 21:13:44 GMT
Hi,


   - Spark 1.4 on a single node machine. Run spark-shell
   - Reading from Parquet file with bunch of text columns and couple of
   amounts in decimal(14,4). On disk size of of the file is 376M. It has ~100
   million rows
   - rdd1 = sqlcontext.read.parquet
   - rdd1.cache
   - group_by_df =
   rdd1.groupBy("a").agg(sum(rdd1("amount1")),sum(rdd1("amount2")))
   - group_by_df.cache
   - group_by_df.count // Trigger action - Results in 725 rows
   - Run top on machine
   - In the spark UI, the storage shows base ParquetRDD size is 2.3GB
   (multiple of storage size 376M), the size of the group_by_df is 43.2 KB.
   This seems ok
   - However, the "top" command shows the process memory "RES" part jumping
   from 2g at start to 31g after the count. This seems excessive for one group
   by operator and will lead to trouble for repeated similar operations on the
   data ...

Any thoughts ?

Thanks,

Mime
View raw message