spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexey Romanchuk <alexey.romanc...@gmail.com>
Subject Re: Delayed hotspot optimizations in Spark
Date Fri, 10 Oct 2014 08:09:47 GMT
Hey Sean and spark users!

Thanks for reply. I try -Xcomp right now and start time was about few
minutes (as expected), but I got first query slow as before:
Oct 10, 2014 3:03:41 PM INFO: parquet.hadoop.InternalParquetRecordReader:
Assembled and processed 1568899 records from 30 columns in 12897 ms:
121.64837 rec/ms, 3649.451 cell/ms

and next

Oct 10, 2014 3:05:03 PM INFO: parquet.hadoop.InternalParquetRecordReader:
Assembled and processed 1568899 records from 1 columns in 1757 ms:
892.94196 rec/ms, 892.94196 cell/ms

I have no idea about caching or other stuff because CPU load is 100% on
worker and jstack show that worker is reading from parquet file.

Any ideas?

Thanks!

On Fri, Oct 10, 2014 at 2:55 PM, Sean Owen <sowen@cloudera.com> wrote:

> You could try setting "-Xcomp" for executors to force JIT compilation
> upfront. I don't know if it's a good idea overall but might show
> whether the upfront compilation really helps. I doubt it.
>
> However is this almost surely due to caching somewhere, in Spark SQL
> or HDFS? I really doubt hotspot makes a difference compared to these
> much larger factors.
>
> On Fri, Oct 10, 2014 at 8:49 AM, Alexey Romanchuk
> <alexey.romanchuk@gmail.com> wrote:
> > Hello spark users and developers!
> >
> > I am using hdfs + spark sql + hive schema + parquet as storage format. I
> > have lot of parquet files - one files fits one hdfs block for one day.
> The
> > strange thing is very slow first query for spark sql.
> >
> > To reproduce situation I use only one core and I have 97sec for first
> time
> > and only 13sec for all next queries. Sure I query for different data,
> but it
> > has same structure and size. The situation can be reproduced after
> restart
> > thrift server.
> >
> > Here it information about parquet files reading from worker node:
> >
> > Slow one:
> > Oct 10, 2014 2:26:53 PM INFO: parquet.hadoop.InternalParquetRecordReader:
> > Assembled and processed 1560251 records from 30 columns in 11686 ms:
> > 133.51454 rec/ms, 4005.4363 cell/ms
> >
> > Fast one:
> > Oct 10, 2014 2:31:30 PM INFO: parquet.hadoop.InternalParquetRecordReader:
> > Assembled and processed 1568899 records from 1 columns in 1373 ms:
> 1142.6796
> > rec/ms, 1142.6796 cell/ms
> >
> > As you can see second reading is 10x times faster then first. Most of the
> > query time spent to work with parquet file.
> >
> > This problem is really annoying, because most of my spark task contains
> just
> > 1 sql query and data processing and to speedup my jobs I put special
> warmup
> > query in from of any job.
> >
> > My assumption is that it is hotspot optimizations that used due first
> > reading. Do you have any idea how to confirm/solve this performance
> problem?
> >
> > Thanks for advice!
> >
> > p.s. I have billion hotspot optimization showed with
> -XX:+PrintCompilation
> > but can not figure out what are important and what are not.
>

Mime
View raw message