spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>
Subject Fwd: Spark streaming takes longer time to read json into dataframes
Date Sat, 16 Jul 2016 04:26:20 GMT
---------- Forwarded message ----------
From: Diwakar Dhanuskodi <diwakar.dhanuskodi@gmail.com>
Date: Sat, Jul 16, 2016 at 9:30 AM
Subject: Re: Spark streaming takes longer time to read json into dataframes
To: Jean Georges Perrin <jgp@jgp.net>


Hello,

I need it on memory.  Increased executor memory to 25G and executor cores
to 3. Got same result. There is always one task running under executor for
rdd.read.json() because rdd partition size is 1 . Doing hash partitioning
inside foreachRDD is a good approach?

Regards,
Diwakar.

On Sat, Jul 16, 2016 at 9:20 AM, Jean Georges Perrin <jgp@jgp.net> wrote:

> Do you need it on disk or just push it to memory? Can you try to increase
> memory or # of cores (I know it sounds basic)
>
> > On Jul 15, 2016, at 11:43 PM, Diwakar Dhanuskodi <
> diwakar.dhanuskodi@gmail.com> wrote:
> >
> > Hello,
> >
> > I have 400K json messages pulled from Kafka into spark streaming using
> DirectStream approach. Size of 400K messages is around 5G.  Kafka topic is
> single partitioned. I am using rdd.read.json(_._2) inside foreachRDD to
> convert  rdd into dataframe. It takes almost 2.3 minutes to convert into
> dataframe.
> >
> > I am running in Yarn client mode with executor memory as 15G and
> executor cores as 2.
> >
> > Caching rdd before converting into dataframe  doesn't change processing
> time. Whether introducing hash partitions inside foreachRDD  will help?
> (or) Will partitioning topic and have more than one DirectStream help?. How
> can I approach this situation to reduce time in converting to dataframe..
> >
> > Regards,
> > Diwakar.
>
>

Mime
View raw message