spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>
Subject Re: Spark streaming takes longer time to read json into dataframes
Date Mon, 18 Jul 2016 02:33:19 GMT
Hi,

Repartition would  create  shuffle  over  network  which  I should  avoid  to  reduce
processing time because the size of messages at most in a batch will  be  5G. 
 Partitioning topic and parallelize receiving in Direct Stream might do  the  trick.


Sent from Samsung Mobile.

<div>-------- Original message --------</div><div>From: Martin Eden <martineden131@gmail.com>
</div><div>Date:16/07/2016  14:01  (GMT+05:30) </div><div>To: Diwakar
Dhanuskodi <diwakar.dhanuskodi@gmail.com> </div><div>Cc: user <user@spark.apache.org>
</div><div>Subject: Re: Spark streaming takes longer time to read json into dataframes
</div><div>
</div>Hi,

I would just do a repartition on the initial direct DStream since otherwise each RDD in the
stream has exactly as many partitions as you have partitions in the Kafka topic (in your case
1). Like that receiving is still done in only 1 thread but at least the processing further
down is done in parallel. 

If you want to parallelize your receiving as well I would partition my Kafka topic and then
the RDDs in the initial DStream will have as many partitions as you set in Kafka.

Have you seen this? http://spark.apache.org/docs/latest/streaming-kafka-integration.html

M

On Sat, Jul 16, 2016 at 5:26 AM, Diwakar Dhanuskodi <diwakar.dhanuskodi@gmail.com> wrote:

---------- Forwarded message ----------
From: Diwakar Dhanuskodi <diwakar.dhanuskodi@gmail.com>
Date: Sat, Jul 16, 2016 at 9:30 AM
Subject: Re: Spark streaming takes longer time to read json into dataframes
To: Jean Georges Perrin <jgp@jgp.net>


Hello, 

I need it on memory.  Increased executor memory to 25G and executor cores to 3. Got same result.
There is always one task running under executor for rdd.read.json() because rdd partition
size is 1 . Doing hash partitioning inside foreachRDD is a good approach?

Regards, 
Diwakar. 

On Sat, Jul 16, 2016 at 9:20 AM, Jean Georges Perrin <jgp@jgp.net> wrote:
Do you need it on disk or just push it to memory? Can you try to increase memory or # of cores
(I know it sounds basic)

> On Jul 15, 2016, at 11:43 PM, Diwakar Dhanuskodi <diwakar.dhanuskodi@gmail.com>
wrote:
>
> Hello,
>
> I have 400K json messages pulled from Kafka into spark streaming using DirectStream approach.
Size of 400K messages is around 5G.  Kafka topic is single partitioned. I am using rdd.read.json(_._2)
inside foreachRDD to convert  rdd into dataframe. It takes almost 2.3 minutes to convert into
dataframe.
>
> I am running in Yarn client mode with executor memory as 15G and executor cores as 2.
>
> Caching rdd before converting into dataframe  doesn't change processing time. Whether
introducing hash partitions inside foreachRDD  will help? (or) Will partitioning topic and
have more than one DirectStream help?. How can I approach this situation to reduce time in
converting to dataframe..
>
> Regards,
> Diwakar.




Mime
View raw message