spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abdeali Kothari <abdealikoth...@gmail.com>
Subject Re: Repartition not working on a csv file
Date Sun, 01 Jul 2018 02:52:25 GMT
My entire CSV is less than 20KB.
By somewhere in between, I do a broadcast join with 3500 records in another
file.
After the broadcast join I have a lot of processing to do. Overall, the
time to process a single record goes up-to 5mins on 1 executor

I'm trying to increase the partitions that my data is in so that I have at
maximum 1 record per executor (currently it sets 2 tasks, and hence 2
executors... I want it to split it into at least 100 tasks at a time so I
get 5 records per task => ~20min per task)


On Sun, Jul 1, 2018, 07:58 yujhe.li <liyujhe@gmail.com> wrote:

> Abdeali Kothari wrote
> > I am using Spark 2.3.0 and trying to read a CSV file which has 500
> > records.
> > When I try to read it, spark says that it has two stages: 10, 11 and then
> > they join into stage 12.
>
> What's your CSV size per file? I think Spark optimizer may put many files
> into one task when reading small files.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Mime
View raw message