spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Takeshi Yamamuro <linguin....@gmail.com>
Subject Re: [Thriftserver2] Controlling number of tasks
Date Wed, 03 Aug 2016 16:30:42 GMT
Hi,

HiveThriftserver2 itself has no such functionality.
Have you tried adaptive execution in spark?
https://issues.apache.org/jira/browse/SPARK-9850
I have not used this yet though, it seems this experimental feature is to
tune
#tasks depending on partition size.

// maropu


On Thu, Aug 4, 2016 at 1:13 AM, Chanh Le <giaosudau@gmail.com> wrote:

> I believe there is no way to reduce tasks by Hive using coalesce because
> when It come to Hive just read the files and depend on number of files you
> put into. So The way to did was coalesce at the ELT layer put a small
> number of files as possible reduce IO time for reading file.
>
>
> > On Aug 3, 2016, at 7:03 PM, Yana Kadiyska <yana.kadiyska@gmail.com>
> wrote:
> >
> > Hi folks, I have an ETL pipeline that drops a file every 1/2 hour. When
> spark reads these files, I end up with 315K tasks for a dataframe reading a
> few days worth of data.
> >
> > I now with a regular Spark job, I can use coalesce to come to a lower
> number of tasks. Is there a way to tell HiveThriftserver2 to coalsce? I
> have a line in hive-conf that says to use CombinedInputFormat but I'm not
> sure it's working.
> >
> > (Obviously haivng fewer large files is better but I don't control the
> file generation side of this)
> >
> > Tips much appreciated
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>


-- 
---
Takeshi Yamamuro

Mime
View raw message