spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chanh Le <giaosu...@gmail.com>
Subject Re: [Thriftserver2] Controlling number of tasks
Date Wed, 03 Aug 2016 16:13:09 GMT
I believe there is no way to reduce tasks by Hive using coalesce because when It come to Hive
just read the files and depend on number of files you put into. So The way to did was coalesce
at the ELT layer put a small number of files as possible reduce IO time for reading file.


> On Aug 3, 2016, at 7:03 PM, Yana Kadiyska <yana.kadiyska@gmail.com> wrote:
> 
> Hi folks, I have an ETL pipeline that drops a file every 1/2 hour. When spark reads these
files, I end up with 315K tasks for a dataframe reading a few days worth of data.
> 
> I now with a regular Spark job, I can use coalesce to come to a lower number of tasks.
Is there a way to tell HiveThriftserver2 to coalsce? I have a line in hive-conf that says
to use CombinedInputFormat but I'm not sure it's working.
> 
> (Obviously haivng fewer large files is better but I don't control the file generation
side of this)
> 
> Tips much appreciated


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message