spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: How to parallelize zip file processing?
Date Fri, 10 Aug 2018 21:30:44 GMT
Does the zip file contain only one file? I fear in this case you can only have one core. 

Do you mean by the way gzip? In this case you cannot decompress it in parallel...

How is the zip file created ? Can’t you create several ones?

> On 10. Aug 2018, at 22:54, mytramesh <turlapati.ramesh@gmail.com> wrote:
> 
> I know, spark doesn’t support zip file directly since it not distributable.
> Any techniques to process this file quickly?
> 
> I am trying to process around 4GB zip file. All data is moving one executor,
> and only one task is getting assigned to process all the data. 
> 
> Even when I run repartition method, data is getting portioned but on same
> executor. 
> 
> 
> How to distribute data to other executors? 
> How to get assigned more tasks/threads when It got portioned on same
> executor? 
> 
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message