spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: splitting a huge file
Date Fri, 21 Apr 2017 18:39:18 GMT
What is your DWH technology?
If the file is on HDFS and depending on the format than Spark can read parts of it in parallel.

> On 21. Apr 2017, at 20:36, Paul Tremblay <paulhtremblay@gmail.com> wrote:
> 
> We are tasked with loading a big file (possibly 2TB) into a data warehouse. In order
to do this efficiently, we need to split the file into smaller files.
> 
> I don't believe there is a way to do this with Spark, because in order for Spark to distribute
the file to the worker nodes, it first has to be split up, right? 
> 
> We ended up using a single machine with a single thread to do the splitting. I just want
to make sure I am not missing something obvious.
> 
> Thanks!
> 
> -- 
> Paul Henry Tremblay
> Attunix

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message