spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <>
Subject Re: splitting a huge file
Date Fri, 21 Apr 2017 18:39:18 GMT
What is your DWH technology?
If the file is on HDFS and depending on the format than Spark can read parts of it in parallel.

> On 21. Apr 2017, at 20:36, Paul Tremblay <> wrote:
> We are tasked with loading a big file (possibly 2TB) into a data warehouse. In order
to do this efficiently, we need to split the file into smaller files.
> I don't believe there is a way to do this with Spark, because in order for Spark to distribute
the file to the worker nodes, it first has to be split up, right? 
> We ended up using a single machine with a single thread to do the splitting. I just want
to make sure I am not missing something obvious.
> Thanks!
> -- 
> Paul Henry Tremblay
> Attunix

To unsubscribe e-mail:

View raw message