spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Tremblay <paulhtremb...@gmail.com>
Subject splitting a huge file
Date Fri, 21 Apr 2017 18:36:31 GMT
We are tasked with loading a big file (possibly 2TB) into a data warehouse.
In order to do this efficiently, we need to split the file into smaller
files.

I don't believe there is a way to do this with Spark, because in order for
Spark to distribute the file to the worker nodes, it first has to be split
up, right?

We ended up using a single machine with a single thread to do the
splitting. I just want to make sure I am not missing something obvious.

Thanks!

-- 
Paul Henry Tremblay
Attunix

Mime
View raw message