spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roger Marin <ro...@rogersmarin.com>
Subject Re: splitting a huge file
Date Sat, 22 Apr 2017 04:56:34 GMT
If the file is in HDFS already you can use spark to read the file using a
specific input format (depending on file type) to split it.

http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/InputFormat.html

On Sat, Apr 22, 2017 at 4:36 AM, Paul Tremblay <paulhtremblay@gmail.com>
wrote:

> We are tasked with loading a big file (possibly 2TB) into a data
> warehouse. In order to do this efficiently, we need to split the file into
> smaller files.
>
> I don't believe there is a way to do this with Spark, because in order for
> Spark to distribute the file to the worker nodes, it first has to be split
> up, right?
>
> We ended up using a single machine with a single thread to do the
> splitting. I just want to make sure I am not missing something obvious.
>
> Thanks!
>
> --
> Paul Henry Tremblay
> Attunix
>

Mime
View raw message