spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vijayasarathy Kannan <kvi...@vt.edu>
Subject Re: Reading large files
Date Wed, 06 May 2015 21:22:31 GMT
Thanks.

In both cases, does the driver need to have enough memory to contain the
entire file? How do both these functions work when, for example, the binary
file is 4G and available driver memory is lesser?

On Wed, May 6, 2015 at 1:54 PM, Ulanov, Alexander <alexander.ulanov@hp.com>
wrote:

>  SparkContext has two methods for reading binary files: binaryFiles
> (reads multiple binary files into RDD) and binaryRecords (reads separate
> lines of a single binary file into RDD). For example, I have a big binary
> file split into logical parts, so I can use “binaryFiles”. The possible
> problem is that the order of records between parts is not preserved, so I
> have to do sortBy afterwards.
>
>
>
> Alexander
>
>
>
> *From:* Vijayasarathy Kannan [mailto:kvijay@vt.edu]
> *Sent:* Wednesday, May 06, 2015 10:38 AM
> *To:* user@spark.apache.org
> *Subject:* Reading large files
>
>
>
> ​Hi,
>
>
>
> Is there a way to read a large file, in parallel​/distributed way? I have
> a single large binary file which I currently read on the driver program and
> then distribute it to executors (using groupBy(), etc.). I want to know if
> there's a way to make the executors each read a specific/unique portion of
> the file or create RDDs of multiple portions of the file and finally union
> them.
>
>
>
> Thanks.
>

Mime
View raw message