spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ulanov, Alexander" <alexander.ula...@hp.com>
Subject RE: Reading large files
Date Wed, 06 May 2015 17:54:54 GMT
SparkContext has two methods for reading binary files: binaryFiles (reads multiple binary files
into RDD) and binaryRecords (reads separate lines of a single binary file into RDD). For example,
I have a big binary file split into logical parts, so I can use “binaryFiles”. The possible
problem is that the order of records between parts is not preserved, so I have to do sortBy
afterwards.

Alexander

From: Vijayasarathy Kannan [mailto:kvijay@vt.edu]
Sent: Wednesday, May 06, 2015 10:38 AM
To: user@spark.apache.org
Subject: Reading large files

​Hi,

Is there a way to read a large file, in parallel​/distributed way? I have a single large
binary file which I currently read on the driver program and then distribute it to executors
(using groupBy(), etc.). I want to know if there's a way to make the executors each read a
specific/unique portion of the file or create RDDs of multiple portions of the file and finally
union them.

Thanks.
Mime
View raw message