spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Pentreath" <>
Subject Re: RDD[URI]
Date Thu, 30 Jan 2014 18:02:24 GMT
What is the precise use case and reasoning behind wanting to work on a File as the "record"
in an RDD?

CombineFileInputFormat may be useful in some way:

Sent from Mailbox for iPhone

On Thu, Jan 30, 2014 at 7:34 PM, Christopher Nguyen <>

> Philip, I guess the key problem statement is the "large collection of"
> part? If so this may be helpful, at the HDFS level:
> Otherwise you can always start with an RDD[fileUri] and go from there to an
> RDD[(fileUri, read_contents)].
> Sent while mobile. Pls excuse typos etc.
> On Jan 30, 2014 9:13 AM, "尹绪森" <> wrote:
>> I am also interested in this. My solution now is making a file to a line
>> of string, i.e. deleting all '\n', then adding filename as the head of line
>> with a space.
>> [filename] [space] [content]
>> Anyone have better ideas ?
>> 2014-1-31 AM12:18于 "Philip Ogren" <>写道:
>>> In my Spark programming thus far my unit of work has been a single row
>>> from an hdfs file by creating an RDD[Array[String]] with something like:
>>> spark.textFile(path).map(_.split("\t"))
>>> Now, I'd like to do some work over a large collection of files in which
>>> the unit of work is a single file (rather than a row from a file.)  Does
>>> Spark anticipate users creating an RDD[URI] or RDD[File] or some such and
>>> supporting actions and transformations that one might want to do on such an
>>> RDD?  Any advice and/or code snippets would be appreciated!
>>> Thanks,
>>> Philip
View raw message