spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Pentreath" <nick.pentre...@gmail.com>
Subject Re: RDD[URI]
Date Thu, 30 Jan 2014 18:02:24 GMT
What is the precise use case and reasoning behind wanting to work on a File as the "record"
in an RDD?


CombineFileInputFormat may be useful in some way: http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/





https://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/MultiFileWordCount.java






—
Sent from Mailbox for iPhone

On Thu, Jan 30, 2014 at 7:34 PM, Christopher Nguyen <ctn@adatao.com>
wrote:

> Philip, I guess the key problem statement is the "large collection of"
> part? If so this may be helpful, at the HDFS level:
> http://blog.cloudera.com/blog/2009/02/the-small-files-problem/.
> Otherwise you can always start with an RDD[fileUri] and go from there to an
> RDD[(fileUri, read_contents)].
> Sent while mobile. Pls excuse typos etc.
> On Jan 30, 2014 9:13 AM, "尹绪森" <yinxusen@gmail.com> wrote:
>> I am also interested in this. My solution now is making a file to a line
>> of string, i.e. deleting all '\n', then adding filename as the head of line
>> with a space.
>>
>> [filename] [space] [content]
>>
>> Anyone have better ideas ?
>> 2014-1-31 AM12:18于 "Philip Ogren" <philip.ogren@oracle.com>写道:
>>
>>> In my Spark programming thus far my unit of work has been a single row
>>> from an hdfs file by creating an RDD[Array[String]] with something like:
>>>
>>> spark.textFile(path).map(_.split("\t"))
>>>
>>> Now, I'd like to do some work over a large collection of files in which
>>> the unit of work is a single file (rather than a row from a file.)  Does
>>> Spark anticipate users creating an RDD[URI] or RDD[File] or some such and
>>> supporting actions and transformations that one might want to do on such an
>>> RDD?  Any advice and/or code snippets would be appreciated!
>>>
>>> Thanks,
>>> Philip
>>>
>>
Mime
View raw message