spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philip Ogren <philip.og...@oracle.com>
Subject Re: RDD[URI]
Date Thu, 30 Jan 2014 18:25:38 GMT
Thank you for the links!  These look very useful.

I do not have a precise use case - at this point I'm just exploring what 
is possible/feasible.  Like the blog suggests, I might have a bunch of 
images lying around and might want to collect meta-data from them.  In 
my case, I do a lot of NLP and so I would like to process text from a 
large collection of documents, perhaps after running through Tika.  Both 
of these use cases seem closely related from a Spark user's perspective.



On 1/30/2014 11:02 AM, Nick Pentreath wrote:
> What is the precise use case and reasoning behind wanting to work on a 
> File as the "record" in an RDD?
>
> CombineFileInputFormat may be useful in some way: 
> http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/

>
>
> https://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/MultiFileWordCount.java
>
>
> —
> Sent from Mailbox <https://www.dropbox.com/mailbox> for iPhone
>
>
> On Thu, Jan 30, 2014 at 7:34 PM, Christopher Nguyen <ctn@adatao.com 
> <mailto:ctn@adatao.com>> wrote:
>
>     Philip, I guess the key problem statement is the "large collection
>     of" part? If so this may be helpful, at the HDFS level:
>     http://blog.cloudera.com/blog/2009/02/the-small-files-problem/.
>
>     Otherwise you can always start with an RDD[fileUri] and go from
>     there to an RDD[(fileUri, read_contents)].
>
>     Sent while mobile. Pls excuse typos etc.
>
>     On Jan 30, 2014 9:13 AM, "尹绪森" <yinxusen@gmail.com
>     <mailto:yinxusen@gmail.com>> wrote:
>
>         I am also interested in this. My solution now is making a file
>         to a line of string, i.e. deleting all '\n', then adding
>         filename as the head of line with a space.
>
>         [filename] [space] [content]
>
>         Anyone have better ideas ?
>
>         2014-1-31 AM12:18于 "Philip Ogren" <philip.ogren@oracle.com
>         <mailto:philip.ogren@oracle.com>> 写道:
>
>             In my Spark programming thus far my unit of work has been
>             a single row from an hdfs file by creating an
>             RDD[Array[String]] with something like:
>
>             spark.textFile(path).map(_.split("\t"))
>
>             Now, I'd like to do some work over a large collection of
>             files in which the unit of work is a single file (rather
>             than a row from a file.)  Does Spark anticipate users
>             creating an RDD[URI] or RDD[File] or some such and
>             supporting actions and transformations that one might want
>             to do on such an RDD?  Any advice and/or code snippets
>             would be appreciated!
>
>             Thanks,
>             Philip
>
>


Mime
View raw message