spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Davidson <ilike...@gmail.com>
Subject Re: access hdfs file name in map()
Date Fri, 30 May 2014 05:27:59 GMT
Currently there is not a way to do this using textFile(). However, you
could pretty straightforwardly define your own subclass of HadoopRDD [1] in
order to get access to this information (likely using
mapPartitionsWithIndex to look up the InputSplit for a particular
partition).

Note that sc.textFile() is just a convenience function to construct a new
HadoopRDD [2].

[1] HadoopRDD:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L93
[2] sc.textFile():
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L456


On Thu, May 29, 2014 at 7:49 PM, Xu (Simon) Chen <xchenum@gmail.com> wrote:

> Hello,
>
> A quick question about using spark to parse text-format CSV files stored
> on hdfs.
>
> I have something very simple:
> sc.textFile("hdfs://test/path/*").map(line => line.split(",")).map(p =>
> (XXX, p[0], p[2]))
>
> Here, I want to replace XXX with a string, which is the current csv
> filename for the line. This is needed since some information may be encoded
> in the file name, like date.
>
> In hive, I am able to define an external table and use INPUT__FILE__NAME
> as a column in queries. I wonder if spark has something similar.
>
> Thanks!
> -Simon
>

Mime
View raw message