spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Davidson <>
Subject Re: access hdfs file name in map()
Date Fri, 30 May 2014 05:27:59 GMT
Currently there is not a way to do this using textFile(). However, you
could pretty straightforwardly define your own subclass of HadoopRDD [1] in
order to get access to this information (likely using
mapPartitionsWithIndex to look up the InputSplit for a particular

Note that sc.textFile() is just a convenience function to construct a new
HadoopRDD [2].

[1] HadoopRDD:
[2] sc.textFile():

On Thu, May 29, 2014 at 7:49 PM, Xu (Simon) Chen <> wrote:

> Hello,
> A quick question about using spark to parse text-format CSV files stored
> on hdfs.
> I have something very simple:
> sc.textFile("hdfs://test/path/*").map(line => line.split(",")).map(p =>
> (XXX, p[0], p[2]))
> Here, I want to replace XXX with a string, which is the current csv
> filename for the line. This is needed since some information may be encoded
> in the file name, like date.
> In hive, I am able to define an external table and use INPUT__FILE__NAME
> as a column in queries. I wonder if spark has something similar.
> Thanks!
> -Simon

View raw message