spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Uri Laserson <laser...@cloudera.com>
Subject Access original filename in a map function
Date Wed, 19 Mar 2014 00:12:35 GMT
Hi spark-folk,

I have a directory full of files that I want to process using PySpark.
 There is some necessary metadata in the filename that I would love to
attach to each record in that file.  Using Java MapReduce, I would access

(FileSplit) context.getInputSplit()).getPath().getName()

in the setup() method of the mapper.

Using Hadoop Streaming, I can access the environment variable
map_input_fileto get the filename.

Is there something I can do in PySpark to get the filename?  Surely, one
solution would be to get the list of files first, load each one as an RDD
separately, and then union them together.  But listing the files in HDFS is
a bit annoying through Python, so I was wondering if the filename is
somehow attached to a partition.

Thanks!

Uri

-- 
Uri Laserson, PhD
Data Scientist, Cloudera
Twitter/GitHub: @laserson
+1 617 910 0447
laserson@cloudera.com

Mime
View raw message