spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xu (Simon) Chen" <xche...@gmail.com>
Subject Re: access hdfs file name in map()
Date Fri, 01 Aug 2014 17:42:34 GMT
Hi Roberto,

Ultimately, the info you need is set here:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L69

Being a spark newbie, I extended org.apache.spark.rdd.HadoopRDD class as
HadoopRDDWithEnv, which takes in an additional parameter (varname) in the
constructor, then override the compute() function to return something like
"""split.getPipeEnvVars.getOrElse(varName, "") + "|" + value.toString()"""
as the value. This obviously is less general and makes certain assumptions
about the input data. Also you need to write several wrappers in
SparkContext, so that you can do something like sc.textFileWithEnv("hdfs
path", "mapreduce_map_input_file").

I was hoping to do something like
sc.textFile("hdfs_path").pipe("""/usr/bin/awk
"{print\"${mapreduce_map_input_file}\",$0}" """). This gives me some weird
kyro buffer overflow exception... Haven't got a chance to look into the
details yet.

-Simon



On Fri, Aug 1, 2014 at 7:38 AM, Roberto Torella <roberto.torella@gmail.com>
wrote:

> Hi Simon,
>
> I'm trying to do the same but I'm quite lost.
>
> How did you do that? (Too direct? :)
>
>
> Thanks and ciao,
> r-
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/access-hdfs-file-name-in-map-tp6551p11160.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Mime
View raw message