You can make a Hadoop input format which passes through the name of the file. I generally find it easier to just hit Hadoop, get the file names, and construct the RDDs though

El martes, 1 de septiembre de 2015, Matt K <matvey1414@gmail.com> escribió:
Just want to add - I'm looking to partition the resulting Parquet files by customer-id, which is why I'm looking to extract the customer-id from the path.

On Tue, Sep 1, 2015 at 7:00 PM, Matt K <matvey1414@gmail.com> wrote:
Hi all,

TL;DR - is there a way to extract the source path from an RDD via the Scala API?

I have sequence files on S3 that look something like this:
s3://data/customer=123/...
s3://data/customer=456/...

I am using Spark Dataframes to convert these sequence files to Parquet. As part of the processing, I actually need to know the customer-id. I'm doing something like this:

val rdd = sql.sparkContext.sequenceFile("s3://data/customer=*/*", classOf[BytesWritable], classOf[Text])

val rowRdd = rdd.map(x => convertTextRowToTypedRdd(x._2, schema, delimiter))

val dataFrame = sql.createDataFrame(rowRdd, schema)


What I am trying to figure out is how to get the customer-id, which is part of the path. I am not sure if there's a way to extract the source path from the resulting HadoopRDD. Do I need to create one RDD per customer to get around this?


Thanks,

-Matt




--
www.calcmachine.com - easy online calculator.