spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt K <matvey1...@gmail.com>
Subject extracting file path using dataframes
Date Tue, 01 Sep 2015 23:00:58 GMT
Hi all,

TL;DR - is there a way to extract the source path from an RDD via the Scala
API?

I have sequence files on S3 that look something like this:
s3://data/customer=123/...
s3://data/customer=456/...

I am using Spark Dataframes to convert these sequence files to Parquet. As
part of the processing, I actually need to know the customer-id. I'm doing
something like this:

val rdd = sql.sparkContext.sequenceFile("s3://data/customer=*/*",
classOf[BytesWritable],
classOf[Text])

val rowRdd = rdd.map(x => convertTextRowToTypedRdd(x._2, schema, delimiter))

val dataFrame = sql.createDataFrame(rowRdd, schema)


What I am trying to figure out is how to get the customer-id, which is part
of the path. I am not sure if there's a way to extract the source path from
the resulting HadoopRDD. Do I need to create one RDD per customer to get
around this?


Thanks,

-Matt

Mime
View raw message