spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Krzysztof Zarzycki <k.zarzy...@gmail.com>
Subject Getting all files of a table
Date Tue, 01 Dec 2015 18:55:48 GMT
Hi there,
Do you know how easily I can get a list of all files of a Hive table?

What I want to achieve is to get all files that are underneath parquet
table and using sparksql-protobuf[1] library(really handy library!) and its
helper class ProtoParquetRDD:

val protobufsRdd = new ProtoParquetRDD(sc, "files", classOf[MyProto])

Access the underlying parquet files as normal protocol buffers. But I don't
know how to get them. I pointed the call above to one file by hand it
worked well.
The parquet table was created based on the same library and it's implicit
hiveContext extension createDataFrame, which creates a DataFrame based on
Protocol buffer class.

(The revert read operation is needed to support legacy code, where after
converting protocol buffers to parquet, I still want some code to access
parquet as normal protocol buffers).

Maybe someone will have other way to get an Rdd of protocol buffers from
Parquet-stored table.

[1] https://github.com/saurfang/sparksql-protobuf

Thanks!
Krzysztof

Mime
View raw message