spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Krzysztof Zarzycki <k.zarzy...@gmail.com>
Subject Re: Getting all files of a table
Date Tue, 01 Dec 2015 20:01:20 GMT
Great that worked! The only problem was that it returned all the files
including _SUCCESS and _metadata, but I filtered only the *.parquet

Thanks Michael,
Krzysztof


2015-12-01 20:20 GMT+01:00 Michael Armbrust <michael@databricks.com>:

> sqlContext.table("...").inputFiles
>
> (this is best effort, but should work for hive tables).
>
> Michael
>
> On Tue, Dec 1, 2015 at 10:55 AM, Krzysztof Zarzycki <k.zarzycki@gmail.com>
> wrote:
>
>> Hi there,
>> Do you know how easily I can get a list of all files of a Hive table?
>>
>> What I want to achieve is to get all files that are underneath parquet
>> table and using sparksql-protobuf[1] library(really handy library!) and its
>> helper class ProtoParquetRDD:
>>
>> val protobufsRdd = new ProtoParquetRDD(sc, "files", classOf[MyProto])
>>
>> Access the underlying parquet files as normal protocol buffers. But I
>> don't know how to get them. I pointed the call above to one file by hand it
>> worked well.
>> The parquet table was created based on the same library and it's implicit
>> hiveContext extension createDataFrame, which creates a DataFrame based on
>> Protocol buffer class.
>>
>> (The revert read operation is needed to support legacy code, where after
>> converting protocol buffers to parquet, I still want some code to access
>> parquet as normal protocol buffers).
>>
>> Maybe someone will have other way to get an Rdd of protocol buffers from
>> Parquet-stored table.
>>
>> [1] https://github.com/saurfang/sparksql-protobuf
>>
>> Thanks!
>> Krzysztof
>>
>>
>>
>>
>

Mime
View raw message