spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick McCarthy <pmccar...@dstillery.com.INVALID>
Subject Best practices for data like file storage
Date Fri, 01 Nov 2019 15:33:28 GMT
Hi List,

I'm looking for resources to learn about how to store data on disk for
later access.

For a while my team has been using Spark on top of our existing hdfs/Hive
cluster without much agency as far as what format is used to store the
data. I'd like to learn more about how to re-stage my data to speed up my
own analyses, and to start building expertise to define new data stores.

One example of a problem I'm facing is data which is written to Hive using
a customized protobuf serde. The data contains many very complex types
(arrays of structs of arrays of... ) and I often need very few elements of
any particular record, yet the format requires Spark to deserialize the
entire object.

The sorts of information I'm looking for:

   - Do's and Dont's of laying out a parquet schema
   - Measuring / debugging read speed
   - How to bucket, index, etc.

Thanks!

Mime
View raw message