spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chanh Le <giaosu...@gmail.com>
Subject How to struct data in parquet format?
Date Mon, 04 Jul 2016 03:28:13 GMT
Hi everyone,
I am building a query engine for internal operation use. My data need to update hourly and
daily. I am using Spark, Alluxio and Zeppeline store by parquet file.
The structure of my data FILE_NAME/network_id=xxxx/time=yyyyy/


When I update data for 1 hour I just still append on this path.


df.write.mode(SaveMode.Append).partitionBy("network_id", "time").parquet(s"alluxio://master1:19998/
<alluxio://master1:19998/>$folderName")

Is that good way to do that? Because the way I partition data follow the query with 2 filter
on network and time.
Because I saw the log when I added new data it need to OPEN all folder then put the new data
into.

Thanks.
Mime
View raw message