spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Haviv <danielru...@gmail.com>
Subject Merging Parquet Files
Date Wed, 19 Nov 2014 08:41:56 GMT
Hello,
I'm writing a process that ingests json files and saves them a parquet
files.
The process is as such:

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val jsonRequests=sqlContext.jsonFile("/requests")
val parquetRequests=sqlContext.parquetFile("/requests_parquet")

jsonRequests.registerTempTable("jsonRequests")
parquetRequests.registerTempTable("parquetRequests")

val unified_requests=sqlContext.sql("select * from jsonRequests union
select * from parquetRequests")

unified_requests.saveAsParquetFile("/tempdir")

and then I delete /requests_parquet and rename /tempdir as /requests_parquet

Is there a better way to achieve that ?

Another problem I have is that I get a lot of small json files and as a
result a lot of small parquet files, I'd like to merge the json files into
a few parquet files.. how I do that?

Thank you,
Daniel

Mime
View raw message