spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Deepak Sharma <deepakmc...@gmail.com>
Subject Re: Optimized way to use spark as db to hdfs etl
Date Sat, 05 Nov 2016 16:12:12 GMT
Hi Rohit
You can use accumulators and increase it on every record processing.
At last you can get the value of accumulator on driver , which will give
you the count.

HTH
Deepak

On Nov 5, 2016 20:09, "Rohit Verma" <rohit.verma@rokittech.com> wrote:

> I am using spark to read from database and write in hdfs as parquet file.
> Here is code snippet.
>
> private long etlFunction(SparkSession spark){
> spark.sqlContext().setConf("spark.sql.parquet.compression.codec",
> “SNAPPY");
> Properties properties = new Properties();
> properties.put("driver”,”oracle.jdbc.driver");
> properties.put("fetchSize”,”5000");
> Dataset<Row> dataset = spark.read().jdbc(jdbcUrl, query, properties);
> dataset.write.format(“parquet”).save(“pdfs-path”);
> return dataset.count();
> }
>
> When I look at spark ui, during write I have stats of records written,
> visible in sql tab under query plan.
>
> While the count itself is a heavy task.
>
> Can someone suggest best way to get count in most optimized way.
>
> Thanks all..
>

Mime
View raw message