spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From anbutech <anbutec...@outlook.com>
Subject Record count query parallel processing in databricks spark delta lake
Date Fri, 17 Jan 2020 18:18:55 GMT
Hi,

I have a question on the design of monitoring pyspark script on the large
number of source json data coming from more than 100 kafka topics.
These multiple topics are store under separate bucket in aws s3.each of the
kafka topics having more Terabytes of json data with respect to the
partition year,month,day,hour data.
each hour having lot of json files with .gz compression format.

What is the best way to process more terabytes of data read from s3 under
partition year,month,day,hour for all the topics source.

we are using databricks delta lake in databricks platform.query is taking
lot of time to get the count of records by year,month,date wise.

what is the best approach to handle terabytes of data to get the record
counts for all the days.

please help me on the below problem:

topics_list.csv
--------------
I'm planning to put all the 150 topics in the csv file to read and process
the data to get day record count.

I have to iterate sequence one by one topics from csv file using for loop or
other options,to pass the year,month,date arguments 
to get the record count for the particular day for all the topics.

df
=spark.read.json("s3a://kafka-bucket_name/topic_name/year/month/day/hour/")

df.createOrReplaceTempView(topic1_source)

spark.sql("select count(1) from topic1_source")

Could you help me or give an good  approach to parallely run the query for
all the topics to get the record day count for all the 150 topics
effectively using apache spark delta lake in databricks.

thanks











--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message