spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dddaaa <>
Subject How does readStream() and writeStream() work?
Date Fri, 03 Aug 2018 12:19:32 GMT
I'm wondering how does readStream() and writeStream() work internally
Lets take a simple example:

df = spark.readStream \
		.format("kafka") \
		.option("kafka.bootstrap.servers", kafka_brokers) \
		.option("subscribe", kafka_topic) \
		.load() \

res =
		.option("checkpointLocation", hdfs_dir + "checkpoint") \
		.option("path",  hdfs_dir + "data") \
		.trigger(processingTime= ' 60 seconds') \

This code read a kafka topic and then writes it to hdfs every 60 seconds.

My questions are:
1. when running readStream() what happens? does the spark job do something
or is it just like a "transformation" in spark's terminology and nothing
actually happens until an action is called?
2. writeStream() is started and the wrtiing happend every 60 seconds. Can I
intervene somehow in what happens when the actual writing occurs? for
example, can I write a log message "60 seconds passed, writing bulk to hdfs"
each time ?
3. is it possible to write to the same hdfs file each time the actual
writing occurs? for now it creates a new hdfs file each time.


Sent from:

To unsubscribe e-mail:

View raw message