spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abhishek Somani <abhisheksoman...@gmail.com>
Subject Custom datasource: when acquire and release a lock?
Date Fri, 24 May 2019 13:36:46 GMT
Hi experts,

I am trying to create a custom Spark Datasource(v1) to read from a
transactional data endpoint, and I need to acquire a lock with the endpoint
before fetching data and release the lock after reading. Note that the lock
acquisition and release needs to happen in the Driver JVM.

I have created a custom RDD for this purpose, and tried acquiring the lock
in MyRDD.getPartitions(), and releasing the lock at the end of the job by
registering a QueryExecutionListener.

Now as I have learnt, this is not the right approach as the RDD can get
reused on further actions WITHOUT calling getPartitions() again(as the
partitions of an RDD get cached). For example, if someone calls
Dataset.collect() twice, the first time MyRDD.getPartitions() will get
invoked, I will acquire a lock and release the lock at the end. However the
second time collect() is called, getPartitions will NOT be called again as
the RDD would be reused and the partitions would have gotten cached in the
RDD.

Can someone advice me on where would be the right places to acquire and
release a lock with my data endpoint in this scenario.

Thanks a lot,
Abhishek Somani

Mime
View raw message