spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Custom datasource: when acquire and release a lock?
Date Mon, 27 May 2019 05:07:57 GMT
What does your data source structure look like?
Can’t you release it at the end of the build scan method? 

What technology is used in the transactional data endpoint?


> Am 24.05.2019 um 15:36 schrieb Abhishek Somani <abhisheksomani88@gmail.com>:
> 
> Hi experts,
> 
> I am trying to create a custom Spark Datasource(v1) to read from a transactional data
endpoint, and I need to acquire a lock with the endpoint before fetching data and release
the lock after reading. Note that the lock acquisition and release needs to happen in the
Driver JVM.
> 
> I have created a custom RDD for this purpose, and tried acquiring the lock in MyRDD.getPartitions(),
and releasing the lock at the end of the job by registering a QueryExecutionListener. 
> 
> Now as I have learnt, this is not the right approach as the RDD can get reused on further
actions WITHOUT calling getPartitions() again(as the partitions of an RDD get cached). For
example, if someone calls Dataset.collect() twice, the first time MyRDD.getPartitions() will
get invoked, I will acquire a lock and release the lock at the end. However the second time
collect() is called, getPartitions will NOT be called again as the RDD would be reused and
the partitions would have gotten cached in the RDD. 
> 
> Can someone advice me on where would be the right places to acquire and release a lock
with my data endpoint in this scenario.
> 
> Thanks a lot,
> Abhishek Somani

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Mime
View raw message