spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chang Chen <baibaic...@gmail.com>
Subject Re: Is RDD thread safe?
Date Mon, 25 Nov 2019 04:21:30 GMT
Sorry I did't describe clearly,  RDD id itself is thread-safe, how about
cached data?

See codes from BlockManager

def getOrElseUpdate(...)   = {
  get[T](blockId)(classTag) match {
   case ...
   case _ =>                                      // 1. no data is cached.
    // Need to compute the block
 }
 // Initially we hold no locks on this block
 doPutIterator(...) match{..}
}

Considering  two DAGs (contain the same cached RDD ) runs simultaneously,
if both returns none  when they get same block from BlockManager(i.e. #1
above), then I guess the same data would be cached twice.

If the later cache could override the previous data, and no memory is
waste, then this is OK

Thanks
Chang


Weichen Xu <weichen.xu@databricks.com> 于2019年11月25日周一 上午11:52写道:

> Rdd id is immutable and when rdd object created, the rdd id is generated.
> So why there is race condition in "rdd id" ?
>
> On Mon, Nov 25, 2019 at 11:31 AM Chang Chen <baibaichen@gmail.com> wrote:
>
>> I am wonder the concurrent semantics for reason about the correctness. If
>> the two query simultaneously run the DAGs which use the same cached
>> DF\RDD,but before cache data actually happen, what will happen?
>>
>> By looking into code a litter, I suspect they have different BlockID for
>> same Dataset which is unexpected behavior, but there is no race condition.
>>
>> However RDD id is not lazy, so there is race condition.
>>
>> Thanks
>> Chang
>>
>>
>> Weichen Xu <weichen.xu@databricks.com> 于2019年11月12日周二 下午1:22写道:
>>
>>> Hi Chang,
>>>
>>> RDD/Dataframe is immutable and lazy computed. They are thread safe.
>>>
>>> Thanks!
>>>
>>> On Tue, Nov 12, 2019 at 12:31 PM Chang Chen <baibaichen@gmail.com>
>>> wrote:
>>>
>>>> Hi all
>>>>
>>>> I meet a case where I need cache a source RDD, and then create
>>>> different DataFrame from it in different threads to accelerate query.
>>>>
>>>> I know that SparkSession is thread safe(
>>>> https://issues.apache.org/jira/browse/SPARK-15135), but i am not sure
>>>> whether RDD  is thread safe or not
>>>>
>>>> Thanks
>>>>
>>>

Mime
View raw message