spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aniket Bhatnagar <aniket.bhatna...@gmail.com>
Subject Re: Data source API | sizeInBytes should be to *Scan
Date Wed, 11 Feb 2015 18:47:03 GMT
Circling back on this. Did you get a chance to re-look at this?

Thanks,
Aniket

On Sun, Feb 8, 2015, 2:53 AM Aniket Bhatnagar <aniket.bhatnagar@gmail.com>
wrote:

> Thanks for looking into this. If this true, isn't this an issue today? The
> default implementation of sizeInBytes is 1 + broadcast threshold. So, if
> catalyst's cardinality estimation estimates even a small filter
> selectivity, it will result in broadcasting the relation. Therefore,
> shouldn't the default be much higher than broadcast threshold?
>
> Also, since the default implementation of sizeInBytes already exists in
> BaseRelation, I am not sure why the same/similar default implementation
> can't be provided with in *Scan specific sizeInBytes functions and have
> Catalyst always trust the size returned by DataSourceAPI (with default
> implementation being to never broadcast). Another thing that could be done
> is have sizeInBytes return Option[Long] so that Catalyst explicitly knows
> when DataSource was able to optimize the size. The reason why I would push
> for sizeInBytes in *Scan interfaces is because at times the data source
> implementation can more accurately predict the size output. For example,
> DataSource implementations for MongoDB, ElasticSearch, Cassandra, etc can
> easy use filter push downs to query the underlying storage to predict the
> size. Such predictions will be more accurate than Catalyst's prediction.
> Therefore, if its not a fundamental change in Catalyst, I would think this
> makes sense.
>
>
> Thanks,
> Aniket
>
>
> On Sat, Feb 7, 2015, 4:50 AM Reynold Xin <rxin@databricks.com> wrote:
>
>> We thought about this today after seeing this email. I actually built a
>> patch for this (adding filter/column to data source stat estimation), but
>> ultimately dropped it due to the potential problems the change the cause.
>>
>> The main problem I see is that column pruning/predicate pushdowns are
>> advisory, i.e. the data source might or might not apply those filters.
>>
>> Without significantly complicating the data source API, it is hard for
>> the optimizer (and future cardinality estimation) to know whether the
>> filter/column pushdowns are advisory, and whether to incorporate that in
>> cardinality estimation.
>>
>> Imagine this scenario: a data source applies a filter and estimates the
>> filter's selectivity is 0.1, then the data set is reduced to 10% of the
>> size. Catalyst's own cardinality estimation estimates the filter
>> selectivity to 0.1 again, and thus the estimated data size is now 1% of the
>> original data size, lowering than some threshold. Catalyst decides to
>> broadcast the table. The actual table size is actually 10x the size.
>>
>>
>>
>>
>>
>> On Fri, Feb 6, 2015 at 3:39 AM, Aniket Bhatnagar <
>> aniket.bhatnagar@gmail.com> wrote:
>>
>>> Hi Spark SQL committers
>>>
>>> I have started experimenting with data sources API and I was wondering if
>>> it makes sense to move the method sizeInBytes from BaseRelation to Scan
>>> interfaces. This is because that a relation may be able to leverage
>>> filter
>>> push down to estimate size potentially making a very large relation
>>> broadcast-able. Thoughts?
>>>
>>> Aniket
>>>
>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message