calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrei Sereda <and...@sereda.cc>
Subject Re: Review Request (CALCITE-2689). ElasticSearch Adapter. Grouping on date / number fields.
Date Fri, 14 Dec 2018 15:02:50 GMT
> Is the query still going to be inefficient if the user adds the IS NOT
NULL predicate?
It will be efficient. For explicit null check aggregates can be pushed to
ES since sentinel will never be used.

> Otherwise, I don't find it very problematic to expect the users to right
their queries correclty.
I think users will quickly forget to add IS NOT NULL and then complain
about slow queries. What to do when there are several grouping columns
(GROUP BY a, b, c) ?

I'll start working on Composite aggregations
<https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html>
to remedy this problem (requires ES 6.1+)


On Fri, Dec 14, 2018 at 5:23 AM Stamatis Zampetakis <zabetak@gmail.com>
wrote:

> I have no strong opinion since I am not using the adapter.
>
> However, I have a small question regarding option 2.
>
> Is the query still going to be inefficient if the user adds the IS NOT NULL
> predicate?
> If yes, then I probably option 1 would be the best option for the moment.
> Otherwise, I don't find it
> very problematic to expect the users to right their queries correclty. The
> same happens with other database
> (e.g., Oracle) where sometimes you need to introduce NULL checks in the
> queries in order to use the
> existing indexes and have efficient queries.
>
> Best,
> Stamatis
>
>
> Στις Πέμ, 13 Δεκ 2018 στις 4:40 μ.μ., ο/η Andrei Sereda <andrei@sereda.cc>
> έγραψε:
>
> > Thanks, Stamatis.
> >
> > I agree that sentinels are not the best approach but without pushing
> > aggregations to ES they're not very usable to elastic users.
> >
> > So there are two bad choices:
> > 1) (Low?) Probability of collisions triggering invalid results
> (surprises)
> > but efficient query.
> > 2)  Inefficient query (fetching whole index in memory)
> >
> > I hope we can drop support for ES 2.x soon and enforce ES >= 6.1. If so,
> > one can use  Composite aggregations
> > <
> >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html
> > >
> > which don't have these shortcomings.
> > What we can do now is have separate aggregation query for ES2 and ES6.
> >
> > What do you think?
> >
> > Regards,
> > Andrei.
> >
> >
> >
> > On Thu, Dec 13, 2018 at 6:16 AM Stamatis Zampetakis <zabetak@gmail.com>
> > wrote:
> >
> > > Hi Andrei,
> > >
> > > I haven't gone entirely over the new PR, but I think there are cases
> > where
> > > the result of the queries are going to be wrong (when values collide
> with
> > > sentinels).
> > > Another approach would be to introduce a rule in order to push the
> > > aggregation in ElasticSearch *only* if the field in questions is *not*
> > > nullable. This can make
> > > queries a bit more verbose but it guarantees that there will be no
> > suprises
> > > to the end-user.
> > >
> > > select max(amount), date from orders group by amount, date ->
> Aggregation
> > > not pushed on ES
> > > select max(amount), date from orders where amount is not null and date
> is
> > > not null group by amount, date -> Aggregation pushed on ES
> > >
> > > Sorry for the late reply!
> > >
> > > Best,
> > > Stamatis
> > >
> > >
> > > Στις Κυρ, 2 Δεκ 2018 στις 2:01 π.μ., ο/η Julian Hyde <jhyde@apache.org
> >
> > > έγραψε:
> > >
> > > > Interesting.
> > > >
> > > > One of the benefits that a SQL layer such as Calcite can bring is
> that
> > it
> > > > hides the details necessary to make operations like this work.
> > > >
> > > > Julian
> > > >
> > > >
> > > > > On Dec 1, 2018, at 5:22 AM, Kevin Risden <krisden@apache.org>
> wrote:
> > > > >
> > > > > I haven't had a chance to review but saw that Elastic has the same
> > > issue
> > > > > with aggregations.
> > > > >
> > > > > https://github.com/elastic/elasticsearch/issues/35745
> > > > >
> > > > > Kevin Risden
> > > > >
> > > > > On Wed, Nov 28, 2018, 20:46 Andrei Sereda <andrei@sereda.cc wrote:
> > > > >
> > > > >> Greetings ,
> > > > >> We have discovered an issue with ES aggregations when grouping
on
> > > > >> non-textual fields (date, long). Currently the following sql
fails
> > > > because
> > > > >> for missing value
> > > > >> <
> > > > >>
> > > >
> > >
> >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_missing_value_13
> > > > >>>
> > > > >> we inject __MISSING__ sentinel which is not date / number
> parseable
> > > (it
> > > > >> can’t be null either) :
> > > > >>
> > > > >> select max(amount), date from orders group by date -- special
ES
> > type
> > > > >>
> > > > >> The solution is to make sentinel type specific :
> > > > >>
> > > > >>   1. Integer.MIN_VALUE for integers
> > > > >>   2. 9999-12-31 for dates etc.
> > > > >>
> > > > >> For low cardinality types like boolean, byte, short I’m not
sure
> > what
> > > > to do
> > > > >> since there is high probability of collision between missing
field
> > and
> > > > >> actual value (eg. what value to choose for missing boolean?)
:
> > > > >>
> > > > >> select max(amount), isActive from orders group by isActive --
> > boolean
> > > > type
> > > > >>
> > > > >> Let me know if you solved this problem differently before.
> Composite
> > > > >> aggregations
> > > > >> <
> > > > >>
> > > >
> > >
> >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html
> > > > >>>
> > > > >> (available since 6.1) should help in future.
> > > > >>
> > > > >> PR: https://github.com/apache/calcite/pull/946
> > > > >> JIRA: https://issues.apache.org/jira/browse/CALCITE-2689
> > > > >>
> > > > >> Many Thanks in Advance,
> > > > >> Andrei.
> > > > >>
> > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message