calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Tiles & Lattice HOWTO
Date Mon, 04 May 2015 21:28:11 GMT
Julian,

I think that what you are talking about is related to the mathematical
definition of a lattice is a partially ordered set in which least upper
bound and greatest lower bound are defined for every pair of elements.

In this case, the objects in the set are aggregation queries and the
partial ordering refers to the variables included in the group by.  You can
take the ordering as to be defined between queries A and B such that A < B
if the variables in the group by part of A is a strict subset of the
variables in the group by part of B.  This means that the least upper bound
between two queries as the query with the union of all grouping variables
and the greatest lower bound as the query with the intersection of grouping
variables.

This is all straightforward except for bringing in a fair bit of
terminology.

I am confused by one thing you say, however, and I wonder if others are as
well.

You say "A lattice represents a star (or snowflake) schema, not a general
schema".  If my interpretation of what you say is correct, the lattice is
not a schema at all, but instead a set of queries that are defined over a
schema with ordering defined as above.

Is it merely a shorthand to refer to a schema as equivalent to the lattice
composed of all the possible aggregation queries with all combinations of
dimensions?

On Mon, May 4, 2015 at 6:30 PM, Julian Hyde <julianhyde@gmail.com> wrote:

> A lattice represents a star (or snowflake) schema, not a general
> schema. In particular, all relationships must be many-to-one, heading
> from a fact table at the center of the star.
>
> The lattice definition uses a SQL statement to represent the star. SQL
> is a useful short-hand to represent several tables joined together,
> and assigning aliases to the column names (it more convenient than
> inventing a new language to represent relationships, join conditions
> and cardinalities).
>
> Unlike regular SQL, order is important. If you put A before B in the
> FROM clause, and make a join between A and B, you are saying that
> there is a many-to-one foreign key relationship from A to B. (E.g. in
> the example lattice, the Sales fact table occurs before the Time
> dimension table, and before the Product dimension table. The Product
> dimension table occurs before the ProductClass outer dimension table,
> further down an arm of a snowflake.)
>
> A lattice implies constraints. In the A to B relationship, there is a
> foreign key on A (i.e. every value of A's foreign key has a
> corresponding value in B's key), and a unique key on B (i.e. no key
> value occurs more than once). These constraints are really important,
> because it allows the planner to remove joins to tables whose columns
> are not being used, and know that the query results will not change.
>
> Calcite does not check these constraints. If they are violated,
> Calcite will return wrong results.
>
> A lattice is a big, virtual join view. It is not materialized (it
> would be several times larger than the star schema, because of
> denormalization) and you probably wouldn't want to query it (far too
> many columns). So what is it useful for? As we said above, (a) the
> lattice declares some very useful primary and foreign key constraints,
> (b) it helps the query planner map user queries onto
> filter-join-aggregate materialized views (the most useful kind of
> materialized view for DW queries), (c) gives Calcite a framework
> within which to gather stats about data volumes and user queries, (d)
> allows Calcite to automatically design and populate materialized
> views.
>
> Most star schema models force you to choose whether a column is a
> dimension or a measure. In a lattice, every column is a dimension
> column. (I.e. it can become one of the columns in the GROUP BY clause
> to query the star schema at a particular dimensionality). Any column
> can also be used in a measure; you define measures by giving the
> column and an aggregate function.
>
> If "unit_sales" tends to be used much more often as a measure rather
> than a dimension, that's fine. Calcite's algorithm should notice that
> it is rarely aggregated, and not be inclined to create tiles that
> aggregate on it. (By "should" I mean "could and one day will". The
> algorithm does not currently take query history into account when
> designing tiles.)
>
> But someone might want to know whether orders with fewer than 5 items
> were more or less profitable than orders with more than 100. All of a
> sudden, "unit_sales" has become a dimension. If there's virtually zero
> cost to declaring a column a dimension column, I figured let's make
> them all dimension columns.
>
> The model allows for a particular table to be used more than once,
> with a different table alias. You could use this to model say
> OrderDate and ShipDate, with two uses to the Time dimension table.
>
> Most SQL systems require that the column names in a view are unique.
> This is hard to achieve in a lattice, because you often include
> primary and foreign key columns in a join. So Calcite lets you refer
> to columns in two ways. If the column is unique, you can use its name,
> ["unit_sales"]. Whether or not it is unique in the lattice, it will be
> unique in its table, so you can use it qualified by its table alias.
> Examples: ["sales", "unit_says", ["ship_date", "time_id"],
> ["order_date", "time_id"].
>
> A "tile" is a materialized table in a lattice, with a particular
> dimensionality. (What Kylin calls a "cuboid".) The "tiles" attribute
> of the lattice JSON element (JsonLattice) defines an initial set of
> tiles to materialize.
>
> If you run the algorithm, you can omit the tiles attribute. Calcite
> will choose an initial set. If you include the tiles attribute, the
> algorithm will start with that list and then start finding other tiles
> that are complementary (i.e. "fill in the gaps" left by the initial
> tiles).
>
> Yeah, CARDINALITY_MAP is currently a hack. I didn't want Calcite to
> have to compute column cardinalities by executing a  "count(distinct
> ...)" query for each column. We need to introduce a
> CardinalityProvider SPI, with a default implementation that generates
> SQL, but allows people to use more efficient means, e.g. database
> stats or HyperLogLog. Column cardinalities (and joint cardinalities
> such as "count(distinct state, city, zipcode)") are really important
> to the algorithm. If computing cardinalities is slow the algorithm
> will be slow, and if they are inaccurate the algorithm will produce
> sub-optimal results.
>
> I should have written this all down in a blog post a long while ago.
> Sorry I didn't. Now I have.
>
> Julian
>
>
>
> On Sun, May 3, 2015 at 6:52 AM, Marc Prud'hommeaux <marc@glimpse.io>
> wrote:
> >
> > Can anyone point me to a short summary of how to construct and use the
> "tiles" model parameter in Calcite? The slides at
> http://www.slideshare.net/julianhyde/calcite-stratany2014 <
> http://www.slideshare.net/julianhyde/calcite-stratany2014> are
> intriguing, but I can't find any video of the presentation.
> >
> > And does it require a specific pre-existing star schema, or can it be
> used with any arbitrary schema? Skimming
> core/src/main/java/org/apache/calcite/materialize/Lattice.java it looks
> like the CARDINALITY_MAP constant is hardwired to the foodmart database.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message