calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <jul...@hydromatic.net>
Subject Re: Tiles & Lattice HOWTO
Date Mon, 11 May 2015 23:10:38 GMT
Moving CARDINALITY_MAP into the model seem to be a bit of a hack. Let’s fix https://issues.apache.org/jira/browse/CALCITE-711
first. After that, someone can contribute a statistic provider that reads from the model.

On May 8, 2015, at 5:33 PM, Marc Prud'hommeaux <marc@glimpse.io> wrote:

> Julian-
> 
> Thanks very much for the detailed explanation! It gives me enough to go on to experiment
with using it for my purposes (ad-hoc generation of lattice structures).
> 
> Regarding the CARDINALITY_MAP, it might be less of a hack if it was just moved into the
model; i.e., have a CARDINALITY_MAP array in hsqldb-foodmart-lattice-model.json rather than
hardcoded into Lattice.java. Also, as we discussed a few years ago at <http://julianhyde.blogspot.com/2012/04/how-should-mondrian-get-table-and.html>,
we might get some mileage from using DatabaseMetadata.getIndexInfo for cardinality estimates.
> 
> 	-Marc
> 
> 
>> On May 4, 2015, at 1:30 PM, Julian Hyde <julianhyde@gmail.com> wrote:
>> 
>> A lattice represents a star (or snowflake) schema, not a general
>> schema. In particular, all relationships must be many-to-one, heading
>> from a fact table at the center of the star.
>> 
>> The lattice definition uses a SQL statement to represent the star. SQL
>> is a useful short-hand to represent several tables joined together,
>> and assigning aliases to the column names (it more convenient than
>> inventing a new language to represent relationships, join conditions
>> and cardinalities).
>> 
>> Unlike regular SQL, order is important. If you put A before B in the
>> FROM clause, and make a join between A and B, you are saying that
>> there is a many-to-one foreign key relationship from A to B. (E.g. in
>> the example lattice, the Sales fact table occurs before the Time
>> dimension table, and before the Product dimension table. The Product
>> dimension table occurs before the ProductClass outer dimension table,
>> further down an arm of a snowflake.)
>> 
>> A lattice implies constraints. In the A to B relationship, there is a
>> foreign key on A (i.e. every value of A's foreign key has a
>> corresponding value in B's key), and a unique key on B (i.e. no key
>> value occurs more than once). These constraints are really important,
>> because it allows the planner to remove joins to tables whose columns
>> are not being used, and know that the query results will not change.
>> 
>> Calcite does not check these constraints. If they are violated,
>> Calcite will return wrong results.
>> 
>> A lattice is a big, virtual join view. It is not materialized (it
>> would be several times larger than the star schema, because of
>> denormalization) and you probably wouldn't want to query it (far too
>> many columns). So what is it useful for? As we said above, (a) the
>> lattice declares some very useful primary and foreign key constraints,
>> (b) it helps the query planner map user queries onto
>> filter-join-aggregate materialized views (the most useful kind of
>> materialized view for DW queries), (c) gives Calcite a framework
>> within which to gather stats about data volumes and user queries, (d)
>> allows Calcite to automatically design and populate materialized
>> views.
>> 
>> Most star schema models force you to choose whether a column is a
>> dimension or a measure. In a lattice, every column is a dimension
>> column. (I.e. it can become one of the columns in the GROUP BY clause
>> to query the star schema at a particular dimensionality). Any column
>> can also be used in a measure; you define measures by giving the
>> column and an aggregate function.
>> 
>> If "unit_sales" tends to be used much more often as a measure rather
>> than a dimension, that's fine. Calcite's algorithm should notice that
>> it is rarely aggregated, and not be inclined to create tiles that
>> aggregate on it. (By "should" I mean "could and one day will". The
>> algorithm does not currently take query history into account when
>> designing tiles.)
>> 
>> But someone might want to know whether orders with fewer than 5 items
>> were more or less profitable than orders with more than 100. All of a
>> sudden, "unit_sales" has become a dimension. If there's virtually zero
>> cost to declaring a column a dimension column, I figured let's make
>> them all dimension columns.
>> 
>> The model allows for a particular table to be used more than once,
>> with a different table alias. You could use this to model say
>> OrderDate and ShipDate, with two uses to the Time dimension table.
>> 
>> Most SQL systems require that the column names in a view are unique.
>> This is hard to achieve in a lattice, because you often include
>> primary and foreign key columns in a join. So Calcite lets you refer
>> to columns in two ways. If the column is unique, you can use its name,
>> ["unit_sales"]. Whether or not it is unique in the lattice, it will be
>> unique in its table, so you can use it qualified by its table alias.
>> Examples: ["sales", "unit_says", ["ship_date", "time_id"],
>> ["order_date", "time_id"].
>> 
>> A "tile" is a materialized table in a lattice, with a particular
>> dimensionality. (What Kylin calls a "cuboid".) The "tiles" attribute
>> of the lattice JSON element (JsonLattice) defines an initial set of
>> tiles to materialize.
>> 
>> If you run the algorithm, you can omit the tiles attribute. Calcite
>> will choose an initial set. If you include the tiles attribute, the
>> algorithm will start with that list and then start finding other tiles
>> that are complementary (i.e. "fill in the gaps" left by the initial
>> tiles).
>> 
>> Yeah, CARDINALITY_MAP is currently a hack. I didn't want Calcite to
>> have to compute column cardinalities by executing a  "count(distinct
>> ...)" query for each column. We need to introduce a
>> CardinalityProvider SPI, with a default implementation that generates
>> SQL, but allows people to use more efficient means, e.g. database
>> stats or HyperLogLog. Column cardinalities (and joint cardinalities
>> such as "count(distinct state, city, zipcode)") are really important
>> to the algorithm. If computing cardinalities is slow the algorithm
>> will be slow, and if they are inaccurate the algorithm will produce
>> sub-optimal results.
>> 
>> I should have written this all down in a blog post a long while ago.
>> Sorry I didn't. Now I have.
>> 
>> Julian
>> 
>> 
>> 
>> On Sun, May 3, 2015 at 6:52 AM, Marc Prud'hommeaux <marc@glimpse.io> wrote:
>>> 
>>> Can anyone point me to a short summary of how to construct and use the "tiles"
model parameter in Calcite? The slides at http://www.slideshare.net/julianhyde/calcite-stratany2014
<http://www.slideshare.net/julianhyde/calcite-stratany2014> are intriguing, but I can't
find any video of the presentation.
>>> 
>>> And does it require a specific pre-existing star schema, or can it be used with
any arbitrary schema? Skimming core/src/main/java/org/apache/calcite/materialize/Lattice.java
it looks like the CARDINALITY_MAP constant is hardwired to the foodmart database.
>>> 
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message