calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <jh...@apache.org>
Subject Re: SQL Query Set Analyzer
Date Mon, 06 Aug 2018 22:32:24 GMT
It’s hard to automatically recommend a set of MVs from past queries. The design space is
just too large. But if you are designing MVs for interactive BI, you can use the “lattice”
model. This works because many queries will be filter-join-aggregate queries on a star schema
(i.e. a central fact table and dimension tables joined over many-to-one relationships). (Or
perhaps a join between two or more such queries.) 

Do the queries you are trying to optimize have that pattern?

If so, you might start by creating a lattice for each such star schema. Then the lattice can
suggest MVs that are summary tables.

(Lattice suggester is one step more meta - it recommends lattices - but given where you are,
I would suggest hand-writing one or two lattices.)

Calcite is a framework, and this unfortunately means that you have to write Java code to use
these features. It might be easier if you use the new “server” module, which supports
CREATE MATERIALIZED VIEW as a DDL statement. Then you can create some demos for your colleagues
that are wholly or mostly SQL.

The simplest way to populate a materialized view is the CREATE MATERIALIZED VIEW statement.
It basically does the same as CREATE TABLE AS SELECT (executes a query, stores the results
in a table) but it leaves behind the metadata about where that data came from.

Materialized views can in principle be maintained incrementally, but how you do it depends
upon what changes are allowed (append only? Replace rows and write the old rows to an audit
table?). We’ve not done a lot of work on it. I believe the Hive folks have given this more
thought than I have.

Julian


> On Aug 3, 2018, at 11:11 PM, James Taylor <jamestaylor@apache.org> wrote:
> 
> Both the Lattice Suggestor and Quark sound like what I need for an
> automated solution, but I have some more basic follow up questions first.
> Here's our basic use case (very similar to Zheng Shao's, I believe):
> - Our company has stood up Presto for data analysts
> - Nightly ETL jobs populate Hive tables with lots of data
> - Analysts run adhoc queries over data using Presto
> - The top CPU using queries are pretty complex (2-3 pages of complex SQL,
> lots of joins and aggregation)
> 
> There are some basic/obvious stuff that can be done manually first:
> - Provide better visibility into which queries are expensive
> - Ask query owners to produce their own materialized views and manually
> change their queries to use them (I believe there's some amount of this
> already)
> 
> Then there's kind of a middle ground:
> - Ask query owners to identify what they think are the top few materialized
> views to build
> - Manually build these materialized views in the daily ETL job.
> - Use Calcite to rewrite the query to use the materialized views. Can
> Calcite do this and would it be a problem if the queries are Presto
> queries? I'd need to make sure I provided Calcite with the cost information
> it needs, right?
> - Dark launch to test that the rewritten query returns the same results as
> the original query (and measure the perf improvement)
> 
> But the more interesting stuff is:
> - Automatically identifying the materialized views that should be built.
> Sounds like both the Lattice Suggestor and Quark are potentially a good
> fit. I'm not clear on what is output by the Suggestor. Would it spit out a
> CREATE VIEW statement (or could what it outputs produce that)? How does the
> Suggestor compare with Quark?
> - Automatically build the materialized views. Would the Lattice framework
> or Quark help me with that? Would it be possible to incrementally build the
> materialized views or would it be necessary to build the materialized views
> from the beginning of time again and again (clearly not feasible given the
> size of the tables)? Maybe it depends on the aggregation functions that are
> used?
> - And the nirvana is a kind of feedback loop - based on the top N expensive
> queries, identify and build the materialized views, use them transparently
> during querying, and then retire them when they're infrequently used.
> 
> Would it be a better choice to build the materialized views as Druid
> tables? That'd require a Druid connector to Presto, though. This reminds me
> of the work you already did, Julian, with Hive+Druid (i.e. CALCITE-1731)
> but for Presto instead of Hive. Do you think any of that would transfer
> over in some way?
> 
> WDYT? Huge amount of work? Any advice is much appreciated.
> 
> Thanks,
> James
> 
> On Thu, Jul 26, 2018 at 11:29 AM, Julian Hyde <jhyde@apache.org <mailto:jhyde@apache.org>>
wrote:
> 
>> PS
>> 
>> +1 for Babel.
>> 
>> If you are analyzing a set of queries, it is very likely that these
>> queries were written to be executed against another database. Babel aims to
>> take such queries and convert them into Calcite relational algebra. The
>> process might occasionally be lossy, if Calcite's algebra does not support
>> a feature, but the algebra is still useful.
>> 
>>> On Jul 26, 2018, at 11:25 AM, Julian Hyde <jhyde@apache.org> wrote:
>>> 
>>> There are many possible analyzers, but Lattice Suggester is one that I
>> am working on and is relatively mature. It looks at lots of queries and
>> builds lattices (star schemas with measures) from those queries. It finds
>> commonality by “growing” lattices - adding measures, adding derived
>> expressions, adding many-to-one joins. Lattice Suggester takes a set of SQL
>> query strings, then it parses them (to SqlNode), validates, and converts to
>> relational algebra (RelNode). Then it looks for patterns in the relational
>> algebra. Working at the algebra level as opposed to the SQL parse tree is a
>> net benefit, but some things (e.g. figuring out the original column alias
>> for an expression) are a bit more difficult.
>>> 
>>> Lattice suggester is under development in my
>> https://github.com/julianhyde/calcite/tree/1870-lattice-suggester <
>> https://github.com/julianhyde/calcite/tree/1870-lattice-suggester <https://github.com/julianhyde/calcite/tree/1870-lattice-suggester>>
>> branch. The tests pass, and I hope to have it merged into master in the
>> next month or two.
>>> 
>>> I agree with Devjyoti that it’s difficult to gather together all
>> possible analyses in one tool. Lattice Suggester is a good base for
>> analyses that model queries as a filter/project/aggregate of a pre-joined
>> star schema — a very common model in BI; for example, it could model which
>> combinations of columns are commonly used as filters.
>>> 
>>> For analyses that are not tied to star schemas, feel free to create new
>> tools. The tools would benefit from collaborative development, and I think
>> that Calcite would be a good home for them.
>>> 
>>> Julian
>>> 
>>> 
>>> 
>>> 
>>>> On Jul 25, 2018, at 10:28 PM, Devjyoti Patra <devjyotip@qubole.com
>> <mailto:devjyotip@qubole.com <mailto:devjyotip@qubole.com>>> wrote:
>>>> 
>>>> Hi Zheng,
>>>> 
>>>> At Qubole, we are building something very similar to what you are
>> looking
>>>> for. And from experience, I can tell you that it is a lot easy to build
>> it
>>>> than what one may think.
>>>> We use Calcite parser to parse the SQL into Sqlnode and then use
>> different
>>>> tree visitors to extract query attributes like  tables, filter columns,
>>>> joins, subqueries etc.,
>>>> 
>>>> Our approach is very similar to Uber's QueryParser project (
>>>> https://github.com/uber/queryparser <https://github.com/uber/queryparser>
<https://github.com/uber/ <https://github.com/uber/>
>> queryparser> ), but we go deeper in our analysis of
>>>> finding queries that are semantically similar to some canonicalized
>> form.
>>>> If you intend to begin from scratch, I can give you some pointers to get
>>>> started.
>>>> 
>>>> Thanks,
>>>> Devjyoti
>>>> 
>>>> 
>>>> On Thu, Jul 26, 2018 at 9:37 AM, Zheng Shao <zshao9@gmail.com <mailto:zshao9@gmail.com>
<mailto:
>> zshao9@gmail.com>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> We are thinking about starting a project to analyze huge number of SQL
>>>>> queries (think millions) to identify common patterns:
>>>>> * Common sub queries
>>>>> * Common filtering conditions (columns) for a table
>>>>> * Common join keys for table pairs
>>>>> 
>>>>> Are there any existing projects on that direction using Calcite?  Would
>>>>> love to leverage instead of building from scratch.
>>>>> 
>>>>> Zheng


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message