phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Purtell (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PHOENIX-838) Continuous queries
Date Tue, 11 Mar 2014 23:25:42 GMT

     [ https://issues.apache.org/jira/browse/PHOENIX-838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Purtell updated PHOENIX-838:
-----------------------------------

    Description: 
Support continuous queries. 

As a coprocessor application, Phoenix is well positioned to observe  mutations and treat those
observations as an event stream. 

Continuous queries are persistent queries that run server side, typically expressed as structured
queries using some extensions for defining a bounded subset of a potentially unbounded tuple
stream. A Phoenix user could create a materialized view using WINDOW and other OLAP extensions
to SQL discussed on PHOENIX-154 to define time- or tuple- based sliding windows, possibly
partitioned, and an aggregating or filtering operation over those windows. This would trigger
instantiation of a long running distributed task on the cluster for incrementally maintaining
the view. ("Task" is meant here as a logical notion, it may not be a separate thread of execution.)
As the task receives observer events and performs work, it would update state in memory for
on-demand retrieval. For state reconstruction after failure the WAL could be overloaded with
in-window event history and/or the in-memory state could be periodically checkpointed into
shadow stores in the region.

Users would pick up the latest state maintained by the continuous query by querying the view,
or perhaps Phoenix can do this transparently on any query if the optimizer determines equivalence.

This could be an important feature for Phoenix. Generally Phoenix and HBase are meant to handle
high data volumes that overwhelm other data management options, so even subsets of the full
data may present scale challenges. Many use cases mix ad hoc or exploratory full table scans
with aggregates, rollups, or sampling queries over a subset or sample. The user wishes the
latter queries to run as fast as possible. If that work can be done inline with the process
of initially persisting mutations then we trade some memory and CPU resources up front to
eliminate significant IO time later that would otherwise dominate.

An initial implementation could automatically partition continuous queries on region boundaries.
If this can be done then failure handling and state reconstruction for continuous queries
would map naturally onto existing HBase mechanisms for detecting and recovering from regionserver
failure. The following constructs should be excluded:

- DISTINCT (might require too much in memory state)
- Joins (defeats partitioning)
- Subqueries (implementation complexity)

Queries not meeting the constraints would generate an exception at view creation time. Partitioning
could be exposed explicitly to the user, or the JDBC driver could pick up global results in
parallel using an Endpoint invocation over all regions and perform a final global aggregation
or filtering step at the client.

Follow on work could enable subqueries as stacking in the event model. The inner query would
generate an event that notifies the outer query when new results are ready, and the outer
query would pick up the results and process them further.

It might also be useful follow on work to extend server side persistent query management with
an inactive-but-resident state. This would allow users to shed load by deactivating a subset
of persistent queries without requiring expensive reconstruction or losing state.

  was:
Support continuous queries. 

As a coprocessor application, Phoenix is well positioned to observe  mutations and treat those
observations as an event stream. 

Continuous queries are persistent queries that run server side, typically expressed as structured
queries using some extensions for defining a bounded subset of the potentially unbounded event
stream. A Phoenix user could create a materialized view using WINDOW and other OLAP extensions
to SQL discussed on PHOENIX-154 to define time- or tuple- based sliding windows, possibly
partitioned, and an aggregating or filtering operation over those windows. This would trigger
instantiation of a long running distributed task on the cluster for incrementally maintaining
the view. ("Task" is meant here as a logical notion, it may not be a separate thread of execution.)
As the task receives observer events and performs work, it would update state in memory for
on-demand retrieval. For state reconstruction after failure the WAL could be overloaded with
in-window event history and/or the in-memory state could be periodically checkpointed into
shadow stores in the region.

Users would pick up the latest state maintained by the continuous query by querying the view,
or perhaps Phoenix can do this transparently on any query if the optimizer determines equivalence.

This could be an important feature for Phoenix. Generally Phoenix and HBase are meant to handle
high data volumes that overwhelm other data management options, so even subsets of the full
data may present scale challenges. Many use cases mix ad hoc or exploratory full table scans
with aggregates, rollups, or sampling queries over a subset or sample. The user wishes the
latter queries to run as fast as possible. If that work can be done inline with the process
of initially persisting mutations then we trade some memory and CPU resources up front to
eliminate significant IO time later that would otherwise dominate.

An initial implementation could automatically partition continuous queries on region boundaries.
If this can be done then failure handling and state reconstruction for continuous queries
would map naturally onto existing HBase mechanisms for detecting and recovering from regionserver
failure. The following constructs should be excluded:

- DISTINCT (might require too much in memory state)
- Joins (defeats partitioning)
- Subqueries (implementation complexity)

Queries not meeting the constraints would generate an exception at view creation time. Partitioning
could be exposed explicitly to the user, or the JDBC driver could pick up global results in
parallel using an Endpoint invocation over all regions and perform a final global aggregation
or filtering step at the client.

Follow on work could enable subqueries as stacking in the event model. The inner query would
generate an event that notifies the outer query when new results are ready, and the outer
query would pick up the results and process them further.

It might also be useful follow on work to extend server side persistent query management with
an inactive-but-resident state. This would allow users to shed load by deactivating a subset
of persistent queries without requiring expensive reconstruction or losing state.


> Continuous queries
> ------------------
>
>                 Key: PHOENIX-838
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-838
>             Project: Phoenix
>          Issue Type: New Feature
>            Reporter: Andrew Purtell
>
> Support continuous queries. 
> As a coprocessor application, Phoenix is well positioned to observe  mutations and treat
those observations as an event stream. 
> Continuous queries are persistent queries that run server side, typically expressed as
structured queries using some extensions for defining a bounded subset of a potentially unbounded
tuple stream. A Phoenix user could create a materialized view using WINDOW and other OLAP
extensions to SQL discussed on PHOENIX-154 to define time- or tuple- based sliding windows,
possibly partitioned, and an aggregating or filtering operation over those windows. This would
trigger instantiation of a long running distributed task on the cluster for incrementally
maintaining the view. ("Task" is meant here as a logical notion, it may not be a separate
thread of execution.) As the task receives observer events and performs work, it would update
state in memory for on-demand retrieval. For state reconstruction after failure the WAL could
be overloaded with in-window event history and/or the in-memory state could be periodically
checkpointed into shadow stores in the region.
> Users would pick up the latest state maintained by the continuous query by querying the
view, or perhaps Phoenix can do this transparently on any query if the optimizer determines
equivalence.
> This could be an important feature for Phoenix. Generally Phoenix and HBase are meant
to handle high data volumes that overwhelm other data management options, so even subsets
of the full data may present scale challenges. Many use cases mix ad hoc or exploratory full
table scans with aggregates, rollups, or sampling queries over a subset or sample. The user
wishes the latter queries to run as fast as possible. If that work can be done inline with
the process of initially persisting mutations then we trade some memory and CPU resources
up front to eliminate significant IO time later that would otherwise dominate.
> An initial implementation could automatically partition continuous queries on region
boundaries. If this can be done then failure handling and state reconstruction for continuous
queries would map naturally onto existing HBase mechanisms for detecting and recovering from
regionserver failure. The following constructs should be excluded:
> - DISTINCT (might require too much in memory state)
> - Joins (defeats partitioning)
> - Subqueries (implementation complexity)
> Queries not meeting the constraints would generate an exception at view creation time.
Partitioning could be exposed explicitly to the user, or the JDBC driver could pick up global
results in parallel using an Endpoint invocation over all regions and perform a final global
aggregation or filtering step at the client.
> Follow on work could enable subqueries as stacking in the event model. The inner query
would generate an event that notifies the outer query when new results are ready, and the
outer query would pick up the results and process them further.
> It might also be useful follow on work to extend server side persistent query management
with an inactive-but-resident state. This would allow users to shed load by deactivating a
subset of persistent queries without requiring expensive reconstruction or losing state.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message