calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <jul...@hydromatic.net>
Subject Re: Filter push
Date Tue, 14 Oct 2014 04:39:10 GMT
Vladimir,

I appreciate the critique of the design, and I agree with just about all of your points. However
I think we need to help people like Dan, developing the RocksDB adapter, get to runnable code
faster. An example, as part of the optiq-csv demo project, and not requiring any rules or
code generation, would achieve that.

See detailed comments inline.

Julian

On Oct 13, 2014, at 4:59 PM, Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote:

>> * ProjectableCursorableTable goes further, and allows Calcite to
>> specify a list of projected fields and a list of filters. The cursor
>> must implement the projects, but it can choose which filters it is
>> able to implement.
> 
> I am against of such interfaces.
> I would be happy to be proven wrong.
> 
> This looks like a rabbit hole: it is a powerful feature, however
> 1) It seems hard to make it fast
> Effectively, it forces engine to interpret the whole thing since Calcite
> won't know if some of the filters are implemented by the table or not.
> We'll have to double-check if the list returned from "projectFilterScan" is
> valid (e.g. it does not contain completely new filters).

Adapters are not allowed to do that. Calcite would throw if an adapter returned filters that
(based on ==) were not in the original list.

We would know at planning time which of the filters the table can handle. The remaining filters
can be handled as they are today.

> 2) It does not look to scale well: tomorrow you'll want
> ProjectableCursorableIndexScanThenAccessTable once you realize some of the
> filtering can be checked against just the index contents. E.g. range scan
> of the key, then some fuzzy filter logic on the key itself, then table
> access for the rest with some more filters.

I agree, it doesn't scale well. I am following the mantra that simple things should be simple.

Pushing down projects and filters are by far the most common optimizations.

If someone builds an adapter to a particular data source and their users are telling them
that they need (say) pushdown aggregation, they have already validated that Calcite is a useful
technology and they will not mind re-implementing filters and projects using rules.

> 3) I am not sure if those kind of interfaces would solve more complex
> cases: complex RexNodes (e.g. RexOver over RexOver over Rex..).
> Ideally, filters should be split to the ones that "can be implemented at
> storage and the ones that can not". I guess this has to be in some rule and
> "CursorableTable" is just a tiny bit. The logic to split the filters is not
> yet automagically solved by Calcite.

Good point. If the filters can be decomposed, then Calcite should do it before passing the
candidate filters to the adapter. The same goes for other transformations such as constant
folding. 

That said, it should be OK to pass complex filters to the adapter. The adapter can just say
no if it doesn't understand the filter.

> 
>> * CursorableTable is an optional interface that can be implemented by
>> any Table that allows you to get the results directly, without code
>> generation, and without creating a TableAccessRel or similar.
> 
> How is that better than AbstractQueryableTable?
> There is no need to do code generation if you need just a table scan.
> There is no need to create separate TableAccessRel either.
> 
> Here's the example:
> table definition:
> https://github.com/vlsi/optiq-mat-plugin/blob/master/mat-plugin/src/com/github/vlsi/mat/optiq/HeapSchema.java#L40
> table implementation:
> https://github.com/vlsi/optiq-mat-plugin/blob/master/mat-plugin/src/com/github/vlsi/mat/optiq/InstanceByClassTable.java#L27

Yes, I realized that later. In the prototype I am developing, I am considering going back
to AbstractQueryableTable. One thing against that approach is that Queryable is a big and
confusing interface, even with the help of AbstractQueryableTable.

I further thought of having the adapter writer override the where and select methods of the
Queryable, but requires way too many lambda-style classes, and without flagging interfaces
it is not clear to Calcite at planning time whether a table is capable of implementing filter
and project.

> 
>> It returns a Cursor, which is similar to a JDBC ResultSet but much
>> simpler to implement,
> 
> We might just want "cursor convention", however it is a separate issue
> (e.g. getElementType -> Cursor.class | Object[].class |
> CustomDefinedPOJO.class)
> I do not like if "cursorable" would be a feature of "Cursorable" table.
> This will confuse users since "different kind of tables will have subtle
> differences and it would be impossible to pick the right one".

Yes. I came to the same conclusion. I am now thinking of result type being Enumerable<Object[]>
or Queryable<Object[]>, which is basically what we have today.

>> and is more efficient than an Iterator or
>> Enumerable.
> 
> Can you please elaborate why Cursor would be so much better?
> I see nothing specific to Cursor that would make it more efficient.

When I looked at net.hydromatic.avatica.Cursor, I saw that it was not so easy to implement.
And, since you need to pass values via a Getter, every value has to be boxed. I sketched out
a simpler Cursor interface:

interface Cursor2 {
  boolean getBoolean(int);
  byte getByte(int);
  short getShort(int);
  char getChar(int);
  int getInt(int);
  long getLong(int);
  float getFloat(int);
  double getDouble(int);
  Object getObject(int);
  boolean isNull(int);
  boolean moveNext();
  void close();
}

You can implement Cursor2 without so that doing any per-row memory allocation. In my experience
that is really important for high performance data processing.

Note also that isNull(int) takes a column ordinal, whereas Cursor.wasNull() requires the cursor
implementation to remember which column was referenced most recently.

> The downside of Cursor is the requirement to convert the values to suit
> each and every getter (30+ methods in Cursor$Accessor interface).
> For instance, the data might be stored internally as "int", and Calcite
> will use getString for some reason (who stops that?)
> This might be not that efficient and it even might surprise the developer
> who implements the Cursor.

The contract in Cursor2 would be that if the table declares a column as an INTEGER, then Calcite
would only call getInt() (possibly following up with a call to isNull() if the column is nullable
and getInt() returned 0).

And similarly for other types.

So the developer who implements the Cursor2 would only have to implement the method for the
one type for each column.

> I bet no one would be able to implement Date/Timestamp kind of fields from
> the first and even the second try (especially getting all the getters
> right).

Yeah, JDBC is excruciating to implement for datetime values. In Cursor2, time and date values
would be represented by int, and timestamp by long, milliseconds since the zoneless epoch.

That said, I decided to represent rows as Object[] for now. I might come back to Cursor2 if
we need an efficient interface to other data sources.

Julian


Mime
View raw message