drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Altekruse <altekruseja...@gmail.com>
Subject Re: buffer allocation of cast into var length type
Date Wed, 04 Dec 2013 03:27:12 GMT
Jinfeng,

I did not even think of actually turning integers in ascii, while I know it
is part of SQL it seems like such a crazy thing to do on a short lived
query on large dataset.

I would take a look at the code we are using for the project operator, that
is the last time I remember discussing passing buffers between different
value vectors. There we used it for simply changing the metadata for a
column where all that the project involved was a column name change, not a
mathematical operation.

In regards to the more involved case where you need convert an integer to
its ascii implementation, how would the consumer know how big of a buffer
to allocate? Would there be a pre-processing step where you determine the
number of digits needed to represent the integers/doubles in base 10? For
integers I guess we could zero fill them all to the same length, but that
seems like it wouldn't be worth it for the little time we would save
scanning through the dataset.

Another option is that we could always over-allocate the buffers and then
slice off the excess, but there is no really good way to avoid waste.

Not sure if we want to open this can of worms, but there is another
possible solution that is related to some thoughts I have around making the
parquet reader faster. It is possible that we might have to break our
design of a single column always being represented by a single buffer.

In cases like this where it is hard to know the final buffer length, it
might be easier to allocate a reasonable guess and then just tack on
another buffer if we guessed wrong. I know that one of the main goals of
value vectors is that they are random access, with minimal overhead for
value extraction, but I think this might be a case where it would be worth
breaking it.

The simple implementation might look like the variable length vectors, with
a metadata buffer sitting in front of the data to describe ranges of values
held in each of the buffers. i.e values 1-400 are in buffer 1 : 401-1000
are in buffer 2. (I would assume we we never exceed 5 or so buffers, but it
could provide extra flexibility).

To prevent the need for an extra step of indirection with each value
extraction, we could change the interfaces on value vectors a bit to make
them expose an interator, rather than get(index) method. This would allow
for fetching the first buffer, reading all of its values with the same
overhead as we have now, until we hit the end of the buffer, and then we
could rely on an exception to indicate we ran out of values and at that
time swap to the second buffer.

-Jason


On Tue, Dec 3, 2013 at 8:59 PM, Jinfeng Ni <jinfengni99@gmail.com> wrote:

> Hi Jason,
>
> Good question.
>
> Actually, for some type cast, it is *binary coercible, *means there is no
> need internally to do any conversion. for instance, char --> varchar,
> varchar --> varbinary, etc.
>
> For other cases, some transformation is required, since the binary
> representation of source type is different from the binary representation
> of target type.
> For instance, int -> varchar.  The target type need keep each digit of the
> integer, while the source type is a 4-byte representation.
>
> I will look into whether it's possible to use the buffer in the output
> value vector directly, without copying into new buffer.
>
>
>
>
>
> On Tue, Dec 3, 2013 at 6:29 PM, Jason Altekruse <altekrusejason@gmail.com
> >wrote:
>
> > Hi Jinfeng,
> >
> > This might be a dumb question, but is there any transformation being
> > performed when going from a fixed length type to a variable length type?
> > That is, are the bytes in the buffer coming in going to be the same as
> the
> > bytes coming out of the cast?
> >
> > I understand that for casts like int-> long we need to add extra space
> > between each value, but is it possible that we could just hand the buffer
> > from one value vector type to the other without copying it into a new
> > buffer?
> >
> > We would still have to create a new buffer with the offsets of the
> > "variable length" values, but it would save us some time if we could do
> > this.
> >
> > -Jason Altekruse
> >
> >
> > On Tue, Dec 3, 2013 at 5:35 PM, Jinfeng Ni <jinfengni99@gmail.com>
> wrote:
> >
> > > Hi all,
> > >
> > > I' working on the explicit cast support in drill. So far, I have
> > prototyped
> > > the implementation for the first 3 categories, and would like to seek
> > input
> > > from you regarding how to deal with the buffer allocation for cast from
> > > fixed-length type into var-length type.
> > >
> > > 1. cast from fixed-length type to fixed-length type
> > > eg:   float4 --> int,
> > >         int -> float4,
> > >
> > > 2. cast from var-length type to fixed-length type
> > > eg: varchar --> int
> > >       varbinary --> int
> > > (Still need to figure out how to handle overflow issue when cast)
> > >
> > > 3. cast from fixed-length type to var-length type
> > > eg:  int  -> varchar
> > >        bigint -> varbinary
> > >
> > > 4. cast from var-length type to var-length type
> > > eg:   varchar --> varchar
> > >         varbinary --> varchar
> > >
> > > For the 3rd one, ie. from fixed-length to var-length type, it causes
> some
> > > problem to the current implementation, in terms of buffer allocation.
> > >
> > > For the fixed-length type, drill uses java primitive type in
> ValueHolder.
> > > For instance, IntHolder.value is a int.  But for var-length type, drill
> > > will use a buffer to keep its value. When doing cast from int into
> > varchar,
> > > the buffer for the VarCharHolder is not allocated, and we have to
> figure
> > > out a way to do the allocation, before cast.
> > >
> > > There seems 2 options:
> > > Option 1:  allocate buffer in the function template setup() method.
>  The
> > > buffer will be used in eval() method.
> > > Problem with this option :
> > > 1) need copy twice.  first copy from fixed-type input into the buffer
> > > allocated in setup(), second copy from the buffer into the buffer in
> the
> > > target vector.
> > > 2) need add a cleanup() method to function template, to clean the
> buffer
> > > allocated, which currently is not there in the code base.
> > >
> > > Option 2:  the consumer of output of the cast function will be
> > responsible
> > > to pre-allocate buffer in the target ValueVector for all the
> > > VarCharHolder().  The cast function will simply do the conversion and
> > copy
> > > into the pre-allocated buffer in the target ValueVector.
> > > Good thing of this option is it requires 1 copy.
> > >
> > > I have prototyped the 1st option, and have not figured out how to
> > implement
> > > the 2nd approach yet. But I would like to seek suggestion regarding
> > those 2
> > > options, before I proceed next.
> > >
> > > Thanks!
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message