drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jinfeng Ni <jinfengn...@gmail.com>
Subject Re: buffer allocation of cast into var length type
Date Wed, 04 Dec 2013 16:54:05 GMT
Hi Jason,




On Tue, Dec 3, 2013 at 7:27 PM, Jason Altekruse <altekrusejason@gmail.com>wrote:

>
> In regards to the more involved case where you need convert an integer to
> its ascii implementation, how would the consumer know how big of a buffer
> to allocate? Would there be a pre-processing step where you determine the
> number of digits needed to represent the integers/doubles in base 10? For
> integers I guess we could zero fill them all to the same length, but that
> seems like it wouldn't be worth it for the little time we would save
> scanning through the dataset.
>
>
For cast function, it seems simple : user would specify the max length of
the target VARCHAR type, i. e VARCHAR(10).
If the length is not big enough, truncation would happen during the
conversion, and a warning might be raised.

However, as far as I know so far, the current drill code allocate a
pre-determined length for VarCharVector / VarBinaryVector (correct me if
I'm wrong). This makes sense in reading a schemaless parquet file, since
parquet reader does not know the actually length for each column. But for
the cast case, since we know the max length of the target type. In that
sense, I feel that VarCharVector / VarBinaryVector need a way to specify
the max length, if we know the target type.

Another issue with pre-determined length is that the buffer may not be big
enough to hold all the incoming data. cast from fixed-length input does not
have a serious problem here, since we know the max length. But for other
function, like string concat, etc, this pre-determine length may have
issue.


> Another option is that we could always over-allocate the buffers and then
> slice off the excess, but there is no really good way to avoid waste.
>
> Not sure if we want to open this can of worms, but there is another
> possible solution that is related to some thoughts I have around making the
> parquet reader faster. It is possible that we might have to break our
> design of a single column always being represented by a single buffer.
>
> In cases like this where it is hard to know the final buffer length, it
> might be easier to allocate a reasonable guess and then just tack on
> another buffer if we guessed wrong. I know that one of the main goals of
> value vectors is that they are random access, with minimal overhead for
> value extraction, but I think this might be a case where it would be worth
> breaking it.
>
> The simple implementation might look like the variable length vectors, with
> a metadata buffer sitting in front of the data to describe ranges of values
> held in each of the buffers. i.e values 1-400 are in buffer 1 : 401-1000
> are in buffer 2. (I would assume we we never exceed 5 or so buffers, but it
> could provide extra flexibility).
>
> I'm trying to look at how to copy into the buffer in the outgoing
recordbatch directly, in stead of copy into a temp buffer.  This seems
require change in the code generator for the function. I'll look into it,
and will keep you updated.

Thanks!


To prevent the need for an extra step of indirection with each value
> extraction, we could change the interfaces on value vectors a bit to make
> them expose an interator, rather than get(index) method. This would allow
> for fetching the first buffer, reading all of its values with the same
> overhead as we have now, until we hit the end of the buffer, and then we
> could rely on an exception to indicate we ran out of values and at that
> time swap to the second buffer.
>
> -Jason
>
>
> On Tue, Dec 3, 2013 at 8:59 PM, Jinfeng Ni <jinfengni99@gmail.com> wrote:
>
> > Hi Jason,
> >
> > Good question.
> >
> > Actually, for some type cast, it is *binary coercible, *means there is no
> > need internally to do any conversion. for instance, char --> varchar,
> > varchar --> varbinary, etc.
> >
> > For other cases, some transformation is required, since the binary
> > representation of source type is different from the binary representation
> > of target type.
> > For instance, int -> varchar.  The target type need keep each digit of
> the
> > integer, while the source type is a 4-byte representation.
> >
> > I will look into whether it's possible to use the buffer in the output
> > value vector directly, without copying into new buffer.
> >
> >
> >
> >
> >
> > On Tue, Dec 3, 2013 at 6:29 PM, Jason Altekruse <
> altekrusejason@gmail.com
> > >wrote:
> >
> > > Hi Jinfeng,
> > >
> > > This might be a dumb question, but is there any transformation being
> > > performed when going from a fixed length type to a variable length
> type?
> > > That is, are the bytes in the buffer coming in going to be the same as
> > the
> > > bytes coming out of the cast?
> > >
> > > I understand that for casts like int-> long we need to add extra space
> > > between each value, but is it possible that we could just hand the
> buffer
> > > from one value vector type to the other without copying it into a new
> > > buffer?
> > >
> > > We would still have to create a new buffer with the offsets of the
> > > "variable length" values, but it would save us some time if we could do
> > > this.
> > >
> > > -Jason Altekruse
> > >
> > >
> > > On Tue, Dec 3, 2013 at 5:35 PM, Jinfeng Ni <jinfengni99@gmail.com>
> > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I' working on the explicit cast support in drill. So far, I have
> > > prototyped
> > > > the implementation for the first 3 categories, and would like to seek
> > > input
> > > > from you regarding how to deal with the buffer allocation for cast
> from
> > > > fixed-length type into var-length type.
> > > >
> > > > 1. cast from fixed-length type to fixed-length type
> > > > eg:   float4 --> int,
> > > >         int -> float4,
> > > >
> > > > 2. cast from var-length type to fixed-length type
> > > > eg: varchar --> int
> > > >       varbinary --> int
> > > > (Still need to figure out how to handle overflow issue when cast)
> > > >
> > > > 3. cast from fixed-length type to var-length type
> > > > eg:  int  -> varchar
> > > >        bigint -> varbinary
> > > >
> > > > 4. cast from var-length type to var-length type
> > > > eg:   varchar --> varchar
> > > >         varbinary --> varchar
> > > >
> > > > For the 3rd one, ie. from fixed-length to var-length type, it causes
> > some
> > > > problem to the current implementation, in terms of buffer allocation.
> > > >
> > > > For the fixed-length type, drill uses java primitive type in
> > ValueHolder.
> > > > For instance, IntHolder.value is a int.  But for var-length type,
> drill
> > > > will use a buffer to keep its value. When doing cast from int into
> > > varchar,
> > > > the buffer for the VarCharHolder is not allocated, and we have to
> > figure
> > > > out a way to do the allocation, before cast.
> > > >
> > > > There seems 2 options:
> > > > Option 1:  allocate buffer in the function template setup() method.
> >  The
> > > > buffer will be used in eval() method.
> > > > Problem with this option :
> > > > 1) need copy twice.  first copy from fixed-type input into the buffer
> > > > allocated in setup(), second copy from the buffer into the buffer in
> > the
> > > > target vector.
> > > > 2) need add a cleanup() method to function template, to clean the
> > buffer
> > > > allocated, which currently is not there in the code base.
> > > >
> > > > Option 2:  the consumer of output of the cast function will be
> > > responsible
> > > > to pre-allocate buffer in the target ValueVector for all the
> > > > VarCharHolder().  The cast function will simply do the conversion and
> > > copy
> > > > into the pre-allocated buffer in the target ValueVector.
> > > > Good thing of this option is it requires 1 copy.
> > > >
> > > > I have prototyped the 1st option, and have not figured out how to
> > > implement
> > > > the 2nd approach yet. But I would like to seek suggestion regarding
> > > those 2
> > > > options, before I proceed next.
> > > >
> > > > Thanks!
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message