ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Valentin Kulichenko <valentin.kuliche...@gmail.com>
Subject Re: Custom string encoding
Date Mon, 03 Jul 2017 22:27:11 GMT
Yes, this needs to be tested and confirmed. I will work on it.

Would be great to get more details about indexes. I'm not sure I understand
the limitation there.

-Val

On Mon, Jul 3, 2017 at 7:21 AM, Dmitriy Setrakyan <dsetrakyan@apache.org>
wrote:

> Agree with Valya on the system-wide default. We need to have it.
>
> Also, are we certain that the encoding will provide 1-byte length for UTF-8
> for different languages? Would be nice to test it to confirm, as it has a
> potential to decrease the Ignite storage space by 2x in certain cases.
>
> D.
>
> On Sun, Jul 2, 2017 at 12:26 PM, Valentin Kulichenko <
> valentin.kulichenko@gmail.com> wrote:
>
> > Vova,
> >
> > That's actually a good point. Probably that would be enough and there is
> no
> > need to introduce absract encoder. However, I still think it makes sense
> to
> > specify default encoding in BinaryConfiguration and
> > BinaryTypeConfiguration.
> >
> > -Val
> >
> > On Sun, Jul 2, 2017 at 10:31 AM Vladimir Ozerov <vozerov@gridgain.com>
> > wrote:
> >
> > > Yes, this is exactly what non-UTF8 encodings do.
> > >
> > > вс, 2 июля 2017 г. в 20:08, Dmitriy Setrakyan <dsetrakyan@apache.org>:
> > >
> > > > On Sun, Jul 2, 2017 at 9:50 AM, Vladimir Ozerov <
> vozerov@gridgain.com>
> > > > wrote:
> > > >
> > > > > There is no need for custom encoders, as they are already built-in
> to
> > > > Java.
> > > > >
> > > >
> > > > Will non-ASCII encodings fit into 1 byte? The whole point here is to
> > save
> > > > space.
> > > >
> > > >
> > > > >
> > > > > вс, 2 июля 2017 г. в 19:16, Dmitriy Setrakyan <
> dsetrakyan@apache.org
> > >:
> > > > >
> > > > > > Vladimir, how would you plugin custom encoders in your design?
> > > > > >
> > > > > > On Sat, Jul 1, 2017 at 11:53 PM, Vladimir Ozerov <
> > > vozerov@gridgain.com
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Valya,
> > > > > > >
> > > > > > > Personally I vote against this feature. BinaryConfiguration
is
> > > proven
> > > > > to
> > > > > > be
> > > > > > > inconvenient, since it has to be configured before node
start,
> it
> > > > > cannot
> > > > > > be
> > > > > > > changed in runtime, and it requires classes on the server.
> > > Moreover,
> > > > if
> > > > > > you
> > > > > > > decide to change encoding at some point, it would be
> impossible.
> > > > > > >
> > > > > > > I think, we should add this feature on API level instead.
If
> > string
> > > > is
> > > > > > > written in non-UTF8 form, we will write in different format:
> > > > > > > [encoding_code][string]
> > > > > > >
> > > > > > > BInaryWriter.writeString(String fieldName, String val);
> > > > > > > BInaryWriter.writeString(String fieldName, String val,
*String
> > > > > > encoding*);
> > > > > > >
> > > > > > > BinaryReader.readString(String fieldName);
> > > > > > > BinaryReader.readString(String fieldName, *String encoding*);
> > > > > > >
> > > > > > > BinaryObjectBuilder.writeString(String fieldName, String
val,
> > > *String
> > > > > > > encoding*);
> > > > > > >
> > > > > > > class MyClass {
> > > > > > >     *@BinaryString(encoding = "Cp1251")*
> > > > > > >     private String myCyrillicString;
> > > > > > > }
> > > > > > >
> > > > > > > Vladimir.
> > > > > > >
> > > > > > > On Sat, Jul 1, 2017 at 7:26 PM, Dmitriy Setrakyan <
> > > > > dsetrakyan@apache.org
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > On Sat, Jul 1, 2017 at 2:24 AM, Sergi Vladykin <
> > > > > > sergi.vladykin@gmail.com
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > In SQL indexes we may store partial strings and
assume them
> > to
> > > be
> > > > > in
> > > > > > > > UTF-8,
> > > > > > > > > I don't think this can be abstracted away. But
may be this
> is
> > > > not a
> > > > > > big
> > > > > > > > > deal if in indexes we still will use UTF-8.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Sergi, why does it matter if it is UTF8 or custom
encoding?
> Why
> > > > can't
> > > > > > we
> > > > > > > > use our own compact encoding in indexes?
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > 2017-07-01 10:13 GMT+03:00 Dmitriy Setrakyan
<
> > > > > dsetrakyan@apache.org
> > > > > > >:
> > > > > > > > >
> > > > > > > > > > Val, do you know how we compare strings
in SQL queries?
> > Will
> > > we
> > > > > be
> > > > > > > able
> > > > > > > > > to
> > > > > > > > > > use this encoder?
> > > > > > > > > >
> > > > > > > > > > Additionally, I think that the encoder is
a bit too
> > abstract.
> > > > Why
> > > > > > not
> > > > > > > > go
> > > > > > > > > > even further and allow users create their
own ASCII table
> > for
> > > > > > > encoding?
> > > > > > > > > >
> > > > > > > > > > D.
> > > > > > > > > >
> > > > > > > > > > On Fri, Jun 30, 2017 at 6:49 PM, Valentin
Kulichenko <
> > > > > > > > > > valentin.kulichenko@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > > Andrey,
> > > > > > > > > > >
> > > > > > > > > > > Can you elaborate more on this? What
is your concern?
> > > > > > > > > > >
> > > > > > > > > > > -Val
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Jun 30, 2017 at 6:17 PM Andrey
Mashenkov <
> > > > > > > > > > > andrey.mashenkov@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Val,
> > > > > > > > > > > >
> > > > > > > > > > > > Looks like make sense.
> > > > > > > > > > > >
> > > > > > > > > > > > This will not affect FullText
index, as Lucene has
> own
> > > > format
> > > > > > for
> > > > > > > > > > storing
> > > > > > > > > > > > data.
> > > > > > > > > > > >
> > > > > > > > > > > > But.. would it be compatible with
H2 indexing ? I
> > doubt.
> > > > > > > > > > > >
> > > > > > > > > > > > 1 июля 2017 г. 2:27 пользователь
"Valentin
> Kulichenko"
> > <
> > > > > > > > > > > > valentin.kulichenko@gmail.com>
написал:
> > > > > > > > > > > >
> > > > > > > > > > > > > Folks,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Currently binary marshaller
always encodes strings
> in
> > > > > UTF-8.
> > > > > > > > > However,
> > > > > > > > > > > > > sometimes it can be useful
to customize this. For
> > > > example,
> > > > > if
> > > > > > > > data
> > > > > > > > > > > > contains
> > > > > > > > > > > > > a lot of Cyrillic, Chinese
or other symbols, but
> not
> > so
> > > > > many
> > > > > > > > Latin
> > > > > > > > > > > > symbols,
> > > > > > > > > > > > > memory is used very inefficiently.
In this case it
> > > would
> > > > be
> > > > > > > great
> > > > > > > > > to
> > > > > > > > > > > > encode
> > > > > > > > > > > > > most frequently used symbols
in one byte instead of
> > two
> > > > or
> > > > > > > three.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I propose to introduce BinaryStringEncoder
> interface
> > > that
> > > > > > will
> > > > > > > > > > convert
> > > > > > > > > > > > > strings to byte arrays and
back, and make it
> > pluggable
> > > > via
> > > > > > > > > > > > > BinaryConfiguration. This
will allow users to plug
> in
> > > any
> > > > > > > > encoding
> > > > > > > > > > > > > algorithms based on their
requirements.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thoughts?
> > > > > > > > > > > > >
> > > > > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-5655
> > > > > > > > > > > > >
> > > > > > > > > > > > > -Val
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message