ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilya Kasnacheev <ilya.kasnach...@gmail.com>
Subject Re: Compression prototype
Date Tue, 28 Aug 2018 15:37:31 GMT
Hello!

Yes, we can tinker with BinaryObject format, which is currently clearly
excessive.

But the best part with compression, it will automatically remove this
redundancy for us, for free. Even if we had hairy XML as binary object
format, it will still compress roughly to the same number of bytes. If we
will have fast transparent compression, we can just skip this work. Of
course, codifying offsets can have other uses, but it also have a lot of
limitations.

Regards,
-- 
Ilya Kasnacheev


вт, 28 авг. 2018 г. в 18:30, Vyacheslav Daradur <daradurvs@gmail.com>:

> I have another suggestion which may help us reduce objects size
> extremely - implementing some kind of SQL Scheme.
>
> For now, BinaryObject's format is too excessive - each serialized
> object stores offset of every serialized field even if the offset can
> be easily calculated.
>
> If we move this metadata from a serialized object to a separate entity
> - this will reduce an object's size.
> On Mon, Aug 27, 2018 at 2:53 PM Vyacheslav Daradur <daradurvs@gmail.com>
> wrote:
> >
> > According to my benchmarks - zstd compression algorithm [1] looks very
> > interesting, it has a high compression ratio with quite good speed.
> > AFAIK it supports external dictionaries, but I'm not sure about using
> > it with "on the fly building" dictionaries. Anyway, have look at (it
> > has ASF 2.0 friendly license).
> >
> > Also, here is data generator / loader [1]. If it will be useful for
> > you we should ask Nikolay Izhikov to share public docs to start.
> >
> > [1] https://github.com/facebook/zstd
> > [2] https://github.com/nizhikov/ignite-cod-data-loader
> > On Mon, Aug 27, 2018 at 2:11 PM Ilya Kasnacheev
> > <ilya.kasnacheev@gmail.com> wrote:
> > >
> > > Hello Vyacheslav!
> > >
> > > Unfortunately I have not found any efficient algorithms that will
> allow me
> > > to use external dictionary as a pre-processed data structure. If plain
> gzip
> > > is used without dictionary, the compression is around 0.7, as opposed
> to
> > > 0.4 that I will get with custom implementation, AFAIR the performance
> was
> > > also worse. I didn't really try it with dictionary, but I assume
> > > performance will be even worse since it will have to scan dictionary
> before
> > > getting to actual data.
> > >
> > > We have such a huge array of tests that we can just run them all with
> > > compression enabled, see if there are any new failures. But the impact
> of
> > > my commit is fairly low, it is only triggered when data is written to
> page
> > > (maybe to WAL also?), and we don't really do much frivolous stuff to
> pages.
> > >
> > > Still, I am very much interested in finding existing compression
> > > implementations with support of external dictionary; I am also very
> much
> > > interested in having different implementations of compression for
> Apache
> > > Ignite (such as per page compression) and comparing them by benchmark
> and
> > > by code impact. I am also very interested in large standard datasets
> for
> > > Apache Ignite (or generators thereof) so that we can run precise
> benchmarks
> > > on various compression schemes. If you have any of the following,
> please
> > > get back to me.
> > >
> > > Regards,
> > > --
> > > Ilya Kasnacheev
> > >
> > >
> > > пн, 27 авг. 2018 г. в 11:35, Vyacheslav Daradur <daradurvs@gmail.com>:
> > >
> > > > Hi Igniters!
> > > >
> > > > Ilya, I'm glad to see one more person who is interested in the
> > > > compression feature in Ignite.
> > > >
> > > > I looked through the pull request and want to share following
> thoughts:
> > > >
> > > > It's very dangerous using a custom algorithm in this way - you store
> > > > serialized data separate from a dictionary and there are a lot of
> > > > points when we may lose data: rebalancing, serialization errors, node
> > > > rebooting and so on.
> > > >
> > > > I'd suggest the following ways to improve reliability:
> > > > - use well know algorithms: zstd, deflater, lzma, gzip e.g. that
> > > > allows us to decompress data in any situation
> > > > - store the dictionary inside page with data
> > > >
> > > > Also, we have a lot of discussions [1] [2] about compression on
> > > > BinaryObject and BinaryMarshaller level and Vladimir Ozerov was
> > > > strictly against a compression on this level.
> > > > If something has changed since then, you may look through [1] [2] [3]
> > > > I've done a lot of research in algorithms comparison it may be useful
> > > > for you.
> > > >
> > > > [1]
> > > >
> http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-2-0-td10099.html
> > > > [2]
> > > >
> http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-td20679.html
> > > > [3] https://issues.apache.org/jira/browse/IGNITE-3592
> > > > [4] https://issues.apache.org/jira/browse/IGNITE-5226
> > > > [5] https://github.com/daradurvs/ignite-compression
> > > > On Sat, Aug 25, 2018 at 2:51 AM Denis Magda <dmagda@apache.org>
> wrote:
> > > > >
> > > > > >
> > > > > > Currently, the dictionary for decompression is only stored on
> heap.
> > > > After
> > > > > > restart there's compressed data in the PDS, but there's no
> dictionary
> > > > :)
> > > > >
> > > > >
> > > > > Basically, it means that I've lost my data, right? How about
> persisting
> > > > > data to disk.
> > > > >
> > > > > Overall, we need Vladimir Ozerov to check the contribution. He was
> the
> > > > one
> > > > > who sponsored the IEP and knows the area best.
> > > > >
> > > > > --
> > > > > Denis
> > > > >
> > > > > On Fri, Aug 24, 2018 at 4:31 AM Ilya Kasnacheev <
> > > > ilya.kasnacheev@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hello!
> > > > > >
> > > > > > It is somewhat a part of IEP-20, since I have updated it with
> this
> > > > > > particular direction.
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > --
> > > > > > Ilya Kasnacheev
> > > > > >
> > > > > > 2018-08-24 2:56 GMT+03:00 Denis Magda <dmagda@apache.org>:
> > > > > >
> > > > > > > Hi Ilya,
> > > > > > >
> > > > > > > Sounds terrific! Is this part of the following Ignite
> enhancement
> > > > > > proposal?
> > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > > > > 20%3A+Data+Compression+in+Ignite
> > > > > > >
> > > > > > > --
> > > > > > > Denis
> > > > > > >
> > > > > > > On Thu, Aug 23, 2018 at 5:17 AM Ilya Kasnacheev <
> > > > > > ilya.kasnacheev@gmail.com
> > > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hello!
> > > > > > > >
> > > > > > > > My plan was to add a compression section to cache
> configuration,
> > > > where
> > > > > > > you
> > > > > > > > can enable compression, enable key compression (which
has
> heavier
> > > > > > > > performance implications), adjust dictionary gathering
> settings,
> > > > and in
> > > > > > > the
> > > > > > > > future possibly choose betwen algorithms. In fact
I'm not
> sure,
> > > > since
> > > > > > my
> > > > > > > > assumption is that you can always just use latest&greatest,
> but
> > > > maybe
> > > > > > we
> > > > > > > > can have e.g. very fast and not very strong vs. slower
but
> stronger
> > > > > > one.
> > > > > > > >
> > > > > > > > I'm not sure yet if we should share dictionary between
all
> caches
> > > > vs.
> > > > > > > > having separate dictionary for every cache.
> > > > > > > >
> > > > > > > > With regards to data format, of course there will
be room for
> > > > further
> > > > > > > > extension.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > >
> > > > > > > > --
> > > > > > > > Ilya Kasnacheev
> > > > > > > >
> > > > > > > > 2018-08-23 15:13 GMT+03:00 Sergey Kozlov <
> skozlov@gridgain.com>:
> > > > > > > >
> > > > > > > > > Hi Ilya
> > > > > > > > >
> > > > > > > > > Is there a plan to introduce it as an option
of Ignite
> > > > configuration?
> > > > > > > In
> > > > > > > > > that instead the boolean type I suggest to use
the enum and
> > > > reserve
> > > > > > the
> > > > > > > > > ability to extend compressions algorithms in
future
> > > > > > > > >
> > > > > > > > > On Thu, Aug 23, 2018 at 1:09 PM, Ilya Kasnacheev
<
> > > > > > > > > ilya.kasnacheev@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hello!
> > > > > > > > > >
> > > > > > > > > > I want to share with the developer community
my
> compression
> > > > > > > prototype.
> > > > > > > > > >
> > > > > > > > > > Long story short, it compresses BinaryObject's
byte[] as
> they
> > > > are
> > > > > > > > written
> > > > > > > > > > to Durable Memory page, operating on a pre-built
> dictionary.
> > > > > > Typical
> > > > > > > > > > compression ratio is 0.4 (meaning 2.5x compression)
using
> > > > custom
> > > > > > > > > > LZW+Huffman. Metadata, indexes and primitive
values are
> > > > unaffected
> > > > > > > > > > entirely.
> > > > > > > > > >
> > > > > > > > > > This is akin to DB2's table-level compression[1]
but
> > > > independently
> > > > > > > > > > invented.
> > > > > > > > > >
> > > > > > > > > > On Yardstick tests performance hit is -6%
with PDS and
> up to
> > > > -25%
> > > > > > (in
> > > > > > > > > > throughput) with In-Memory loads. It also
means you can
> fit
> > > > ~twice
> > > > > > as
> > > > > > > > > much
> > > > > > > > > > data into the same IM cluster, or have higher
ram/disk
> ratio
> > > > with
> > > > > > PDS
> > > > > > > > > > cluster, saving on hardware or decreasing
latency.
> > > > > > > > > >
> > > > > > > > > > The code is available as PR 4295[2] (set
> > > > > > > IGNITE_ENABLE_COMPRESSION=true
> > > > > > > > > to
> > > > > > > > > > activate). Note that it will not presently
survive a PDS
> node
> > > > > > > restart.
> > > > > > > > > > The impact is very small, the patch should
be applicable
> to
> > > > most
> > > > > > 2.x
> > > > > > > > > > releases.
> > > > > > > > > >
> > > > > > > > > > Sure there's a long way before this prototype
can have
> hope of
> > > > > > being
> > > > > > > > > > included, but first I would like to hear
input from
> fellow
> > > > > > igniters.
> > > > > > > > > >
> > > > > > > > > > See also IEP-20[3].
> > > > > > > > > >
> > > > > > > > > > 1.
> > > > > > > > > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10
> .
> > > > > > > > > > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html
> > > > > > > > > > 2. https://github.com/apache/ignite/pull/4295
> > > > > > > > > > 3.
> > > > > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > > > > > > > 20%3A+Data+Compression+in+Ignite
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Ilya Kasnacheev
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Sergey Kozlov
> > > > > > > > > GridGain Systems
> > > > > > > > > www.gridgain.com
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best Regards, Vyacheslav D.
> > > >
> >
> >
> >
> > --
> > Best Regards, Vyacheslav D.
>
>
>
> --
> Best Regards, Vyacheslav D.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message