kafka-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Damian Guy <damian....@gmail.com>
Subject Re: Initializing StateStores takes *really* long for large datasets
Date Fri, 25 Nov 2016 15:54:30 GMT
Hi Frank,

If you have run the app before with the same applicationId, completely shut
it down, and then restarted it again, it will need to restore all of the
state which will take some time depending on the amount of data you have.
In this case the placement of the partitions doesn't take into account any
existing state stores, so it might need to load quite a lot of data if
nodes assigned certain partitions don't have that state-store (this is
something we should look at improving).

As for RocksDB tuning - you can provide an implementation of
RocksDBConfigSetter via config: StreamsConfig.ROCKSDB_CONFIG_SETTER_CLASS
it has a single method:

public void setConfig(final String storeName, final Options options,
final Map<String, Object> configs)

in this method you can set various options on the provided Options object.
The options that might help in this case are:
options.setWriteBufferSize(..)  - default in streams is 32MB
options.setMaxWriteBufferNumer(..) - default in streams is 3

However, i'm no expert on RocksDB and i suggest you have look at
https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide for more info.

Thanks,
Damian

On Fri, 25 Nov 2016 at 13:02 Frank Lyaruu <flyaruu@gmail.com> wrote:

> @Damian:
>
> Yes, it ran before, and it has that 200gb blob worth of Rocksdb stuff
>
> @Svente: It's on a pretty high end san in a managed private cloud, I'm
> unsure what the ultimate storage is, but I doubt there is a performance
> problem there.
>
> On Fri, 25 Nov 2016 at 13:37, Svante Karlsson <svante.karlsson@csi.se>
> wrote:
>
> > What kind of disk are you using for the rocksdb store? ie spinning or
> ssd?
> >
> > 2016-11-25 12:51 GMT+01:00 Damian Guy <damian.guy@gmail.com>:
> >
> > > Hi Frank,
> > >
> > > Is this on a restart of the application?
> > >
> > > Thanks,
> > > Damian
> > >
> > > On Fri, 25 Nov 2016 at 11:09 Frank Lyaruu <flyaruu@gmail.com> wrote:
> > >
> > > > Hi y'all,
> > > >
> > > > I have a reasonably simple KafkaStream application, which merges
> about
> > 20
> > > > topics a few times.
> > > > The thing is, some of those topic datasets are pretty big, about 10M
> > > > messages. In total I've got
> > > > about 200Gb worth of state in RocksDB, the largest topic is 38 Gb.
> > > >
> > > > I had set the MAX_POLL_INTERVAL_MS_CONFIG to one hour to cover the
> > > > initialization time,
> > > > but that does not seem nearly enough, I'm looking at more than two
> hour
> > > > startup times, and
> > > > that starts to be a bit ridiculous.
> > > >
> > > > Any tips / experiences on how to deal with this case? Move away from
> > > Rocks
> > > > and use an external
> > > > data store? Any tuning tips on how to tune Rocks to be a bit more
> > useful
> > > > here?
> > > >
> > > > regards, Frank
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message