carbondata-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject Questions about rebuilding datamap
Date Mon, 06 Aug 2018 01:59:41 GMT
Hi community,

Currently rebuilding datamap has some problems in carbondata and I'll explain the problems
and possible solutions here in order to fix it.

Note: User can refer to in repo for the conception of 'deferred-rebuild',


`REBUILD DATAMAP datamap_name ON TABLE table_name` is used to refresh a specific datamap.


1. This operation can even be fired on a non-deferred-rebuild datamap, which is not need.

2. `REBUILD` in current implementation will rebuild the whole datamap, which will discard
the old datamap storage -- in most of the scenarios, it is not needed. Besides, while generating
the new datamap data, we didn't clear up the old data first, which cause rebuild failure.


It seems that currently for all types of datamap in carbondata, only `MV` needs to rebuild

Index datamap (inlcuding lucene, bloomfilter) and preaggregate datamap (including timeseries)
organize the datamap data by segment which maps to the segment in main table. So we can manage
the datamap data in fine granularity:

11. For deferred-rebuild datamap, if we fire `REBUILD DATAMAP` command on it, carbondata will
generate datamap data for the segments which does not have the datamap data yet.

12. If all the segments already have datamap data, this command will return immediately.

13. If this datamap is non-deferred-rebuild, this command will return with error message.

14. In case of concurrent rebuilding, we will block concurrent data rebuilding for one datamap.
A lock will be used to achieve this. 

For MV datamap, it seems that by default it is deferred-rebuild by default. And the structure
of datamap data is different from other datamaps. We will leave it as it is, which means user
will explicitly rebuild datamap for it, we only have to ensure:

21. Since MV datamap is by default deferred-rebuild, `WITH DEFERRED REBUILD` is not needed
for MV datamap, or we should explicit specify `WITH DEFERRED REBUILD` while creating MV datamap.
I'd preferred to the former.

22. Block concurrent rebuilding for one MV datamap.

The last one:
31. Since deferred-rebuild is also a datamap property, how about letting the user specify
it explicity in DMPROPERTIES?

View raw message