db-derby-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olivier Chedru" <oChe...@infovista.com>
Subject RE: Horizontal partitioning?
Date Wed, 18 Jul 2007 07:04:05 GMT
Hi David,

The purpose of my application is to keep measurement data locally, and
Derby is perfect to manage a local database. Because of deployment
reasons, I cannot use a distributed database.
The measurement data never change (they are just dropped when their
lifetime is reached). This application aims at running 24/24, and I need
to keep the database size stable once its lifetime is reached. Database
users will want to query data over a given short period of time, with
reasonable response times. 

If I use a single table, I will fall in the following issues:
- Index maintenance will become more and more expensive as the table
grows. And it will grow a lot: typically, I will store 10000 rows per
second and keep 1 week of data. This would generate more than 6 billion
- Dropping obsolete data will mean using DELETE FROM, which is much more
expensive than dropping a table. And this will probably lead to database
fragmentation over time.

So I try to proceed this way:
1 - When some criteria is met (period of time, number of rows...), I
create a new table without indexes.
2 - Data are stored using batch insert statements.
3 - Once the table is considered full (using the same criteria as
above), I create the necessary indexes on the table.
4 - Goto 1.

To provide an easy access to data and hide all those tables, a view
seems very appropriate. Unfortunately, as shown in my first post, the
view does not take into account constraints on the timestamp columns,
and scans all tables. No need to say the query never ends...

Any idea about this problem is welcome!


-----Original Message-----
From: david.vancouvering@gmail.com [mailto:david.vancouvering@gmail.com]
On Behalf Of David Van Couvering
Sent: Wednesday, July 18, 2007 7:09 AM
To: Derby Discussion
Subject: Re: Horizontal partitioning?

Hm, if you're creating a view on all the tables, that must mean all
the tables are under a single database?  If that's the case, then this
doesn't really sound like horizontal partitioning (which usually means
splitting it into multiple databases, each with its own disk and
usually each with its own CPU).

If you're keeping everything within one database, then partitioning
across multiple tables really is not useful or necessary, as Derby
itself accomplishes quick access to any row in the table using the
index on the primary key.

Normally, if you have N partitions, then you would have N databases on
N machines.  There would be no way to create a view across all of

This is how folks like eBay and Google do it, for example.  This
usually only works if the data in each partition is completely or
almost completely independent and you don't need to do queries that
span tables (e.g. you can issue the query against one specific
partition depending on things like timestamp, user id, geographic
location, etc).  If your queries can't be isolated to a single
partition, then your application code is going to have to deal with
logic normally assigned to the database, such as sorting and merging
data, and now you're in the realm of a clustered database, and having
worked on two of these in the past, I think I can safely say you don't
want to go there...


On 7/16/07, Olivier Chedru <oChedru@infovista.com> wrote:
> Hi all,
> I am trying to achieve horizontal partitioning with Derby.
> My application stores *a lot* of data as time passes, and it needs to
> obsolete data.
> Table partitioning would be the ideal solution, but it is not
implemented in
> Derby. So I create one table per period of time. All tables have the
> columns, the first one being the data timestamp. Dropping obsolete
data is
> then made easy: just drop old tables.
> Now, I need to query data without knowing how tables are organized.
So, I
> create a view on all tables (the view is deleted/created every time a
> table is created/dropped): fine.
> However, checking the runtime statistics, I notice ALL tables are
> when I use the view! Note I added constraints on the Timestamp field.
> Sample SQL:
> create table t1 (ts timestamp, a int, b int, c int);
> create table t2 (ts timestamp, a int, b int, c int);
> create table t3 (ts timestamp, a int, b int, c int);
> alter table t1 add constraint t1_c check (ts >= '2007-07-16
09:00:00.0' and
> ts < '2007-07-16 09:01:00.0');
> alter table t2 add constraint t2_c check (ts >= '2007-07-16
09:01:00.0' and
> ts < '2007-07-16 09:02:00.0');
> alter table t3 add constraint t3_c check (ts >= '2007-07-16
09:02:00.0' and
> ts < '2007-07-16 09:03:00.0');
> create view t_view as select * from t1 union select * from t2 union
select *
> from t3;
> select * from t_view where ts >= '2007-07-16 09:01:30.0' and ts <
> '2007-07-16 09:02:00.0';
> The displayed execution plan shows t1, t2 and t3 are scanned. I expect
> t2 is used. I also tried to force using constraints by defining the
> this way, but it did not improve the behavior:
> create view t_view as select * from t1 --DERBY-PROPERTIES
> union select * from t2 --DERBY-PROPERTIES constraint=t2_c union select
> from t3 --DERBY-PROPERTIES constraint=t3_c;
> Any idea on how to achieve horizontal partitioning?
> Thanks!
> Olivier.

View raw message