ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stanislav Lukyanov <stanlukya...@gmail.com>
Subject Starting with missing PDS pieces
Date Mon, 04 Feb 2019 08:34:01 GMT
Hi Igniters,

I’d like to talk about Ignite startup when we have some of the persistence files missing.

This is related to the topic “Ignite index corruption issue -> unrecoverable cluster”
that is discussed nearby,
but not exactly the same – I’d like to avoid talking about indexes for now (let’s think
of them as of normal partition files)
and focus on possible behavioral changes, not documentation. 

We have three parts of the persistent storage:
- db/ - partition files
- cp/ - checkpoint markers
- wal/ - write-ahead log (let’s not make a disctinction between wal/ and wal/archive/ for
now)

What if some of these pieces is missing? Currently we don’t handle it that well, but experience
shows that
bugs exist, disks fail and users make mistakes – all of which leads to files becoming inaccessible.

For starters, let’s not talk about missing db/ - If we’ve lost the base of our PDS we’re
in trouble, that’s understandable.

Here are the cases I’d like to discuss:
1. db/ is OK, cp/ and wal/ are completely missing.
This isn’t really too likely to happen due to a disk failure since cp/ is stored together
with db/.
But a user’s mistake or a bug in Ignite might lead this.

Current behavior (AFAIK): Ignite doesn’t start.
I guess the current behavior is fine - we don’t know if the data is consistent (if we were
in the middle of a checkpoint or no), 
so let’s not even try to use it.
But a user might want to still start with at least something (or may know for sure that the
data is consistent) – perhaps we could 
allow that we some flag/option like “--force”.

2. db and cp are OK, wal is missing.
This is a highly likely situation – after all, we suggest that users have a WAL on a separate
disk (that may fail).
Because of that I think we should really be well-prepared for this.

There are two cases:

2a. cp/ shows that db/ is in a consistent state (Ignite was stopped not in the middle of a
checkpoint)
Current behavior (AFAIK): Ignite doesn’t start.
We could (almost) safely start here – the data is consistent after all. Might require the
user to acknowledge that
the start Is without WAL (so we might’ve lost some updates of the last checkpoint) by using,
again, “--force".

2b. cp/ shows that db/ is in an inconsistent state (ignite was stopped in the middle of a
checkpoint)
Current behavior (AFAIK): Ignite doesn’t start.
Current behavior is OK – we’re in an inconsistent state, so let’s not start. It is a
question of whether to allow a force-start in this case.

3. db and wal are OK, cp is missing.
Current behavior (AFAIK): Ignite will start.
The current behavior is really awkward. Since we don’t have cp/, we don’t have a way to
map wal/ to the state of db/, so it is as good as missing.
I’d have the same behavior here as in the case 1.

WDYT? 

Thanks,
Stan

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message