ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexey Goncharuk <alexey.goncha...@gmail.com>
Subject Re: Rework locking architecture for MVCC and transactional SQL
Date Fri, 15 Dec 2017 15:55:11 GMT
That would be great, actually, We have a lot of data structures which size
heavily depends on the topology size and caches number, and all those data
structures are placed on-heap. I am looking forward moving that stuff to a
limited memory region with disk offload.

The DataRegion and disk offload are fairly easy to implement - we have
almost all mechanics in the current page memory implementation. We only
need to get rid of copy-on-write and checkpointing logic and alter page
eviction mechanics a little bit.

Another side note - we can look into range locks mechanics. It will be a
greater effort to implement them, but I think it worth it.

2017-12-15 16:59 GMT+03:00 Vladimir Ozerov <vozerov@gridgain.com>:

> Well, there is completely different approach to handle the problem - lock
> escalation. E.g. SQL Server typically require 96 bytes per lock. Much less
> than Ignite, but still. And as soon as some query require more than 5000
> row locks, it is escalated to exclusive table lock. Not only this
> eliminates memory consumption problem, it increases performance of massive
> updates dramatically.
> On the other hand this approach is more prone to deadlocks, since
> transactions updating disjoint data sets now have shared resources to be
> locked.
>
> There is no silver bullet apparently.
>
> On Fri, Dec 15, 2017 at 4:42 PM, Vladimir Ozerov <vozerov@gridgain.com>
> wrote:
>
> > Alex,
> >
> > That might be very good idea. In fact, what you describe closely
> resembles
> > TempDB in MS SQL Server [1]. It is also offloaded to disk, minimally
> logged
> > and purged on restart. Ideally we could try designing this component in
> > generic way, so that it could store a lot different temporal stuff:
> > 1) Locks
> > 2) UNDO data
> > 3) Sort/join data (for SELECT and CREATE INDEX, statistics, whatsoever)
> > 4) If needed - visibility info (e.g. for covering indexes and
> purge/vacuum)
> >
> > WDYT?
> >
> > Vladimir.
> >
> > [1] https://docs.microsoft.com/en-us/sql/relational-
> > databases/databases/tempdb-database
> >
> > On Fri, Dec 15, 2017 at 4:26 PM, Alexey Goncharuk <
> > alexey.goncharuk@gmail.com> wrote:
> >
> >> Vladimir,
> >>
> >> What about moving the entire locking mechanism to a separate off-heap
> >> memory region which will be volatile wrt restarts, but will still
> support
> >> off-load to disk. In the current architecture, it means that we will
> need
> >> to allocate a separate DataRegion with no WAL and no crash recovery -
> >> locks
> >> are meaningless after a restart, and we will automatically drop them. I
> >> would be interesting to prototype this because I think we may be on-par
> >> with on-heap lock placement, as we already proved for in-memory caches.
> >>
> >> 2017-12-14 21:53 GMT+03:00 Denis Magda <dmagda@apache.org>:
> >>
> >> > Vladimir,
> >> >
> >> > No it’s crystal clear, thanks.
> >> >
> >> > If this approach works only for Ignite persistence based deployment,
> how
> >> > will we handle locking for pure in-memory and caching of 3rd party
> >> > databases scenarios? As I understand the tuples still will be stored
> in
> >> the
> >> > page memory while there won’t be any opportunity to fallback to disk
> if
> >> the
> >> > memory usage increases some threshold.
> >> >
> >> > —
> >> > Denis
> >> >
> >> > > On Dec 13, 2017, at 11:21 PM, Vladimir Ozerov <vozerov@gridgain.com
> >
> >> > wrote:
> >> > >
> >> > > Denis,
> >> > >
> >> > > Sorry, may be I was not clear enough - "tuple-approach" and
> >> "persistent
> >> > > approach" are the same. By "tuple" I mean a row stored inside a data
> >> > block.
> >> > > Currently we store lock information in Java heap and proposal is to
> >> move
> >> > it
> >> > > to data blocks. The main driver is memory - if there are a rows to
> be
> >> > > locked we will either run out of memory, or produce serious memory
> >> > > pressure. For example, currently update of 1M entries will consume
> >> ~500Mb
> >> > > of heap. With proposed approach it will consume almost nothing. The
> >> > > drawback is increased number of dirty data pages, but it should not
> >> be a
> >> > > problem because in final implementation we will update data rows
> >> before
> >> > > prepare phase anyway, so I do not expect any write amplification in
> >> usual
> >> > > case.
> >> > >
> >> > > This approach is only applicable for Ignite persistence.
> >> > >
> >> > > On Thu, Dec 14, 2017 at 1:53 AM, Denis Magda <dmagda@apache.org>
> >> wrote:
> >> > >
> >> > >> Vladimir,
> >> > >>
> >> > >> Thanks for a throughout overview and proposal.
> >> > >>
> >> > >>> Also we could try employing tiered approach
> >> > >>> 1) Try to keep everything in-memory to minimize writes to
blocks
> >> > >>> 2) Fallback to persistent lock data if certain threshold is
> reached.
> >> > >>
> >> > >> What are the benefits of the backed-by-persistence approach in
> >> compare
> >> > to
> >> > >> the one based on tuples? Specifically:
> >> > >> - will the persistence approach work for both 3rd party and Ignite
> >> > >> persistence?
> >> > >> - any performance impacts depending on a chosen method?
> >> > >> - what’s faster to implement?
> >> > >>
> >> > >> —
> >> > >> Denis
> >> > >>
> >> > >>> On Dec 13, 2017, at 2:10 AM, Vladimir Ozerov <
> vozerov@gridgain.com>
> >> > >> wrote:
> >> > >>>
> >> > >>> Igniters,
> >> > >>>
> >> > >>> As you probably we know we work actively on MVCC [1] and
> >> transactional
> >> > >> SQL
> >> > >>> [2] features which could be treated as a single huge improvement.
> We
> >> > >> face a
> >> > >>> number of challenges and one of them is locking.
> >> > >>>
> >> > >>> At the moment information about all locks is kept in memory
on
> >> > per-entry
> >> > >>> basis (see GridCacheMvccManager). For every locked key we
maintain
> >> > >> current
> >> > >>> lock owner (XID) and the list of would-be-owner transactions.
When
> >> > >>> transaction is about to lock an entry two scenarios are possible:
> >> > >>> 1) If entry is not locked we obtain the lock immediately
> >> > >>> 2) if entry is locked we add current transaction to the wait
list
> >> and
> >> > >> jumps
> >> > >>> to the next entry to be locked. Once the first entry is released
> by
> >> > >>> conflicting transaction, current transaction becomes an owner
of
> the
> >> > >> first
> >> > >>> entry and tries to promote itself for subsequent entries.
> >> > >>>
> >> > >>> Once all required locks are obtained, response is sent to
the
> >> caller.
> >> > >>>
> >> > >>> This approach doesn't work well for transactional SQL - if
we
> update
> >> > >>> millions of rows in a single transaction we will simply run
out of
> >> > >> memory.
> >> > >>> To mitigate the problem other database vendors keep information
> >> about
> >> > >> locks
> >> > >>> inside the tuples. I propose to apply the similar design as
> follows:
> >> > >>>
> >> > >>> 1) No per-entry lock information is stored in memory anymore.
> >> > >>> 2) The list of active transactions are maintained in memory
still
> >> > >>> 3) When TX locks an entry, it sets special marker to the tuple
[3]
> >> > >>> 4) When TX meets already locked entry, it enlists itself to
wait
> >> queue
> >> > of
> >> > >>> conflicting transaction and suspends
> >> > >>> 5) When first transaction releases conflicting lock, it notifies
> and
> >> > >> wakes
> >> > >>> up suspended transactions, so they resume locking
> >> > >>> 6) Entry lock data is cleared on transaction commit
> >> > >>> 7) Entry lock data is not cleared on rollback or node restart;
> >> Instead,
> >> > >> we
> >> > >>> will could use active transactions list to identify invalid
locks
> >> and
> >> > >>> overwrite them as needed.
> >> > >>>
> >> > >>> Also we could try employing tiered approach
> >> > >>> 1) Try to keep everything in-memory to minimize writes to
blocks
> >> > >>> 2) Fallback to persistent lock data if certain threshold is
> reached.
> >> > >>>
> >> > >>> Thoughts?
> >> > >>>
> >> > >>> [1] https://issues.apache.org/jira/browse/IGNITE-3478
> >> > >>> [2] https://issues.apache.org/jira/browse/IGNITE-4191
> >> > >>> [3] Depends on final MVCC design - it could be per-tuple XID,
undo
> >> > >> vectors,
> >> > >>> per-block transaction lists, etc..
> >> > >>>
> >> > >>> Vladimir.
> >> > >>
> >> > >>
> >> >
> >> >
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message