ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gaurav Bajaj <gauravhba...@gmail.com>
Subject Re: Partition eviction failed, this can cause grid hang. (Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted))
Date Fri, 16 Mar 2018 19:40:54 GMT
Hi,

We also got exact same error. Ours is  setup without kubernetes. We are
using ignite data streamer to put data into caches. After streaming aroung
500k records streamer failed with exception mentioned in original email.

Thanks,
Gaurav

On 16-Mar-2018 4:44 PM, "Arseny Kovalchuk" <arseny.kovalchuk@synesis.ru>
wrote:

> Hi Dmitry.
>
> Thanks for you attention to this issue.
>
> I changed repository to jcenter and set Ignite version to 2.4.
> Unfortunately the reproducer starts with the same error message in the log
> (see attached).
>
> I cannot say whether behavior of the whole cluster will change on 2.4, I
> mean if the cluster can start on corrupted data on 2.4, because we have
> wiped the data and restarted the cluster where the problem has arrived.
> We'll move to 2.4 next week and continue testing of our software. We are
> moving forward to production in April/May, and it would be good if we get
> some clue how to deal with such situation with data in the future.
>
>
>
> ​
> Arseny Kovalchuk
>
> Senior Software Engineer at Synesis
> skype: arseny.kovalchuk
> mobile: +375 (29) 666-16-16 <+375%2029%20666-16-16>
> ​LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en>​
>
> On 16 March 2018 at 17:03, Dmitry Pavlov <dpavlov.spb@gmail.com> wrote:
>
>> Hi Arseny,
>>
>> I've observed in reproducer
>> ignite_version=2.3.0
>>
>> Could you check if it is reproducible in our freshest release 2.4.0.
>>
>> I'm not sure about ticket number, but it is quite possible issue is
>> already fixed.
>>
>> Sincerely,
>> Dmitriy Pavlov
>>
>> чт, 15 мар. 2018 г. в 19:34, Dmitry Pavlov <dpavlov.spb@gmail.com>:
>>
>>> Hi Alexey,
>>>
>>> It may be serious issue. Could you recommend expert here who can pick up
>>> this?
>>>
>>> Sincerely,
>>> Dmitriy Pavlov
>>>
>>> чт, 15 мар. 2018 г. в 19:25, Arseny Kovalchuk <
>>> arseny.kovalchuk@synesis.ru>:
>>>
>>>> Hi, guys.
>>>>
>>>> I've got a reproducer for a problem which is generally reported as
>>>> "Caused by: java.lang.IllegalStateException: Failed to get page IO
>>>> instance (page content is corrupted)". Actually it reproduces the result.
I
>>>> don't have an idea how the data has been corrupted, but the cluster node
>>>> doesn't want to start with this data.
>>>>
>>>> We got the issue again when some of server nodes were restarted several
>>>> times by kubernetes. I suspect that the data got corrupted during such
>>>> restarts. But the main functionality that we really desire to have, that
>>>> the cluster DOESN'T HANG during next restart even if the data is corrupted!
>>>> Anyway, there is no a tool that can help to correct such data, and as a
>>>> result we wipe all data manually to start the cluster. So, having warnings
>>>> about corrupted data in logs and just working cluster is the expected
>>>> behavior.
>>>>
>>>> How to reproduce:
>>>> 1. Download the data from here https://storage.googleapi
>>>> s.com/pub-data-0/data5.tar.gz (~200Mb)
>>>> 2. Download and import Gradle project https://storage.google
>>>> apis.com/pub-data-0/project.tar.gz (~100Kb)
>>>> 3. Unpack the data to the home folder, say /home/user1. You should get
>>>> the path like */home/user1/data5*. Inside data5 you should have
>>>> binary_meta, db, marshaller.
>>>> 4. Open *src/main/resources/data-test.xml* and put the absolute path
>>>> of unpacked data into *workDirectory* property of *igniteCfg5* bean.
>>>> In this example it should be */home/user1/data5.* Do not
>>>> edit consistentId! The consistentId is ignite-instance-5, so the real data
>>>> is in the data5/db/ignite_instance_5 folder
>>>> 5. Start application from ru.synesis.kipod.DataTestBootApp
>>>> 6. Enjoy
>>>>
>>>> Hope it will help.
>>>>
>>>>
>>>> ​
>>>> Arseny Kovalchuk
>>>>
>>>> Senior Software Engineer at Synesis
>>>> skype: arseny.kovalchuk
>>>> mobile: +375 (29) 666-16-16 <+375%2029%20666-16-16>
>>>> ​LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en>​
>>>>
>>>> On 26 December 2017 at 21:15, Denis Magda <dmagda@apache.org> wrote:
>>>>
>>>>> Cross-posting to the dev list.
>>>>>
>>>>> Ignite persistence maintainers please chime in.
>>>>>
>>>>> —
>>>>> Denis
>>>>>
>>>> On Dec 26, 2017, at 2:17 AM, Arseny Kovalchuk <
>>>>> arseny.kovalchuk@synesis.ru> wrote:
>>>>>
>>>>> Hi guys.
>>>>>
>>>>> Another issue when using Ignite 2.3 with native persistence enabled.
>>>>> See details below.
>>>>>
>>>>> We deploy Ignite along with our services in Kubernetes (v 1.8) on
>>>>> premises. Ignite cluster is a StatefulSet of 5 Pods (5 instances) of
Ignite
>>>>> version 2.3. Each Pod mounts PersistentVolume backed by CEPH RBD.
>>>>>
>>>>> We put about 230 events/second into Ignite, 70% of events are ~200KB
>>>>> in size and 30% are 5000KB. Smaller events have indexed fields and we
query
>>>>> them via SQL.
>>>>>
>>>>> The cluster is activated from a client node which also streams events
>>>>> into Ignite from Kafka. We use custom implementation of streamer which
uses
>>>>> cache.putAll() API.
>>>>>
>>>>> We started cluster from scratch without any persistent data. After a
>>>>> while we got corrupted data with the error message.
>>>>>
>>>>> [2017-12-26 07:44:14,251] ERROR [sys-#127%ignite-instance-2%]
>>>>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader:
>>>>> - Partition eviction failed, this can cause grid hang.
>>>>> class org.apache.ignite.IgniteException: Runtime failure on search
>>>>> row: Row@5b1479d6[ key: 171:1513946618964:3008806055072854, val:
>>>>> ru.synesis.kipod.event.KipodEvent [idHash=510912646, hash=-387621419,
>>>>> face_last_name=null, face_list_id=null, channel=171, source=,
>>>>> face_similarity=null, license_plate_number=null, descriptors=null,
>>>>> cacheName=kipod_events, cacheKey=171:1513946618964:3008806055072854,
>>>>> stream=171, alarm=false, processed_at=0, face_id=null, id=3008806055072854,
>>>>> persistent=false, face_first_name=null, license_plate_first_name=null,
>>>>> face_full_name=null, level=0, module=Kpx.Synesis.Outdoor,
>>>>> end_time=1513946624379, params=null, commented_at=0, tags=[vehicle, 0,
>>>>> human, 0, truck, 0, start_time=1513946618964, processed=false,
>>>>> kafka_offset=111259, license_plate_last_name=null, armed=false,
>>>>> license_plate_country=null, topic=MovingObject, comment=,
>>>>> expiration=1514033024000, original_id=null, license_plate_lists=null],
ver:
>>>>> GridCacheVersion [topVer=125430590, order=1513955001926, nodeOrder=3]
][
>>>>> 3008806055072854, MovingObject, Kpx.Synesis.Outdoor, 0, , 1513946618964,
>>>>> 1513946624379, 171, 171, FALSE, FALSE, , FALSE, FALSE, 0, 0, 111259,
>>>>> 1514033024000, (vehicle, 0, human, 0, truck, 0), null, null, null, null,
>>>>> null, null, null, null, null, null, null, null ]
>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>> .BPlusTree.doRemove(BPlusTree.java:1787)
>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>> .BPlusTree.remove(BPlusTree.java:1578)
>>>>> at org.apache.ignite.internal.processors.query.h2.database.H2Tr
>>>>> eeIndex.remove(H2TreeIndex.java:216)
>>>>> at org.apache.ignite.internal.processors.query.h2.opt.GridH2Tab
>>>>> le.doUpdate(GridH2Table.java:496)
>>>>> at org.apache.ignite.internal.processors.query.h2.opt.GridH2Tab
>>>>> le.update(GridH2Table.java:423)
>>>>> at org.apache.ignite.internal.processors.query.h2.IgniteH2Index
>>>>> ing.remove(IgniteH2Indexing.java:580)
>>>>> at org.apache.ignite.internal.processors.query.GridQueryProcess
>>>>> or.remove(GridQueryProcessor.java:2334)
>>>>> at org.apache.ignite.internal.processors.cache.query.GridCacheQ
>>>>> ueryManager.remove(GridCacheQueryManager.java:461)
>>>>> at org.apache.ignite.internal.processors.cache.IgniteCacheOffhe
>>>>> apManagerImpl$CacheDataStoreImpl.finishRemove(IgniteCacheOff
>>>>> heapManagerImpl.java:1453)
>>>>> at org.apache.ignite.internal.processors.cache.IgniteCacheOffhe
>>>>> apManagerImpl$CacheDataStoreImpl.remove(IgniteCacheOffheapMa
>>>>> nagerImpl.java:1416)
>>>>> at org.apache.ignite.internal.processors.cache.persistence.Grid
>>>>> CacheOffheapManager$GridCacheDataStore.remove(GridCacheOffhe
>>>>> apManager.java:1271)
>>>>> at org.apache.ignite.internal.processors.cache.IgniteCacheOffhe
>>>>> apManagerImpl.remove(IgniteCacheOffheapManagerImpl.java:374)
>>>>> at org.apache.ignite.internal.processors.cache.GridCacheMapEntr
>>>>> y.removeValue(GridCacheMapEntry.java:3233)
>>>>> at org.apache.ignite.internal.processors.cache.distributed.dht.
>>>>> GridDhtCacheEntry.clearInternal(GridDhtCacheEntry.java:588)
>>>>> at org.apache.ignite.internal.processors.cache.distributed.dht.
>>>>> GridDhtLocalPartition.clearAll(GridDhtLocalPartition.java:951)
>>>>> at org.apache.ignite.internal.processors.cache.distributed.dht.
>>>>> GridDhtLocalPartition.tryEvict(GridDhtLocalPartition.java:809)
>>>>> at org.apache.ignite.internal.processors.cache.distributed.dht.
>>>>> preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:593)
>>>>> at org.apache.ignite.internal.processors.cache.distributed.dht.
>>>>> preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:580)
>>>>> at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader
>>>>> (IgniteUtils.java:6631)
>>>>> at org.apache.ignite.internal.processors.closure.GridClosurePro
>>>>> cessor$2.body(GridClosureProcessor.java:967)
>>>>> at org.apache.ignite.internal.util.worker.GridWorker.run(GridWo
>>>>> rker.java:110)
>>>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>>>> Executor.java:1149)
>>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>>>> lExecutor.java:624)
>>>>> at java.lang.Thread.run(Thread.java:748)
>>>>> Caused by: java.lang.IllegalStateException: Failed to get page IO
>>>>> instance (page content is corrupted)
>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>> .io.IOVersions.forVersion(IOVersions.java:83)
>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>> .io.IOVersions.forPage(IOVersions.java:95)
>>>>> at org.apache.ignite.internal.processors.cache.persistence.Cach
>>>>> eDataRowAdapter.initFromLink(CacheDataRowAdapter.java:148)
>>>>> at org.apache.ignite.internal.processors.cache.persistence.Cach
>>>>> eDataRowAdapter.initFromLink(CacheDataRowAdapter.java:102)
>>>>> at org.apache.ignite.internal.processors.query.h2.database.H2Ro
>>>>> wFactory.getRow(H2RowFactory.java:62)
>>>>> at org.apache.ignite.internal.processors.query.h2.database.io.H
>>>>> 2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:126)
>>>>> at org.apache.ignite.internal.processors.query.h2.database.io.H
>>>>> 2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:36)
>>>>> at org.apache.ignite.internal.processors.query.h2.database.H2Tr
>>>>> ee.getRow(H2Tree.java:123)
>>>>> at org.apache.ignite.internal.processors.query.h2.database.H2Tr
>>>>> ee.getRow(H2Tree.java:40)
>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>> .BPlusTree.getRow(BPlusTree.java:4372)
>>>>> at org.apache.ignite.internal.processors.query.h2.database.H2Tr
>>>>> ee.compare(H2Tree.java:200)
>>>>> at org.apache.ignite.internal.processors.query.h2.database.H2Tr
>>>>> ee.compare(H2Tree.java:40)
>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>> .BPlusTree.compare(BPlusTree.java:4359)
>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>> .BPlusTree.findInsertionPoint(BPlusTree.java:4279)
>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>> .BPlusTree.access$1500(BPlusTree.java:81)
>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>> .BPlusTree$Search.run0(BPlusTree.java:261)
>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>> .BPlusTree$GetPageHandler.run(BPlusTree.java:4697)
>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>> .BPlusTree$GetPageHandler.run(BPlusTree.java:4682)
>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>> .util.PageHandler.readPage(PageHandler.java:158)
>>>>> at org.apache.ignite.internal.processors.cache.persistence.Data
>>>>> Structure.read(DataStructure.java:319)
>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>> .BPlusTree.removeDown(BPlusTree.java:1823)
>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>> .BPlusTree.removeDown(BPlusTree.java:1842)
>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>> .BPlusTree.removeDown(BPlusTree.java:1842)
>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>> .BPlusTree.removeDown(BPlusTree.java:1842)
>>>>> at org.apache.ignite.internal.processors.cache.persistence.tree
>>>>> .BPlusTree.doRemove(BPlusTree.java:1752)
>>>>> ... 23 more
>>>>>
>>>>>
>>>>> After restart we also get this error. See *ignite-instance-2.log*.
>>>>>
>>>>> The *cache-config.xml* is used for *server* instances.
>>>>> The *ignite-common-cache-conf.xml* is used for *client* instances
>>>>> which activete cluster and stream data from Kafka into Ignite.
>>>>>
>>>>> *Is it possible to tune up (or implement) native persistence in a way
>>>>> when it just reports about error in data or corrupted data, then skip
it
>>>>> and continue to work without that corrupted part. Thus it will make the
>>>>> cluster to continue operating regardless of errors on storage?*
>>>>>
>>>>>
>>>>> ​
>>>>> Arseny Kovalchuk
>>>>>
>>>>> Senior Software Engineer at Synesis
>>>>> skype: arseny.kovalchuk
>>>>> mobile: +375 (29) 666-16-16 <+375%2029%20666-16-16>
>>>>> ​LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en>​
>>>>>
>>>>> <ignite-instance-0.log><ignite-instance-1.log><ignite-instance-2.log>
>>>>> <ignite-instance-3.log><ignite-instance-4.log><cache-config.xml>
>>>>> <ignite-discovery-kubernetes.xml><ignite-common.xml><ignite-
>>>>> common-storage.xml><ignite-common-entity.xml>
>>>>>
>>>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message