trafodion-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Liu, Ming (Ming)" <ming....@esgyn.cn>
Subject RE: Load with log error rows gets Trafodion not work
Date Fri, 09 Sep 2016 05:20:55 GMT
Hi,

Qiao’s HBase log shows there were errors for HBase to open all table regions under “_MD_”
schema, the error stack is like this:

2016-09-08 16:44:36,327 ERROR [RS_OPEN_REGION-hadoop2slave7:60020-0] handler.OpenRegionHandler:
Failed open of region=TRAFODION._MD_.COLUMNS,,1471946223350.b6191867e73d4203d3ac6fad3c860138.,
starting to roll back the global memstore size.
org.apache.hadoop.hbase.DroppedSnapshotException: region: TRAFODION._MD_.COLUMNS,,1471946223350.b6191867e73d4203d3ac6fad3c860138.
                at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2243)
                at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1972)
                at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEditsIfAny(HRegion.java:3826)
                at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionStores(HRegion.java:969)
                at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:841)
                at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:814)
                at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:5828)
                at org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion.openHRegion(TransactionalRegion.java:101)
                at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:5794)
                at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:5765)
                at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:5721)
                at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:5672)
                at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:356)
                at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:126)
                at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
                at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.AssertionError: Key \xB9"b*M3c\x00ADMCKID                           
                                                                                         
                                                                                         
                                         /#1:\x01/1473306352163/Put/vlen=8/seqid=1749 followed
by a smaller key \xB9"b*M3c\x00ADMCKID                                                   
                                                                                         
                                                                                         
                 /#1:\x01/1473306352163/Put/vlen=8/seqid=4003 in cf #1
                at org.apache.hadoop.hbase.regionserver.StoreScanner.checkScanOrder(StoreScanner.java:699)
                at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:493)
                at org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:115)
                at org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:71)
                at org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:940)
                at org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2217)
                at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2197)
                ... 17 more

Not sure why this happened, and the HBase work well. Since metadata is not available and Qiao’s
data is just test data, so he reinitialize trafodion, and it recovered.
I don’t have enough information to know the root cause yet. Above error happens during HBase
startup. There are some HDFS error before HBase abort:

---------------------------------------------------------------------------------
2016-09-07 22:34:21,228 ERROR [regionserver/hadoop2slave7/10.1.1.22:60020] wal.ProtobufLogWriter:
Got IOException while writing trailer
java.nio.channels.ClosedChannelException
                at org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:1635)
                at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:104)
                at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
                at java.io.DataOutputStream.write(DataOutputStream.java:107)
                at com.google.protobuf.CodedOutputStream.refreshBuffer(CodedOutputStream.java:833)
                at com.google.protobuf.CodedOutputStream.flush(CodedOutputStream.java:843)
                at com.google.protobuf.AbstractMessageLite.writeTo(AbstractMessageLite.java:80)
                at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.writeWALTrailer(ProtobufLogWriter.java:157)
                at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.close(ProtobufLogWriter.java:130)
                at org.apache.hadoop.hbase.regionserver.wal.FSHLog.shutdown(FSHLog.java:1149)
                at org.apache.hadoop.hbase.wal.DefaultWALProvider.shutdown(DefaultWALProvider.java:114)
                at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:215)
                at org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1248)
                at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1003)
                at java.lang.Thread.run(Thread.java:745)


And

2016-09-07 22:34:20,765 ERROR [sync.4] wal.FSHLog: Error syncing, request close of wal
java.io.IOException: All datanodes DatanodeInfoWithStorage[10.1.1.23:50010,DS-ba16d69a-682c-42db-8a4f-e0d369e5d397,DISK]
are bad. Aborting...
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1218)
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1016)
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:560)
2016-09-07 22:34:20,767 WARN  [RS_OPEN_META-hadoop2slave7:60020-0-MetaLogRoller] wal.FSHLog:
Failed last sync but no outstanding unsync edits so falling through to close; java.io.IOException:
All datanodes DatanodeInfoWithStorage[10.1.1.23:50010,DS-ba16d69a-682c-42db-8a4f-e0d369e5d397,DISK]
are bad. Aborting...
2016-09-07 22:34:20,767 ERROR [RS_OPEN_META-hadoop2slave7:60020-0-MetaLogRoller] wal.ProtobufLogWriter:
Got IOException while writing trailer
java.io.IOException: All datanodes DatanodeInfoWithStorage[10.1.1.23:50010,DS-ba16d69a-682c-42db-8a4f-e0d369e5d397,DISK]
are bad. Aborting...
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1218)
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1016)
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:560)
2016-09-07 22:34:20,767 ERROR [RS_OPEN_META-hadoop2slave7:60020-0-MetaLogRoller] wal.FSHLog:
Failed close of WAL writer hdfs://hadoop2slave7:8020/hbase/WALs/hadoop2slave7,60020,1473040797512/hadoop2slave7%2C60020%2C1473040797512..meta.1473255260637.meta,
unflushedEntries=0
java.io.IOException: All datanodes DatanodeInfoWithStorage[10.1.1.23:50010,DS-ba16d69a-682c-42db-8a4f-e0d369e5d397,DISK]
are bad. Aborting...
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1218)
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1016)
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:560)
2016-09-07 22:34:20,767 FATAL [RS_OPEN_META-hadoop2slave7:60020-0-MetaLogRoller] regionserver.HRegionServer:
ABORTING region server hadoop2slave7,60020,1473040797512: Failed log close in log roller
org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException: hdfs://hadoop2slave7:8020/hbase/WALs/hadoop2slave7,60020,1473040797512/hadoop2slave7%2C60020%2C1473040797512..meta.1473255260637.meta,
unflushedEntries=0
                at org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:978)
                at org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:716)
                at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:137)
                at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: All datanodes DatanodeInfoWithStorage[10.1.1.23:50010,DS-ba16d69a-682c-42db-8a4f-e0d369e5d397,DISK]
are bad. Aborting...
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1218)
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1016)
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:560)
2016-09-07 22:34:20,768 FATAL [RS_OPEN_META-hadoop2slave7:60020-0-MetaLogRoller] regionserver.HRegionServer:
RegionServer abort: loaded coprocessors are: [org.apache.hadoop.hbase.coprocessor.AggregateImplementation,
org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint, org.apache.hadoop.hbase.coprocessor.transactional.TrxRegionObserver,
org.apache.hadoop.hbase.coprocessor.transactional.TrxRegionEndpoint]
------------------------------------------------------------------------------------


Not sure if these log info helps to find the root cause of metadata corruption? I am still
investigating.

Thanks,
Ming

From: 乔彦克 [mailto:qyanke@gmail.com]
Sent: Friday, September 09, 2016 11:27 AM
To: dev@trafodion.incubator.apache.org; user@trafodion.incubator.apache.org
Cc: Amanda Moran <amanda.moran@esgyn.com>; Selva Govindarajan <selva.govindarajan@esgyn.com>;
Liu, Ming (Ming) <ming.liu@esgyn.cn>
Subject: Re: Load with log error rows gets Trafodion not work

Thanks to Selva and Amanda, I loaded three data sets from hive to Trafodion yesterday, the
other two succeed and the last got the error.
And this error result in that I cannot execute any query from trafci but "initialize trafodion,
drop" (Thanks @Liuming told me to do so). Ming analyzed the hbase log and found that the data
region belongs to trafodion cannot be opened.
After I initialize trafodion again, I reload the three data sets and it goes well.

@Selva, the Trafodion and Hbase are running normal and below is the result of 'sqvers -u'
:
       perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
            LANGUAGE = (unset),
            LC_ALL = (unset),
            LC_CTYPE = "UTF-8",
            LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
cat: /opt/hptc/pdsh/nodes: No such file or directory
MY_SQROOT=/home/trafodion/apache-trafodion_server-2.0.1-incubating
who@host=trafodion@hadoop2slave7
JAVA_HOME=/usr/lib/jvm/jdk1.7.0_67
linux=2.6.32-220.el6.x86_64
redhat=6.2
NO patches
Most common Apache_Trafodion Release 2.0.1 (Build release [DEV], branch -, date 24Jun16)
UTT count is 2
[8]        Apache_Trafodion Release 2.0.1 (Build release [DEV], branch release2.0, date 24Jun16)
             export/lib/hbase-trx-apache1_0_2-2.0.1.jar
             export/lib/hbase-trx-hdp2_3-2.0.1.jar
             export/lib/sqmanvers.jar
             export/lib/trafodion-dtm-apache1_0_2-2.0.1.jar
             export/lib/trafodion-dtm-hdp2_3-2.0.1.jar
             export/lib/trafodion-sql-apache1_0_2-2.0.1.jar
             export/lib/trafodion-sql-hdp2_3-2.0.1.jar
             export/lib/trafodion-utility-2.0.1.jar
[3]        Release 2.0.1 (Build release [DEV], branch release2.0, date 24Jun16)
             export/lib/jdbcT2.jar
             export/lib/jdbcT4.jar
             export/lib/lib_mgmt.jar

@Amanda:
The Hdfs /user directory has not the user trafodion, just root and hive. But I can load and
insert data into Trafodion, so I don't think the problem is there.

Thank you for your replies.
Many thanks again,
Qiao



Amanda Moran <amanda.moran@esgyn.com<mailto:amanda.moran@esgyn.com>>于2016年9月9日周五
上午1:03写道:
Please run this command:

sudo su hdfs --command "hadoop fs -ls /user"

Please verify you have the trafodion user id listed there.

Thanks!

Amanda

On Thu, Sep 8, 2016 at 8:08 AM, Selva Govindarajan <
selva.govindarajan@esgyn.com<mailto:selva.govindarajan@esgyn.com>> wrote:

> Hi Qiao,
>
>
>
> The JIRA you mentioned in the message is already fixed and merged to
> Trafodion on July 20th.  It is unfortunate that this JIRA wasn’t marked
> as resolved. I have marked it as resolved now. This JIRA deals with the
> issue of trafodion process aborting when there is an error while logging
> the error rows. The error rows are logged in hdfs directly.  Most likely
> the “Trafodion” user has no write permission to the hdfs directory where
> the error is logged.
>
>
>
> You can try “Load with continue on error … “  command instead and check if
> it works.
>
>
>
> Can you also please send the output of the command below to confirm if the
> version installed has the above fix.
>
>
>
> sqvers -u
>
>
>
> Can you also issue the following command to confirm if the Trafodion and
> hbase are started successfully.
>
>
>
> hbcheck
>
> sqcheck
>
>
>
>
>
> Selva
>
> *From:* 乔彦克 [mailto:qyanke@gmail.com<mailto:qyanke@gmail.com>]
> *Sent:* Thursday, September 8, 2016 12:20 AM
> *To:* user@trafodion.incubator.apache.org<mailto:user@trafodion.incubator.apache.org>;
dev@trafodion.incubator.apache<mailto:dev@trafodion.incubator.apache>
> .org
> *Subject:* Load with log error rows gets Trafodion not work
>
>
>
> Hi, all,
>
>    I used load with log error rows to load data from hive, and got the
> following error:
>
> [image: loaderr.png]
>
> which leading to hbase-region server crashed.
>
> I restart Hbase region serve and Trafodion, but query in Trafodion has no
> response, even the simplest query "get tables;"  or " get schemas".
>
> Can someone help me to let Trafodion go normal?
>
> https://issues.apache.org/jira/browse/TRAFODION-2109, this jira describe
> the same problem.
>
>
>
> Any reply is appreciated.
>
> Thank you
>
> Qiao
>



--
Thanks,

Amanda Moran
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message