From dev-return-6624-apmail-hawq-dev-archive=hawq.apache.org@hawq.incubator.apache.org Fri Mar 24 14:17:47 2017 Return-Path: X-Original-To: apmail-hawq-dev-archive@minotaur.apache.org Delivered-To: apmail-hawq-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2683E193A5 for ; Fri, 24 Mar 2017 14:17:47 +0000 (UTC) Received: (qmail 36453 invoked by uid 500); 24 Mar 2017 14:17:47 -0000 Delivered-To: apmail-hawq-dev-archive@hawq.apache.org Received: (qmail 36387 invoked by uid 500); 24 Mar 2017 14:17:47 -0000 Mailing-List: contact dev-help@hawq.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hawq.incubator.apache.org Delivered-To: mailing list dev@hawq.incubator.apache.org Received: (qmail 36233 invoked by uid 99); 24 Mar 2017 14:17:46 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Mar 2017 14:17:46 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 93D66C04DE for ; Fri, 24 Mar 2017 14:17:46 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id OKgJHVFkRIHB for ; Fri, 24 Mar 2017 14:17:45 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 9E4E65FBBA for ; Fri, 24 Mar 2017 14:17:44 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 7B61EE0A75 for ; Fri, 24 Mar 2017 14:17:42 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id DE6772406D for ; Fri, 24 Mar 2017 14:17:41 +0000 (UTC) Date: Fri, 24 Mar 2017 14:17:41 +0000 (UTC) From: "Ming LI (JIRA)" To: dev@hawq.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HAWQ-1408) PANICs during COPY ... FROM STDIN MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Ming LI created HAWQ-1408: ----------------------------- Summary: PANICs during COPY ... FROM STDIN Key: HAWQ-1408 URL: https://issues.apache.org/jira/browse/HAWQ-1408 Project: Apache HAWQ Issue Type: Bug Components: Core Reporter: Ming LI Assignee: Ed Espino Fix For: 2.1.0.0-incubating We found PANIC (and respective core dumps). From the initial analysis from = the logs and core dump, the query causing this PANIC is a "COPY ... FROM ST= DIN". This query does not always panic. This kind of queries are executed from Java/Scala code (by one of IG Spark = Jobs). Connection to the DB is managed by connection pool (commons-dbcp2) a= nd validated on borrow by =E2=80=9Cselect 1=E2=80=9D validation query. IG i= s using postgresql-9.4-1206-jdbc41 as a java driver to create those connect= ions. I believe they should be using the driver from DataDirect, available = in PivNet; however, I haven't found hard evidence pointing the driver as a = root cause. My initial analysis on the packcore for the master PANIC. Not sure if this = helps or makes sense. This is the backtrace of the packcore for process 466858: {code} (gdb) bt #0 0x00007fd875f906ab in raise () from /data/logs/52280/packcore-core.post= gres.466858/lib64/libpthread.so.0 #1 0x00000000008c0b19 in SafeHandlerForSegvBusIll (postgres_signal_arg=3D1= 1, processName=3D) at elog.c:4519 #2 #3 0x000000000053b9c3 in SetSegnoForWrite (existing_segnos=3D0x4c46ff0, ex= isting_segnos@entry=3D0x0, relid=3Drelid@entry=3D1195061, segment_num=3Dseg= ment_num@entry=3D6, forNewRel=3DforNewRel@entry=3D0 '\000', keepHash=3Dkeep= Hash@entry=3D1 '\001') at appendonlywriter.c:1166 #4 0x000000000053c08f in assignPerRelSegno (all_relids=3Dall_relids@entry= =3D0x2b96d68, segment_num=3D6) at appendonlywriter.c:1212 #5 0x00000000005f79e8 in DoCopy (stmt=3Dstmt@entry=3D0x2b2a3d8, queryStrin= g=3D) at copy.c:1591 #6 0x00000000007ef737 in ProcessUtility (parsetree=3Dparsetree@entry=3D0x2= b2a3d8, queryString=3D0x2c2f550 "COPY mis_data_ig_client_derived_attributes= .client_derived_attributes_src (id, tracking_id, name, value_string, value_= timestamp, value_number, value_boolean, environment, account, channel, devi= ce, feat"..., params=3D0x0, isTopLevel=3DisTopLevel@entry=3D1 '\001', dest=3Ddest@ent= ry=3D0x2b2a7c8, completionTag=3DcompletionTag@entry=3D0x7ffcb5e318e0 "") at= utility.c:1076 #7 0x00000000007ea95e in PortalRunUtility (portal=3Dportal@entry=3D0x2b8ea= b0, utilityStmt=3DutilityStmt@entry=3D0x2b2a3d8, isTopLevel=3DisTopLevel@en= try=3D1 '\001', dest=3Ddest@entry=3D0x2b2a7c8, completionTag=3DcompletionTa= g@entry=3D0x7ffcb5e318e0 "") at pquery.c:1969 #8 0x00000000007ec13e in PortalRunMulti (portal=3Dportal@entry=3D0x2b8eab0= , isTopLevel=3DisTopLevel@entry=3D1 '\001', dest=3Ddest@entry=3D0x2b2a7c8, = altdest=3Daltdest@entry=3D0x2b2a7c8, completionTag=3DcompletionTag@entry=3D= 0x7ffcb5e318e0 "") at pquery.c:2079 #9 0x00000000007ede95 in PortalRun (portal=3Dportal@entry=3D0x2b8eab0, cou= nt=3Dcount@entry=3D9223372036854775807, isTopLevel=3DisTopLevel@entry=3D1 '= \001', dest=3Ddest@entry=3D0x2b2a7c8, altdest=3Daltdest@entry=3D0x2b2a7c8, = completionTag=3DcompletionTag@entry=3D0x7ffcb5e318e0 "") at pquery.c:1596 #10 0x00000000007e5ad9 in exec_simple_query (query_string=3Dquery_string@en= try=3D0x2b29100 "COPY mis_data_ig_client_derived_attributes.client_derived_= attributes_src (id, tracking_id, name, value_string, value_timestamp, value= _number, value_boolean, environment, account, channel, device, feat"..., seqServerHost=3DseqServerHost@entry=3D0x0, seqServerPort=3DseqServerPor= t@entry=3D-1) at postgres.c:1816 #11 0x00000000007e6cb2 in PostgresMain (argc=3D, argv=3D, argv@entry=3D0x29d7820, username=3D0x29d75d0 "mis_ig") at post= gres.c:4840 #12 0x0000000000799540 in BackendRun (port=3D0x29afc50) at postmaster.c:591= 5 #13 BackendStartup (port=3D0x29afc50) at postmaster.c:5484 #14 ServerLoop () at postmaster.c:2163 #15 0x000000000079c309 in PostmasterMain (argc=3D, argv=3D) at postmaster.c:1454 #16 0x00000000004a4209 in main (argc=3D9, argv=3D0x29af010) at main.c:226 {code} Jumping into the frame 3 and running info locals, we found something odd fo= r "status" variable: {code} (gdb) f 3 #3 0x000000000053b9c3 in SetSegnoForWrite (existing_segnos=3D0x4c46ff0, ex= isting_segnos@entry=3D0x0, relid=3Drelid@entry=3D1195061, segment_num=3Dseg= ment_num@entry=3D6, forNewRel=3DforNewRel@entry=3D0 '\000', keepHash=3Dkeep= Hash@entry=3D1 '\001') at appendonlywriter.c:1166 1166=09appendonlywriter.c: No such file or directory. (gdb) info locals status =3D 0x0 [...] {code} This panic comes from this piece of code in "appendonlywritter.c": {code} for (int i =3D 0; i < segment_num; i++) { AOSegfileStatus *status =3D maxSegno4Segment[i]; status->inuse =3D true; status->xid =3D CurrentXid; existing_segnos =3D lappend_int(existing_segnos, status->segno); } {code} So, we are pulling a 0x0 (null ?!) entry from _maxSegno4Segment_... That's = extrange, because earlier in this function we populate this array, and we s= hould not reach this section unless this _maxSegno4Segment_ array completel= y populated. I'm assuming that the variable "segment_num" (used in the for loop stop con= dition) is 6 because we got (4096 MB, 1.00 CORE) x 6 resources, according t= o master log (?) {code} 2017-02-15 00:12:20.121004 GMT,"mis_ig","ig",p466858,th2037315872,"10.33.18= 8.8","35247",2017-02-14 08:41:58 GMT,30177010,con20836,cmd48711,seg-10000,,= ,x30177010,sx1,"LOG","00000","ConnID 42180. Acquired resource from resource= manager, (4096 MB, 1.00 0000 CORE) x 6." {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)