From user-return-73844-apmail-spark-user-archive=spark.apache.org@spark.apache.org Wed Feb 28 03:33:25 2018 Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C532917293 for ; Wed, 28 Feb 2018 03:33:25 +0000 (UTC) Received: (qmail 5282 invoked by uid 500); 28 Feb 2018 03:33:20 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 5153 invoked by uid 500); 28 Feb 2018 03:33:20 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 5143 invoked by uid 99); 28 Feb 2018 03:33:20 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Feb 2018 03:33:20 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id A5B26C0146 for ; Wed, 28 Feb 2018 03:33:19 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.982 X-Spam-Level: * X-Spam-Status: No, score=1.982 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, HTML_OBFUSCATE_05_10=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, WEIRD_PORT=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=an10-io.20150623.gappssmtp.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id Qa-GCzBnos5d for ; Wed, 28 Feb 2018 03:33:17 +0000 (UTC) Received: from mail-ot0-f174.google.com (mail-ot0-f174.google.com [74.125.82.174]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 297525F175 for ; Wed, 28 Feb 2018 03:33:16 +0000 (UTC) Received: by mail-ot0-f174.google.com with SMTP id w38so984066ota.8 for ; Tue, 27 Feb 2018 19:33:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=an10-io.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=EUOkVqjXWfgOwKtwbMAF71Pvlwm3aql2u7XEXQdR58g=; b=z4Lci0NoUvvqiN2RgCGA2+XOsBcYQEP2rRAlq/JbalQjSWuhfDZuBdhqCkB00tNO2j qlR7ul22j9I7hyJfhVnRX4PSagcGypaZsJTDS9QURmLRDTHutM+A85GhtXe1R4S1c8QZ /or8i0Qqc6x74Uqh+u4DYjkoB3+WRKVj0mp0nQtt8pvVRi1pnAxnMLBABYhoeBfR6hwa hkiNGmyLkwZRTqLIYyS5PmKbQ+OQc4DPw+K4quln8A/hmVTTUw7PiN/AeichEIB56iBY /SKkB9iW0wjEIZE7oSxiod4IoN91p0f3aMQ6UAzguGkPCEQBMGliYwzN4qEwSrWgvdKL q1vw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=EUOkVqjXWfgOwKtwbMAF71Pvlwm3aql2u7XEXQdR58g=; b=FLoEhGmaVOp1PUBf9goR34cBWg9TpAiVVYRJX3bRu7jBRPwg2n6XH7LNg2lYBeKwSm gQel8cDNm+UcAn2kGI8AoddFrONG9ezFtWKtysh+x9UuZXIjUc+sN4BuvgZoTKcCri1h 1euC6mDvrWLLuvTT5gGwPv49ixGdeDX8QU5vVGhdNvLY+7y586KYtNHvI8IwWSYJtNPK CG1IWD4/j2pvVLrfJ7kTUwq9S1gjQLoXIwKIXQ/LUQjbJcmwn0iQZD+47VZOPV87v8kI qd9NEn9mi86kVOdvEKrRWodHPoHe6CgvKVClSebTDUYRUoIYFtYzs5wpLxGQKi5h5Wy3 pVcw== X-Gm-Message-State: APf1xPAO83PzCaZ0TxwkXg1RX8LvZqOjT2BzJylp6n3rzC01JYfGcwU2 hU8hsioqPaVTYloMz8tTjUWYuUpzJF1U+FcvBs1YTThG X-Google-Smtp-Source: AG47ELtSHULNCPOFMcb6db9vLz8vZTMMF/kSJtu1AG66QUjebvRewCPQW3sRbr7aXbEwtjdtgUe7Mu2edx7LvJNk7t0= X-Received: by 10.157.9.35 with SMTP id 32mr11123136otp.278.1519788794627; Tue, 27 Feb 2018 19:33:14 -0800 (PST) MIME-Version: 1.0 Received: by 10.74.203.138 with HTTP; Tue, 27 Feb 2018 19:32:34 -0800 (PST) In-Reply-To: References: From: Faraz Mateen Date: Wed, 28 Feb 2018 08:32:34 +0500 Message-ID: Subject: Re: Data loss in spark job To: user@spark.apache.org Content-Type: multipart/alternative; boundary="94eb2c04f8a23cdafd05663d671c" --94eb2c04f8a23cdafd05663d671c Content-Type: text/plain; charset="UTF-8" Hi, I saw the following error message in executor logs: *Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000662f00000, 520093696, 0) failed; error='Cannot allocate memory' (errno=12)* By increasing RAM of my nodes to 40 GB each, I was able to get rid of RPC connection failures. However, the results I am getting after copying data are still incorrect. Before termination, executor logs have this error message: *ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM* I believe the executors are not shutting down gracefully and that is causing spark to lose some data. Can anyone please explain how I can further debug this? Thanks, Faraz On Mon, Feb 26, 2018 at 4:46 PM, Faraz Mateen wrote: > Hi, > > I think I have a situation where spark is silently failing to write data > to my Cassandra table. Let me explain my current situation. > > I have a table consisting of around 402 million records. The table > consists of 84 columns. Table schema is something like this: > > > *id (text) | datetime (timestamp) | field1 (text) | ..... | field > 84 (text)* > > > To optimize queries on the data, I am splitting it into multiple tables > using spark job mentioned below. Each separated table must have data from > just one field from the source table. New table has the following structure: > > > *id (text) | datetime (timestamp) | day (date) | value (text)* > > > where, "value" column will contain the field column from the source table. > Source table has around *402 million* records which is around *85 GB* of > data distributed on *3 nodes (27 + 32 + 26)*. New table being populated > is supposed to have the same number of records but it is missing some data. > > Initially, I assumed some problem with the data in source table. So, I > copied 1 weeks of data from the source table into another table with the > same schema. Then I split the data like I did before but this time, field > specific table had the same number of records as the source table. I > repeated this again with another data set from another time period and > again number of records in field specific table were equal to number of > records in the source table. > > This has led me to believe that there is some problem with spark's > handling of large data set. Here is my spark submit command to separate the > data: > > *~/spark-2.1.0-bin-hadoop2.7/bin/spark-submit --master > spark://10.128.0.18:7077 --packages > datastax:spark-cassandra-connector:2.0.1-s_2.11 --con**f > spark.cassandra.connection.host="10.128.1.1,10.128.1.2,10.128.1.3" --conf > "spark.storage.memoryFraction=1" --conf spark.local.dir=/media/db/ > --executor-memory 10G --num-executors=6 --executo**r-cores=3 > --total-executor-cores 18 split_data.py* > > > *split_data.py* is the name of my pyspark application. It is essentially > executing the following query: > > > *("select id,datetime,DATE_FORMAT(datetime,'yyyy-MM-dd') as day, "+field+" > as value from data " )* > > The spark job does not crash after these errors and warnings. However when > I check the number of records in the new table, it is always less than the > number of records in source table. Moreover, the number of records in > destination table is not the same after each run of the query. I changed > logging level for spark submit to WARN and saw the following WARNINGS and > ERRORS on the console: > > https://gist.github.com/anonymous/e05f1aaa131348c9a5a9a2db6d > 141f8c#file-gistfile1-txt > > My cluster consists of *3 gcloud VMs*. A spark and a cassandra node is > deployed on each VM. > Each VM has *8 cores* of CPU and* 30 GB* RAM. Spark is deployed in > standalone cluster mode. > Spark version is *2.1.0* > I am using datastax spark cassandra connector version *2.0.1* > Cassandra Version is *3.9* > Each spark executor is allowed 10 GB of RAM and there are 2 executors > running on each node. > > Is the problem related to my machine resources? How can I root cause or > fix this? > Any help will be greatly appreciated. > > Thanks, > Faraz > --94eb2c04f8a23cdafd05663d671c Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi,

I saw the following error message i= n executor logs:

Java HotSpot(TM) 64-Bit Server VM war= ning: INFO: os::commit_memory(0x0000000662f00000, 520093696, 0) failed= ; error=3D'Cannot allocate memory' (errno=3D12)
<= br>
By increasing RAM of my nodes to 40 GB each, I was able to ge= t rid of RPC connection failures. However, the results I am getting after c= opying data are still incorrect.

Before termin= ation, executor logs have this error message:

ERROR CoarseGrained= ExecutorBackend: RECEIVED SIGNAL TERM

I believe the executors are not shutting down gracefully and that is cau= sing spark to lose some data.

Can anyone please ex= plain how I can further debug this?

Thanks,
Faraz

On Mon, Feb 26, 2018 at 4:46 PM, Faraz Matee= n <fmateen@an10.io> wrote:
<= div dir=3D"ltr"> Hi,

I think I have a situation w= here spark is silently failing to write data to my Cassandra table. Let me = explain my current situation.=C2=A0

I have a table consisting= of around 402 million records. The table consists of 84 columns. Table sch= ema is something like this:

id (text)=C2=A0 |=C2=A0 =C2=A0datetim= e (timestamp)=C2=A0 |=C2=A0 =C2=A0field1 (text) | ..... |=C2=A0 =C2=A0field= 84 (text)

To optimize q= ueries on the data, I am splitting it into multiple tables using spark job = mentioned below. Each separated table must have data from just one field fr= om the source table. New table has the following structure:

id (tex= t)=C2=A0 |=C2=A0 =C2=A0datetime (timestamp)=C2=A0 |=C2=A0 =C2=A0day (date)= =C2=A0 |=C2=A0 =C2=A0value (text)
=
=
where, "value" column will contain the field column fro= m the source table. Source table has around=C2=A0402 millio= n=C2=A0records which is around=C2=A085 GB<= /b>=C2=A0of data distributed on=C2=A03 nodes (= 27 + 32 + 26). New table being populated is supposed to have the same n= umber of records but it is missing some data.=C2=A0

Initiall= y, I assumed some problem with the data in source table. So, I copied 1 wee= ks of data from the source table into another table with the same schema. T= hen I split the data like I did before but this time, field specific table = had the same number of records as the source table. I repeated this again w= ith another data set from another time period and again number of records i= n field specific table=C2=A0 were equal to number of records in the source = table.

This has led me to believe that there is some problem= with spark's handling of large data set. Here is my spark submit comma= nd to separate the data:

~/spark-2.1.0-bin-hadoop2.7/bin/s= park-submit --master spark://10.128.0.18:7077=C2=A0 --packag= es datastax:spark-cassandra-connector:2.0.1-s_2.11 --conf spark= .cassandra.connection.host=3D"10.128.1.1,10.128.1.2,10.128.1= .3" --conf "spark.storage.memoryFraction=3D1" --conf sp= ark.local.dir=3D/media/db/ --executor-memory 10G --num-executors=3D6 --exec= utor-cores=3D3 --total-executor-cores 18 split_data.py

spli= t_data.py=C2=A0is the name of my pyspark application. It i= s essentially executing the following query:

("select id,datetime= ,DATE_FORMAT(datetime,'yyyy-MM-dd') as day, "+field+"= ; as value=C2=A0 from data=C2=A0 " )

The spark job do= es not crash after these errors and warnings. However when I check the numb= er of records in the new table, it is always less than the number of record= s in source table. Moreover, the number of records in destination table is = not the same after each run of the query.=C2=A0I changed logging level for spark submit to WARN and saw the following = WARNINGS and ERRORS on the console:


My cluster consists of=C2=A03 gcloud VMs. A spark and a cassandra node is deployed on each VM.=C2=A0
Each VM= has=C2=A08 cores=C2=A0of CPU and= =C2=A030 GB=C2=A0RAM. Spark is deployed in standalo= ne cluster mode.
Spark version is=C2=A02.1.0
=
= I am using datastax spark cassandra connector version=C2=A0= 2.0.1
Cassandra Version is=C2=A03.9
Each = spark executor is allowed 10 GB of RAM and there are 2 executors running on= each node.

Is the problem related to my machine resources? = How can I root cause or fix this?
Any help w= ill be greatly appreciated.

Thanks,
Faraz

--94eb2c04f8a23cdafd05663d671c--