spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Faraz Mateen <>
Subject Re: Data loss in spark job
Date Wed, 28 Feb 2018 03:32:34 GMT

I saw the following error message in executor logs:

*Java HotSpot(TM) 64-Bit Server VM warning: INFO:
os::commit_memory(0x0000000662f00000, 520093696, 0) failed; error='Cannot
allocate memory' (errno=12)*

By increasing RAM of my nodes to 40 GB each, I was able to get rid of RPC
connection failures. However, the results I am getting after copying data
are still incorrect.

Before termination, executor logs have this error message:

*ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM*

I believe the executors are not shutting down gracefully and that is
causing spark to lose some data.

Can anyone please explain how I can further debug this?


On Mon, Feb 26, 2018 at 4:46 PM, Faraz Mateen <> wrote:

> Hi,
> I think I have a situation where spark is silently failing to write data
> to my Cassandra table. Let me explain my current situation.
> I have a table consisting of around 402 million records. The table
> consists of 84 columns. Table schema is something like this:
> *id (text)  |   datetime (timestamp)  |   field1 (text) | ..... |   field
> 84 (text)*
> To optimize queries on the data, I am splitting it into multiple tables
> using spark job mentioned below. Each separated table must have data from
> just one field from the source table. New table has the following structure:
> *id (text)  |   datetime (timestamp)  |   day (date)  |   value (text)*
> where, "value" column will contain the field column from the source table.
> Source table has around *402 million* records which is around *85 GB* of
> data distributed on *3 nodes (27 + 32 + 26)*. New table being populated
> is supposed to have the same number of records but it is missing some data.
> Initially, I assumed some problem with the data in source table. So, I
> copied 1 weeks of data from the source table into another table with the
> same schema. Then I split the data like I did before but this time, field
> specific table had the same number of records as the source table. I
> repeated this again with another data set from another time period and
> again number of records in field specific table  were equal to number of
> records in the source table.
> This has led me to believe that there is some problem with spark's
> handling of large data set. Here is my spark submit command to separate the
> data:
> *~/spark-2.1.0-bin-hadoop2.7/bin/spark-submit --master
> spark:// <>  --packages
> datastax:spark-cassandra-connector:2.0.1-s_2.11 --con**f
>",," --conf
> "" --conf spark.local.dir=/media/db/
> --executor-memory 10G --num-executors=6 --executo**r-cores=3
> --total-executor-cores 18*
> ** is the name of my pyspark application. It is essentially
> executing the following query:
> *("select id,datetime,DATE_FORMAT(datetime,'yyyy-MM-dd') as day, "+field+"
> as value  from data  " )*
> The spark job does not crash after these errors and warnings. However when
> I check the number of records in the new table, it is always less than the
> number of records in source table. Moreover, the number of records in
> destination table is not the same after each run of the query. I changed
> logging level for spark submit to WARN and saw the following WARNINGS and
> ERRORS on the console:
> 141f8c#file-gistfile1-txt
> My cluster consists of *3 gcloud VMs*. A spark and a cassandra node is
> deployed on each VM.
> Each VM has *8 cores* of CPU and* 30 GB* RAM. Spark is deployed in
> standalone cluster mode.
> Spark version is *2.1.0*
> I am using datastax spark cassandra connector version *2.0.1*
> Cassandra Version is *3.9*
> Each spark executor is allowed 10 GB of RAM and there are 2 executors
> running on each node.
> Is the problem related to my machine resources? How can I root cause or
> fix this?
> Any help will be greatly appreciated.
> Thanks,
> Faraz

View raw message