I saw the following error message in executor logs:

Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000662f00000, 520093696, 0) failed; error='Cannot allocate memory' (errno=12)

By increasing RAM of my nodes to 40 GB each, I was able to get rid of RPC connection failures. However, the results I am getting after copying data are still incorrect.

Before termination, executor logs have this error message:

ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM

I believe the executors are not shutting down gracefully and that is causing spark to lose some data.

Can anyone please explain how I can further debug this?


On Mon, Feb 26, 2018 at 4:46 PM, Faraz Mateen <fmateen@an10.io> wrote:

I think I have a situation where spark is silently failing to write data to my Cassandra table. Let me explain my current situation. 

I have a table consisting of around 402 million records. The table consists of 84 columns. Table schema is something like this:

id (text)  |   datetime (timestamp)  |   field1 (text) | ..... |   field 84 (text)

To optimize queries on the data, I am splitting it into multiple tables using spark job mentioned below. Each separated table must have data from just one field from the source table. New table has the following structure:

id (text)  |   datetime (timestamp)  |   day (date)  |   value (text)

where, "value" column will contain the field column from the source table. Source table has around 402 million records which is around 85 GB of data distributed on 3 nodes (27 + 32 + 26). New table being populated is supposed to have the same number of records but it is missing some data. 

Initially, I assumed some problem with the data in source table. So, I copied 1 weeks of data from the source table into another table with the same schema. Then I split the data like I did before but this time, field specific table had the same number of records as the source table. I repeated this again with another data set from another time period and again number of records in field specific table  were equal to number of records in the source table.

This has led me to believe that there is some problem with spark's handling of large data set. Here is my spark submit command to separate the data:

~/spark-2.1.0-bin-hadoop2.7/bin/spark-submit --master spark://  --packages datastax:spark-cassandra-connector:2.0.1-s_2.11 --conf spark.cassandra.connection.host=",," --conf "spark.storage.memoryFraction=1" --conf spark.local.dir=/media/db/ --executor-memory 10G --num-executors=6 --executor-cores=3 --total-executor-cores 18 split_data.py

split_data.py is the name of my pyspark application. It is essentially executing the following query:

("select id,datetime,DATE_FORMAT(datetime,'yyyy-MM-dd') as day, "+field+" as value  from data  " )

The spark job does not crash after these errors and warnings. However when I check the number of records in the new table, it is always less than the number of records in source table. Moreover, the number of records in destination table is not the same after each run of the query. I changed logging level for spark submit to WARN and saw the following WARNINGS and ERRORS on the console:

My cluster consists of 3 gcloud VMs. A spark and a cassandra node is deployed on each VM. 
Each VM has 8 cores of CPU and 30 GB RAM. Spark is deployed in standalone cluster mode.
Spark version is 2.1.0
I am using datastax spark cassandra connector version 2.0.1
Cassandra Version is 3.9
Each spark executor is allowed 10 GB of RAM and there are 2 executors running on each node.

Is the problem related to my machine resources? How can I root cause or fix this?
Any help will be greatly appreciated.