I'm facing a very strange error that occurs halfway of long execution Spark SQL jobs:

18/01/12 22:14:30 ERROR Utils: Aborting task
java.io.EOFException: reached end of stream after reading 0 bytes; 96 bytes expected
at org.spark_project.guava.io.ByteStreams.readFully(ByteStreams.java:735)
at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:127)
at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:110)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)

Since I get this in several jobs, I wonder if it might be a problem at the comm layer.
Did anyone face a similar problem?

It always happens in a job which does a shuffle of 200GB reading then in partitions of ~64MB for a groupBy. And it is weird that it only fails when it processed over 1000 partitions (16 cores on one node)

I even tried changing the spark.shuffle.file.buffer config but it just seems to change the point when it occurs.

Really would appreciate some hints - what it could be, what to try, test, how to debug - as I feel pretty much blocked here.

Thanks in advance