A lot of things can get funny when you run distributed as opposed to
local  e.g. some jar not making it over. Do you see anything of
interest in the log on the executor machines  I'm guessing
192.168.222.152/192.168.222.164. From here
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
seems like the warning message is logged after the task fails  but I
wonder if you might see something more useful as to why it failed to
begin with. As an example we've had cases in Hdfs where a small
example would work, but on a larger example we'd hit a bad file. But
the executor log is usually pretty explicit as to what happened...
On Tue, Jul 1, 2014 at 8:57 PM, Mohammed Guller <mohammed@glassbeam.com> wrote:
> I am running Spark 1.0 on a 4node standalone spark cluster (1 master + 3
> worker). Our app is fetching data from Cassandra and doing a basic filter,
> map, and countByKey on that data. I have run into a strange problem. Even if
> the number of rows in Cassandra is just 1M, the Spark job goes seems to go
> into an infinite loop and runs for hours. With a small amount of data (less
> than 100 rows), the job does finish, but takes almost 3040 seconds and we
> frequently see the messages shown below. If we run the same application on a
> single node Spark (master local[4]), then we donâ€™t see these warnings and
> the task finishes in less than 67 seconds. Any idea what could be the cause
> for these problems when we run our application on a standalone 4node spark
> cluster?
> 14/06/30 19:30:16 WARN TaskSetManager: Lost TID 25036 (task 6.0:90)
> 14/06/30 19:30:16 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
> 14/06/30 19:30:18 WARN TaskSetManager: Lost TID 25310 (task 6.1:0)
>
> 14/06/30 19:30:18 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
> 14/06/30 19:30:19 WARN TaskSetManager: Lost TID 25582 (task 6.2:0)
> 14/06/30 19:30:19 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
> 14/06/30 19:30:21 WARN TaskSetManager: Lost TID 25882 (task 6.3:34)
> 14/06/30 19:30:21 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(0, 192.168.222.142, 39342, 0)
> 14/06/30 19:30:22 WARN TaskSetManager: Lost TID 26152 (task 6.4:0)
> 14/06/30 19:30:22 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(0, 192.168.222.142, 39342, 0)
> 14/06/30 19:30:23 WARN TaskSetManager: Lost TID 26427 (task 6.5:4)
> 14/06/30 19:30:23 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
> 14/06/30 19:30:25 WARN TaskSetManager: Lost TID 26690 (task 6.6:0)
> 14/06/30 19:30:25 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
> 14/06/30 19:30:26 WARN TaskSetManager: Lost TID 26959 (task 6.7:0)
> 14/06/30 19:30:26 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
> 14/06/30 19:30:28 WARN TaskSetManager: Lost TID 27449 (task 6.8:218)
> 14/06/30 19:30:28 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
> 14/06/30 19:30:30 WARN TaskSetManager: Lost TID 27718 (task 6.9:0)
> 14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
> 14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
> 14/06/30 19:30:31 WARN TaskSetManager: Lost TID 27991 (task 6.10:1)
> 14/06/30 19:30:31 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
> 14/06/30 19:30:33 WARN TaskSetManager: Lost TID 28265 (task 6.11:0)
> 14/06/30 19:30:33 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
> 14/06/30 19:30:34 WARN TaskSetManager: Lost TID 28550 (task 6.12:0)
> 14/06/30 19:30:34 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
> 14/06/30 19:30:36 WARN TaskSetManager: Lost TID 28822 (task 6.13:0)
> 14/06/30 19:30:36 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
> 14/06/30 19:30:37 WARN TaskSetManager: Lost TID 29093 (task 6.14:0)
> 14/06/30 19:30:37 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
> 14/06/30 19:30:39 WARN TaskSetManager: Lost TID 29366 (task 6.15:0)
> 14/06/30 19:30:39 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
> 14/06/30 19:30:40 WARN TaskSetManager: Lost TID 29648 (task 6.16:9)
> 14/06/30 19:30:40 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
> 14/06/30 19:30:42 WARN TaskSetManager: Lost TID 29924 (task 6.17:0)
> 14/06/30 19:30:42 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
> 14/06/30 19:30:43 WARN TaskSetManager: Lost TID 30193 (task 6.18:0)
> 14/06/30 19:30:43 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(2, 192.168.222.164, 57185, 0)
> 14/06/30 19:30:45 WARN TaskSetManager: Lost TID 30559 (task 6.19:98)
> 14/06/30 19:30:45 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(1, 192.168.222.152, 45896, 0)
> 14/06/30 19:30:46 WARN TaskSetManager: Lost TID 30826 (task 6.20:0)
> 14/06/30 19:30:46 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(1, 192.168.222.152, 45896, 0)
> 14/06/30 19:30:48 WARN TaskSetManager: Lost TID 31098 (task 6.21:0)
> 14/06/30 19:30:48 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(1, 192.168.222.152, 45896, 0)
> 14/06/30 19:30:50 WARN TaskSetManager: Lost TID 31370 (task 6.22:0)
> 14/06/30 19:30:50 WARN TaskSetManager: Loss was due to fetch failure from
> BlockManagerId(1, 192.168.222.152, 45896, 0)
> Thanks.
> Mohammed
