samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Will Schneider <wschnei...@tripadvisor.com>
Subject Samza/Yarn cluster having issue with OffsetOutOfRangeException
Date Mon, 20 Aug 2018 13:16:12 GMT
Hello all,

We've recently been experiencing some Kafka/Samza issues we're not quite sure how to tackle.
We've exhausted all our internal expertise and were hoping that someone on the mailing lists
might have seen this before and knows what might cause it:

KafkaSystemConsumer [WARN] While refreshing brokers for [Store_LogParser_RedactedMetadata_RedactedEnvironment,35]:
org.apache.kafka.common.errors.OffsetOutOfRangeException: The requested offset is not within
the range of offsets maintained by the server.. Retrying.

^ (Above repeats indefinitely until we intervene)

A bit about our use case:

  *   Versions:
     *   Kafka 1.0.1 (CDH Distribution 3.1.0-1.3.1.0.p0.35)
     *   Samza 0.14.1
     *   Hadoop: 2.6.0-cdh5.12.1
  *   We've seen some manifestation of this error in 4 different environments with minor differences
in configuration, but all running the same versions of the software
     *   Distributed Samza on Yarn (~10 node yarn environment, 3-7 node kafka environment)
     *   Non-distributed virtual test environment (Samza on yarn, but with no network in between)
  *   We have not found a reliable way to reproduce this error
  *   Issue typically presents on process startup. It usually doesn't make a difference if
the application was down for 5 minutes or 5 days before that startup
  *   The LogParser application experiencing this issue is reading and parsing a set of log
files, and supplementing them with metadata stored in the Store topic in question, and cached
locally in RocksDB
  *   The LogParser application has 40-60 running tasks and partitions depending on configuration
  *   There is no discernable pattern for where the error presents itself:
     *   It is not consistent WRT which yarn node hosts tasks with the issue
     *   It is not consistent WRT which kafka node hosts the partitions relevant to the issue
     *   The pattern does not persist with issue nodes upon consecutive appearances of the
error
     *   This leads us to believe the bug is probably endemic to the whole cluster and not
the result of a random hardware issue
  *   Offsets for the LogParser application are maintained in a samza topic called something
like:
     *   __samza_checkpoint_ver_1_for_LogParser-RedactedEnvironment_1
  *   Upon startup, checkpoints are refreshed from that topic, and we'll see something in
the log similar to:
     *   kafka.KafkaCheckpointManager [INFO] Read 6000 from topic: __samza_checkpoint_ver_1_for_LogParser-RedactedEnvironment_1.
Current offset: 5999
     *   On more than one occasion, we have attempted to repair the job by killing individual
yarn containers and letting samza retry them.
        *   This will occasionally work. More frequently, it will get the partition stuck
in a loop trying to read from the __samza_checkpoint topic forever; we're suspicious that
the retry loop above is storing offsets one or many times, causing the topic to fill up considerably.
  *   We are aware of only two workarounds:
     *   1- Fully clearing out the data disks on the Kafka servers and rebuilding the topics
always seems to work, at least for a time.
     *   2- We can use a setting like: streams.Store_LogParser_RedactedMetadata_RedactedEnvironment.samza.reset.offset=true,
which will necessarily ignore the checkpoint topic, and not bother to validate any offset
on the Store.
        *   This works, but requires us to do a lengthy metadata refresh immediately after
startup, which is less than ideal.
  *   We have also seen this on rare occasion on other, smaller Samza tiers
     *   In those cases, the common thread appears to be that the tier was left down for a
period of time longer than the Kafka retention timeout, and got stuck in the loop upon restart.
Attempts at reproducing it this way have been unsuccessful
     *   Worth adding that in this case, adding the samza.reset.offset parameter to the configuration
did not seem to have the intended effect

On another possibly-related note, one of our clusters periodically throws an error like this,
but usually recovers without intervention:

KafkaSystemAdmin [WARN] Exception while trying to get offset for SystemStreamPartition [kafka,
Store_LogParser_RedactedMetadata_RedactedEnvironment, 32]: org.apache.kafka.common.errors.NotLeaderForPartitionException:
This server is not the leader for that topic-partition.. Retrying.


  *   We've seen this error message crop up when we've had issues with the network in our
datacenter, but we're not aware of any such issue at the times when we're experiencing the
bigger issue. We're not sure if that might be related or not.

Has anyone seen these errors before? Is there a known workaround or fix for it?

Thanks for your help!

Attached is a copy of the Samza configuration for the job in question, in case it contains
more valuable information I may have missed.

-Will Schneider


Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message