spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sergey Zhemzhitsky <szh.s...@gmail.com>
Subject Re: DataFrames :: Corrupted Data
Date Wed, 28 Mar 2018 21:04:35 GMT
I suppose that it's hardly possible that this issue is connected with
the string encoding, because

- "pr^?files.10056.10040" should be "profiles.10056.10040" and is
defined as constant in the source code
- "profiles.total^@^@f2-a733-9304fda722ac^@^@^@^@profiles.10361.10005^@^@^@^@.total^@^@0075^@^@^@^@"
should no occur in exception at all, because such a strings are not
created within the job
- the strings being corrupted are defined within the job and there are
no such input data
- when yarn restarts the job for the second time after the first
failure, the job completes successfully




On Wed, Mar 28, 2018 at 10:31 PM, Jörn Franke <jornfranke@gmail.com> wrote:
> Encoding issue of the data? Eg spark uses utf-8 , but source encoding is different?
>
>> On 28. Mar 2018, at 20:25, Sergey Zhemzhitsky <szh.subs@gmail.com> wrote:
>>
>> Hello guys,
>>
>> I'm using Spark 2.2.0 and from time to time my job fails printing into
>> the log the following errors
>>
>> scala.MatchError:
>> profiles.total^@^@f2-a733-9304fda722ac^@^@^@^@profiles.10361.10005^@^@^@^@.total^@^@0075^@^@^@^@
>> scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
>> scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
>> scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
>> scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)
>>
>> The job itself looks like the following and contains a few shuffles and UDAFs
>>
>> val df = spark.read.avro(...).as[...]
>>      .groupBy(...)
>>      .agg(collect_list(...).as(...))
>>      .select(explode(...).as(...))
>>      .groupBy(...)
>>      .agg(sum(...).as(...))
>>      .groupBy(...)
>>      .agg(collectMetrics(...).as(...))
>>
>> The errors occur in the collectMetrics UDAF in the following snippet
>>
>> key match {
>>  case "profiles.total" => updateMetrics(...)
>>  case "profiles.biz" => updateMetrics(...)
>>  case ProfileAttrsRegex(...) => updateMetrics(...)
>> }
>>
>> ... and I'm absolutely ok with scala.MatchError because there is no
>> "catch all" case in the pattern matching expression, but the strings
>> containing corrupted characters seem to be very strange.
>>
>> I've found the following jira issues, but it's hardly difficult to say
>> whether they are related to my case:
>> - https://issues.apache.org/jira/browse/SPARK-22092
>> - https://issues.apache.org/jira/browse/SPARK-23512
>>
>> So I'm wondering, has anybody ever seen such kind of behaviour and
>> what could be the problem?
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message