spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reminia Scarlet <reminia.scar...@gmail.com>
Subject Re: SparkStreming logical plan leaf nodes is not equal pysical plan leaf nodes and streaming metrics cannot be reported.
Date Thu, 24 Oct 2019 06:14:02 GMT
@Jungtaek Lim <kabhwan.opensource@gmail.com>
We joined streaming from eventhub and static dataframe  from csv and
parquet with simple spark.read.csv/ parquet method.
Are sure this is a bug? I am not that familiar with spark codes.
Also forward to dev email list for help.


On Thu, Oct 24, 2019 at 6:11 AM Jungtaek Lim <kabhwan.opensource@gmail.com>
wrote:

> Sorry I haven't checked the details on SPARK-24050. Looks like it was only
> resolved with DSv2 sources, and there're some streaming sources still using
> DSv1.
> File stream source is one of the case, so SPARK-24050 may not help here. I
> guess that was technical reason to only dealt with DSv2, so I'm not sure
> there's a good way to deal with this.
>
> Hopefully file stream source seems to be migrated to DSv2 in Spark 3.0, so
> Spark 3.0 would help solving the problem.
>
> On Wed, Oct 23, 2019 at 11:21 PM Reminia Scarlet <
> reminia.scarlet@gmail.com> wrote:
>
>> @Jungtaek
>> I'm using  Spark 2.4 (HDI 4.0)  in Azure.
>> Maybe there are other corner cases not taking into consideration.
>> Also I will decompile the spark jar from Azure to check the source code .
>>
>> On Wed, Oct 23, 2019 at 9:39 PM Jungtaek Lim <
>> kabhwan.opensource@gmail.com> wrote:
>>
>>> Which version of Spark you are using?
>>> I guess there was relevant issue SPARK-24050 [1] which was fixed in
>>> Spark 2.4.0 so you may want to check the latest version out and try if you
>>> use lower version.
>>>
>>> - Jungtaek Lim (HeartSaVioR)
>>>
>>> 1. https://issues.apache.org/jira/browse/SPARK-24050
>>>
>>> On Wed, Oct 23, 2019 at 9:57 PM Reminia Scarlet <
>>> reminia.scarlet@gmail.com> wrote:
>>>
>>>> Hi all:
>>>>  I use StreamingQueryListener to report batch inputRecordsNum as
>>>> metrics.
>>>>  But the numInputRows is aways 0. And the debug log  in
>>>> MicroBatchExecution.scala said:
>>>>
>>>>  2019-10-23 06:56:05 WARN  MicroBatchExecution:66 - Could not report metrics
as number leaves in trigger logical plan did not match that of the execution plan:
>>>>
>>>>  And this causes num input rows by sources always 0 from below codes in ProgressReporter.scala
when number of leaves size not matches in logical plan and execution plan.
>>>>
>>>> [image: image.png]
>>>> Attached the output logical plan && physical plan leaves. I think
there might be some bugs. Seems LogicalRDD is duplicate as Relation in the logical plan.
>>>> And counting twice as leaf.If we remove the LogcialRDD, leave size should
be the same.
>>>>
>>>> [image: image.png]
>>>> [image: image.png]
>>>>
>>>> Can anyone help? Thx very much.
>>>>
>>>>

Mime
View raw message