spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jungtaek Lim <kabhwan.opensou...@gmail.com>
Subject Re: SparkStreming logical plan leaf nodes is not equal pysical plan leaf nodes and streaming metrics cannot be reported.
Date Thu, 24 Oct 2019 07:11:15 GMT
What you've seen is the code path which there's at least one DSv1 source is
used in the query, and fails to match due to the limitation.

SPARK-24050 describes the "technical limitation" of resolving this if DSv1
source is used, so please refer the description of issue if you're
interested.


On Thu, Oct 24, 2019 at 3:14 PM Reminia Scarlet <reminia.scarlet@gmail.com>
wrote:

> @Jungtaek Lim <kabhwan.opensource@gmail.com>
> We joined streaming from eventhub and static dataframe  from csv and
> parquet with simple spark.read.csv/ parquet method.
> Are sure this is a bug? I am not that familiar with spark codes.
> Also forward to dev email list for help.
>
>
> On Thu, Oct 24, 2019 at 6:11 AM Jungtaek Lim <kabhwan.opensource@gmail.com>
> wrote:
>
>> Sorry I haven't checked the details on SPARK-24050. Looks like it was
>> only resolved with DSv2 sources, and there're some streaming sources still
>> using DSv1.
>> File stream source is one of the case, so SPARK-24050 may not help here.
>> I guess that was technical reason to only dealt with DSv2, so I'm not sure
>> there's a good way to deal with this.
>>
>> Hopefully file stream source seems to be migrated to DSv2 in Spark 3.0,
>> so Spark 3.0 would help solving the problem.
>>
>> On Wed, Oct 23, 2019 at 11:21 PM Reminia Scarlet <
>> reminia.scarlet@gmail.com> wrote:
>>
>>> @Jungtaek
>>> I'm using  Spark 2.4 (HDI 4.0)  in Azure.
>>> Maybe there are other corner cases not taking into consideration.
>>> Also I will decompile the spark jar from Azure to check the source code .
>>>
>>> On Wed, Oct 23, 2019 at 9:39 PM Jungtaek Lim <
>>> kabhwan.opensource@gmail.com> wrote:
>>>
>>>> Which version of Spark you are using?
>>>> I guess there was relevant issue SPARK-24050 [1] which was fixed in
>>>> Spark 2.4.0 so you may want to check the latest version out and try if you
>>>> use lower version.
>>>>
>>>> - Jungtaek Lim (HeartSaVioR)
>>>>
>>>> 1. https://issues.apache.org/jira/browse/SPARK-24050
>>>>
>>>> On Wed, Oct 23, 2019 at 9:57 PM Reminia Scarlet <
>>>> reminia.scarlet@gmail.com> wrote:
>>>>
>>>>> Hi all:
>>>>>  I use StreamingQueryListener to report batch inputRecordsNum as
>>>>> metrics.
>>>>>  But the numInputRows is aways 0. And the debug log  in
>>>>> MicroBatchExecution.scala said:
>>>>>
>>>>>  2019-10-23 06:56:05 WARN  MicroBatchExecution:66 - Could not report
metrics as number leaves in trigger logical plan did not match that of the execution plan:
>>>>>
>>>>>  And this causes num input rows by sources always 0 from below codes
in ProgressReporter.scala when number of leaves size not matches in logical plan and execution
plan.
>>>>>
>>>>> [image: image.png]
>>>>> Attached the output logical plan && physical plan leaves. I think
there might be some bugs. Seems LogicalRDD is duplicate as Relation in the logical plan.
>>>>> And counting twice as leaf.If we remove the LogcialRDD, leave size should
be the same.
>>>>>
>>>>> [image: image.png]
>>>>> [image: image.png]
>>>>>
>>>>> Can anyone help? Thx very much.
>>>>>
>>>>>

Mime
View raw message