ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Николай Ижиков" <nizhikov....@gmail.com>
Subject Re: Integration of Spark and Ignite. Prototype.
Date Wed, 18 Oct 2017 17:02:57 GMT
Hello, Valentin.

My answers is below.
Dmitry, do we need to move discussion to Jira?

 > 1. Why do we have org.apache.spark.sql.ignite package in our codebase?

As I mentioned earlier, to implement and override Spark Catalog one have to use internal(private)
Spark API.
So I have to use package `org.spark.sql.***` to have access to private class and variables.

For example, SharedState class that stores link to ExternalCatalog declared as `private[sql]
class SharedState` - i.e. package private.

 > Can these classes reside under org.apache.ignite.spark instead?

No, as long as we want to have our own implementation of ExternalCatalog.

 > 2. IgniteRelationProvider contains multiple constants which I guess are some king of
config options. Can you describe the purpose of each of them?

I extend comments for this options.
Please, see my commit [1] or PR HEAD:

 > 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two Catalog implementations
and what is the difference?

Good catch, thank you!
After additional research I founded that only IgniteExternalCatalog required.
I will update PR with IgniteCatalog remove in a few days.

 > 4. IgniteStrategy and IgniteOptimization are currently no-op. What are our plans on
implementing them? Also, what exactly is planned in IgniteOptimization and what is its purpose?

Actually, this is very good question :)
And I need advice from experienced community members here:

`IgniteOptimization` purpose is to modify query plan created by Spark.
Currently, we have one optimization described in IGNITE-3084 [2] by you, Valentin :) :

“If there are non-Ignite relations in the plan, we should fall back to native Spark strategies“

I think we can go little further and reduce join of two Ignite backed Data Frames into single
Ignite SQL query. Currently, this feature is unimplemented.

*Do we need it now? Or we can postpone it and concentrates on basic Data Frame and Catalog
implementation?*

`Strategy` purpose, as you correctly mentioned in [2], is transform LogicalPlan into physical
operators.
I don’t have ideas how to use this opportunity. So I think we don’t need IgniteStrategy.

Can you or anyone else suggest some optimization strategy to speed up SQL query execution?

 > 5. I don't like that IgniteStrategy and IgniteOptimization have to be set manually on
SQLContext each time it's created....Is there any way to automate this and improve usability?

These classes added to `extraOptimizations` when one using IgniteSparkSession.
As far as I know, there is no way to automatically add these classes to regular SparkSession.

 > 6. What is the purpose of IgniteSparkSession? I see it's used in IgniteCatalogExample
but not in IgniteDataFrameExample, which is Confusing.

DataFrame API is *public* Spark API. So anyone can provide implementation and plug it into
Spark. That’s why IgniteDataFrameExample doesn’t need any Ignite specific session.

Catalog API is *internal* Spark API. There is no way to plug custom catalog implementation
into Spark [3]. So we have to use `IgniteSparkSession` that extends regular SparkSession and
overrides links 
to `ExternalCatalog`.

 > 7. To create IgniteSparkSession we first create IgniteContext. Is it really needed?
It looks like we can directly provide the configuration file; if IgniteSparkSession really
requires 
IgniteContext, it can create it by itself under the hood.

Actually, IgniteContext is base class for Ignite <-> Spark integration for now. So I
tried to reuse it here. I like the idea to remove explicit usage of IgniteContext.
Will implement it in a few days.

 > Actually, I think it makes sense to create a builder similar to SparkSession.builder()...

Great idea! I will implement such builder in a few days.

 > 9. Do I understand correctly that IgniteCacheRelation is for the case when we don't
have SQL configured on Ignite side?

Yes, IgniteCacheRelation is Data Frame implementation for a key-value cache.

 > I thought we decided not to support this, no? Or this is something else?

My understanding is following:

1. We can’t support automatic resolving key-value caches in *ExternalCatalog*. Because there
is no way to reliably detect key and value classes.

2. We can support key-value caches in regular Data Frame implementation. Because we can require
user to provide key and value classes explicitly.

 > 8. Can you clarify the query syntax in IgniteDataFrameExample#nativeSparkSqlFromCacheExample2?

Key-value cache:

key - java.lang.Long,
value - case class Person(name: String, birthDate: java.util.Date)

Schema of data frame for cache is:

key - long
value.name - string
value.birthDate - date

So we can select data from data from cache:

SELECT
   key, `value.name`,  `value.birthDate`
FROM
   testCache
WHERE key >= 2 AND `value.name` like '%0'

[1] https://github.com/apache/ignite/pull/2742/commits/faf3ed6febf417bc59b0519156fd4d09114c8da7
[2] https://issues.apache.org/jira/browse/IGNITE-3084?focusedCommentId=15794210&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15794210
[3] https://issues.apache.org/jira/browse/SPARK-17767?focusedCommentId=15543733&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15543733


18.10.2017 04:39, Dmitriy Setrakyan пишет:
> Val, thanks for the review. Can I ask you to add the same comments to the
> ticket?
> 
> On Tue, Oct 17, 2017 at 3:20 PM, Valentin Kulichenko <
> valentin.kulichenko@gmail.com> wrote:
> 
>> Nikolay, Anton,
>>
>> I did a high level review of the code. First of all, impressive results!
>> However, I have some questions/comments.
>>
>> 1. Why do we have org.apache.spark.sql.ignite package in our codebase? Can
>> these classes reside under org.apache.ignite.spark instead?
>> 2. IgniteRelationProvider contains multiple constants which I guess are
>> some king of config options. Can you describe the purpose of each of them?
>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two Catalog
>> implementations and what is the difference?
>> 4. IgniteStrategy and IgniteOptimization are currently no-op. What are our
>> plans on implementing them? Also, what exactly is planned in
>> IgniteOptimization and what is its purpose?
>> 5. I don't like that IgniteStrategy and IgniteOptimization have to be set
>> manually on SQLContext each time it's created. This seems to be very error
>> prone. Is there any way to automate this and improve usability?
>> 6. What is the purpose of IgniteSparkSession? I see it's used
>> in IgniteCatalogExample but not in IgniteDataFrameExample, which is
>> confusing.
>> 7. To create IgniteSparkSession we first create IgniteContext. Is it really
>> needed? It looks like we can directly provide the configuration file; if
>> IgniteSparkSession really requires IgniteContext, it can create it by
>> itself under the hood. Actually, I think it makes sense to create a builder
>> similar to SparkSession.builder(), it would be good if our APIs here are
>> consistent with Spark APIs.
>> 8. Can you clarify the query syntax
>> inIgniteDataFrameExample#nativeSparkSqlFromCacheExample2?
>> 9. Do I understand correctly that IgniteCacheRelation is for the case when
>> we don't have SQL configured on Ignite side? I thought we decided not to
>> support this, no? Or this is something else?
>>
>> Thanks!
>>
>> -Val
>>
>> On Tue, Oct 17, 2017 at 4:40 AM, Anton Vinogradov <
>> avinogradov@gridgain.com>
>> wrote:
>>
>>> Sounds awesome.
>>>
>>> I'll try to review API & tests this week.
>>>
>>> Val,
>>> Your review still required :)
>>>
>>> On Tue, Oct 17, 2017 at 2:36 PM, Николай Ижиков <nizhikov.dev@gmail.com>
>>> wrote:
>>>
>>>> Yes
>>>>
>>>> 17 окт. 2017 г. 2:34 PM пользователь "Anton Vinogradov" <
>>>> avinogradov@gridgain.com> написал:
>>>>
>>>>> Nikolay,
>>>>>
>>>>> So, it will be able to start regular spark and ignite clusters and,
>>> using
>>>>> peer classloading via spark-context, perform any DataFrame request,
>>>>> correct?
>>>>>
>>>>> On Tue, Oct 17, 2017 at 2:25 PM, Николай Ижиков <
>>> nizhikov.dev@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello, Anton.
>>>>>>
>>>>>> An example you provide is a path to a master *local* file.
>>>>>> These libraries are added to the classpath for each remote node
>>> running
>>>>>> submitted job.
>>>>>>
>>>>>> Please, see documentation:
>>>>>>
>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
>>>>>> spark/SparkContext.html#addJar(java.lang.String)
>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
>>>>>> spark/SparkContext.html#addFile(java.lang.String)
>>>>>>
>>>>>>
>>>>>> 2017-10-17 13:10 GMT+03:00 Anton Vinogradov <
>>> avinogradov@gridgain.com
>>>>> :
>>>>>>
>>>>>>> Nikolay,
>>>>>>>
>>>>>>>> With Data Frame API implementation there are no requirements
to
>>>> have
>>>>>> any
>>>>>>>> Ignite files on spark worker nodes.
>>>>>>>
>>>>>>> What do you mean? I see code like:
>>>>>>>
>>>>>>> spark.sparkContext.addJar(MAVEN_HOME +
>>>>>>> "/org/apache/ignite/ignite-core/2.3.0-SNAPSHOT/ignite-
>>>>>>> core-2.3.0-SNAPSHOT.jar")
>>>>>>>
>>>>>>> On Mon, Oct 16, 2017 at 5:22 PM, Николай Ижиков
<
>>>>> nizhikov.dev@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello, guys.
>>>>>>>>
>>>>>>>> I have created example application to run Ignite Data Frame
on
>>>>>> standalone
>>>>>>>> Spark cluster.
>>>>>>>> With Data Frame API implementation there are no requirements
to
>>>> have
>>>>>> any
>>>>>>>> Ignite files on spark worker nodes.
>>>>>>>>
>>>>>>>> I ran this application on the free dataset: ATP tennis match
>>>>>> statistics.
>>>>>>>>
>>>>>>>> data - https://github.com/nizhikov/atp_matches
>>>>>>>> app - https://github.com/nizhikov/ignite-spark-df-example
>>>>>>>>
>>>>>>>> Valentin, do you have a chance to look at my changes?
>>>>>>>>
>>>>>>>>
>>>>>>>> 2017-10-12 6:03 GMT+03:00 Valentin Kulichenko <
>>>>>>>> valentin.kulichenko@gmail.com
>>>>>>>>> :
>>>>>>>>
>>>>>>>>> Hi Nikolay,
>>>>>>>>>
>>>>>>>>> Sorry for delay on this, got a little swamped lately.
I will
>> do
>>>> my
>>>>>> best
>>>>>>>> to
>>>>>>>>> review the code this week.
>>>>>>>>>
>>>>>>>>> -Val
>>>>>>>>>
>>>>>>>>> On Mon, Oct 9, 2017 at 11:48 AM, Николай Ижиков
<
>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hello, Valentin.
>>>>>>>>>>
>>>>>>>>>> Did you have a chance to look at my changes?
>>>>>>>>>>
>>>>>>>>>> Now I think I have done almost all required features.
>>>>>>>>>> I want to make some performance test to ensure my
>>> implementation
>>>>>> work
>>>>>>>>>> properly with a significant amount of data.
>>>>>>>>>> And I definitely need some feedback for my changes.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2017-10-09 18:45 GMT+03:00 Николай Ижиков
<
>>>> nizhikov.dev@gmail.com
>>>>>> :
>>>>>>>>>>
>>>>>>>>>>> Hello, guys.
>>>>>>>>>>>
>>>>>>>>>>> Which version of Spark do we want to use?
>>>>>>>>>>>
>>>>>>>>>>> 1. Currently, Ignite depends on Spark 2.1.0.
>>>>>>>>>>>
>>>>>>>>>>>      * Can be run on JDK 7.
>>>>>>>>>>>      * Still supported: 2.1.2 will be released
soon.
>>>>>>>>>>>
>>>>>>>>>>> 2. Latest Spark version is 2.2.0.
>>>>>>>>>>>
>>>>>>>>>>>      * Can be run only on JDK 8+
>>>>>>>>>>>      * Released Jul 11, 2017.
>>>>>>>>>>>      * Already supported by huge vendors(Amazon
for
>> example).
>>>>>>>>>>>
>>>>>>>>>>> Note that in IGNITE-3084 I implement some internal
Spark
>> API.
>>>>>>>>>>> So It will take some effort to switch between
Spark 2.1 and
>>> 2.2
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2017-09-27 2:20 GMT+03:00 Valentin Kulichenko
<
>>>>>>>>>>> valentin.kulichenko@gmail.com>:
>>>>>>>>>>>
>>>>>>>>>>>> I will review in the next few days.
>>>>>>>>>>>>
>>>>>>>>>>>> -Val
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Sep 26, 2017 at 2:23 PM, Denis Magda
<
>>>> dmagda@apache.org
>>>>>>
>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello Nikolay,
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is good news. Finally this capability
is coming to
>>>>> Ignite.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Val, Vladimir, could you do a preliminary
review?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Answering on your questions.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. Yardstick should be enough for performance
>>> measurements.
>>>>> As a
>>>>>>>> Spark
>>>>>>>>>>>>> user, I will be curious to know what’s
the point of this
>>>>>>>> integration.
>>>>>>>>>>>>> Probably we need to compare Spark + Ignite
and Spark +
>>> Hive
>>>> or
>>>>>>>> Spark +
>>>>>>>>>>>>> RDBMS cases.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2. If Spark community is reluctant let’s
include the
>>> module
>>>> in
>>>>>>>>>>>>> ignite-spark integration.
>>>>>>>>>>>>>
>>>>>>>>>>>>> —
>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sep 25, 2017, at 11:14 AM, Николай
Ижиков <
>>>>>>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello, guys.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Currently, I’m working on integration
between Spark
>> and
>>>>> Ignite
>>>>>>>> [1].
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For now, I implement following:
>>>>>>>>>>>>>>     * Ignite DataSource implementation(
>>>>> IgniteRelationProvider)
>>>>>>>>>>>>>>     * DataFrame support for Ignite
SQL table.
>>>>>>>>>>>>>>     * IgniteCatalog implementation
for a transparent
>>>>> resolving
>>>>>> of
>>>>>>>>>>>> ignites
>>>>>>>>>>>>>> SQL tables.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Implementation of it can be found
in PR [2]
>>>>>>>>>>>>>> It would be great if someone provides
feedback for a
>>>>>> prototype.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I made some examples in PR so you
can see how API
>>> suppose
>>>> to
>>>>>> be
>>>>>>>>>>>> used [3].
>>>>>>>>>>>>>> [4].
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I need some advice. Can you help
me?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. How should this PR be tested?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Of course, I need to provide some
unit tests. But what
>>>> about
>>>>>>>>>>>> scalability
>>>>>>>>>>>>>> tests, etc.
>>>>>>>>>>>>>> Maybe we need some Yardstick benchmark
or similar?
>>>>>>>>>>>>>> What are your thoughts?
>>>>>>>>>>>>>> Which scenarios should I consider
in the first place?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2. Should we provide Spark Catalog
implementation
>> inside
>>>>>> Ignite
>>>>>>>>>>>> codebase?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> A current implementation of Spark
Catalog based on
>>>> *internal
>>>>>>> Spark
>>>>>>>>>>>> API*.
>>>>>>>>>>>>>> Spark community seems not interested
in making Catalog
>>> API
>>>>>>> public
>>>>>>>> or
>>>>>>>>>>>>>> including Ignite Catalog in Spark
code base [5], [6].
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *Should we include Spark internal
API implementation
>>>> inside
>>>>>>> Ignite
>>>>>>>>>>>> code
>>>>>>>>>>>>>> base?*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Or should we consider to include
Catalog
>> implementation
>>> in
>>>>>> some
>>>>>>>>>>>> external
>>>>>>>>>>>>>> module?
>>>>>>>>>>>>>> That will be created and released
outside Ignite?(we
>>> still
>>>>> can
>>>>>>>>>>>> support
>>>>>>>>>>>>> and
>>>>>>>>>>>>>> develop it inside Ignite community).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-3084
>>>>>>>>>>>>>> [2] https://github.com/apache/ignite/pull/2742
>>>>>>>>>>>>>> [3] https://github.com/apache/
>>>> ignite/pull/2742/files#diff-
>>>>>>>>>>>>>> f4ff509cef3018e221394474775e0905
>>>>>>>>>>>>>> [4] https://github.com/apache/
>>>> ignite/pull/2742/files#diff-
>>>>>>>>>>>>>> f2b670497d81e780dfd5098c5dd8a89c
>>>>>>>>>>>>>> [5] http://apache-spark-developers-list.1001551.n3.
>>>>>>>>>>>>>> nabble.com/Spark-Core-Custom-
>>> Catalog-Integration-between-
>>>>>>>>>>>>>> Apache-Ignite-and-Apache-Spark-td22452.html
>>>>>>>>>>>>>> [6] https://issues.apache.org/jira/browse/SPARK-17767
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Nikolay Izhikov
>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nikolay Izhikov
>>>>>> NIzhikov.dev@gmail.com
>>>>>>
>>>>>
>>>>
>>>
>>
> 

Mime
View raw message