[ https://issues.apache.org/jira/browse/DATAFU-148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849318#comment-16849318
]
Russell Jurney edited comment on DATAFU-148 at 5/28/19 4:16 AM:
----------------------------------------------------------------
[~matterhayes]' code review brings up an important point: the API should decorate the `pyspark.sql.DataFrame`
API.
Why not have an `activate()` or `initialize()` method and then add these methods to the `DataFrame`
class? [pymongo_spark|https://github.com/mongodb/mongo-hadoop/blob/master/spark/src/main/python/pymongo_spark.py]
(part of [mongo-hadoop][https://github.com/mongodb/mongo-hadoop]) does this to add methods
like `pyspark.rdd.RDD.saveToMongoDB` which make the API consistent with PySpark's.
See: https://github.com/mongodb/mongo-hadoop/tree/master/spark/src/main/python#usage
You use it like this:
{code:python}
import pymongo_spark
pymongo_spark.activate()
...
some_rdd.saveToMongoDB('mongodb://localhost:27017/db.output_collection')}}
{code}
And internally it looks like this:
{code:python}
def activate():
"""Activate integration between PyMongo and PySpark.
This function only needs to be called once.
"""
# Patch methods in rather than extending these classes. Many RDD methods
# result in the creation of a new RDD, whose exact type is beyond our
# control. However, we would still like to be able to call any of our
# methods on the resulting RDDs.
pyspark.rdd.RDD.saveToMongoDB = saveToMongoDB
{code}
was (Author: russell.jurney):
[~matterhayes]' code review brings up an important point: the API should decorate the `pyspark.sql.DataFrame`
API.
Why not have an `activate()` or `initialize()` method and then add these methods to the `DataFrame`
class? [`pymongo_spark`](https://github.com/mongodb/mongo-hadoop/blob/master/spark/src/main/python/pymongo_spark.py)
(part of [mongo-hadoop](https://github.com/mongodb/mongo-hadoop)) does this to add methods
like `pyspark.rdd.RDD.saveToMongoDB` which make the API consistent with PySpark's.
See: https://github.com/mongodb/mongo-hadoop/tree/master/spark/src/main/python#usage
You use it like this:
{code:python}
import pymongo_spark
pymongo_spark.activate()
...
some_rdd.saveToMongoDB('mongodb://localhost:27017/db.output_collection')}}
{code}
And internally it looks like this:
{code:python}
def activate():
"""Activate integration between PyMongo and PySpark.
This function only needs to be called once.
"""
# Patch methods in rather than extending these classes. Many RDD methods
# result in the creation of a new RDD, whose exact type is beyond our
# control. However, we would still like to be able to call any of our
# methods on the resulting RDDs.
pyspark.rdd.RDD.saveToMongoDB = saveToMongoDB
{code}
> Setup Spark sub-project
> -----------------------
>
> Key: DATAFU-148
> URL: https://issues.apache.org/jira/browse/DATAFU-148
> Project: DataFu
> Issue Type: New Feature
> Reporter: Eyal Allweil
> Assignee: Eyal Allweil
> Priority: Major
> Attachments: patch.diff, patch.diff
>
> Time Spent: 40m
> Remaining Estimate: 0h
>
> Create a skeleton Spark sub project for Spark code to be contributed to DataFu
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
|