spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Cutler <cutl...@gmail.com>
Subject Re: Thoughts on Cloudpickle Update
Date Fri, 19 Jan 2018 18:56:51 GMT
Thanks Holden and Hyukjin.  I agree, let's start doing the work first and
see if it the changes are low risk enough, then we can evaluate how best to
proceed.  I made https://issues.apache.org/jira/browse/SPARK-23159 and will
get started on the update and we can continue to discuss in the PR.

On Fri, Jan 19, 2018 at 1:32 AM, Hyukjin Kwon <gurwls223@gmail.com> wrote:

> Yea, that sounds good to me.
>
> 2018-01-19 18:29 GMT+09:00 Holden Karau <holden@pigscanfly.ca>:
>
>> So it is pretty core, but its one of the better indirectly tested
>> components. I think probably the most reasonable path is to see what the
>> diff ends up looking like and make a call at that point for if we want it
>> to go to master or master & branch-2.3?
>>
>> On Fri, Jan 19, 2018 at 12:30 AM, Hyukjin Kwon <gurwls223@gmail.com>
>> wrote:
>>
>>> > So given that it fixes some real world bugs, any particular reason
>>> why? Would you be comfortable with doing it in 2.3.1?
>>>
>>> Ah, I don't feel strongly about this but RC2 will be running on and
>>> cloudpickle's quite core fix to PySpark. Just thought we might want to have
>>> enough time with it.
>>>
>>> One worry is, upgrading it includes a fix about namedtuple too where
>>> PySpark has a custom fix.
>>> I would like to check few things about this.
>>>
>>> So, yea, it's vague. I wouldn't stay against if you'd prefer.
>>>
>>>
>>>
>>>
>>> 2018-01-19 16:42 GMT+09:00 Holden Karau <holden@pigscanfly.ca>:
>>>
>>>>
>>>>
>>>> On Jan 19, 2018 7:28 PM, "Hyukjin Kwon" <gurwls223@gmail.com> wrote:
>>>>
>>>> > Is it an option to match the latest version of cloudpickle and still
>>>> set protocol level 2?
>>>>
>>>> IMHO, I think this can be an option but I am not fully sure yet if we
>>>> should/could go ahead for it within Spark 2.X. I need some
>>>> investigations including things about Pyrolite.
>>>>
>>>> Let's go ahead with matching it to 0.4.2 first. I am quite clear on
>>>> matching it to 0.4.2 at least.
>>>>
>>>> So given that there is a follow up on which fixes a regression if we're
>>>> not comfortable doing the latest version let's double-check that the
>>>> version we do upgrade to doesn't have that regression.
>>>>
>>>>
>>>>
>>>> > I agree that upgrading to try and match version 0.4.2 would be a
>>>> good starting point. Unless no one objects, I will open up a JIRA and try
>>>> to do this.
>>>>
>>>> Yup but I think we shouldn't make this into Spark 2.3.0 to be clear.
>>>>
>>>> So given that it fixes some real world bugs, any particular reason why?
>>>> Would you be comfortable with doing it in 2.3.1?
>>>>
>>>>
>>>>
>>>> > Also lets try to keep track in our commit messages which version of
>>>> cloudpickle we end up upgrading to.
>>>>
>>>> +1: PR description, commit message or any unit to identify each will be useful.
>>>> It should be easier once we have a matched version.
>>>>
>>>>
>>>>
>>>> 2018-01-19 12:55 GMT+09:00 Holden Karau <holden@pigscanfly.ca>:
>>>>
>>>>> So if there are different version of Python on the cluster machines I
>>>>> think that's already unsupported so I'm not worried about that.
>>>>>
>>>>> I'd suggest going to the highest released version since there appear
>>>>> to be some useful fixes between 0.4.2 & 0.5.2
>>>>>
>>>>> Also lets try to keep track in our commit messages which version of
>>>>> cloudpickle we end up upgrading to.
>>>>>
>>>>> On Thu, Jan 18, 2018 at 5:45 PM, Bryan Cutler <cutlerb@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks for all the details and background Hyukjin! Regarding the
>>>>>> pickle protocol change, if I understand correctly, it is currently
at level
>>>>>> 2 in Spark which is good for backwards compatibility for all of Python
2.
>>>>>> Choosing HIGHEST_PROTOCOL, which is the default for cloudpickle 0.5.0
and
>>>>>> above, will pick a level determined by your Python version. So is
the
>>>>>> concern here for Spark if someone has different versions of Python
in their
>>>>>> cluster, like 3.5 and 3.3, then different protocols will be used
and
>>>>>> deserialization might fail?  Is it an option to match the latest
version of
>>>>>> cloudpickle and still set protocol level 2?
>>>>>>
>>>>>> I agree that upgrading to try and match version 0.4.2 would be a
good
>>>>>> starting point. Unless no one objects, I will open up a JIRA and
try to do
>>>>>> this.
>>>>>>
>>>>>> Thanks,
>>>>>> Bryan
>>>>>>
>>>>>> On Mon, Jan 15, 2018 at 7:57 PM, Hyukjin Kwon <gurwls223@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Bryan,
>>>>>>>
>>>>>>> Yup, I support to match the version. I pushed it forward before
to
>>>>>>> match it with https://github.com/cloudpipe/cloudpickle
>>>>>>> before few times in Spark's copy and also cloudpickle itself
with
>>>>>>> few fixes. I believe our copy is closest to 0.4.1.
>>>>>>>
>>>>>>> I have been trying to follow up the changes in cloudpipe/cloudpickle
>>>>>>> for which version we should match, I think we should match
>>>>>>> it with 0.4.2 first (I need to double check) because IMHO they
have
>>>>>>> been adding rather radical changes from 0.5.0, including
>>>>>>> pickle protocol change (by default).
>>>>>>>
>>>>>>> Personally, I would like to match it with the latest because
there
>>>>>>> have been some important changes. For
>>>>>>> example, see this too - https://github.com/cloudpipe
>>>>>>> /cloudpickle/pull/138 (it's pending for reviewing yet) eventually
>>>>>>> but 0.4.2 should be
>>>>>>> a good start point.
>>>>>>>
>>>>>>> For the strategy, I think we can match it and follow 0.4.x within
>>>>>>> Spark for the conservative and safe choice + minimal cost.
>>>>>>>
>>>>>>>
>>>>>>> I tried to leave few explicit answers to the questions from you,
>>>>>>> Bryan:
>>>>>>>
>>>>>>> > Spark is currently using a forked version and it seems like
>>>>>>> updates are made every now and then when
>>>>>>> > needed, but it's not really clear where the current state
is and
>>>>>>> how much it has diverged.
>>>>>>>
>>>>>>> I am quite sure our cloudpickle copy is closer to 0.4.1 IIRC.
>>>>>>>
>>>>>>>
>>>>>>> > Are there any known issues with recent changes from those
that
>>>>>>> follow cloudpickle dev?
>>>>>>>
>>>>>>> I am technically involved in cloudpickle dev although less active.
>>>>>>> They changed default pickle protocol (https://github.com/cloudpipe/
>>>>>>> cloudpickle/pull/127). So, if we target 0.5.x+, we should double
>>>>>>> check
>>>>>>> the potential compatibility issue, or fix the protocol, which
I
>>>>>>> believe is introduced from 0.5.x.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2018-01-16 11:43 GMT+09:00 Bryan Cutler <cutlerb@gmail.com>:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> I've seen a couple issues lately related to cloudpickle,
notably
>>>>>>>> https://issues.apache.org/jira/browse/SPARK-22674, and would
like
>>>>>>>> to get some feedback on updating the version in PySpark which
should fix
>>>>>>>> these issues and allow us to remove some workarounds.  Spark
is currently
>>>>>>>> using a forked version and it seems like updates are made
every now and
>>>>>>>> then when needed, but it's not really clear where the current
state is and
>>>>>>>> how much it has diverged.  This makes back-porting fixes
difficult.  There
>>>>>>>> was a previous discussion on moving it to a dependency here
>>>>>>>> <http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-DISCUSS-Moving-to-cloudpickle-and-or-Py4J-as-a-dependencies-td20954.html>,
>>>>>>>> but given the status right now I think it would be best to
do another
>>>>>>>> update and bring things closer to upstream before we talk
about completely
>>>>>>>> moving it outside of Spark.  Before starting another update,
it might be
>>>>>>>> good to discuss the strategy a little.  Should the version
in Spark be
>>>>>>>> derived from a release or at least tied to a specific commit?
 It would
>>>>>>>> also be good if we can document where it has diverged.  Are
there any known
>>>>>>>> issues with recent changes from those that follow cloudpickle
dev?  Any
>>>>>>>> other thoughts or concerns?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Bryan
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>

Mime
View raw message