spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <gourav.sengu...@gmail.com>
Subject Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server
Date Mon, 06 May 2019 16:58:54 GMT
Hence, what I mentioned initially does sound correct ?

On Mon, May 6, 2019 at 5:43 PM Andrew Melo <andrew.melo@gmail.com> wrote:

> Hi,
>
> On Mon, May 6, 2019 at 11:41 AM Patrick McCarthy
> <pmccarthy@dstillery.com.invalid> wrote:
> >
> > Thanks Gourav.
> >
> > Incidentally, since the regular UDF is row-wise, we could optimize that
> a bit by taking the convert() closure and simply making that the UDF.
> >
> > Since there's that MGRS object that we have to create too, we could
> probably optimize it further by applying the UDF via rdd.mapPartitions,
> which would allow the UDF to instantiate objects once per-partition instead
> of per-row and then iterate element-wise through the rows of the partition.
> >
> > All that said, having done the above on prior projects I find the pandas
> abstractions to be very elegant and friendly to the end-user so I haven't
> looked back :)
> >
> > (The common memory model via Arrow is a nice boost too!)
>
> And some tentative SPIPs that want to use columnar representations
> internally in Spark should also add some good performance in the
> future.
>
> Cheers
> Andrew
>
> >
> > On Mon, May 6, 2019 at 11:13 AM Gourav Sengupta <
> gourav.sengupta@gmail.com> wrote:
> >>
> >> The proof is in the pudding
> >>
> >> :)
> >>
> >>
> >>
> >> On Mon, May 6, 2019 at 2:46 PM Gourav Sengupta <
> gourav.sengupta@gmail.com> wrote:
> >>>
> >>> Hi Patrick,
> >>>
> >>> super duper, thanks a ton for sharing the code. Can you please confirm
> that this runs faster than the regular UDF's?
> >>>
> >>> Interestingly I am also running same transformations using another geo
> spatial library in Python, where I am passing two fields and getting back
> an array.
> >>>
> >>>
> >>> Regards,
> >>> Gourav Sengupta
> >>>
> >>> On Mon, May 6, 2019 at 2:00 PM Patrick McCarthy <
> pmccarthy@dstillery.com> wrote:
> >>>>
> >>>> Human time is considerably more expensive than computer time, so in
> that regard, yes :)
> >>>>
> >>>> This took me one minute to write and ran fast enough for my needs. If
> you're willing to provide a comparable scala implementation I'd be happy to
> compare them.
> >>>>
> >>>> @F.pandas_udf(T.StringType(), F.PandasUDFType.SCALAR)
> >>>>
> >>>> def generate_mgrs_series(lat_lon_str, level):
> >>>>
> >>>>
> >>>>     import mgrs
> >>>>
> >>>>     m = mgrs.MGRS()
> >>>>
> >>>>
> >>>>     precision_level = 0
> >>>>
> >>>>     levelval = level[0]
> >>>>
> >>>>
> >>>>     if levelval == 1000:
> >>>>
> >>>>        precision_level = 2
> >>>>
> >>>>     if levelval == 100:
> >>>>
> >>>>        precision_level = 3
> >>>>
> >>>>
> >>>>     def convert(ll_str):
> >>>>
> >>>>           lat, lon = ll_str.split('_')
> >>>>
> >>>>
> >>>>           return m.toMGRS(lat, lon,
> >>>>
> >>>>               MGRSPrecision = precision_level)
> >>>>
> >>>>
> >>>>     return lat_lon_str.apply(lambda x: convert(x))
> >>>>
> >>>> On Mon, May 6, 2019 at 8:23 AM Gourav Sengupta <
> gourav.sengupta@gmail.com> wrote:
> >>>>>
> >>>>> And you found the PANDAS UDF more performant ? Can you share your
> code and prove it?
> >>>>>
> >>>>> On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy <
> pmccarthy@dstillery.com> wrote:
> >>>>>>
> >>>>>> I disagree that it's hype. Perhaps not 1:1 with pure scala
> performance-wise, but for python-based data scientists or others with a lot
> of python expertise it allows one to do things that would otherwise be
> infeasible at scale.
> >>>>>>
> >>>>>> For instance, I recently had to convert latitude / longitude
pairs
> to MGRS strings (
> https://en.wikipedia.org/wiki/Military_Grid_Reference_System). Writing a
> pandas UDF (and putting the mgrs python package into a conda environment)
> was _significantly_ easier than any alternative I found.
> >>>>>>
> >>>>>> @Rishi - depending on your network is constructed, some lag
could
> come from just uploading the conda environment. If you load it from hdfs
> with --archives does it improve?
> >>>>>>
> >>>>>> On Sun, May 5, 2019 at 2:15 PM Gourav Sengupta <
> gourav.sengupta@gmail.com> wrote:
> >>>>>>>
> >>>>>>> hi,
> >>>>>>>
> >>>>>>> Pandas UDF is a bit of hype. One of their blogs shows the
used
> case of adding 1 to a field using Pandas UDF which is pretty much
> pointless. So you go beyond the blog and realise that your actual used case
> is more than adding one :) and the reality hits you
> >>>>>>>
> >>>>>>> Pandas UDF in certain scenarios is actually slow, try using
apply
> for a custom or pandas function. In fact in certain scenarios I have found
> general UDF's work much faster and use much less memory. Therefore test out
> your used case (with at least 30 million records) before trying to use the
> Pandas UDF option.
> >>>>>>>
> >>>>>>> And when you start using GroupMap then you realise after
reading
> https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs
> that "Oh!! now I can run into random OOM errors and the maxrecords options
> does not help at all"
> >>>>>>>
> >>>>>>> Excerpt from the above link:
> >>>>>>> Note that all data for a group will be loaded into memory
before
> the function is applied. This can lead to out of memory exceptions,
> especially if the group sizes are skewed. The configuration for
> maxRecordsPerBatch is not applied on groups and it is up to the user to
> ensure that the grouped data will fit into the available memory.
> >>>>>>>
> >>>>>>> Let me know about your used case in case possible
> >>>>>>>
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Gourav
> >>>>>>>
> >>>>>>> On Sun, May 5, 2019 at 3:59 AM Rishi Shah <
> rishishah.star@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> Thanks Patrick! I tried to package it according to this
> instructions, it got distributed on the cluster however the same spark
> program that takes 5 mins without pandas UDF has started to take 25mins...
> >>>>>>>>
> >>>>>>>> Have you experienced anything like this? Also is Pyarrow
0.12
> supported with Spark 2.3 (according to documentation, it should be fine)?
> >>>>>>>>
> >>>>>>>> On Tue, Apr 30, 2019 at 9:35 AM Patrick McCarthy <
> pmccarthy@dstillery.com> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Rishi,
> >>>>>>>>>
> >>>>>>>>> I've had success using the approach outlined here:
> https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html
> >>>>>>>>>
> >>>>>>>>> Does this work for you?
> >>>>>>>>>
> >>>>>>>>> On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah <
> rishishah.star@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> modified the subject & would like to clarify
that I am looking
> to create an anaconda parcel with pyarrow and other libraries, so that I
> can distribute it on the cloudera cluster..
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Apr 30, 2019 at 12:21 AM Rishi Shah
<
> rishishah.star@gmail.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi All,
> >>>>>>>>>>>
> >>>>>>>>>>> I have been trying to figure out a way to
build anaconda
> parcel with pyarrow included for my cloudera managed server for
> distribution but this doesn't seem to work right. Could someone please help?
> >>>>>>>>>>>
> >>>>>>>>>>> I have tried to install anaconda on one
of the management
> nodes on cloudera cluster... tarred the directory, but this directory
> doesn't include all the packages to form a proper parcel for distribution.
> >>>>>>>>>>>
> >>>>>>>>>>> Any help is much appreciated!
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Regards,
> >>>>>>>>>>>
> >>>>>>>>>>> Rishi Shah
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Regards,
> >>>>>>>>>>
> >>>>>>>>>> Rishi Shah
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>>
> >>>>>>>>> Patrick McCarthy
> >>>>>>>>>
> >>>>>>>>> Senior Data Scientist, Machine Learning Engineering
> >>>>>>>>>
> >>>>>>>>> Dstillery
> >>>>>>>>>
> >>>>>>>>> 470 Park Ave South, 17th Floor, NYC 10016
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Regards,
> >>>>>>>>
> >>>>>>>> Rishi Shah
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>>
> >>>>>> Patrick McCarthy
> >>>>>>
> >>>>>> Senior Data Scientist, Machine Learning Engineering
> >>>>>>
> >>>>>> Dstillery
> >>>>>>
> >>>>>> 470 Park Ave South, 17th Floor, NYC 10016
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>>
> >>>> Patrick McCarthy
> >>>>
> >>>> Senior Data Scientist, Machine Learning Engineering
> >>>>
> >>>> Dstillery
> >>>>
> >>>> 470 Park Ave South, 17th Floor, NYC 10016
> >
> >
> >
> > --
> >
> > Patrick McCarthy
> >
> > Senior Data Scientist, Machine Learning Engineering
> >
> > Dstillery
> >
> > 470 Park Ave South, 17th Floor, NYC 10016
>

Mime
View raw message