spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From slcclimber <anant.a...@gmail.com>
Subject Re: [MLlib] Contributing Algorithm for Outlier Detection
Date Tue, 11 Nov 2014 18:16:04 GMT
Mayur,
Libsvm format sounds good to me. I could work on writing the tests if that
helps you?
Anant
On Nov 11, 2014 11:06 AM, "Ashutosh [via Apache Spark Developers List]" <
ml-node+s1001551n9286h79@n3.nabble.com> wrote:

>  Hi Mayur,
>
> Vector data types are implemented using breeze library, it is presented at
>
> .../org/apache/spark/mllib/linalg
>
>
>  Anant,
>
> One restriction I found that a vector can only be of 'Double', so it
> actually restrict the user.
>
> What are you thoughts on LibSVM format?
>
> Thanks for the comments, I was just trying to get away from those
> increment /decrement functions, they look ugly. Points are noted. I'll try
> to fix them soon. Tests are also required for the code.
>
>
>  Regards,
>
> Ashutosh
>
>
>  ------------------------------
> *From:* Mayur Rustagi [via Apache Spark Developers List] <ml-node+[hidden
> email] <http://user/SendEmail.jtp?type=node&node=9286&i=0>>
> *Sent:* Saturday, November 8, 2014 12:52 PM
> *To:* Ashutosh Trivedi (MT2013030)
> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
>
>  >
> > We should take a vector instead giving the user flexibility to decide
> > data source/ type
>
> What do you mean by vector datatype exactly?
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
> On Wed, Nov 5, 2014 at 6:45 AM, slcclimber <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=9239&i=0>> wrote:
>
> > Ashutosh,
> > I still see a few issues.
> > 1. On line 112 you are counting using a counter. Since this will happen
> in
> > a RDD the counter will cause issues. Also that is not good functional
> style
> > to use a filter function with a side effect.
> > You could use randomSplit instead. This does not the same thing without
> the
> > side effect.
> > 2. Similar shared usage of j in line 102 is going to be an issue as
> well.
> > also hash seed does not need to be sequential it could be randomly
> > generated or hashed on the values.
> > 3. The compute function and trim scores still runs on a comma separeated
> > RDD. We should take a vector instead giving the user flexibility to
> decide
> > data source/ type. what if we want data from hive tables or parquet or
> JSON
> > or avro formats. This is a very restrictive format. With vectors the
> user
> > has the choice of taking in whatever data format and converting them to
> > vectors insteda of reading json files creating a csv file and then
> workig
> > on that.
> > 4. Similar use of counters in 54 and 65 is an issue.
> > Basically the shared state counters is a huge issue that does not scale.
> > Since the processing of RDD's is distributed and the value j lives on
> the
> > master.
> >
> > Anant
> >
> >
> >
> > On Tue, Nov 4, 2014 at 7:22 AM, Ashutosh [via Apache Spark Developers
> List]
> > <[hidden email] <http://user/SendEmail.jtp?type=node&node=9239&i=1>>
> wrote:
> >
> > >  Anant,
> > >
> > > I got rid of those increment/ decrements functions and now code is
> much
> > > cleaner. Please check. All your comments have been looked after.
> > >
> > >
> > >
> >
> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
> > >
> > >
> > >  _Ashu
> > >
> > > <
> >
> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
> > >
> > >   Outlier-Detection-with-AVF-Spark/OutlierWithAVFModel.scala at master
> ·
> > > codeAshu/Outlier-Detection-with-AVF-Spark · GitHub
> > >  Contribute to Outlier-Detection-with-AVF-Spark development by
> creating
> > an
> > > account on GitHub.
> > >  Read more...
> > > <
> >
> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
> > >
> > >
> > >  ------------------------------
> > > *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
> > > email] <http://user/SendEmail.jtp?type=node&node=9083&i=0>>
> > > *Sent:* Friday, October 31, 2014 10:09 AM
> > > *To:* Ashutosh Trivedi (MT2013030)
> > > *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
> > >
> > >
> > > You should create a jira ticket to go with it as well.
> > > Thanks
> > > On Oct 30, 2014 10:38 PM, "Ashutosh [via Apache Spark Developers
> List]"
> > <[hidden
> > > email] <http://user/SendEmail.jtp?type=node&node=9037&i=0>>
wrote:
> > >
> > >>  ​Okay. I'll try it and post it soon with test case. After that I
> think
> > >> we can go ahead with the PR.
> > >>  ------------------------------
> > >> *From:* slcclimber [via Apache Spark Developers List]
> <ml-node+[hidden
> > >> email] <http://user/SendEmail.jtp?type=node&node=9036&i=0>>
> > >> *Sent:* Friday, October 31, 2014 10:03 AM
> > >> *To:* Ashutosh Trivedi (MT2013030)
> > >> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
> > >>
> > >>
> > >> Ashutosh,
> > >> A vector would be a good idea vectors are used very frequently.
> > >> Test data is usually stored in the spark/data/mllib folder
> > >>  On Oct 30, 2014 10:31 PM, "Ashutosh [via Apache Spark Developers
> List]"
> > >> <[hidden email] <http://user/SendEmail.jtp?type=node&node=9035&i=0>>
> > >> wrote:
> > >>
> > >>> Hi Anant,
> > >>> sorry for my late reply. Thank you for taking time and reviewing it.
> > >>>
> > >>> I have few comments on first issue.
> > >>>
> > >>> You are correct on the string (csv) part. But we can not take input
> of
> > >>> type you mentioned. We calculate frequency in our function.
> Otherwise
> > user
> > >>> has to do all this computation. I realize that taking a RDD[Vector]
> > would
> > >>> be general enough for all. What do you say?
> > >>>
> > >>> I agree on rest all the issues. I will correct them soon and post
> it.
> > >>> I have a doubt on test cases. Where should I put data while giving
> test
> > >>> scripts? or should i generate synthetic data for testing with in the
> > >>> scripts, how does this work?
> > >>>
> > >>> Regards,
> > >>> Ashutosh
> > >>>
> > >>> ------------------------------
> > >>>  If you reply to this email, your message will be added to the
> > >>> discussion below:
> > >>>
> > >>>
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9034.html
> > >>>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> > >>> Detection, click here.
> > >>> NAML
> > >>> <
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> > >
> > >>>
> > >>
> > >>
> > >> ------------------------------
> > >>  If you reply to this email, your message will be added to the
> > >> discussion below:
> > >>
> > >>
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9035.html
> > >>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> > >> Detection, click here.
> > >> NAML
> > >> <
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> > >
> > >>
> > >>
> > >> ------------------------------
> > >>  If you reply to this email, your message will be added to the
> > >> discussion below:
> > >>
> > >>
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9036.html
> > >>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> > >> Detection, click here.
> > >> NAML
> > >> <
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> > >
> > >>
> > >
> > >
> > > ------------------------------
> > >  If you reply to this email, your message will be added to the
> discussion
> > > below:
> > >
> > >
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9037.html
> > >  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> > Detection, click
> > > here.
> > > NAML
> > > <
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> > >
> > >
> > >
> > > ------------------------------
> > >  If you reply to this email, your message will be added to the
> discussion
> > > below:
> > >
> > >
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9083.html
> > >  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> > Detection, click
> > > here
> > > <
> > >
> > > .
> > > NAML
> > > <
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> > >
> > >
> >
> >
> >
> >
> > --
> > View this message in context:
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9095.html
> > Sent from the Apache Spark Developers List mailing list archive at
> > Nabble.com.
> >
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9239.html
>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click
> here.
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9286.html
>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click
> here
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=8880&code=YW5hbnQuYXN0eUBnbWFpbC5jb218ODg4MHwxOTU2OTQ5NjMy>
> .
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9287.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message