spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From slcclimber <anant.a...@gmail.com>
Subject Re: [MLlib] Contributing Algorithm for Outlier Detection
Date Wed, 05 Nov 2014 01:15:16 GMT
Ashutosh,
I still see a few issues.
1. On line 112 you are counting using a counter. Since this will happen in
a RDD the counter will cause issues. Also that is not good functional style
to use a filter function with a side effect.
You could use randomSplit instead. This does not the same thing without the
side effect.
2. Similar shared usage of j in line 102 is going to be an issue as well.
also hash seed does not need to be sequential it could be randomly
generated or hashed on the values.
3. The compute function and trim scores still runs on a comma separeated
RDD. We should take a vector instead giving the user flexibility to decide
data source/ type. what if we want data from hive tables or parquet or JSON
or avro formats. This is a very restrictive format. With vectors the user
has the choice of taking in whatever data format and converting them to
vectors insteda of reading json files creating a csv file and then workig
on that.
4. Similar use of counters in 54 and 65 is an issue.
Basically the shared state counters is a huge issue that does not scale.
Since the processing of RDD's is distributed and the value j lives on the
master.

Anant



On Tue, Nov 4, 2014 at 7:22 AM, Ashutosh [via Apache Spark Developers List]
<ml-node+s1001551n9083h86@n3.nabble.com> wrote:

>  Anant,
>
> I got rid of those increment/ decrements functions and now code is much
> cleaner. Please check. All your comments have been looked after.
>
>
> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
>
>
>  _Ashu
>
> <https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala>
>   Outlier-Detection-with-AVF-Spark/OutlierWithAVFModel.scala at master ·
> codeAshu/Outlier-Detection-with-AVF-Spark · GitHub
>  Contribute to Outlier-Detection-with-AVF-Spark development by creating an
> account on GitHub.
>  Read more...
> <https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala>
>
>  ------------------------------
> *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
> email] <http://user/SendEmail.jtp?type=node&node=9083&i=0>>
> *Sent:* Friday, October 31, 2014 10:09 AM
> *To:* Ashutosh Trivedi (MT2013030)
> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
>
>
> You should create a jira ticket to go with it as well.
> Thanks
> On Oct 30, 2014 10:38 PM, "Ashutosh [via Apache Spark Developers List]" <[hidden
> email] <http://user/SendEmail.jtp?type=node&node=9037&i=0>> wrote:
>
>>  ​Okay. I'll try it and post it soon with test case. After that I think
>> we can go ahead with the PR.
>>  ------------------------------
>> *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
>> email] <http://user/SendEmail.jtp?type=node&node=9036&i=0>>
>> *Sent:* Friday, October 31, 2014 10:03 AM
>> *To:* Ashutosh Trivedi (MT2013030)
>> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
>>
>>
>> Ashutosh,
>> A vector would be a good idea vectors are used very frequently.
>> Test data is usually stored in the spark/data/mllib folder
>>  On Oct 30, 2014 10:31 PM, "Ashutosh [via Apache Spark Developers List]"
>> <[hidden email] <http://user/SendEmail.jtp?type=node&node=9035&i=0>>
>> wrote:
>>
>>> Hi Anant,
>>> sorry for my late reply. Thank you for taking time and reviewing it.
>>>
>>> I have few comments on first issue.
>>>
>>> You are correct on the string (csv) part. But we can not take input of
>>> type you mentioned. We calculate frequency in our function. Otherwise user
>>> has to do all this computation. I realize that taking a RDD[Vector] would
>>> be general enough for all. What do you say?
>>>
>>> I agree on rest all the issues. I will correct them soon and post it.
>>> I have a doubt on test cases. Where should I put data while giving test
>>> scripts? or should i generate synthetic data for testing with in the
>>> scripts, how does this work?
>>>
>>> Regards,
>>> Ashutosh
>>>
>>> ------------------------------
>>>  If you reply to this email, your message will be added to the
>>> discussion below:
>>>
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9034.html
>>>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>>> Detection, click here.
>>> NAML
>>> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>
>>
>>
>> ------------------------------
>>  If you reply to this email, your message will be added to the
>> discussion below:
>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9035.html
>>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>> Detection, click here.
>> NAML
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>>
>> ------------------------------
>>  If you reply to this email, your message will be added to the
>> discussion below:
>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9036.html
>>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>> Detection, click here.
>> NAML
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9037.html
>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click
> here.
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9083.html
>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click
> here
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=8880&code=YW5hbnQuYXN0eUBnbWFpbC5jb218ODg4MHwxOTU2OTQ5NjMy>
> .
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9095.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message