spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ajay Chander <itsche...@gmail.com>
Subject Re: UseCase_Design_Help
Date Wed, 05 Oct 2016 14:46:18 GMT
+ user@spark.apache.org

Hi Daniel, I will try this one out and let you know. Thank you.

On Wed, Oct 5, 2016 at 9:50 AM, Daniel Siegmann <
dsiegmann@securityscorecard.io> wrote:

> I think it's fine to read animal types locally because there are only 70
> of them. It's just that you want to execute the Spark actions in parallel.
> The easiest way to do that is to have only a single action.
>
> Instead of grabbing the result right away, I would just add a column for
> the animal type and union the datasets for the animal types. Something like
> this (not sure if the syntax is correct):
>
> val animalCounts: DataFrame = animalTypes.map { anmtyp =>
>     sqlContext.sql("select lit("+anmtyp+") as animal_type,
> count(distinct("+anmtyp+")) from TEST1 ")
> }.reduce(_.union(_))
>
> animalCounts.foreach( /* print the output */ )
>
> On Wed, Oct 5, 2016 at 12:42 AM, Daniel <daniel.tizon@gmail.com> wrote:
>
>> First of all, if you want to read a txt file in Spark, you should use
>> sc.textFile, because you are using "Source.fromFile", so you are reading it
>> with Scala standard api, so it will be read sequentially.
>>
>> Furthermore you are going to need create a schema if you want to use
>> dataframes.
>>
>> El 5/10/2016 1:53, "Ajay Chander" <itschevva@gmail.com> escribió:
>>
>>> Right now, I am doing it like below,
>>>
>>> import scala.io.Source
>>>
>>> val animalsFile = "/home/ajay/dataset/animal_types.txt"
>>> val animalTypes = Source.fromFile(animalsFile).getLines.toArray
>>>
>>> for ( anmtyp <- animalTypes ) {
>>>       val distinctAnmTypCount = sqlContext.sql("select
>>> count(distinct("+anmtyp+")) from TEST1 ")
>>>       println("Calculating Metrics for Animal Type: "+anmtyp)
>>>       if( distinctAnmTypCount.head().getAs[Long](0) <= 10 ){
>>>         println("Animal Type: "+anmtyp+" has <= 10 distinct values")
>>>       } else {
>>>         println("Animal Type: "+anmtyp+" has > 10 distinct values")
>>>       }
>>>     }
>>>
>>> But the problem is it is running sequentially.
>>>
>>> Any inputs are appreciated. Thank you.
>>>
>>>
>>> Regards,
>>> Ajay
>>>
>>>
>>> On Tue, Oct 4, 2016 at 7:44 PM, Ajay Chander <itschevva@gmail.com>
>>> wrote:
>>>
>>>> Hi Everyone,
>>>>
>>>> I have a use-case where I have two Dataframes like below,
>>>>
>>>> 1) First Dataframe(DF1) contains,
>>>>
>>>> *    ANIMALS    *
>>>> Mammals
>>>> Birds
>>>> Fish
>>>> Reptiles
>>>> Amphibians
>>>>
>>>> 2) Second Dataframe(DF2) contains,
>>>>
>>>> *    ID, Mammals, Birds, Fish, Reptiles, Amphibians    *
>>>> 1,      Dogs,      Eagle,      Goldfish,      NULL,      Frog
>>>> 2,      Cats,      Peacock,      Guppy,     Turtle,      Salamander
>>>> 3,      Dolphins,      Eagle,      Zander,      NULL,      Frog
>>>> 4,      Whales,      Parrot,      Guppy,      Snake,      Frog
>>>> 5,      Horses,      Owl,      Guppy,      Snake,      Frog
>>>> 6,      Dolphins,      Kingfisher,      Zander,      Turtle,      Frog
>>>> 7,      Dogs,      Sparrow,      Goldfish,      NULL,      Salamander
>>>>
>>>> Now I want to take each row from DF1 and find out its distinct count in
>>>> DF2. Example, pick Mammals from DF1 then find out count(distinct(Mammals))
>>>> from DF2 i.e. 5
>>>>
>>>> DF1 has 70 distinct rows/Animal types
>>>> DF2 has some million rows
>>>>
>>>> Whats the best way to achieve this efficiently using parallelism ?
>>>>
>>>> Any inputs are helpful. Thank you.
>>>>
>>>> Regards,
>>>> Ajay
>>>>
>>>>
>>>
>

Mime
View raw message