spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel <daniel.ti...@gmail.com>
Subject Re: UseCase_Design_Help
Date Wed, 05 Oct 2016 04:42:46 GMT
First of all, if you want to read a txt file in Spark, you should use
sc.textFile, because you are using "Source.fromFile", so you are reading it
with Scala standard api, so it will be read sequentially.

Furthermore you are going to need create a schema if you want to use
dataframes.

El 5/10/2016 1:53, "Ajay Chander" <itschevva@gmail.com> escribió:

> Right now, I am doing it like below,
>
> import scala.io.Source
>
> val animalsFile = "/home/ajay/dataset/animal_types.txt"
> val animalTypes = Source.fromFile(animalsFile).getLines.toArray
>
> for ( anmtyp <- animalTypes ) {
>       val distinctAnmTypCount = sqlContext.sql("select
> count(distinct("+anmtyp+")) from TEST1 ")
>       println("Calculating Metrics for Animal Type: "+anmtyp)
>       if( distinctAnmTypCount.head().getAs[Long](0) <= 10 ){
>         println("Animal Type: "+anmtyp+" has <= 10 distinct values")
>       } else {
>         println("Animal Type: "+anmtyp+" has > 10 distinct values")
>       }
>     }
>
> But the problem is it is running sequentially.
>
> Any inputs are appreciated. Thank you.
>
>
> Regards,
> Ajay
>
>
> On Tue, Oct 4, 2016 at 7:44 PM, Ajay Chander <itschevva@gmail.com> wrote:
>
>> Hi Everyone,
>>
>> I have a use-case where I have two Dataframes like below,
>>
>> 1) First Dataframe(DF1) contains,
>>
>> *    ANIMALS    *
>> Mammals
>> Birds
>> Fish
>> Reptiles
>> Amphibians
>>
>> 2) Second Dataframe(DF2) contains,
>>
>> *    ID, Mammals, Birds, Fish, Reptiles, Amphibians    *
>> 1,      Dogs,      Eagle,      Goldfish,      NULL,      Frog
>> 2,      Cats,      Peacock,      Guppy,     Turtle,      Salamander
>> 3,      Dolphins,      Eagle,      Zander,      NULL,      Frog
>> 4,      Whales,      Parrot,      Guppy,      Snake,      Frog
>> 5,      Horses,      Owl,      Guppy,      Snake,      Frog
>> 6,      Dolphins,      Kingfisher,      Zander,      Turtle,      Frog
>> 7,      Dogs,      Sparrow,      Goldfish,      NULL,      Salamander
>>
>> Now I want to take each row from DF1 and find out its distinct count in
>> DF2. Example, pick Mammals from DF1 then find out count(distinct(Mammals))
>> from DF2 i.e. 5
>>
>> DF1 has 70 distinct rows/Animal types
>> DF2 has some million rows
>>
>> Whats the best way to achieve this efficiently using parallelism ?
>>
>> Any inputs are helpful. Thank you.
>>
>> Regards,
>> Ajay
>>
>>
>

Mime
View raw message