Right now, I am doing it like below,

import scala.io.Source

val animalsFile = "/home/ajay/dataset/animal_types.txt"
val animalTypes = Source.fromFile(animalsFile).getLines.toArray

for ( anmtyp <- animalTypes ) {
      val distinctAnmTypCount = sqlContext.sql("select count(distinct("+anmtyp+")) from TEST1 ")
      println("Calculating Metrics for Animal Type: "+anmtyp)
      if( distinctAnmTypCount.head().getAs[Long](0) <= 10 ){
        println("Animal Type: "+anmtyp+" has <= 10 distinct values")
      } else {
        println("Animal Type: "+anmtyp+" has > 10 distinct values")
      }
    }

But the problem is it is running sequentially.

Any inputs are appreciated. Thank you.


Regards,
Ajay


On Tue, Oct 4, 2016 at 7:44 PM, Ajay Chander <itschevva@gmail.com> wrote:
Hi Everyone,

I have a use-case where I have two Dataframes like below,

1) First Dataframe(DF1) contains,

    ANIMALS    
Mammals
Birds
Fish
Reptiles
Amphibians

2) Second Dataframe(DF2) contains,

    ID, Mammals, Birds, Fish, Reptiles, Amphibians    
1,      Dogs,      Eagle,      Goldfish,      NULL,      Frog
2,      Cats,      Peacock,      Guppy,     Turtle,      Salamander
3,      Dolphins,      Eagle,      Zander,      NULL,      Frog
4,      Whales,      Parrot,      Guppy,      Snake,      Frog
5,      Horses,      Owl,      Guppy,      Snake,      Frog
6,      Dolphins,      Kingfisher,      Zander,      Turtle,      Frog
7,      Dogs,      Sparrow,      Goldfish,      NULL,      Salamander

Now I want to take each row from DF1 and find out its distinct count in DF2. Example, pick Mammals from DF1 then find out count(distinct(Mammals)) from DF2 i.e. 5

DF1 has 70 distinct rows/Animal types
DF2 has some million rows

Whats the best way to achieve this efficiently using parallelism ?

Any inputs are helpful. Thank you.

Regards,
Ajay