spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From YaoPau <jonrgr...@gmail.com>
Subject Building a hash table from a csv file using yarn-cluster, and giving it to each executor
Date Thu, 13 Nov 2014 15:34:37 GMT
I built my Spark Streaming app on my local machine, and an initial step in
log processing is filtering out rows with spam IPs.  I use the following
code which works locally:

    // Creates a HashSet for badIPs read in from file
    val badIpSource = scala.io.Source.fromFile("wrongIPlist.csv")
    val ipLines = badIpSource.getLines()
    

    val set = new HashSet[String]()
    val badIpSet = set ++ ipLines
    badIpSource.close()

    def isGoodIp(ip: String): Boolean = !badIpSet.contains(ip)

But when I try this using "--master yarn-cluster" I get "Exception in thread
"Thread-4" java.lang.reflect.InvocationTargetException ... Caused by:
java.io.FileNotFoundException: wrongIPlist.csv (No such file or directory)". 
The file is there (I wasn't sure which directory it was accessing so it's in
both my current client directory and my HDFS home directory), so now I'm
wondering if reading a file in parallel is just not allowed in general and
that's why I'm getting the error.

I'd like each executor to have access to this HashSet (not a huge file,
about 3000 IPs) instead of having to do a more expensive JOIN.  Any
recommendations on a better way to handle this?  



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Building-a-hash-table-from-a-csv-file-using-yarn-cluster-and-giving-it-to-each-executor-tp18850.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message