spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ponkin <alexey.pon...@ya.ru>
Subject [Spark2] huge BloomFilters
Date Wed, 02 Nov 2016 10:27:26 GMT
Hi,
I need to build huge BloomFilter with 150 millions or even more insertions
import org.apache.spark.util.sketch.BloomFilter
val bf = spark.read.avro("/hdfs/path").filter("some ==
1").stat.bloomFilter("id", 150000000, 0.01)

if I use keys for serialization
implicit val bfEncoder = org.apache.spark.sql.Encoders.kryo[BloomFilter]
And then try to save this filter in hdfs
the size of this bloom filter is more than 1G.

Is there any way to compress BloomFilter?
Do anybody have an experience with such a huge bloom filters?

In general I need to check some condition in Spark-streaming application.
I was thinking to use BloomFilters for that.




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark2-huge-BloomFilters-tp27991.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message