spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Chalef <daniel.cha...@sparkpost.com.INVALID>
Subject [Spark Core] Vectorizing very high-dimensional data sourced in long format
Date Fri, 30 Oct 2020 02:18:50 GMT
Hello,

I have a very large long-format dataframe (several billion rows) that I'd
like to pivot and vectorize (using the VectorAssembler), with the aim to
reduce dimensionality using something akin to TF-IDF. Once pivoted, the
dataframe will have ~130 million columns.

The source, long-format schema looks as follows:

root
 |-- entity_id: long (nullable = false)
 |-- attribute_id: long (nullable = false)
 |-- event_count: integer (nullable = true)

Pivoting as per the following fails, exhausting executor and driver memory.
I am unsure whether increasing memory limits would be successful here as my
sense is that pivoting and then using a VectorAssembler isn't the right
approach to solving this problem.

wide_frame = (
    long_frame.groupBy("entity_id")
    .pivot("attribute_id")
    .agg(F.first("event_count"))
)

Are there other Spark patterns that I should attempt in order to achieve my
end goal of a vector of attributes for every entity?

Thanks, Daniel

Mime
View raw message