spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gautham Acharya <>
Subject [Beginner] Run compute on large matrices and return the result in seconds?
Date Tue, 09 Jul 2019 23:22:28 GMT
This is my first email to this mailing list, so I apologize if I made any errors.

My team's going to be building an application and I'm investigating some options for distributed
compute systems. We want to be performing computes on large matrices.

The requirements are as follows:

1.     The matrices can be expected to be up to 50,000 columns x 3 million rows. The values
are all integers (except for the row/column headers).

2.     The application needs to select a specific row, and calculate the correlation coefficient
( )
against every other row. This means up to 3 million different calculations.

3.     A sorted list of the correlation coefficients and their corresponding row keys need
to be returned in under 5 seconds.

4.     Users will eventually request random row/column subsets to run calculations on, so
precomputing our coefficients is not an option. This needs to be done on request.

I've been looking at many compute solutions, but I'd consider Spark first due to the widespread
use and community. I currently have my data loaded into Apache Hbase for a different scenario
(random access of rows/columns). I've naively tired loading a dataframe from the CSV using
a Spark instance hosted on AWS EMR, but getting the results for even a single correlation
takes over 20 seconds.

Thank you!


View raw message