Hi Gautham,I am a beginner spark user too and I may not have a complete understanding of your question, but I thought I would start a discussion anyway. Have you looked into using Spark's built in Correlation function? (https://spark.apache.org/docs/latest/ml-statistics.html) This might let you get what you want (per-row correlation against the same matrix) without having to deal with parallelizing the computation yourself. Also, I think the question of how quick you can get your results is largely a data access question vs how fast is Spark question. As long as you can exploit data parallelism (i.e. you can partition up your data), Spark will give you a speedup. You can imagine that if you had a large machine with many cores and ~100 GB of RAM (e.g. a m5.12xlarge EC2 instance), you could fit your problem in main memory and perform your computation with thread based parallelism. This might get your result relatively quickly. For a dedicated application with well constrained memory and compute requirements, it might not be a bad option to do everything on one machine as well. Accessing an external database and distributing work over a large number of computers can add overhead that might be out of your control.Thanks,StevenOn Thu, Jul 11, 2019 at 9:24 AM Gautham Acharya <email@example.com> wrote:
Ping? I would really appreciate advice on this! Thank you!
This is my first email to this mailing list, so I apologize if I made any errors.
My team's going to be building an application and I'm investigating some options for distributed compute systems. We want to be performing computes on large matrices.
The requirements are as follows:
1. The matrices can be expected to be up to 50,000 columns x 3 million rows. The values are all integers (except for the row/column headers).
3. A sorted list of the correlation coefficients and their corresponding row keys need to be returned in under 5 seconds.
4. Users will eventually request random row/column subsets to run calculations on, so precomputing our coefficients is not an option. This needs to be done on request.
I've been looking at many compute solutions, but I'd consider Spark first due to the widespread use and community. I currently have my data loaded into Apache Hbase for a different scenario (random access of rows/columns). I’ve naively tired loading a dataframe from the CSV using a Spark instance hosted on AWS EMR, but getting the results for even a single correlation takes over 20 seconds.