Hi Gautham,
I am a beginner spark user too and I may not have a complete understanding
of your question, but I thought I would start a discussion anyway. Have you
looked into using Spark's built in Correlation function? (
https://spark.apache.org/docs/latest/mlstatistics.html) This might let you
get what you want (perrow correlation against the same matrix) without
having to deal with parallelizing the computation yourself. Also, I think
the question of how quick you can get your results is largely a data access
question vs how fast is Spark question. As long as you can exploit data
parallelism (i.e. you can partition up your data), Spark will give you a
speedup. You can imagine that if you had a large machine with many cores
and ~100 GB of RAM (e.g. a m5.12xlarge EC2 instance), you could fit your
problem in main memory and perform your computation with thread based
parallelism. This might get your result relatively quickly. For a dedicated
application with well constrained memory and compute requirements, it might
not be a bad option to do everything on one machine as well. Accessing an
external database and distributing work over a large number of computers
can add overhead that might be out of your control.
Thanks,
Steven
On Thu, Jul 11, 2019 at 9:24 AM Gautham Acharya <gauthama@alleninstitute.org>
wrote:
> Ping? I would really appreciate advice on this! Thank you!
>
>
>
> *From:* Gautham Acharya
> *Sent:* Tuesday, July 9, 2019 4:22 PM
> *To:* user@spark.apache.org
> *Subject:* [Beginner] Run compute on large matrices and return the result
> in seconds?
>
>
>
> This is my first email to this mailing list, so I apologize if I made any
> errors.
>
>
>
> My team's going to be building an application and I'm investigating some
> options for distributed compute systems. We want to be performing computes
> on large matrices.
>
>
>
> The requirements are as follows:
>
>
>
> 1. The matrices can be expected to be up to 50,000 columns x 3
> million rows. The values are all integers (except for the row/column
> headers).
>
> 2. The application needs to select a specific row, and calculate the
> correlation coefficient (
> https://pandas.pydata.org/pandasdocs/stable/reference/api/pandas.DataFrame.corr.html
)
> against every other row. This means up to 3 million different calculations.
>
> 3. A sorted list of the correlation coefficients and their
> corresponding row keys need to be returned in under 5 seconds.
>
> 4. Users will eventually request random row/column subsets to run
> calculations on, so precomputing our coefficients is not an option. This
> needs to be done on request.
>
>
>
> I've been looking at many compute solutions, but I'd consider Spark first
> due to the widespread use and community. I currently have my data loaded
> into Apache Hbase for a different scenario (random access of rows/columns).
> I’ve naively tired loading a dataframe from the CSV using a Spark instance
> hosted on AWS EMR, but getting the results for even a single correlation
> takes over 20 seconds.
>
>
>
> Thank you!
>
>
>
>
>
> gautham
>
>
>
