spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gautham Acharya <gauth...@alleninstitute.org>
Subject RE: [Beginner] Run compute on large matrices and return the result in seconds?
Date Wed, 17 Jul 2019 21:11:06 GMT
Users can also request random rows in those columns. So a user can request a subset of the
matrix (N rows and N columns) which would change the value of the correlation coefficient.

From: Jerry Vinokurov [mailto:grapesmoker@gmail.com]
Sent: Wednesday, July 17, 2019 1:27 PM
To: user@spark.apache.org
Subject: Re: [Beginner] Run compute on large matrices and return the result in seconds?

CAUTION: This email originated from outside the Allen Institute. Please do not click links
or open attachments unless you've validated the sender and know the content is safe.
________________________________
Maybe I'm not understanding something about this use case, but why is precomputation not an
option? Is it because the matrices themselves change? Because if the matrices are constant,
then I think precomputation would work for you even if the users request random correlations.
You can just store the resulting column with the matrix id, row, and column as the key for
retrieval.

My general impression is that while you could do this in Spark, it's probably not the correct
framework for carrying out this kind of operation. This feels more like a job for something
like OpenMP than for Spark.


On Wed, Jul 17, 2019 at 3:42 PM Gautham Acharya <gauthama@alleninstitute.org<mailto:gauthama@alleninstitute.org>>
wrote:
As I said in the my initial message, precomputing is not an option.

Retrieving only the top/bottom N most correlated is an option – would that speed up the
results?

Our SLAs are soft – slight variations (+- 15 seconds) will not cause issues.

--gautham
From: Patrick McCarthy [mailto:pmccarthy@dstillery.com<mailto:pmccarthy@dstillery.com>]
Sent: Wednesday, July 17, 2019 12:39 PM
To: Gautham Acharya <gauthama@alleninstitute.org<mailto:gauthama@alleninstitute.org>>
Cc: Bobby Evans <revans2@gmail.com<mailto:revans2@gmail.com>>; Steven Stetzler
<steven.stetzler@gmail.com<mailto:steven.stetzler@gmail.com>>; user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: [Beginner] Run compute on large matrices and return the result in seconds?

CAUTION: This email originated from outside the Allen Institute. Please do not click links
or open attachments unless you've validated the sender and know the content is safe.
________________________________
Do you really need the results of all 3MM computations, or only the top- and bottom-most correlation
coefficients? Could correlations be computed on a sample and from that estimate a distribution
of coefficients? Would it make sense to precompute offline and instead focus on fast key-value
retrieval, like ElasticSearch or ScyllaDB?

Spark is a compute framework rather than a serving backend, I don't think it's designed with
retrieval SLAs in mind and you may find those SLAs difficult to maintain.

On Wed, Jul 17, 2019 at 3:14 PM Gautham Acharya <gauthama@alleninstitute.org<mailto:gauthama@alleninstitute.org>>
wrote:
Thanks for the reply, Bobby.

I’ve received notice that we can probably tolerate response times of up to 30 seconds. Would
this be more manageable? 5 seconds was an initial ask, but 20-30 seconds is also a reasonable
response time for our use case.

With the new SLA, do you think that we can easily perform this computation in spark?
--gautham

From: Bobby Evans [mailto:revans2@gmail.com<mailto:revans2@gmail.com>]
Sent: Wednesday, July 17, 2019 7:06 AM
To: Steven Stetzler <steven.stetzler@gmail.com<mailto:steven.stetzler@gmail.com>>
Cc: Gautham Acharya <gauthama@alleninstitute.org<mailto:gauthama@alleninstitute.org>>;
user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: [Beginner] Run compute on large matrices and return the result in seconds?

CAUTION: This email originated from outside the Allen Institute. Please do not click links
or open attachments unless you've validated the sender and know the content is safe.
________________________________
Let's do a few quick rules of thumb to get an idea of what kind of processing power you will
need in general to do what you want.

You need 3,000,000 ints by 50,000 rows.  Each int is 4 bytes so that ends up being about 560
GB that you need to fully process in 5 seconds.

If you are reading this from spinning disks (which average about 80 MB/s) you would need at
least 1,450 disks to just read the data in 5 seconds (that number can vary a lot depending
on the storage format and your compression ratio).
If you are reading the data over a network (let's say 10GigE even though in practice you cannot
get that in the cloud easily) you would need about 90 NICs just to read the data in 5 seconds,
again depending on the compression ration this may be lower.
If you assume you have a cluster where it all fits in main memory and have cached all of the
data in memory (which in practice I have seen on most modern systems at somewhere between
12 and 16 GB/sec) you would need between 7 and 10 machines just to read through the data once
in 5 seconds.  Spark also stores cached data compressed so you might need less as well.

All the numbers fit with things that spark should be able to handle, but a 5 second SLA is
very tight for this amount of data.

Can you make this work with Spark?  probably. Does spark have something built in that will
make this fast and simple for you?  I doubt it you have some very tight requirements and will
likely have to write something custom to make it work the way you want.


On Thu, Jul 11, 2019 at 4:12 PM Steven Stetzler <steven.stetzler@gmail.com<mailto:steven.stetzler@gmail.com>>
wrote:
Hi Gautham,

I am a beginner spark user too and I may not have a complete understanding of your question,
but I thought I would start a discussion anyway. Have you looked into using Spark's built
in Correlation function? (https://spark.apache.org/docs/latest/ml-statistics.html<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fml-statistics.html&data=02%7C01%7C%7C66627034c52c4b439bf008d70af53a3e%7C32669cd6737f4b398bddd6951120d3fc%7C0%7C0%7C636989920678010006&sdata=EXt206eUyVXF1Pw3xSDBCwBL%2FgJlM6IRYKc3e%2Bnz%2BUE%3D&reserved=0>)
This might let you get what you want (per-row correlation against the same matrix) without
having to deal with parallelizing the computation yourself. Also, I think the question of
how quick you can get your results is largely a data access question vs how fast is Spark
question. As long as you can exploit data parallelism (i.e. you can partition up your data),
Spark will give you a speedup. You can imagine that if you had a large machine with many cores
and ~100 GB of RAM (e.g. a m5.12xlarge EC2 instance), you could fit your problem in main memory
and perform your computation with thread based parallelism. This might get your result relatively
quickly. For a dedicated application with well constrained memory and compute requirements,
it might not be a bad option to do everything on one machine as well. Accessing an external
database and distributing work over a large number of computers can add overhead that might
be out of your control.

Thanks,
Steven

On Thu, Jul 11, 2019 at 9:24 AM Gautham Acharya <gauthama@alleninstitute.org<mailto:gauthama@alleninstitute.org>>
wrote:
Ping? I would really appreciate advice on this! Thank you!

From: Gautham Acharya
Sent: Tuesday, July 9, 2019 4:22 PM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: [Beginner] Run compute on large matrices and return the result in seconds?


This is my first email to this mailing list, so I apologize if I made any errors.



My team's going to be building an application and I'm investigating some options for distributed
compute systems. We want to be performing computes on large matrices.



The requirements are as follows:



1.     The matrices can be expected to be up to 50,000 columns x 3 million rows. The values
are all integers (except for the row/column headers).

2.     The application needs to select a specific row, and calculate the correlation coefficient
( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpandas.pydata.org%2Fpandas-docs%2Fstable%2Freference%2Fapi%2Fpandas.DataFrame.corr.html&data=02%7C01%7C%7C66627034c52c4b439bf008d70af53a3e%7C32669cd6737f4b398bddd6951120d3fc%7C0%7C0%7C636989920678020014&sdata=zrb6RnbLvteKoJXmwsv5oe2R7adrfmXNNDQP3GTrc7o%3D&reserved=0>
) against every other row. This means up to 3 million different calculations.

3.     A sorted list of the correlation coefficients and their corresponding row keys need
to be returned in under 5 seconds.

4.     Users will eventually request random row/column subsets to run calculations on, so
precomputing our coefficients is not an option. This needs to be done on request.



I've been looking at many compute solutions, but I'd consider Spark first due to the widespread
use and community. I currently have my data loaded into Apache Hbase for a different scenario
(random access of rows/columns). I’ve naively tired loading a dataframe from the CSV using
a Spark instance hosted on AWS EMR, but getting the results for even a single correlation
takes over 20 seconds.



Thank you!


--gautham



--

Patrick McCarthy

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016


--
http://www.google.com/profiles/grapesmoker<https://nam05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.google.com%2Fprofiles%2Fgrapesmoker&data=02%7C01%7C%7C66627034c52c4b439bf008d70af53a3e%7C32669cd6737f4b398bddd6951120d3fc%7C0%7C0%7C636989920678030019&sdata=mgOpCHm8m%2FgBCcNh%2BDmLodvhxXutDOV08EYPV2tSHEU%3D&reserved=0>
Mime
View raw message