spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix Cheung <felixcheun...@hotmail.com>
Subject Re: spark.lapply in SparkR: Error in writeBin(batch, con, endian = "big")
Date Thu, 25 Aug 2016 18:00:32 GMT
The reason your second example works is because of a closure capture behavior. It should be
ok for a small amount of data.

You could also use SparkR:::broadcast but please keep in mind that is not public API we actively
support.

Thank you for the information on formula - I will test that out. Please note that SparkR code
is now at

https://github.com/apache/spark/tree/master/R
_____________________________
From: Cinquegrana, Piero <piero.cinquegrana@neustar.biz<mailto:piero.cinquegrana@neustar.biz>>
Sent: Thursday, August 25, 2016 6:08 AM
Subject: RE: spark.lapply in SparkR: Error in writeBin(batch, con, endian = "big")
To: <user@spark.apache.org<mailto:user@spark.apache.org>>, Felix Cheung <felixcheung_m@hotmail.com<mailto:felixcheung_m@hotmail.com>>


I tested both in local and cluster mode and the ‘<<-‘ seemed to work at least for
small data. Or am I missing something? Is there a way for me to test? If that does not work,
can I use something like this?

sc <- SparkR:::getSparkContext()
bcStack <- SparkR:::broadcast(sc,stack)

I realized that the error: Error in writeBin(batch, con, endian = "big")

Was due to an object within the ‘parameters’ list which was a R formula.

When the spark.lapply method calls the parallelize method, it splits the list and calls the
SparkR:::writeRaw method, which tries to convert from formula to binary exploding the size
of the object being passed.

https://github.com/amplab-extras/SparkR-pkg/blob/master/pkg/R/serialize.R

From: Felix Cheung [mailto:felixcheung_m@hotmail.com]
Sent: Thursday, August 25, 2016 2:35 PM
To: Cinquegrana, Piero <Piero.Cinquegrana@neustar.biz<mailto:Piero.Cinquegrana@neustar.biz>>;
user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: spark.lapply in SparkR: Error in writeBin(batch, con, endian = "big")

Hmm <<-- wouldn't work in cluster mode. Are you running spark in local mode?

In any case, I tried running your earlier code and it worked for me on a 250MB csv:

scoreModel <- function(parameters){
               library(data.table) # I assume this should data.table
               dat <- data.frame(fread(“file.csv”))
               score(dat,parameters)
}
parameterList <- lapply(1:100, function(i) getParameters(i))
modelScores <- spark.lapply(parameterList, scoreModel)

Could you provide more information on your actual code?

_____________________________
From: Cinquegrana, Piero <piero.cinquegrana@neustar.biz<mailto:piero.cinquegrana@neustar.biz>>
Sent: Wednesday, August 24, 2016 10:37 AM
Subject: RE: spark.lapply in SparkR: Error in writeBin(batch, con, endian = "big")
To: Cinquegrana, Piero <piero.cinquegrana@neustar.biz<mailto:piero.cinquegrana@neustar.biz>>,
Felix Cheung <felixcheung_m@hotmail.com<mailto:felixcheung_m@hotmail.com>>, <user@spark.apache.org<mailto:user@spark.apache.org>>



Hi Spark experts,

I was able to get around the broadcast issue by using a global assignment ‘<<-‘
instead of reading the data locally. However, I still get the following error:

Error in writeBin(batch, con, endian = "big") :
  attempting to add too many elements to raw vector


Pseudo code:

scoreModel <- function(parameters){
               library(score)
               score(dat,parameters)
}

dat <<- read.csv(‘file.csv’)
modelScores <- spark.lapply(parameterList, scoreModel)

From: Cinquegrana, Piero [mailto:Piero.Cinquegrana@neustar.biz]
Sent: Tuesday, August 23, 2016 2:39 PM
To: Felix Cheung <felixcheung_m@hotmail.com<mailto:felixcheung_m@hotmail.com>>;user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: spark.lapply in SparkR: Error in writeBin(batch, con, endian = "big")

The output from score() is very small, just a float. The input, however, could be as big as
several hundred MBs. I would like to broadcast the dataset to all executors.

Thanks,
Piero

From: Felix Cheung [mailto:felixcheung_m@hotmail.com]
Sent: Monday, August 22, 2016 10:48 PM
To: Cinquegrana, Piero <Piero.Cinquegrana@neustar.biz<mailto:Piero.Cinquegrana@neustar.biz>>;user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: spark.lapply in SparkR: Error in writeBin(batch, con, endian = "big")

How big is the output from score()?

Also could you elaborate on what you want to broadcast?


On Mon, Aug 22, 2016 at 11:58 AM -0700, "Cinquegrana, Piero" <Piero.Cinquegrana@neustar.biz<mailto:Piero.Cinquegrana@neustar.biz>>
wrote:
Hello,

I am using the new R API in SparkR spark.lapply (spark 2.0). I am defining a complex function
to be run across executors and I have to send the entire dataset, but there is not (that I
could find) a way to broadcast the variable in SparkR. I am thus reading the dataset in each
executor from disk, but I getting the following error:

Error in writeBin(batch, con, endian = "big") :
  attempting to add too many elements to raw vector

Any idea why this is happening?

Pseudo code:

scoreModel <- function(parameters){
               library(read.table)
               dat <- data.frame(fread(“file.csv”))
               score(dat,parameters)
}

parameterList <- lapply(1:numModels, function(i) getParameters(i))

modelScores <- spark.lapply(parameterList, scoreModel)


Piero Cinquegrana
MarketShare: A Neustar Solution /Data Science
Mobile:+39.329.17.62.539/www.neustar.biz<http://www.neustar.biz/>
Reduceyour environmental footprint. Print only if necessary.
Follow Neustar:   [New%20Picture]  Facebook<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_pages_NeuStar_104072179630456-3Ffref-3Dts&d=DQMFAg&c=MOptNlVtIETeDALC_lULrw&r=3gXtazXocjhQ4zuUNllnnttMoPLZDfqBTi42s_2XqUY&m=yceEWMjpUYWGlvL0Alf3CH6um6E6ecHcnX_iH3b3WW8&s=kTklp0PwiGNOEuGCv372Uvx3gC_8jom2kpMSDkt1i6U&e=>
  [New%20Picture%20(1)(1)]  LinkedIn<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_5349-3Ftrk-3Dtyah-26trkInfo-3DclickedVertical-253Acompany-252CclickedEntityId-253A5349-252Cidx-253A2-2D1-2D4-252CtarId-253A1450369757393-252Ctas-253Aneustar&d=DQMFAg&c=MOptNlVtIETeDALC_lULrw&r=3gXtazXocjhQ4zuUNllnnttMoPLZDfqBTi42s_2XqUY&m=yceEWMjpUYWGlvL0Alf3CH6um6E6ecHcnX_iH3b3WW8&s=9N3DRk8Hdq-pUlGXTaUx6fpdayRdhW66Su_NMiSTR2Q&e=>
  [New%20Picture%20(2)]  Twitter<https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_Neustar&d=DQMFAg&c=MOptNlVtIETeDALC_lULrw&r=3gXtazXocjhQ4zuUNllnnttMoPLZDfqBTi42s_2XqUY&m=yceEWMjpUYWGlvL0Alf3CH6um6E6ecHcnX_iH3b3WW8&s=hp6UhqxuA6vRj6lchMSqS0AT_NKE-HGDLDC0aYhEGJ4&e=>
The information contained in this email message is intended only for the use of the recipient(s)
named above and may contain confidential and/or privileged information. If you are not the
intended recipient you have received this email message in error and any review, dissemination,
distribution, or copying of this message is strictly prohibited. If you have received this
communication in error, please notify us immediately and delete the original message.



Piero Cinquegrana
MarketShare: A Neustar Solution /Data Science
Mobile:+39.329.17.62.539/www.neustar.biz<http://www.neustar.biz/>
Reduceyour environmental footprint. Print only if necessary.
Follow Neustar:   [New%20Picture]  Facebook<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_pages_NeuStar_104072179630456-3Ffref-3Dts&d=DQMFAg&c=MOptNlVtIETeDALC_lULrw&r=3gXtazXocjhQ4zuUNllnnttMoPLZDfqBTi42s_2XqUY&m=yceEWMjpUYWGlvL0Alf3CH6um6E6ecHcnX_iH3b3WW8&s=kTklp0PwiGNOEuGCv372Uvx3gC_8jom2kpMSDkt1i6U&e=>
  [New%20Picture%20(1)(1)]  LinkedIn<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_5349-3Ftrk-3Dtyah-26trkInfo-3DclickedVertical-253Acompany-252CclickedEntityId-253A5349-252Cidx-253A2-2D1-2D4-252CtarId-253A1450369757393-252Ctas-253Aneustar&d=DQMFAg&c=MOptNlVtIETeDALC_lULrw&r=3gXtazXocjhQ4zuUNllnnttMoPLZDfqBTi42s_2XqUY&m=yceEWMjpUYWGlvL0Alf3CH6um6E6ecHcnX_iH3b3WW8&s=9N3DRk8Hdq-pUlGXTaUx6fpdayRdhW66Su_NMiSTR2Q&e=>
  [New%20Picture%20(2)]  Twitter<https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_Neustar&d=DQMFAg&c=MOptNlVtIETeDALC_lULrw&r=3gXtazXocjhQ4zuUNllnnttMoPLZDfqBTi42s_2XqUY&m=yceEWMjpUYWGlvL0Alf3CH6um6E6ecHcnX_iH3b3WW8&s=hp6UhqxuA6vRj6lchMSqS0AT_NKE-HGDLDC0aYhEGJ4&e=>
The information contained in this email message is intended only for the use of the recipient(s)
named above and may contain confidential and/or privileged information. If you are not the
intended recipient you have received this email message in error and any review, dissemination,
distribution, or copying of this message is strictly prohibited. If you have received this
communication in error, please notify us immediately and delete the original message.





Mime
View raw message