spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Wolf <opus...@gmail.com>
Subject Re: Spark and Speech Recognition
Date Thu, 30 Jul 2015 13:15:43 GMT
Oh...  That was embarrassingly easy!

Thank you that was exactly the understanding of partitions that I needed.

P

On Thu, Jul 30, 2015 at 6:35 AM, Simon Elliston Ball <
simon@simonellistonball.com> wrote:

> You might also want to consider broadcasting the models to ensure you get
> one instance shared across cores in each machine, otherwise the model will
> be serialised to each task and you'll get a copy per executor (roughly core
> in this instance)
>
> Simon
>
> Sent from my iPhone
>
> On 30 Jul 2015, at 10:14, Akhil Das <akhil@sigmoidanalytics.com> wrote:
>
> Like this?
>
> val data = sc.textFile("/sigmoid/audio/data/", 24).foreachPartition(urls
> => speachRecognizer(urls))
>
> Let 24 be the total number of cores that you have on all the workers.
>
> Thanks
> Best Regards
>
> On Wed, Jul 29, 2015 at 6:50 AM, Peter Wolf <opus111@gmail.com> wrote:
>
>> Hello, I am writing a Spark application to use speech recognition to
>> transcribe a very large number of recordings.
>>
>> I need some help configuring Spark.
>>
>> My app is basically a transformation with no side effects: recording URL
>> --> transcript.  The input is a huge file with one URL per line, and the
>> output is a huge file of transcripts.
>>
>> The speech recognizer is written in Java (Sphinx4), so it can be packaged
>> as a JAR.
>>
>> The recognizer is very processor intensive, so you can't run too many on
>> one machine-- perhaps one recognizer per core.  The recognizer is also
>> big-- maybe 1 GB.  But, most of the recognizer is a immutable acoustic and
>> language models that can be shared with other instances of the recognizer.
>>
>> So I want to run about one recognizer per core of each machine in my
>> cluster.  I want all recognizer on one machine to run within the same JVM
>> and share the same models.
>>
>> How does one configure Spark for this sort of application?  How does one
>> control how Spark deploys the stages of the process.  Can someone point me
>> to an appropriate doc or keywords I should Google.
>>
>> Thanks
>> Peter
>>
>
>

Mime
View raw message