spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From boci <boci.b...@gmail.com>
Subject Re: ElasticSearch enrich
Date Thu, 26 Jun 2014 07:04:33 GMT
That's okay, but hadoop has ES integration. what happened if I run
saveAsHadoopFile without hadoop (or I must need to pull up hadoop
programatically? (if I can))

b0c1

----------------------------------------------------------------------------------------------------------------------------------
Skype: boci13, Hangout: boci.boci@gmail.com


On Thu, Jun 26, 2014 at 1:20 AM, Holden Karau <holden@pigscanfly.ca> wrote:

>
>
> On Wed, Jun 25, 2014 at 4:16 PM, boci <boci.boci@gmail.com> wrote:
>
>> Hi guys, thanks the direction now I have some problem/question:
>> - in local (test) mode I want to use ElasticClient.local to create es
>> connection, but in prodution I want to use ElasticClient.remote, to this I
>> want to pass ElasticClient to mapPartitions, or what is the best
>> practices?
>>
> In this case you probably want to make the ElasticClient inside of
> mapPartitions (since it isn't serializable) and if you want to use a
> different client in local mode just have a flag that control what type of
> client you create.
>
>> - my stream output is write into elasticsearch. How can I
>> test output.saveAsHadoopFile[ESOutputFormat]("-") in local environment?
>>
> - After store the enriched data into ES, I want to generate aggregated
>> data (EsInputFormat) how can I test it in local?
>>
> I think the simplest thing to do would be use the same client in mode and
> just start single node elastic search cluster.
>
>>
>> Thanks guys
>>
>> b0c1
>>
>>
>>
>>
>> ----------------------------------------------------------------------------------------------------------------------------------
>> Skype: boci13, Hangout: boci.boci@gmail.com
>>
>>
>> On Wed, Jun 25, 2014 at 1:33 AM, Holden Karau <holden@pigscanfly.ca>
>> wrote:
>>
>>> So I'm giving a talk at the Spark summit on using Spark & ElasticSearch,
>>> but for now if you want to see a simple demo which uses elasticsearch for
>>> geo input you can take a look at my quick & dirty implementation with
>>> TopTweetsInALocation (
>>> https://github.com/holdenk/elasticsearchspark/blob/master/src/main/scala/com/holdenkarau/esspark/TopTweetsInALocation.scala
>>> ). This approach uses the ESInputFormat which avoids the difficulty of
>>> having to manually create ElasticSearch clients.
>>>
>>> This approach might not work for your data, e.g. if you need to create a
>>> query for each record in your RDD. If this is the case, you could instead
>>> look at using mapPartitions and setting up your Elasticsearch connection
>>> inside of that, so you could then re-use the client for all of the queries
>>> on each partition. This approach will avoid having to serialize the
>>> Elasticsearch connection because it will be local to your function.
>>>
>>> Hope this helps!
>>>
>>> Cheers,
>>>
>>> Holden :)
>>>
>>>
>>> On Tue, Jun 24, 2014 at 4:28 PM, Mayur Rustagi <mayur.rustagi@gmail.com>
>>> wrote:
>>>
>>>> Its not used as default serializer for some issues with compatibility &
>>>> requirement to register the classes..
>>>>
>>>> Which part are you getting as nonserializable... you need to serialize
>>>> that class if you are sending it to spark workers inside a map, reduce ,
>>>> mappartition or any of the operations on RDD.
>>>>
>>>>
>>>> Mayur Rustagi
>>>> Ph: +1 (760) 203 3257
>>>> http://www.sigmoidanalytics.com
>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>
>>>>
>>>>
>>>> On Wed, Jun 25, 2014 at 4:52 AM, Peng Cheng <pc175@uow.edu.au> wrote:
>>>>
>>>>> I'm afraid persisting connection across two tasks is a dangerous act
>>>>> as they
>>>>> can't be guaranteed to be executed on the same machine. Your ES server
>>>>> may
>>>>> think its a man-in-the-middle attack!
>>>>>
>>>>> I think its possible to invoke a static method that give you a
>>>>> connection in
>>>>> a local 'pool', so nothing will sneak into your closure, but its too
>>>>> complex
>>>>> and there should be a better option.
>>>>>
>>>>> Never use kryo before, if its that good perhaps we should use it as the
>>>>> default serializer
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/ElasticSearch-enrich-tp8209p8222.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Cell : 425-233-8271
>>>
>>
>>
>
>
> --
> Cell : 425-233-8271
>

Mime
View raw message