flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Pompermaier <pomperma...@okkam.it>
Subject Re: Flink Mongodb
Date Tue, 04 Nov 2014 16:04:36 GMT
Ok, thanks for the explanation!
That was more or less like I thought it should be but there are still
points I'd like to clarify:

1 - What if a region is very big and there are other regions very small..?
There will be one slot that takes a very long time while the others will
stay inactive..
2 - Do you think it is possible to implement this in an adaptive way (stop
processing of huge region if it worth it and assign remaining data to
inactive task managers)?

On Tue, Nov 4, 2014 at 4:37 PM, Fabian Hueske <fhueske@apache.org> wrote:

> Local split assignment preferably assigns input split to workers that can
> locally read the data of an input split.
> For example, HDFS stores file chunks (blocks) distributed over the cluster
> and gives access to these chunks to every worker via network transfer.
> However, if a chunk is read from a process that runs on the same node as
> the chunk is stored, the read operation directly accesses the local file
> system without going over the network. Hence, it is essential to assign
> input splits based on the locality of their data if you want to have
> reasonably performance. We call this local split assignment. This is a
> general concept of all data parallel systems including Hadoop, Spark, and
> Flink.
> This issue is not related to serializability of input formats.
> I assume that the wrapped MongoIF is also not capable of local split
> assignment.
> Am Dienstag, 4. November 2014 schrieb Flavio Pompermaier :
>> What do you mean for  "might lack support for local split assignment"? You
>> mean that InputFormat is not serializable? This instead is not true for
>> Mongodb?
>> On Tue, Nov 4, 2014 at 10:00 AM, Fabian Hueske <fhueske@apache.org>
>> wrote:
>>> There's a page about Hadoop Compatibility that shows how to use the
>>> wrapper.
>>> The HBase format should work as well, but might lack support for local
>>> split assignment. In that case performance would suffer a lot.
>>> Am Dienstag, 4. November 2014 schrieb Flavio Pompermaier :
>>>> Should I start from
>>>> http://flink.incubator.apache.org/docs/0.7-incubating/example_connectors.html
>>>> ? Is it ok?
>>>> Thus, in principle, also the TableInputFormat of HBase could be used in
>>>> a similar way..isn't it?
>>>> On Tue, Nov 4, 2014 at 9:42 AM, Fabian Hueske <fhueske@apache.org>
>>>> wrote:
>>>>> Hi,
>>>>> the blog post uses Flinks wrapper for Hadoop InputFormats.
>>>>> This has been ported to the new API and is described in the
>>>>> documentation.
>>>>> So you just need to take Mongos Hadoop IF and plug it into the new IF
>>>>> wrapper. :-)
>>>>> Fabian
>>>>> Am Dienstag, 4. November 2014 schrieb Flavio Pompermaier :
>>>>> Hi to all,
>>>>>> I saw this post
>>>>>> https://flink.incubator.apache.org/news/2014/01/28/querying_mongodb.html
>>>>>> but it use the old APIs (HadoopDataSource instead of DataSource).
>>>>>> How can I use Mongodb with the new Flink APIs?
>>>>>> Best,
>>>>>> Flavio

View raw message