flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Ewen <se...@apache.org>
Subject Re: Runtime generated (source) datasets
Date Wed, 21 Jan 2015 19:21:46 GMT
There is a common misunderstanding between the "compile" phase of the
Java/Scala compiler (which does not generate the Flink plan) and the Flink
"compile/optimize" phase (happening when calling env.execute()).

The Flink compile/optimize phase is not a compile phase in the sense that
source code is parsed and translated to byte code. It only is a set of
transformations on the program's data flow

We should probably stop calling the Flink phase "compile", but simply
"pre-flight" or "optimize" or "prepare". Otherwise, it creates frequent
confusions...

On Wed, Jan 21, 2015 at 6:05 AM, Flavio Pompermaier <pompermaier@okkam.it>
wrote:

> Thanks Fabian, that makes a lot of sense :)
>
> Best,
> Flavio
>
> On Wed, Jan 21, 2015 at 2:41 PM, Fabian Hueske <fhueske@gmail.com> wrote:
>
>> The program is compiled when the ExecutionEnvironment.execute() method is
>> called. At that moment, theEexecutionEnvironment collects all data sources
>> that were previously created and traverses them towards connected data
>> sinks. All sinks that are found this way are remembered and treated as
>> execution targets. The sinks and all connected operators and data sources
>> are given to the optimizer which analyses the plan, compiles an execution
>> plan, and submits the plan to the execution system which the
>> ExecutionEnvironment refers to (local, remote, ...).
>>
>> Therefore, your code can build arbitrary data flows with as many source
>> as you like. Once you call ExecutionEnvironment.execute() all data sources
>> and operators which are required to compute the result of all data sinks
>> are executed.
>>
>>
>> 2015-01-21 14:26 GMT+01:00 Flavio Pompermaier <pompermaier@okkam.it>:
>>
>>> Great! Could you explain me a little bit the internals of how and when
>>> Flink will generate the plan and how the execution environment is involved
>>> in this phase?
>>> Just to better understand this step!
>>>
>>> Thanks again,
>>> Flavio
>>>
>>>
>>> On Wed, Jan 21, 2015 at 2:14 PM, Till Rohrmann <trohrmann@apache.org>
>>> wrote:
>>>
>>>> Yes this will also work. You only have to make sure that the list of
>>>> data sets is processed properly later on in your code.
>>>>
>>>> On Wed, Jan 21, 2015 at 2:09 PM, Flavio Pompermaier <
>>>> pompermaier@okkam.it> wrote:
>>>>
>>>>> Hi Till,
>>>>> thanks for the reply. However my problem is that I'll have something
>>>>> like:
>>>>>
>>>>> List<Dataset<<ElementType>>  getInput(String[] args,
>>>>> ExecutionEnvironment env) {....}
>>>>>
>>>>> So I don't know in advance how many of them I'll have at runtime. Does
>>>>> it still work?
>>>>>
>>>>> On Wed, Jan 21, 2015 at 1:55 PM, Till Rohrmann <trohrmann@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi Flavio,
>>>>>>
>>>>>> if your question was whether you can write a Flink job which can
read
>>>>>> input from different sources, depending on the user input, then the
answer
>>>>>> is yes. The Flink job plans are actually generated at runtime so
that you
>>>>>> can easily write a method which generates a user dependent input/data
set.
>>>>>>
>>>>>> You could do something like this:
>>>>>>
>>>>>> DataSet<ElementType> getInput(String[] args, ExecutionEnvironment
>>>>>> env) {
>>>>>>   if(args[0] == csv) {
>>>>>>     return env.readCsvFile(...);
>>>>>>   } else {
>>>>>>     return env.createInput(new AvroInputFormat<ElementType>(...));
>>>>>>   }
>>>>>> }
>>>>>>
>>>>>> as long as the element type of the data set are all equal for all
>>>>>> possible data sources. I hope that I understood your problem correctly.
>>>>>>
>>>>>> Greets,
>>>>>>
>>>>>> Till
>>>>>>
>>>>>> On Wed, Jan 21, 2015 at 11:45 AM, Flavio Pompermaier <
>>>>>> pompermaier@okkam.it> wrote:
>>>>>>
>>>>>>> Hi guys,
>>>>>>>
>>>>>>> I have a big question for you about how Fling handles job's plan
>>>>>>> generation:
>>>>>>> let's suppose that I want to write a job that takes as input
a
>>>>>>> description of a set of datasets that I want to work on (for
example a csv
>>>>>>> file and its path, 2 hbase tables, 1 parquet directory and its
path, etc).
>>>>>>> From what I know Flink generates the job's plan at compile time,
so
>>>>>>> I was wondering whether this is possible right now or not..
>>>>>>>
>>>>>>> Thanks in advance,
>>>>>>> Flavio
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>

Mime
View raw message