samoa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SAMOA-58) Samoa AvroFileStream from HDFSFileStreamSource stops at end of first file
Date Mon, 22 Feb 2016 14:43:18 GMT

    [ https://issues.apache.org/jira/browse/SAMOA-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157073#comment-15157073
] 

ASF GitHub Bot commented on SAMOA-58:
-------------------------------------

GitHub user edi-bice opened a pull request:

    https://github.com/apache/incubator-samoa/pull/48

    Patch for SAMOA-58 (Samoa AvroFileStream from HDFSFileStreamSource stops at end of first
file)

    FileStreamSource seemed to support multiple files but during my testing it turned out
otherwise - Samoa AvroFileStream from HDFSFileStreamSource stops at end of first file. I had
to change AvroFileStream, ArffFileStream and their parent FileStream in order to make this
work.
    
    See following JIRA for additional detail:
    
    https://issues.apache.org/jira/browse/SAMOA-58
    
    Additionally, I modified bin/samoa, pom.xml, SystemUtils (as well as added a resource)
to fix reading from HDFS on my cluster.
    
    A seemingly unrelated change is the explicit test for supported Avro types so as to filter
out any fields that are not supported instead of assuming all non-nominal (non-enum) fields
are numeric and failing during reading.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/edi-bice/incubator-samoa master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-samoa/pull/48.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #48
    
----
commit 5cbbcfab94db47732ab44b3b9d752c45f02e2f30
Author: edi_bice <edi_bice@yahoo.com>
Date:   2016-02-17T15:45:07Z

    Only add fields of supported types (double, float, long, int and enum) rather than adding
and defaulting all non-enum to numeric and failing at value parse time

commit d5a055f5c5ff0c6787beaa03234375cdcbb89cb5
Author: edi_bice <edi_bice@yahoo.com>
Date:   2016-02-17T21:53:02Z

    until we change samza to produce files with .avro extension

commit ba73bb24d9477207e8dfd85fbf478be1e3877c7d
Author: edi_bice <edi_bice@yahoo.com>
Date:   2016-02-18T22:06:12Z

    A tentative solution to issue described in:
    
    https://issues.apache.org/jira/browse/SAMOA-58

commit 29e0379949eb7847ea46bfe432d98d90dff993e9
Author: edi_bice <edi_bice@yahoo.com>
Date:   2016-02-19T16:55:03Z

    Issue described in https://issues.apache.org/jira/browse/SAMOA-58 was apparently more
complicated than what was expected in previous commit. While we did succeed in replacing the
first exhausted file stream with a new one, the loader was not changed and would return null.
This rework of AvroFileStream, FileStream and ArffFileStream hopefully cleans things up a
bit and allows multi-file streams of either (Avro or Arff) type.

commit fe093240a248e26be84ded4d378acc1d5c81d599
Author: edi_bice <edi_bice@yahoo.com>
Date:   2016-01-25T17:02:22Z

    configure don't code

commit 99f04bb4396190e92af2a43e56d005cb502357ca
Author: Edi Bice <edi_bice@yahoo.com>
Date:   2016-02-22T14:25:43Z

    cherry-picked from faf branch - changes needed to be able to read from HDFS on a YARN
2.7.1 cluster

----


> Samoa AvroFileStream from HDFSFileStreamSource stops at end of first file
> -------------------------------------------------------------------------
>
>                 Key: SAMOA-58
>                 URL: https://issues.apache.org/jira/browse/SAMOA-58
>             Project: SAMOA
>          Issue Type: Bug
>          Components: SAMOA-Instances
>         Environment: RHEL 6.6, java 1.8.0_72
>            Reporter: Edi Bice
>
> It appears Samoa is capable of streaming a collection of files as a single stream effectively
concatenating the files. However using Samoa AvroFileStream from HDFSFileStreamSource seems
the stream stops at end of first file:
> bin/samoa local target/SAMOA-Local-0.4.0-incubating-SNAPSHOT.jar "PrequentialEvaluation
-i -1 -l (classifiers.ensemble.Bagging -s 100) -s (AvroFileStream -s HDFSFileStreamSource
-f /tmp/order_and_feats_flat_avro/2016_02_18/ -c 1 -e binary) -f 10000"
> 2016-02-18 20:43:20,991 [main] INFO  org.apache.samoa.evaluation.EvaluatorProcessor (EvaluatorProcessor.java:183)
- last event is received!
> 2016-02-18 20:43:20,991 [main] INFO  org.apache.samoa.evaluation.EvaluatorProcessor (EvaluatorProcessor.java:184)
- total count: 262144
> ...
> 2016-02-18 20:43:20,993 [main] INFO  org.apache.samoa.evaluation.EvaluatorProcessor (EvaluatorProcessor.java:191)
- total evaluation time: 34 seconds for 262144 instances
> bash-4.1$ hadoop fs -ls /tmp/order_and_feats_flat_avro/2016_02_18 | more
> Found 70 items
> -rw-r--r--   3 yarn hdfs  230855335 2016-02-18 16:01 /tmp/order_and_feats_flat_avro/2016_02_18/hdfs-1a238673-c4ec-4462-be67-78d573efa790-00001
> -rw-r--r--   3 yarn hdfs  229800273 2016-02-18 16:04 /tmp/order_and_feats_flat_avro/2016_02_18/hdfs-1a238673-c4ec-4462-be67-78d573efa790-00002
> ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message