beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kenneth Knowles (JIRA)" <>
Subject [jira] [Commented] (BEAM-1439) Beam Example(s) exploring public document datasets
Date Tue, 28 Mar 2017 03:36:41 GMT


Kenneth Knowles commented on BEAM-1439:

And also, please engage with the Beam community early - before applications are reviewed!

Here are some ideas for getting engaged:

# Work through Beam's "getting started" materials such as
#* Especially get as familiar as you can with the runner that you are interested in
# Subscribe to and/or
# You are welcome to share your applications for early commentary on to
get early feedback and mentorship (this is quite normal for GSoC+Apache; even if you don't
get selected by GSoC you will learn and make new acquaintances)
# Pick up starter bugs to get familiar with the codebase beyond our getting started material

> Beam Example(s) exploring public document datasets
> --------------------------------------------------
>                 Key: BEAM-1439
>                 URL:
>             Project: Beam
>          Issue Type: Wish
>          Components: examples-java
>            Reporter: Kenneth Knowles
>            Assignee: Kenneth Knowles
>            Priority: Minor
>              Labels: gsoc2017, java, mentor, python
> In Beam, we have examples illustrating counting the occurrences of words and performing
a basic TF-IDF analysis on the works of Shakespeare (or whatever you point it at). It would
be even cooler to do these analyses, and more, on a much larger data set that is really the
subject of current investigations.
> In chatting with professors at the University of Washington, I've learned that scholars
of many fields would really like to explore new and highly customized ways of processing the
growing body of publicly-available scholarly documents, such as PubMed Central. Queries like
"show me documents where chemical compounds X and Y were both used in the 'method' section"
> So I propose a Google Summer of Code project wherein a student writes some large-scale
Beam pipelines to perform analyses such as term frequency, bigram frequency, etc.
> Skills required:
>  - Java or Python
>  - (nice to have) Working through the Beam getting started materials

This message was sent by Atlassian JIRA

View raw message