spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicholas Chammas <nicholas.cham...@gmail.com>
Subject Re: example of non-line oriented input data?
Date Mon, 17 Mar 2014 17:08:22 GMT
Hmm, so I lucked out with my data source in that it comes to me as
line-delimited JSON, so I didn't have to write code to massage it into that
format.

If you are prepared to make several assumptions about your data (let's say
it's JSON), it should be straightforward to write some kind of
pre-processor that splits it out into lines just by counting and matching
open and closed braces. You'll have to assume, for example, that your JSON
is well formed and that values themselves don't contain braces, but that
may be okay for your purposes. If you want it to be more involved, there's this
post <http://stackoverflow.com/a/7795029/877069> on "lazily" reading JSON
objects from a file stream.

And if all of this is too much, just stick with deserializing the entire
document at once and take it from there.

Nick


On Mon, Mar 17, 2014 at 11:56 AM, Diana Carroll <dcarroll@cloudera.com>wrote:

> I don't actually have any data.  I'm writing a course that teaches
> students how to do this sort of thing and am interested in looking at a
> variety of real life examples of people doing things like that.  I'd love
> to see some working code implementing the "obvious work-around" you
> mention...do you have any to share?  It's an approach that makes a lot of
> sense, and as I said, I'd love to not have to re-invent the wheel if
> someone else has already written that code.  Thanks!
>
> Diana
>
>
> On Mon, Mar 17, 2014 at 11:35 AM, Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
>
>> There was a previous discussion about this here:
>>
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Having-Spark-read-a-JSON-file-td1963.html
>>
>> How big are the XML or JSON files you're looking to deal with?
>>
>> It may not be practical to deserialize the entire document at once. In
>> that case an obvious work-around would be to have some kind of
>> pre-processing step that separates XML nodes/JSON objects with newlines so
>> that you *can* analyze the data with Spark in a "line-oriented format".
>> Your preprocessor wouldn't have to parse/deserialize the massive document;
>> it would just have to track open/closed tags/braces to know when to insert
>> a newline.
>>
>> Then you'd just open the line-delimited result and deserialize the
>> individual objects/nodes with map().
>>
>> Nick
>>
>>
>> On Mon, Mar 17, 2014 at 11:18 AM, Diana Carroll <dcarroll@cloudera.com>wrote:
>>
>>> Has anyone got a working example of a Spark application that analyzes
>>> data in a non-line-oriented format, such as XML or JSON?  I'd like to do
>>> this without re-inventing the wheel...anyone care to share?  Thanks!
>>>
>>> Diana
>>>
>>
>>
>

Mime
View raw message