spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiangrui Meng <men...@gmail.com>
Subject Re: reading large XML files
Date Tue, 20 May 2014 17:38:08 GMT
You can search for XMLInputFormat on Google. There are some
implementations that allow you to specify the <tag> to split on, e.g.:
https://github.com/lintool/Cloud9/blob/master/src/dist/edu/umd/cloud9/collection/XMLInputFormat.java

On Tue, May 20, 2014 at 10:31 AM, Nathan Kronenfeld
<nkronenfeld@oculusinfo.com> wrote:
> Unfortunately, I don't have a bunch of moderately big xml files; I have one,
> really big file - big enough that reading it into memory as a single string
> is not feasible.
>
>
> On Tue, May 20, 2014 at 1:24 PM, Xiangrui Meng <mengxr@gmail.com> wrote:
>>
>> Try sc.wholeTextFiles(). It reads the entire file into a string
>> record. -Xiangrui
>>
>> On Tue, May 20, 2014 at 8:25 AM, Nathan Kronenfeld
>> <nkronenfeld@oculusinfo.com> wrote:
>> > We are trying to read some large GraphML files to use in spark.
>> >
>> > Is there an easy way to read XML-based files like this that accounts for
>> > partition boundaries and the like?
>> >
>> >              Thanks,
>> >              Nathan
>> >
>> >
>> > --
>> > Nathan Kronenfeld
>> > Senior Visualization Developer
>> > Oculus Info Inc
>> > 2 Berkeley Street, Suite 600,
>> > Toronto, Ontario M5A 4J5
>> > Phone:  +1-416-203-3003 x 238
>> > Email:  nkronenfeld@oculusinfo.com
>
>
>
>
> --
> Nathan Kronenfeld
> Senior Visualization Developer
> Oculus Info Inc
> 2 Berkeley Street, Suite 600,
> Toronto, Ontario M5A 4J5
> Phone:  +1-416-203-3003 x 238
> Email:  nkronenfeld@oculusinfo.com

Mime
View raw message