spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ewan Higgs <ewan.hi...@ugent.be>
Subject Re: Multi-Line JSON in SparkSQL
Date Tue, 05 May 2015 07:51:34 GMT
FWIW, CSV has the same problem that renders it immune to naive partitioning.

Consider the following RFC 4180 compliant record:

1,2,"
all,of,these,are,just,one,field
",4,5

Now, it's probably a terrible idea to give a file system awareness of 
actual file types, but couldn't HDFS handle this nearer the replication 
level? XML, JSON, and CSV are so pervasive it almost seems like it could 
be appropriate -if- enormous JSON files are considered enough of an 
issue that some basic ETL becomes a non viable solution.

-Ewan

On 05/05/15 09:37, Joe Halliwell wrote:
> @reynold, I’ll raise a JIRA today.@oliver, let’s discuss on the ticket?
>
>
>
>
> I suspect the algorithm is going to be bit fiddly and would definitely benefit from multiple
heads. If possible, I think we should handle pathological cases like {“:”:”:”,{”{”:”}”}}
correctly, rather than bailing out.
>
>
>
>
> JSON grammar is simple enough that this feels tractable. (I wonder if there’s research
on “start anywhere” languages/parsers in general...)
>
>
>
>
> Cheers,
>
> Joe
>
>
> http://www.joehalliwell.com
>
> @joehalliwell
>
> On Mon, May 4, 2015 at 10:07 PM, Olivier Girardot
> <o.girardot@lateral-thoughts.com> wrote:
>
>> @joe, I'd be glad to help if you need.
>> Le lun. 4 mai 2015 à 20:06, Matei Zaharia <matei.zaharia@gmail.com> a
>> écrit :
>>> I don't know whether this is common, but we might also allow another
>>> separator for JSON objects, such as two blank lines.
>>>
>>> Matei
>>>
>>>> On May 4, 2015, at 2:28 PM, Reynold Xin <rxin@databricks.com> wrote:
>>>>
>>>> Joe - I think that's a legit and useful thing to do. Do you want to give
>>> it
>>>> a shot?
>>>>
>>>> On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell <joe.halliwell@gmail.com>
>>>> wrote:
>>>>
>>>>> I think Reynold’s argument shows the impossibility of the general case.
>>>>>
>>>>> But a “maximum object depth” hint could enable a new input format
to do
>>>>> its job both efficiently and correctly in the common case where the
>>> input
>>>>> is an array of similarly structured objects! I’d certainly be
>>> interested in
>>>>> an implementation along those lines.
>>>>>
>>>>> Cheers,
>>>>> Joe
>>>>>
>>>>> http://www.joehalliwell.com
>>>>> @joehalliwell
>>>>>
>>>>>
>>>>> On Mon, May 4, 2015 at 7:55 AM, Reynold Xin <rxin@databricks.com>
>>> wrote:
>>>>>> I took a quick look at that implementation. I'm not sure if it actually
>>>>>> handles JSON correctly, because it attempts to find the first {
>>> starting
>>>>>> from a random point. However, that random point could be in the middle
>>> of
>>>>>> a
>>>>>> string, and thus the first { might just be part of a string, rather
>>> than
>>>>>> a
>>>>>> real JSON object starting position.
>>>>>>
>>>>>>
>>>>>> On Sun, May 3, 2015 at 11:13 PM, Emre Sevinc <emre.sevinc@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> You can check out the following library:
>>>>>>>
>>>>>>> https://github.com/alexholmes/json-mapreduce
>>>>>>>
>>>>>>> --
>>>>>>> Emre Sevinç
>>>>>>>
>>>>>>>
>>>>>>> On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot <
>>>>>>> o.girardot@lateral-thoughts.com> wrote:
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>> Is there any way in Spark SQL to load multi-line JSON data
>>>>>> efficiently, I
>>>>>>>> think there was in the mailing list a reference to
>>>>>>>> http://pivotal-field-engineering.github.io/pmr-common/ for
its
>>>>>>>> JSONInputFormat
>>>>>>>>
>>>>>>>> But it's rather inaccessible considering the dependency is
not
>>>>>> available
>>>>>>> in
>>>>>>>> any public maven repo (If you know of one, I'd be glad to
hear it).
>>>>>>>>
>>>>>>>> Is there any plan to address this or any public recommendation
?
>>>>>>>> (considering the documentation clearly states that
>>>>>> sqlContext.jsonFile
>>>>>>> will
>>>>>>>> not work for multi-line json(s))
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Olivier.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Emre Sevinc
>>>>>>>
>>>>>
>>>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message