From user-return-39803-apmail-spark-user-archive=spark.apache.org@spark.apache.org Mon Aug 10 17:36:35 2015 Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 52A6B18458 for ; Mon, 10 Aug 2015 17:36:35 +0000 (UTC) Received: (qmail 78796 invoked by uid 500); 10 Aug 2015 17:36:31 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 78714 invoked by uid 500); 10 Aug 2015 17:36:31 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 78704 invoked by uid 99); 10 Aug 2015 17:36:31 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Aug 2015 17:36:31 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id CCB37DBFAE for ; Mon, 10 Aug 2015 17:36:30 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.214 X-Spam-Level: **** X-Spam-Status: No, score=4.214 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, URIBL_BLOCKED=0.001, URI_HEX=1.313] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id nCcnyBC0FsH9 for ; Mon, 10 Aug 2015 17:36:17 +0000 (UTC) Received: from mail-ob0-f170.google.com (mail-ob0-f170.google.com [209.85.214.170]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 099A842994 for ; Mon, 10 Aug 2015 17:36:17 +0000 (UTC) Received: by obbfr1 with SMTP id fr1so92290074obb.1 for ; Mon, 10 Aug 2015 10:36:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=BLTC5YJFFNo4mt9d/AhhFTt/0P+0aUDi9qDqTzNmGRw=; b=iKmGKA0DVz/poYyuGe2MO3ZZ+rbTuRyUJYxQ8tADLcrMMDsgZN6zMrZFiZwEcbXDf+ lTJENvBA8tosM8IO43pYd3PIVWe+lRD5Qub2I1pPZFsvrcrVHhZ7IfizD9f2vOlMhYOG +sdjaBExAeqPF0lvMose7aKetnOs7k4jzWMyBoiEZy965mDZ0mpWYqi8o0i9n+EgiAEL GiblbHNLedl4rYtXfk9ia80CbI7R6XtUkisO7JDQrKyIkGMeW7CcXDOkl3Hixaws/qCg L1O1ycu4ZMVrxxNuxMawVgGwO9kAW9EFB8cTIz0g6lZGzkmu+Do8Ya8YbwV60NhCs9Ry H7MQ== X-Received: by 10.60.92.37 with SMTP id cj5mr20435640oeb.30.1439228176589; Mon, 10 Aug 2015 10:36:16 -0700 (PDT) MIME-Version: 1.0 Received: by 10.202.76.213 with HTTP; Mon, 10 Aug 2015 10:35:57 -0700 (PDT) In-Reply-To: References: <1439066535842-24179.post@n3.nabble.com> From: Umesh Kacha Date: Mon, 10 Aug 2015 23:05:57 +0530 Message-ID: Subject: Re: How to create DataFrame from a binary file? To: bo yang Cc: user Content-Type: multipart/alternative; boundary=047d7b33d5b8376d13051cf86cc0 --047d7b33d5b8376d13051cf86cc0 Content-Type: text/plain; charset=UTF-8 Hi Bo thanks much let me explain please see the following code JavaPairRDD pairRdd = javaSparkContext.binaryFiles("/hdfs/path/to/binfile"); JavaRDD javardd = pairRdd.values(); DataFrame binDataFrame = sqlContext.createDataFrame(javaBinRdd, PortableDataStream.class); binDataFrame.show(); //shows just one row with above file path /hdfs/path/to/binfile I want binary data in DataFrame from above file so that I can directly do analytics on it. My data is binary so I cant use StructType with primitive data types rigth since everything is binary/byte. My custom data format in binary is same as Parquet I did not find any good example where/how parquet is read into DataFrame. Please guide. On Sun, Aug 9, 2015 at 11:52 PM, bo yang wrote: > Well, my post uses raw text json file to show how to create data frame > with a custom data schema. The key idea is to show the flexibility to deal > with any format of data by using your own schema. Sorry if I did not make > you fully understand. > > Anyway, let us know once you figure out your problem. > > > > > On Sun, Aug 9, 2015 at 11:10 AM, Umesh Kacha > wrote: > >> Hi Bo I know how to create a DataFrame my question is how to create a >> DataFrame for binary files and in your blog it is raw text json files >> please read my question properly thanks. >> >> On Sun, Aug 9, 2015 at 11:21 PM, bo yang wrote: >> >>> You can create your own data schema (StructType in spark), and use >>> following method to create data frame with your own data schema: >>> >>> sqlContext.createDataFrame(yourRDD, structType); >>> >>> I wrote a post on how to do it. You can also get the sample code there: >>> >>> Light-Weight Self-Service Data Query through Spark SQL: >>> >>> https://www.linkedin.com/pulse/light-weight-self-service-data-query-through-spark-sql-bo-yang >>> >>> Take a look and feel free to let me know for any question. >>> >>> Best, >>> Bo >>> >>> >>> >>> On Sat, Aug 8, 2015 at 1:42 PM, unk1102 wrote: >>> >>>> Hi how do we create DataFrame from a binary file stored in HDFS? I was >>>> thinking to use >>>> >>>> JavaPairRDD pairRdd = >>>> javaSparkContext.binaryFiles("/hdfs/path/to/binfile"); >>>> JavaRDD javardd = pairRdd.values(); >>>> >>>> I can see that PortableDataStream has method called toArray which can >>>> convert into byte array I was thinking if I have JavaRDD can I >>>> call >>>> the following and get DataFrame >>>> >>>> DataFrame binDataFrame = >>>> sqlContext.createDataFrame(javaBinRdd,Byte.class); >>>> >>>> Please guide I am new to Spark. I have my own custom format which is >>>> binary >>>> format and I was thinking if I can convert my custom format into >>>> DataFrame >>>> using binary operations then I dont need to create my own custom Hadoop >>>> format am I on right track? Will reading binary data into DataFrame >>>> scale? >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-create-DataFrame-from-a-binary-file-tp24179.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org >>>> For additional commands, e-mail: user-help@spark.apache.org >>>> >>>> >>> >> > --047d7b33d5b8376d13051cf86cc0 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Bo thanks much let me explain please see the following = code

JavaPa= irRDD<String,Portabl= eDataStream> pairRdd =3D
javaSparkContext.binaryFiles= ("/hdfs/path/to/bi= nfile");
JavaRDD<PortableDataStream> javardd= =3D pairRdd.values();
DataFrame binDataFrame =3D sqlContext.createDataFrame(javaBinRdd,=C2=A0PortableDataStream.class);
binDataFrame.show(); //shows just one row wit= h above file path /hdfs/path/to/binfile

I want binary data in DataFrame from above file so that = I can directly do analytics on it. My data is binary so I cant use StructTy= pe with=C2=A0primitive=C2=A0data types rigth since everything is binary/byt= e. My custom data format in binary is same as Parquet I did not find any go= od example where/how parquet is read into DataFrame. Please guide.





On Sun, Aug 9, 2= 015 at 11:52 PM, bo yang <bobyangbo@gmail.com> wrote:
<= blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px= #ccc solid;padding-left:1ex">
Well, my post uses raw text = json file to show how to create data frame with a custom data schema. The k= ey idea is to show the flexibility to deal with any format of data by using= your own schema. Sorry if I did not make you fully understand.

Anyway, let us know once you figure out your problem.



On Sun, = Aug 9, 2015 at 11:10 AM, Umesh Kacha <umesh.kacha@gmail.com> wrote:
Hi Bo I know= how to create a DataFrame my question is how to create a DataFrame for bin= ary files and in your blog it is raw text json files please read my questio= n properly thanks.

On Sun, Aug 9, 2015 at 11:21 PM, bo yang <bobyangbo@gma= il.com> wrote:
You can create your own data schema (StructType in spark), and use fol= lowing method to create data frame with your own data schema:

sqlContext.createDataFrame(yourRDD, structType);

I wrote a post on how to do it. You can also get the sample code there:<= /div>

Light-Weight Self-Service Data Query through Spark= SQL:

Take a look and feel free to = =C2=A0let me know for any question.

Best,
Bo



On Sat, Aug 8, 2015 at 1:42 PM, unk1102 = <umesh.kacha@gmail.com> wrote:
Hi how do we create DataFrame from a binary file stored in HDFS? I= was
thinking to use

JavaPairRDD<String,PortableDataStream> pairRdd =3D
javaSparkContext.binaryFiles("/hdfs/path/to/binfile");
JavaRDD<PortableDataStream> javardd =3D pairRdd.values();

I can see that PortableDataStream has method called toArray which can
convert into byte array I was thinking if I have JavaRDD<byte[]> can = I call
the following and get DataFrame

DataFrame binDataFrame =3D sqlContext.createDataFrame(javaBinRdd,Byte.class= );

Please guide I am new to Spark. I have my own custom format which is binary=
format and I was thinking if I can convert my custom format into DataFrame<= br> using binary operations then I dont need to create my own custom Hadoop
format am I on right track? Will reading binary data into DataFrame scale?<= br>


--
View this message in context: http://apache-spark-user-list.1001560.= n3.nabble.com/How-to-create-DataFrame-from-a-binary-file-tp24179.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org





--047d7b33d5b8376d13051cf86cc0--