sqoop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Abraham Fine" <...@abrahamfine.com>
Subject Re: Review Request 42327: SQOOP-2788: Parquet support for HdfsConnector
Date Fri, 22 Jan 2016 01:04:32 GMT


> On Jan. 21, 2016, 10:49 p.m., Ryan Blue wrote:
> > connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsExtractor.java,
line 252
> > <https://reviews.apache.org/r/42327/diff/8/?file=1204487#file1204487line252>
> >
> >     Why not use the existing Hadoop class? That's what we did for Kite: https://github.com/kite-sdk/kite/blob/master/kite-data/kite-data-core/src/main/java/org/kitesdk/data/spi/filesystem/InputFormatReader.java#L69
> >     
> >     Then you don't need to include your own anonymous class. There's a bit of overhead
to get the right constructor but I think that's a better strategy.

i agree that the end result it prettier but im not sure the added complexity is worth it in
this case. 

i would like to get a committer's view on this one


> On Jan. 21, 2016, 10:49 p.m., Ryan Blue wrote:
> > connector/connector-hdfs/pom.xml, line 88
> > <https://reviews.apache.org/r/42327/diff/8/?file=1204486#file1204486line88>
> >
> >     Does the exclusion from dependencyManagement apply here? I don't think you need
both.

forgot to remove that. thanks!


> On Jan. 21, 2016, 10:49 p.m., Ryan Blue wrote:
> > connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsExtractor.java,
line 506
> > <https://reviews.apache.org/r/42327/diff/8/?file=1204487#file1204487line506>
> >
> >     I think a better and cheaper check is to read the first 4 bytes (and/or the
last 4 bytes) and make sure they are "PAR1". The first 4 and the last 4 for all valid parquet
files contain that magic pattern.

we "could" run into an issue where we run into a non-parquet file that starts/ends with that
magic pattern right? isn't this approach more complete and will prevent false-positives?


> On Jan. 21, 2016, 10:49 p.m., Ryan Blue wrote:
> > connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/hdfsWriter/HdfsParquetWriter.java,
line 53
> > <https://reviews.apache.org/r/42327/diff/8/?file=1204492#file1204492line53>
> >
> >     Does this make the default UNCOMPRESSED? I think I'd rather see SNAPPY by default.

for our other formats we default to not compressing. i think it may be strange for one format
to use compression while the other ones do not. do you feel strongly about this?


> On Jan. 21, 2016, 10:49 p.m., Ryan Blue wrote:
> > connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/hdfsWriter/HdfsParquetWriter.java,
line 58
> > <https://reviews.apache.org/r/42327/diff/8/?file=1204492#file1204492line58>
> >
> >     Why use the OutputFormat and RecordWriter here instead of an AvroParquetWriter?
That is more straight-forward.

good idea. thanks


> On Jan. 21, 2016, 10:49 p.m., Ryan Blue wrote:
> > connector/connector-hdfs/src/test/java/org/apache/sqoop/connector/hdfs/TestLoader.java,
line 165
> > <https://reviews.apache.org/r/42327/diff/8/?file=1204495#file1204495line165>
> >
> >     Should this be assertTrue instead of assert?

thanks


> On Jan. 21, 2016, 10:49 p.m., Ryan Blue wrote:
> > test/src/test/java/org/apache/sqoop/integration/connector/hdfs/ParquetTest.java,
line 96
> > <https://reviews.apache.org/r/42327/diff/8/?file=1204500#file1204500line96>
> >
> >     Why is this a multiset?

it is true that setLines does not necessarily need to be a multiset. I made it one to match
the assertMapReduceOutput method from HdfsAsserts which inspired this chunk of code. Even
though we do not have any duplicates in our output here, i think the multiset reflects the
constraints we are testing for (unordered, may contain duplicates) so i kept it.


> On Jan. 21, 2016, 10:49 p.m., Ryan Blue wrote:
> > connector/connector-hdfs/src/test/java/org/apache/sqoop/connector/hdfs/TestLoader.java,
line 197
> > <https://reviews.apache.org/r/42327/diff/8/?file=1204495#file1204495line197>
> >
> >     Why is this value changed?

reverted


> On Jan. 21, 2016, 10:49 p.m., Ryan Blue wrote:
> > connector/connector-hdfs/src/test/java/org/apache/sqoop/connector/hdfs/TestLoader.java,
line 188
> > <https://reviews.apache.org/r/42327/diff/8/?file=1204495#file1204495line188>
> >
> >     Why is default compression not supported?

looking at the parquet's CompressionCodecName enum: 

```
public enum CompressionCodecName {
  UNCOMPRESSED(null, CompressionCodec.UNCOMPRESSED, ""),
  SNAPPY("parquet.hadoop.codec.SnappyCodec", CompressionCodec.SNAPPY, ".snappy"),
  GZIP("org.apache.hadoop.io.compress.GzipCodec", CompressionCodec.GZIP, ".gz"),
  LZO("com.hadoop.compression.lzo.LzoCodec", CompressionCodec.LZO, ".lzo");
```


ToCompression.Default maps to org.apache.hadoop.io.compress.DefaultCodec which I believe is
zlib, which is not supported here.


> On Jan. 21, 2016, 10:49 p.m., Ryan Blue wrote:
> > pom.xml, line 720
> > <https://reviews.apache.org/r/42327/diff/8/?file=1204498#file1204498line720>
> >
> >     Why exclude Avro? The dependency version should be managed by maven to match
the version depended on by Sqoop directly.

this is an issue that i believe to be related to our connector class loader.

not excluding avro causes the following exception:

```
java.lang.LinkageError: loader constraint violation: when resolving method "org.apache.sqoop.connector.idf.AVROIntermediateDataFormat.getAvroSchema()Lorg/apache/avro/Schema;"
the class loader (instance of org/apache/sqoop/classloader/ConnectorClassLoader) of the current
class, org/apache/sqoop/connector/hdfs/hdfsWriter/HdfsParquetWriter, and the class loader
(instance of sun/misc/Launcher$AppClassLoader) for the method's defining class, org/apache/sqoop/connector/idf/AVROIntermediateDataFormat,
have different Class objects for the type org/apache/avro/Schema used in the signature
	at org.apache.sqoop.connector.hdfs.hdfsWriter.HdfsParquetWriter.initialize(HdfsParquetWriter.java:60)
	at org.apache.sqoop.connector.hdfs.HdfsLoader$1.run(HdfsLoader.java:95)
	at org.apache.sqoop.connector.hdfs.HdfsLoader$1.run(HdfsLoader.java:61)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
	at org.apache.sqoop.connector.hdfs.HdfsLoader.load(HdfsLoader.java:61)
	at org.apache.sqoop.connector.hdfs.HdfsLoader.load(HdfsLoader.java:46)
	at org.apache.sqoop.job.mr.SqoopOutputFormatLoadExecutor$ConsumerThread$1.call(SqoopOutputFormatLoadExecutor.java:276)
	at org.apache.sqoop.job.mr.SqoopOutputFormatLoadExecutor$ConsumerThread$1.call(SqoopOutputFormatLoadExecutor.java:257)
	at org.apache.sqoop.utils.ClassUtils.executeWithClassLoader(ClassUtils.java:281)
	at org.apache.sqoop.job.mr.SqoopOutputFormatLoadExecutor$ConsumerThread.run(SqoopOutputFormatLoadExecutor.java:256)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)oader (instance of sun/misc/Launcher$AppClassLoader)
for the method's defining class, org/apache/sqoop/connector/idf/AVROIntermediateDataFormat,
have different Class objects for the type org/apache/avro/Schema used in the signature
```

this is occurring within individual mappers.


- Abraham


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42327/#review115691
-----------------------------------------------------------


On Jan. 21, 2016, 7:06 p.m., Abraham Fine wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42327/
> -----------------------------------------------------------
> 
> (Updated Jan. 21, 2016, 7:06 p.m.)
> 
> 
> Review request for Sqoop.
> 
> 
> Bugs: SQOOP-2788
>     https://issues.apache.org/jira/browse/SQOOP-2788
> 
> 
> Repository: sqoop-sqoop2
> 
> 
> Description
> -------
> 
> read and write parquet in hdfsconnector
> 
> 
> Diffs
> -----
> 
>   connector/connector-hdfs/pom.xml 599631418ca63cc43d645c1ee1e7a73dc824b313 
>   connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsExtractor.java
a813c479a07d68e14ed49936f642e762e5b37437 
>   connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsLoader.java
774221aaf5c8cdb8d26ca108fae239598b42229b 
>   connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsPartition.java
644de60581faf90ceb2fcef8d3e0544067791fcc 
>   connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/configuration/ToFormat.java
27d121f529ecb4d5bd79e2b8c74ab8f7cc15fb10 
>   connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/hdfsWriter/GenericHdfsWriter.java
2ccccc4a94a582c8b47ccdefa523d1fd1632e627 
>   connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/hdfsWriter/HdfsParquetWriter.java
PRE-CREATION 
>   connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/hdfsWriter/HdfsSequenceWriter.java
75c2e7ef192d7d9628e622cc3c5ef176e33a73d0 
>   connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/hdfsWriter/HdfsTextWriter.java
78cf9732fdb89689b04d43e4af70ca5a43732dbf 
>   connector/connector-hdfs/src/test/java/org/apache/sqoop/connector/hdfs/TestLoader.java
11fcef2a38209c79928f582cf8aa03e889247f22 
>   connector/connector-sdk/src/main/java/org/apache/sqoop/connector/common/SqoopAvroUtils.java
985149cbb0d28b55a19d17076d996364d7f2ae90 
>   connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/AVROIntermediateDataFormat.java
d78fa8b72ecfe62eeec240e01597e7f2a7e4dd76 
>   pom.xml cb8a973abc96af1de905cebd80d30177cbaf1cb4 
>   test/pom.xml 644a9c7dbc746d4a3268532bdcf0babd4faaafba 
>   test/src/test/java/org/apache/sqoop/integration/connector/hdfs/ParquetTest.java PRE-CREATION

> 
> Diff: https://reviews.apache.org/r/42327/diff/
> 
> 
> Testing
> -------
> 
> integration tests pass
> 
> 
> Thanks,
> 
> Abraham Fine
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message