nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Austin Heyne <ahe...@ccri.com>
Subject Re: GetHDFS from Azure Blob
Date Tue, 28 Mar 2017 22:11:22 GMT
Thanks Bryan,

We're only working with one account here but with multiple root level 
containers. e.g.

wasb://csv@accountName.blob.core.windows.net/
wasb://xml@accountName.blob.core.windows.net/
wasb://json@accountName.blob.core.windows.net/

The thing that stands out to me the most is why would the defaultFS need 
to be set at all if we're always providing complete wasb://... paths? 
Almost seems like a bug or oversight.

If anyone has any input on how we could work around this please let me know.

Thanks for your help,
Austin

On 03/28/2017 04:39 PM, Bryan Bende wrote:
> Austin,
>
> I think you are correct that its <containername>@<accountname>, I
> hadn't looked at this config in a long time and was reading too
> quickly before :)
>
> That would line up with the other property
> fs.azure.account.key.<accountname>.blob.core.windows.net where you
> specify the key for that account.
>
> I have no idea if this will work, but lets say you had three different
> WASB file systems, presumably each with their own account name and
> key, you might be able to define these in core-site.xml:
>
>   <property>
>        <name>fs.azure.account.key.ACCOUNT1.blob.core.windows.net</name>
>        <value>KEY1</value>
>      </property>
>
>   <property>
>        <name>fs.azure.account.key.ACCOUNT2.blob.core.windows.net</name>
>        <value>KEY2</value>
>      </property>
>
>   <property>
>        <name>fs.azure.account.key.ACCOUNT3.blob.core.windows.net</name>
>        <value>KEY3</value>
>      </property>
>
> Then in your HDFS processor in NiFi you point at this core-site.xml
> and use a specific directory like
> wasb://container@ACCOUNT3.blob.core.windows.net/<path> and I'm hoping
> it would know how to use the key for ACCOUNT3.
>
> Not really sure if that helps your situation.
>
> -Bryan
>
>
> On Tue, Mar 28, 2017 at 4:14 PM, Austin Heyne <aheyne@ccri.com> wrote:
>> Bryan,
>>
>> So I initially didn't think much of it (assumed it a typo, etc) but you've
>> said that the access url for wasb that you've been using is
>> wasb://YOUR_USER@YOUR_HOST/. However, this has never worked for us and I'm
>> wondering if we have a difference configuration somewhere. What we have to
>> use is wasb[s]://<containername>@<accountname>.blob.core.windows.net/<path>
>> which seems to be in line with the Azure blob storage GUI and is what is
>> outlined here [1]. Is there some other way this connector is being setup? It
>> would make much more sense using your access pattern as then each container
>> wouldn't need to have it's own core-site.xml.
>>
>> Thanks,
>> Austin
>>
>> [1a]
>> https://hadoop.apache.org/docs/current/hadoop-azure/index.html#Accessing_wasb_URLs
>> [1b]
>> https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storage
>>
>>
>>
>>
>> On 03/28/2017 03:55 PM, Bryan Bende wrote:
>>> Austin,
>>>
>>> I believe the default FS is only used when you write to a path that
>>> doesn't specify the filesystem. Meaning, if you set the directory of
>>> PutHDFS to /data then it will use the default FS, but if you specify
>>> wasb://user@wasb2/data then it will go to /data in a different
>>> filesystem.
>>>
>>> The problem here is that I don't see a way to specify different keys
>>> for each WASB filesystem in the core-site.xml.
>>>
>>> Admittedly I have never tried to setup something like this with many
>>> different filesystems.
>>>
>>> -Bryan
>>>
>>>
>>> On Tue, Mar 28, 2017 at 3:50 PM, Austin Heyne <aheyne@ccri.com> wrote:
>>>> Hi Andre,
>>>>
>>>> Yes, I'm aware of that configuration property, it's what I have been
>>>> using
>>>> to set the core-site.xml and hdfs-site.xml. For testing this I didn't
>>>> modify
>>>> the core-site located in the HADOOP_CONF_DIR but rather copied and
>>>> modified
>>>> it and the pointed the processor to the copy. The problem with this is
>>>> that
>>>> we'll end up with a large number of core-site.xml copies that will all
>>>> have
>>>> to be maintained separately. Ideally we'd be able to specify the
>>>> defaultFS
>>>> in the processor config or have the processor behave like the hdfs
>>>> command
>>>> line tools. The command line tools don't require the defaultFS to be set
>>>> to
>>>> a wasb url in order to use wasb urls.
>>>>
>>>> The key idea here is long term maintainability and using Ambari to
>>>> maintain
>>>> the configuration. If we need to change any other setting in the
>>>> core-site.xml we'd have to change it in a bunch of different files
>>>> manually.
>>>>
>>>> Thanks,
>>>> Austin
>>>>
>>>>
>>>> On 03/28/2017 03:34 PM, Andre wrote:
>>>>
>>>> Austin,
>>>>
>>>> Perhaps that wasn't explicit but the settings don't need to be system
>>>> wide,
>>>> instead the defaultFS may be changed just for a particular processor,
>>>> while
>>>> the others may use configurations.
>>>>
>>>> The *HDFS processor documentation mentions it allows yout to set
>>>> particular
>>>> hadoop configurations:
>>>>
>>>> " A file or comma separated list of files which contains the Hadoop file
>>>> system configuration. Without this, Hadoop will search the classpath for
>>>> a
>>>> 'core-site.xml' and 'hdfs-site.xml' file or will revert to a default
>>>> configuration"
>>>>
>>>> Have you tried using this field to point to a file as described by Bryan?
>>>>
>>>> Cheers
>>>>
>>>> On 29 Mar 2017 05:21, "Austin Heyne" <aheyne@ccri.com> wrote:
>>>>
>>>> Thanks Bryan,
>>>>
>>>> Working with the configuration you sent what I needed to change was to
>>>> set
>>>> the fs.defaultFS to the wasb url that we're working from. Unfortunately
>>>> this
>>>> is a less than ideal solution since we'll be pulling files from multiple
>>>> wasb urls and ingesting them into an Accumulo datastore. Changing the
>>>> defaultFS I'm pretty certainly would mess with our local HDFS/Accumulo
>>>> install. In addition we're trying to maintain all of this configuration
>>>> with
>>>> Ambari, which from what I can tell only supports one core-site
>>>> configuration
>>>> file.
>>>>
>>>> Is the only solution here to maintain multiple core-site.xml files or is
>>>> there another way we configure this?
>>>>
>>>> Thanks,
>>>>
>>>> Austin
>>>>
>>>>
>>>>
>>>> On 03/28/2017 01:41 PM, Bryan Bende wrote:
>>>>> Austin,
>>>>>
>>>>> Can you provide the full error message and stacktrace for  the
>>>>> IllegalArgumentException from nifi-app.log?
>>>>>
>>>>> When you start the processor it creates a FileSystem instance based on
>>>>> the config files provided to the processor, which in turn causes all
>>>>> of the corresponding classes to load.
>>>>>
>>>>> I'm not that familiar with Azure, but if "Azure blob store" is WASB,
>>>>> then I have successfully done the following...
>>>>>
>>>>> In core-site.xml:
>>>>>
>>>>> <configuration>
>>>>>
>>>>>        <property>
>>>>>          <name>fs.defaultFS</name>
>>>>>          <value>wasb://YOUR_USER@YOUR_HOST/</value>
>>>>>        </property>
>>>>>
>>>>>        <property>
>>>>>          <name>fs.azure.account.key.nifi.blob.core.windows.net</name>
>>>>>          <value>YOUR_KEY</value>
>>>>>        </property>
>>>>>
>>>>>        <property>
>>>>>          <name>fs.AbstractFileSystem.wasb.impl</name>
>>>>>          <value>org.apache.hadoop.fs.azure.Wasb</value>
>>>>>        </property>
>>>>>
>>>>>        <property>
>>>>>          <name>fs.wasb.impl</name>
>>>>>          <value>org.apache.hadoop.fs.azure.NativeAzureFileSystem</value>
>>>>>        </property>
>>>>>
>>>>>        <property>
>>>>>          <name>fs.azure.skip.metrics</name>
>>>>>          <value>true</value>
>>>>>        </property>
>>>>>
>>>>> </configuration>
>>>>>
>>>>> In Additional Resources property of an HDFS processor, point to a
>>>>> directory with:
>>>>>
>>>>> azure-storage-2.0.0.jar
>>>>> commons-codec-1.6.jar
>>>>> commons-lang3-3.3.2.jar
>>>>> commons-logging-1.1.1.jar
>>>>> guava-11.0.2.jar
>>>>> hadoop-azure-2.7.3.jar
>>>>> httpclient-4.2.5.jar
>>>>> httpcore-4.2.4.jar
>>>>> jackson-core-2.2.3.jar
>>>>> jsr305-1.3.9.jar
>>>>> slf4j-api-1.7.5.jar
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Bryan
>>>>>
>>>>>
>>>>> On Tue, Mar 28, 2017 at 1:15 PM, Austin Heyne <aheyne@ccri.com>
wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> Thanks for all the help you've given me so far. Today I'm trying
to
>>>>>> pull
>>>>>> files from an Azure blob store. I've done some reading on this and
from
>>>>>> previous tickets [1] and guides [2] it seems the recommended approach
>>>>>> is
>>>>>> to
>>>>>> place the required jars, to use the HDFS Azure protocol, in 'Additional
>>>>>> Classpath Resoures' and the hadoop core-site and hdfs-site configs
into
>>>>>> the
>>>>>> 'Hadoop Configuration Resources'. I have my local HDFS properly
>>>>>> configured
>>>>>> to access wasb urls. I'm able to ls, copy to and from, etc with out
>>>>>> problem.
>>>>>> Using the same HDFS config files and trying both all the jars in
my
>>>>>> hadoop-client/lib directory (hdp) and using the jars recommend in
[1]
>>>>>> I'm
>>>>>> still seeing the "java.lang.IllegalArgumentException: Wrong FS: "
error
>>>>>> in
>>>>>> my NiFi logs and am unable to pull files from Azure blob storage.
>>>>>>
>>>>>> Interestingly, it seems the processor is spinning up way to fast,
the
>>>>>> errors
>>>>>> appear in the log as soon as I start the processor. I'm not sure
how it
>>>>>> could be loading all of those jars that quickly.
>>>>>>
>>>>>> Does anyone have any experience with this or recommendations to try?
>>>>>>
>>>>>> Thanks,
>>>>>> Austin
>>>>>>
>>>>>> [1] https://issues.apache.org/jira/browse/NIFI-1922
>>>>>> [2]
>>>>>>
>>>>>>
>>>>>> https://community.hortonworks.com/articles/71916/connecting-to-azure-data-lake-from-a-nifi-dataflow.html
>>>>>>
>>>>>>
>>>>


Mime
View raw message