spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@hortonworks.com>
Subject Re: Local Spark talking to remote HDFS?
Date Tue, 25 Aug 2015 17:11:16 GMT
I wouldn't try to play with forwarding & tunnelling; always hard to work out what ports
get used everywhere, and the services like hostname==URL in paths.

Can't you just set up an entry in the windows /etc/hosts file? It's what I do (on Unix) to
talk to VMs


> On 25 Aug 2015, at 04:49, Dino Fancellu <dino@felstar.com> wrote:
> 
> Tried adding 50010, 50020 and 50090. Still no difference.
> 
> I can't imagine I'm the only person on the planet wanting to do this.
> 
> Anyway, thanks for trying to help.
> 
> Dino.
> 
> On 25 August 2015 at 08:22, Roberto Congiu <roberto.congiu@gmail.com> wrote:
>> Port 8020 is not the only port you need tunnelled for HDFS to work. If you
>> only list the contents of a directory, port 8020 is enough... for instance,
>> using something
>> 
>> val p = new org.apache.hadoop.fs.Path("hdfs://localhost:8020/")
>> val fs = p.getFileSystem(sc.hadoopConfiguration)
>> fs.listStatus(p)
>> 
>> you should see the file list.
>> But then, when accessing a file, you need to actually get its blocks, it has
>> to connect to the data node.
>> The error 'could not obtain block' means it can't get that block from the
>> DataNode.
>> Refer to
>> http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.2.1/bk_reference/content/reference_chap2_1.html
>> to see the complete list of ports that also need to be tunnelled.
>> 
>> 
>> 
>> 2015-08-24 13:10 GMT-07:00 Dino Fancellu <dino@felstar.com>:
>>> 
>>> Changing the ip to the guest IP address just never connects.
>>> 
>>> The VM has port tunnelling, and it passes through all the main ports,
>>> 8020 included to the host VM.
>>> 
>>> You can tell that it was talking to the guest VM before, simply
>>> because it said when file not found
>>> 
>>> Error is:
>>> 
>>> Exception in thread "main" org.apache.spark.SparkException: Job
>>> aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most
>>> recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost):
>>> org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block:
>>> BP-452094660-10.0.2.15-1437494483194:blk_1073742905_2098
>>> file=/tmp/people.txt
>>> 
>>> but I have no idea what it means by that. It certainly can find the
>>> file and knows it exists.
>>> 
>>> 
>>> 
>>> On 24 August 2015 at 20:43, Roberto Congiu <roberto.congiu@gmail.com>
>>> wrote:
>>>> When you launch your HDP guest VM, most likely it gets launched with NAT
>>>> and
>>>> an address on a private network (192.168.x.x) so on your windows host
>>>> you
>>>> should use that address (you can find out using ifconfig on the guest
>>>> OS).
>>>> I usually add an entry to my /etc/hosts for VMs that I use often....if
>>>> you
>>>> use vagrant, there's also a vagrant module that can do that
>>>> automatically.
>>>> Also, I am not sure how the default HDP VM is set up, that is, if it
>>>> only
>>>> binds HDFS to 127.0.0.1 or to all addresses. You can check that with
>>>> netstat
>>>> -a.
>>>> 
>>>> R.
>>>> 
>>>> 2015-08-24 11:46 GMT-07:00 Dino Fancellu <dino@felstar.com>:
>>>>> 
>>>>> I have a file in HDFS inside my HortonWorks HDP 2.3_1 VirtualBox VM.
>>>>> 
>>>>> If I go into the guest spark-shell and refer to the file thus, it works
>>>>> fine
>>>>> 
>>>>>  val words=sc.textFile("hdfs:///tmp/people.txt")
>>>>>  words.count
>>>>> 
>>>>> However if I try to access it from a local Spark app on my Windows
>>>>> host,
>>>>> it
>>>>> doesn't work
>>>>> 
>>>>>  val conf = new SparkConf().setMaster("local").setAppName("My App")
>>>>>  val sc = new SparkContext(conf)
>>>>> 
>>>>>  val words=sc.textFile("hdfs://localhost:8020/tmp/people.txt")
>>>>>  words.count
>>>>> 
>>>>> Emits
>>>>> 
>>>>> 
>>>>> 
>>>>> The port 8020 is open, and if I choose the wrong file name, it will
>>>>> tell
>>>>> me
>>>>> 
>>>>> 
>>>>> 
>>>>> My pom has
>>>>> 
>>>>>        <dependency>
>>>>>                        <groupId>org.apache.spark</groupId>
>>>>>                        <artifactId>spark-core_2.11</artifactId>
>>>>>                        <version>1.4.1</version>
>>>>>                        <scope>provided</scope>
>>>>>                </dependency>
>>>>> 
>>>>> Am I doing something wrong?
>>>>> 
>>>>> Thanks.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> View this message in context:
>>>>> 
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Local-Spark-talking-to-remote-HDFS-tp24425.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>> 
>>>> 
>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message