spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: Spark Security
Date Mon, 01 Jun 2020 13:59:27 GMT
spark_read_csv() does not read locally; again it is using Spark to read it.

If you are literally running a local Spark cluster locally on your machine,
then all that is happening on your machine via Spark, because the
driver/executors are one local process.
Otherwise, it is running wherever the Spark cluster is running - some
machines within your org, or in the cloud, or wherever it was run. You
would be running a driver process somewhere else.

Yes, what is relevant is network firewalls on the machines where Spark
runs. (And potentially enabling auth in Spark itself).
Of course it also matters where the data is. Spark has nothing to say about
how the data is being stored.



On Mon, Jun 1, 2020 at 7:20 AM Wilbert S. <wilbertseoane@gmail.com> wrote:

> Hello,
>
> This is what happens when I load the data using sparklyr::spark_read_csv()
> in R. It creates a "derby.log" file that says something along the lines of:
>
> Sun May 31 14:17:02 EDT 2020:
> Booting Derby version The Apache Software Foundation - Apache Derby -
> 10.12.1.1 - (1704137): instance xxxxxxx
> on database directory memory:C:\Users\wseoane\2020-05-31 sparklyr on three
> rows\databaseName=metastore_db with class loader
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$xxxxxxxxx
> Loaded from
> file:/C:/Users/wseoane/AppData/Local/spark/spark-2.4.3-bin-hadoop2.7/jars/derby-10.12.1.1.jar
> java.vendor=Oracle Corporation
> java.runtime.version=1.8.0_241-b07
> user.dir=C:\Users\wseoane\2020-05-31 sparklyr on three rows
> os.name=Windows 10
> os.arch=xxxxx
> os.version=10.0
> derby.system.home=null
> Database Class Loader started - derby.database.classpath=''
>
>
> I can then click to view details about the Spark connection in my browser
> while I have the Spark connection in sparklyr. Here are the results from a
> test .tsv file:
> Jobs:
> [image: Jobs 2020-05-31 142103.png]
> SQL:
> [image: SQL 2020-05-31 142217.png]
> Stages:
> [image: Stages 2020-05-31 142217.png]
> Storage:
> [image: Storage 2020-05-31 142217.png]
>
> So, since sparklyr::spark_read_csv() reads in the data locally and not in
> the cloud, security is determined by my company's IT department correct
> (i.e. the firewalls that the IT department has in place in the network and
> the antivirus software they have installed on my computer and etc.)? If it
> were on the cloud, the cloud would need it's own layer of security ("up to
> whoever runs the cluster") but that is not relevant here since I am using sparklyr::spark_read_csv(),
> correct?
>
>
> Thanks,
>
> Wilbert Seoane
>

Mime
View raw message