This is Apache Spark with modifications to run security sensitive code inside Intel SGX enclaves. The implementation leverages sgx-lkl, a library OS that allows to run Java-based applications inside SGX enclaves.
This guide shows how to run Sgx-Spark in a few simple steps using Docker. Most parts of the setup and deployment are wrapped within Docker containers. Compliation and deployment should thus be smooth.
Clone this Sgx-Spark repository
Build the Sgx-Spark base image. The name of the resulting Docker image is
. This process might take a while (30-60 mins):sgx-spark/dockerfiles$ docker build -t sgxspark .
Prepare the disk image that will be required by sgx-lkl. Due to restrictions of Docker, this step can currently not be implemented as part of the above Docker build process. Thus, this step is platform-dependent. The process has been successfully tested on Ubuntu 16.04 and Arch Linux:
sgx-spark/lkl$ make prepare-image
Create a Docker network device that will be used for communication by the Docker containers. Note that by creating a user-defined network, Docker will create an embedded DNS server so that workers can find the Spark master by name.
sgx-spark$ docker network create sgxsparknet
From within directory sgx-spark/dockerfiles
, run the Sgx-Spark master node,
the Sgx-Spark worker node, as well as the actual Sgx-Spark job as follows.
Run the Sgx-Spark master node:
sgx-spark/dockerfiles$ docker run \ --user user \ --env-file $(pwd)/docker-env \ --net sgxsparknet \ --name sgxspark-docker-master \ -p 7077:7077 \ -p 8082:8082 \ -ti sgxspark /sgx-spark/
Run the Sgx-Spark worker node:
sgx-spark/dockerfiles$ docker run \ --user user \ --memory="4g" \ --shm-size="8g" \ --env-file $(pwd)/docker-env \ --net sgxsparknet \ --privileged \ -v $(pwd)/../lkl:/spark-image:ro \ -ti sgxspark /sgx-spark/
Run the Sgx-Spark job as follows.
As of writing, the three jobs below are known to be fully supported:
sgx-spark/dockerfiles$ docker run \ --user user \ --memory="4g" \ --shm-size="8g" \ --env-file $(pwd)/docker-env \ --net sgxsparknet \ --privileged \ -v $(pwd)/../lkl:/spark-image:ro \ -e SPARK_JOB_CLASS=org.apache.spark.examples.MyWordCount \ -e SPARK_JOB_NAME=WordCount \ -e \ -e SPARK_JOB_ARG1=output \ -ti sgxspark /sgx-spark/
sgx-spark/dockerfiles$ docker run \ --user user \ --memory="4g" \ --shm-size="8g" \ --env-file $(pwd)/docker-env \ --net sgxsparknet \ --privileged \ -v $(pwd)/../lkl:/spark-image:ro \ -e SPARK_JOB_CLASS=org.apache.spark.examples.mllib.KMeansExample \ -e SPARK_JOB_NAME=KMeans \ -e SPARK_JOB_ARG0=data/mllib/kmeans_data.txt \ -ti sgxspark /sgx-spark/
sgx-spark/dockerfiles$ docker run \ --user user \ --memory="4g" \ --shm-size="8g" \ --env-file $(pwd)/docker-env \ --net sgxsparknet \ --privileged \ -v $(pwd)/../lkl:/spark-image:ro \ -e SPARK_JOB_CLASS=org.apache.spark.examples.LineCount \ -e SPARK_JOB_NAME=LineCount \ -e \ -ti sgxspark /sgx-spark/
To run Sgx-Spark natively, proceed as detailed in the following.
Install all required dependencies. For Ubuntu 16.04, these can be installed as follows:
$ sudo apt-get update
$ sudo apt-get install -y --no-install-recommends scala libtool autoconf curl xutils-dev git build-essential maven openjdk-8-jdk ssh bc python autogen wget autotools-dev sudo automake
Hadoop, and thus Spark, depends on Google Protocol Buffers (GPB) in version 2.5.0:
Make sure to uninstall any other versions of GPB
Install GPB v2.5.0. Instructions for Ubuntu 16.04 are as follows:
$ cd /tmp /tmp$ wget /tmp$ tar xvf protobuf-2.5.0.tar.gz /tmp$ cd protobuf-2.5.0 /tmp/protobuf-2.5.0$ ./ && ./configure && make && sudo make install /tmp/protobuf-2.5.0$ sudo apt-get install -y --no-install-recommends libprotoc-dev
Instructions for Arch Linux are available at
As Sgx-Spark uses sgx-lkl, the
latter must have been downloaded and compiled successfully. As of writing (June 14, 2018),
should be compiled using branch cleanup-musl
. Please
follow the documentation of sgx-lkl and ensure that your
installation of sgx-lkl executes simple Java applications successfully.
sgx-spark$ build/mvn -DskipTests package
As part of this compilation process, a modified Hadoop library has been compiled. Copy the Hadoop JAR file into the Sgx-Spark jars directory:
sgx-spark$ cp hadoop-2.6.5-src/hadoop-common-project/hadoop-common/target/hadoop-common-2.6.5.jar assembly/target/scala-2.11/jars/
Sgx-Spark ships with a native C library (
) that enables shared-memory-based communication between two JVMs. Compile as follows:sgx-spark/C$ make install
Adjust file
for your environment:Variable
must point to yoursgx-lkl
directory (see Prerequisites). -
Build the Sgx-Spark disk image required for sgx-lkl:
sgx-spark/lkl$ make clean all
Finally, we are ready to run (i) the Sgx-Spark master node,
(ii) the Sgx-Spark worker node, (iii) the worker's enclave, (iv) the Sgx-Spark client,
and (v) the client's enclave. In the following commands, replace: <hostname>
the master node's actual hostname; <sgx-lkl>
with the path to your sgx-lkl
Note: After running each example, make sure to (i) restart all processes, (ii) delete all shared memory files in /dev/shm
If you run all the nodes locally, you need to add the following line to
Run the Master node
sgx-spark$ ./
Run the Worker node
sgx-spark$ ./
Run the enclave for the Worker node
sgx-spark$ ./
Run the enclave for the driver program. This is the process that will output the job results!
sgx-spark$ ./
Finally, submit a Spark job. The result will be output in the process we started just before.
sgx-spark$ ./
sgx-spark$ ./
sgx-spark$ ./
In order to run the above installation without SGX, start your environment as follows:
Start the Master node as above
Start the Worker node as above, but change environment variable
Do not start the enclaves
Submit the Spark job as above, but change evironment variable
There are a few important things to keep in mind when developing Sgx-Spark:
Whenever you change parts of the code, obviously, you must recompile the Spark code
sgx-spark$ mvn package -DskipTests
There have been (not clearly definable) situations in which the above command did not compile all of the changed files. In this case, issue:
sgx-spark$ mvn clean package -DskipTests
After making changes to the Sgx-Spark code and after compiling the Java/Scala code (see above), you always need to rebuild the lkl image that will be used by sgx-lkl:
sgx-spark/lkl$ make clean all
If you changed parts of the Hadoop code (in directory
), you will also need to copy the resulting*jar
file:sgx-spark$ cp hadoop-2.6.5-src/hadoop-common-project/hadoop-common/target/hadoop-common-2.6.5.jar assembly/target/scala-2.11/jars/
Lastly, do not forget to remove all related shared memory files in
before running your next experiment!
Development with sgx-lkl can be tedious. For development purposes, a special flag allows to run the
enclave-side of Sgx-Spark in a regular JVM rather than on top of sgx-lkl. To make use of this feature,
run the enclave JVMs using scripts
Under the hood, these scripts set environment variable DEBUG_IS_ENCLAVE_REAL=false
(defaults to true
) and
provide the JVM with a value for environment variable SGXLKL_SHMEM_FILE
. Note that the value of SGXLKL_SHMEM_FILE
must be the same as the one provided for the corresponding Worker (
) and Driver (