-
Notifications
You must be signed in to change notification settings - Fork 2
Using SparkR
Prerequisites:
- R (download instructions)
- Spark 2.x
- EMC ECS 3.x
Launch the SparkR shell with an additional package, the Spark ECS library:
# cd $SPARK_HOME
$ bin/sparkR --packages com.emc.ecs:spark-ecs-s3_2.11:1.4.1
The Spark ECS library works with SparkR by providing a data source that produces a SparkDataFrame. There are a few ways to create a data frame for a given bucket, discussed in the following sections.
The read.df
function (ref) reads in a dataset from a data source as a SparkDataFrame. Use the s3
data source to access a bucket's metadata. For example:
# create a SparkDataFrame for the bucket
df <- read.df(NULL, "s3", bucket = "<name>", endpoint = "http://...", identity = "...", secretKey = "...")
# print the schema of the frame for illustration purposes
printSchema(df)
# Filter the bucket to show objects with metadata 'myinteger' greater than 0
showDF(filter(df, "myinteger > 0"))
More information in the SparkR documentation.
Spark SQL supports DDL commands to register a temporary table based on a data source. For example, the below registers a bucket as a table called bucket1
. The table may then be used in subsequent SQL statements.
sql("
CREATE TEMPORARY VIEW bucket1
USING s3
OPTIONS (
bucket '...',
endpoint '...',
identity '...',
secretKey '...'
)")
df <- sql("SELECT * FROM bucket1 WHERE myinteger > 0")
The tableToDF
function (ref) may be used to create a SparkDataFrame directly from a table name:
df <- tableToDF("bucket1")
Conversely, the createOrReplaceTempView
function (ref) may be used to register an already-defined SparkDataFrame as a temporary table.
createOrReplaceTempView(df, "bucket1")
sql("SELECT * FROM bucket1 WHERE ...")