Skip to content

Using SparkR

Eron Wright edited this page Apr 28, 2017 · 6 revisions

Getting Started

Prerequisites:

Launch the SparkR shell with an additional package, the Spark ECS library:

# cd $SPARK_HOME
$ bin/sparkR --packages com.emc.ecs:spark-ecs-s3_2.11:1.4.1

Using

The Spark ECS library works with SparkR by providing a data source that produces a SparkDataFrame. There are a few ways to create a data frame for a given bucket, discussed in the following sections.

Using read.df

The read.df function (ref) reads in a dataset from a data source as a SparkDataFrame. Use the s3 data source to access a bucket's metadata. For example:

# create a SparkDataFrame for the bucket
df <- read.df(NULL, "s3", bucket = "<name>", endpoint = "http://...", identity = "...", secretKey = "...")

# print the schema of the frame for illustration purposes
printSchema(df)

# Filter the bucket to show objects with metadata 'myinteger' greater than 0
showDF(filter(df, "myinteger > 0"))

More information in the SparkR documentation.

Using Spark SQL

Spark SQL supports DDL commands to register a temporary table based on a data source. For example, the below registers a bucket as a table called bucket1. The table may then be used in subsequent SQL statements.

sql("
  CREATE TEMPORARY VIEW bucket1
  USING s3
  OPTIONS (
    bucket '...',
    endpoint '...',
    identity '...',
    secretKey '...'
  )")

df <- sql("SELECT * FROM bucket1 WHERE myinteger > 0")

The tableToDF function (ref) may be used to create a SparkDataFrame directly from a table name:

df <- tableToDF("bucket1")

Conversely, the createOrReplaceTempView function (ref) may be used to register an already-defined SparkDataFrame as a temporary table.

createOrReplaceTempView(df, "bucket1")
sql("SELECT * FROM bucket1 WHERE ...")
Clone this wiki locally