Skip to content

alfred82santa/sparpy

Folders and files

NameName
Last commit message
Last commit date
Nov 28, 2022
Jul 2, 2020
Jul 2, 2020
Jul 2, 2020
Jul 2, 2020
Jul 2, 2020
Oct 21, 2020
Nov 28, 2022
Jul 2, 2020
Jul 2, 2020
Jul 2, 2020
Jul 2, 2020
Sep 1, 2020
Nov 28, 2022
Jul 2, 2020

Repository files navigation

Sparpy: A Spark entrypoint for Python

Changelog

v0.5.5

  • Added --proxy option in order to set a proxy to access to Python packages repositories.

v0.5.4

  • Added plugin-env section on configuration file in order to be able to set environment variables on plugin download process.
  • Added --plugin-env option (and its environment variable associated SPARPY_PLUGIN_ENVVARS) in order to set environment variables on plugin download process. It could be necessary on some cases using conda environments.
  • Added environment variable SPARPY_CONFIG for option --config.
  • Added environment variable SPARPY_DEBUG for option --debug.

v0.5.3

  • Fix isparpy.

v0.5.2

  • Fix ignoring all packages when exclude packages list is empty.

v0.5.1

  • Fix Python package regex.
  • Fix download script.

v0.5.0

  • Added --exclude-python-packages option in order to exclude python packages.
  • Better parsing plugins names.
  • Added --exclude-packages option in order to exclude spark packages.

v0.4.5

  • Fix isparpy entrypoint. Allows --class parameter.
  • Allow to set constraints files.

v0.4.4

  • Don't set master and deploy_mode default values.

v0.4.3

  • Fix sparpy-submit entrypoint.
  • Fix --property-file option.
  • Fix --class option.

v0.4.2

  • Able to use environment variables for the most of options.

v0.4.1

  • Support to set pip options as configuration using --conf sparpy.config-key=value in order to allow to use sparpy-submit in EMR-on-EKS images.
  • Allows --class in order to allow to use sparpy-submit in EMR-on-EKS images.
  • Allows --property-file in order to allow to use sparpy-submit in EMR-on-EKS images.

v0.4.0

  • Added --pre option in order to allow pre-release packages.
  • Added --env option in order to set environment variables for spark process.
  • Added spark-env config section in order to set environment variables for spark process.
  • Write pip output when it fails.
  • Fixed problems with interactive sparpy.
  • Fixed no-self option in config file.
  • Allow to use plugins that don't use click. They must be callable with one argument of type Sequence[str] in order to pass arguments to it.
  • Added --version option in order to print sparpy version.
  • Fixed error when a plugin requires a package which is already installed but version does not satisfy requirement.
  • Sparpy does not print error traceback when subprocess fails.

v0.3.0

  • Enable --force-download option.
  • Added --find-links option in order to use a directory as package repository.
  • Added --no-index option in order to avoid to use external package repositories.
  • Added --queue option in order to set yarn queue.
  • Ensure driver's python executable is same python as sparpy.
  • Added new entry point sparpy-download just to download packages to specific directory.
  • Added new entry point isparpy in order to start an interactive session.

v0.2.1

  • Force pyspark python executable to same as sparpy.
  • Fix unrecognized options.
  • Fix default configuration file names.

v0.2.0

  • Added configuration file option.
  • Added --debug option.

How to build a Sparpy plugin

On package setup.py an entry point should be configured for Sparpy:

setup(
    name='yourpackage',
    ...

    entry_points={
        ...
        'sparpy.cli_plugins': [
            'my_command_1=yourpackage.module:command_1',
            'my_command_2=yourpackage.module:command_2',
        ]
    }
)

Note

Avoid to use PySpark as requirement in order to not download package from pypi.

Install

It must be installed on a Spark edge node.

$  pip install sparpy[base]

How to use

Using default Spark submit parameters:

$ sparpy --plugin "mypackage>=0.1" my_plugin_command --myparam 1

Configuration files

sparpy and sparpu-submit accept the parameter --config that allow to set a configuration file. If it is not set it will try to use configuration file $HOME/.sparpyrc. It if does not exist it will try to use /etc/sparpy.conf.

Format:

[spark]

master=yarn
deploy-mode=client

queue=my_queue

spark-executable=/path/to/my-spark-submit
conf=
    spark.conf.1=value1
    spark.conf.2=value2

packages=
    maven:package_1:0.1.1
    maven:package_2:0.6.1

repositories=
    https://my-maven-repository-1.com/mvn
    https://my-maven-repository-2.com/mvn

reqs_paths=
    /path/to/dir/with/python/packages_1
    /path/to/dir/with/python/packages_2

[spark-env]

MY_ENV_VAR=value

[plugins]

extra-index-urls=
    https://my-pypi-repository-1.com/simple
    https://my-pypi-repository-2.com/simple

cache-dir=/path/to/cache/dir

plugins=
    my-package1
    my-package2==0.1.2

requirements-files=
    /path/to/requirement-1.txt
    /path/to/requirement-2.txt

find-links=
    /path/to/directory/with/packages_1
    /path/to/directory/with/packages_2

download-dir-prefix=my_prefix_

no-index=false
no-self=false
force-download=true

[plugin-env]

MY_ENV_VAR=value

[interactive]

pyspark-executable=/path/to/pyspark
python-interactive-driver=/path/to/interactive/driver

About

Sparpy: A spark entry point for python

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published