"Mohair" is a project to prototype the use of substrait and arrow to handle pushdown of partial queries to remote storage. Initially, this project will largely inspired by Skytether--an extension of SkyhookDM for single-cell gene expression analysis and computational storage. However, work on skytether narrowly started before Arrow Flight, Acero, and substrait were started. So, a lot of skytether is low-level and specific, where mohair will try to be higher-level and more generic.
The overall goal is to be able to delegate part of a query plan to remote storage, execute the remaining query plan on the intermediate results (from remote storage), and then return the final results to the client. It is expected that this will require some amount of:
- splitting a query plan
- implementing a flight service
- communicating with a remote flight service
- communicating with flight services using substrait
For an informal tracking of progress, we list some milestones here (which will be updated as appropriate).
- Execute a single query without splitting
- Submit query as a substrait plan
- Submit query to a flight service
- Implement flight service as a very simple data server (probably using a file system for now)
- Execute a single query with a simple split
- Submit query as a substrait plan
- Split plan into 2 pieces
- Execute the 2nd piece on the intermediate results of the 1st piece.
This section is a bootstrap guide for trying out the code in this repository. Here, I will try to highlight the types of interactions I am trying to support and where in the code they are implemented (so that it's possible to further explore the code if you're interested).
Note that I may have forgotten to include some steps necessary for installation. If this is the case, let me know or file an issue in the mohair issue tracker.
The C++ code in this repository depends on Arrow, Substrait, and DuckDB. I am trying to simplify installation of dependencies, but for now this is only done for macosx using Homebrew.
I created a homebrew tap, which is located at drin/homebrew-hatchery:
# Opening my tap is optional
brew tap drin/hatchery
brew install apache-arrow-substrait
# In case my tap isn't tapped
# brew install drin/hatchery/apache-arrow-substrait
# and then the other formulas
brew install duckdb-substrait
# this is not yet working
# brew install skytether-mohair
To build C++
code, I use meson. To manage python
code, I use
poetry.
To build the C++
code:
brew install meson ninja git-lfs
git clone https://github.com/drin/mohair.git
pushd mohair
# Optional: these submodules are only needed for regenerating protobuf code
# git submodule init -- submodules/substrait-proto
# git submodule update -- submodules/substrait-proto
# git submodule init -- submodules/mohair-proto
# git submodule update -- submodules/mohair-proto
# NOTE: to regenerate, refer to the `Compiling Protobuf Wrappers` section
# Optional: git-lfs is only really needed for getting examples and such
# git lfs install --local
# git lfs pull
# "build-dir" is the name I use for my build directory
meson setup build-dir
meson compile -C build-dir
Although the formula itself doesn't work (I'm not yet sure why), the formula logic should be helpful as a reference for what commands to use: drin/homebrew-hatchery/skytether-mohair.
To build the python
code:
brew install poetry
# poetry commands assume you're in the repository root
poetry install
This will be done at a future date (it shouldn't be important now).
The short story is:
buf generate --template buf.gen.yaml submodules/substrait-proto
buf generate --template buf.gen.yaml submodules/mohair-proto
# NOTE: the below fixes don't accommodate the extensions.proto includes because of the
# extra nesting
# For fixing the includes in the C++ code
# sed -i '' 's/include ["]substrait[/]/include "..\/substrait\//' (grep -Rl "include \"substrait" src/cpp/query/substrait/)
# sed -i '' 's/include ["]mohair[/]/include "..\/mohair\//' (grep -Rl "include \"mohair" src/cpp/query/mohair/)
# For fixing the imports in the python code
# sed -i '' 's/from substrait/from mohair.substrait/' (grep -Rl "from substrait" src/python/mohair/substrait/)
# sed -i '' 's/from substrait/from mohair.substrait/' (grep -Rl "from substrait" src/python/mohair/mohair/)