Skip to content

deepseek-ai/smallpond

Folders and files

NameName
Last commit message
Last commit date

Latest commit

52ecc5e ยท Mar 5, 2025

History

3 Commits
Mar 5, 2025
Mar 5, 2025
Feb 27, 2025
Mar 5, 2025
Mar 5, 2025
Mar 5, 2025
Feb 27, 2025
Feb 27, 2025
Feb 27, 2025
Mar 5, 2025
Feb 27, 2025
Feb 27, 2025

Repository files navigation

smallpond

CI PyPI Docs License

A lightweight data processing framework built on DuckDB and 3FS.

Features

  • ๐Ÿš€ High-performance data processing powered by DuckDB
  • ๐ŸŒ Scalable to handle PB-scale datasets
  • ๐Ÿ› ๏ธ Easy operations with no long-running services

Installation

Python 3.8 to 3.12 is supported.

pip install smallpond

Quick Start

# Download example data
wget https://duckdb.org/data/prices.parquet
import smallpond

# Initialize session
sp = smallpond.init()

# Load data
df = sp.read_parquet("prices.parquet")

# Process data
df = df.repartition(3, hash_by="ticker")
df = sp.partial_sql("SELECT ticker, min(price), max(price) FROM {0} GROUP BY ticker", df)

# Save results
df.write_parquet("output/")
# Show results
print(df.to_pandas())

Documentation

For detailed guides and API reference:

Performance

We evaluated smallpond using the GraySort benchmark (script) on a cluster comprising 50 compute nodes and 25 storage nodes running 3FS. The benchmark sorted 110.5TiB of data in 30 minutes and 14 seconds, achieving an average throughput of 3.66TiB/min.

Details can be found in 3FS - Gray Sort.

Development

pip install .[dev]

# run unit tests
pytest -v tests/test*.py

# build documentation
pip install .[docs]
cd docs
make html
python -m http.server --directory build/html

License

This project is licensed under the MIT License.

About

A lightweight data processing framework built on DuckDB and 3FS.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages