Skip to content
/ splink Public

A streamlined workflow for probabilistic entity resolution using Splink.

Notifications You must be signed in to change notification settings

enginux/splink

Repository files navigation

πŸš€ Splink Workflow Guide

Welcome to the Splink Workflow Guide!

This repository provides a structured approach to entity resolution using Splink, enabling efficient record linkage and deduplication at scale.

πŸ“Œ Overview

Splink is a Python package for probabilistic record linkage (entity resolution) that allows you to deduplicate and link records from datasets without unique identifiers.

This workflow demonstrates best practices to:

  • Clean and preprocess data for entity resolution.
  • Configure Splink for optimal matching.
  • Perform scalable and efficient record linkage.
  • Evaluate and fine-tune results for accuracy.

πŸ—οΈ Installation

To get started, install the required dependencies with my chosen backend - spark:

!pip install 'splink[spark]'

🚦 Getting Started

Follow these steps to use this workflow:

  1. Prepare your data – Ensure datasets are cleaned and formatted properly.
  2. Define linkage rules – Configure comparison and scoring functions.
  3. Run Splink – Execute entity resolution with your chosen backend (Spark, DuckDB, etc.).
  4. Evaluate results – Analyze and refine matches for accuracy.

πŸš€ Quickstart

To get a basic Splink model up and running, use the following code. It demonstrates how to:

Estimate the parameters of a deduplication model Use the parameter estimates to identify duplicate records Use clustering to generate an estimated unique person ID.

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets

db_api = DuckDBAPI()

df = splink_datasets.fake_1000

settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.NameComparison("first_name"),
        cl.JaroAtThresholds("surname"),
        cl.DateOfBirthComparison(
            "dob",
            input_is_string=True,
        ),
        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
        cl.EmailComparison("email"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name", "dob"),
        block_on("surname"),
    ]
)

linker = Linker(df, settings, db_api)

linker.training.estimate_probability_two_random_records_match(
    [block_on("first_name", "surname")],
    recall=0.7,
)

linker.training.estimate_u_using_random_sampling(max_pairs=1e6)

linker.training.estimate_parameters_using_expectation_maximisation(
    block_on("first_name", "surname")
)

linker.training.estimate_parameters_using_expectation_maximisation(block_on("email"))

pairwise_predictions = linker.inference.predict(threshold_match_weight=-5)

clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
    pairwise_predictions, 0.95
)

df_clusters = clusters.as_pandas_dataframe(limit=10)

πŸ”§ Configuration

You can tweak the following settings for better results:

  • Blocking rules: Define rules to limit the number of comparisons.
  • Scoring functions: Adjust parameters for similarity calculations.
  • Threshold tuning: Optimize match acceptance criteria.

πŸ“‚ Project Structure

πŸ“ splink/
β”œβ”€β”€ πŸ“œ README.md
β”œβ”€β”€ πŸ“¦ requirements.txt
β”œβ”€β”€ πŸ“ data/         # Sample datasets
β”œβ”€β”€ πŸ“ notebooks/    # Jupyter notebooks  
β”‚   β”œβ”€β”€ πŸ“ tutorials/  # Step-by-step guides and explanations  
β”‚   β”œβ”€β”€ πŸ“ examples/   # Practical use cases and sample workflows  
β”œβ”€β”€ πŸ“ scripts/      # Python scripts for automation  
└── πŸ“ results/      # Output match results  

🎯 Use Cases

This workflow is useful for:

  • Customer data deduplication
  • Fraud detection through identity matching
  • Merging datasets across different sources
  • etc. imagination is the limit

🀝 Contributing

Contributions are welcome! Feel free to submit issues, suggestions, or pull requests.


Happy Linking! πŸš€

About

A streamlined workflow for probabilistic entity resolution using Splink.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published