url-snapshotter
is a powerful command-line tool built in Python for monitoring the health and content of web services over time. It works by creating "snapshots" that capture both the HTTP status codes and the content of specified URLs at particular moments. This tool allows you to create, view, and compare these snapshots to efficiently identify changes, regressions, or anomalies in your services.
With url-snapshotter
, you can ensure the reliability of your web platforms by regularly auditing the status and content of your URLs. Whether you want to track updates, verify expected changes after deployments, or spot unexpected issues, url-snapshotter
helps you gain a deeper understanding of how your services evolve and ensures they function as expected.
- url-snapshotter
-
Pre-Maintenance Verification for Web Services: Ensure smooth upgrades and maintenance by using
url-snapshotter
to validate your services before and after any infrastructure changes. Whether you're managing websites through Kubernetes (like with an Ingress Controller), or handling broader infrastructure updates such as cluster upgrades,url-snapshotter
helps ensure your services remain consistent and reliable throughout the maintenance process.- Create a pre-upgrade snapshot to capture the baseline state of your services, including HTTP status codes and content.
- After the upgrade, take another snapshot and use
url-snapshotter
to compare both snapshots side-by-side, allowing you to quickly identify any discrepancies. - Instantly detect potential issues to confirm successful maintenance, reducing the risk of unnoticed regressions or service outages.
-
Periodic Monitoring and Change Tracking: Proactively monitor the state of your web services over time by using
url-snapshotter
to regularly capture snapshots of key URLs. This approach is invaluable for identifying unexpected issues, verifying deployments, and validating intended changes in your applications.- Schedule periodic snapshots to keep an evolving record of the content and status codes of your services.
- Track unexpected changes, such as unauthorized modifications or emerging errors, and verify that new deployments function as intended.
- Validate and document changes over time to ensure alignment with your development and deployment goals, ultimately maintaining service quality.
-
Efficient Asynchronous URL Fetching: Speed up the process of monitoring large sets of URLs using asynchronous requests to fetch content concurrently.
url-snapshotter
makes use of async I/O to handle multiple URLs at once, resulting in significantly improved performance, especially when dealing with long URL lists. -
Comprehensive Snapshot Comparison: Easily compare two snapshots to identify any changes in HTTP status codes or page content. Detect regression, monitor modifications, or spot unexpected issues in your web services. By highlighting differences between snapshots,
url-snapshotter
helps you understand what has changed at a glance, empowering you to take corrective actions swiftly if necessary. -
Customizable Content Cleaning with Patterns: Automatically clean up dynamic elements that frequently change, such as CSRF tokens, script nonces, or session identifiers, using custom regex patterns. This ensures that the comparisons are focused only on meaningful content changes. You can add, modify, or remove cleaning patterns to suit the specific needs of your web applications.
-
Interactive Dynamic Snapshot Viewing: View the details of any previously captured snapshot directly from the command line, including HTTP status codes and content hashes. The dynamic viewing capability helps you quickly verify the current state of URLs or investigate individual snapshots without having to dig through raw data.
-
Adjustable Concurrency Control: Set the level of concurrency during URL fetching to suit your needs—whether you want rapid fetching for small lists or more controlled fetching to reduce the load on your servers.
-
Intuitive CLI Integration: Execute all operations from a single command-line interface (
url-snapshotter
). Whether you're creating a new snapshot, comparing snapshots, or viewing snapshot details, the easy-to-use CLI streamlines these tasks.
The easiest way to run url-snapshotter
is with Docker. This avoids the need to set up Python or dependencies on your machine:
docker run -it --rm -v $(pwd):/app -w /app python:3.12-slim-bookworm sh -c "pip install --no-cache-dir -r requirements.txt && python -m url_snapshotter.cli"
This command:
- Uses a Python Docker image
- Mounts the current directory to the Docker container.
- Installs dependencies and starts the
url-snapshotter
CLI prompts.
The tool can be run using Poetry, which is used to manage dependencies and the virtual environment. Once installed, you can run url-snapshotter
with:
url-snapshotter [OPTIONS]
Alternatively, via Python (requires dependencies installed and Python 3.12+):
python -m url_snapshotter.cli [OPTIONS]
create
: Create a new snapshot of URLs from a file.view
: View the details of an existing snapshot.compare
: Compare two snapshots to see any differences.-f, --file [PATH]
: Specify the file containing URLs (one URL per line).-c, --concurrent [NUM]
: Number of concurrent requests for async fetching (default is 4).--debug
: Enable debug logging for more details.
Note that if you don't give any options, by default the interactive CLI will start.
The examples assume you have Poetry setup properly and added to your path.
-
Create a Snapshot
url-snapshotter create -f urls.txt
-
Compare Two Snapshots
url-snapshotter compare
-
View a Snapshot
url-snapshotter view --snapshot-id 1
-
Adjust Concurrent Requests
url-snapshotter create -f urls.txt -c 10
The urls.txt
file (or any name you choose) is a simple text file containing the list of URLs you want to monitor. Each URL should be on a separate line without extra spaces or characters. Example:
https://example.com
https://another-example.com
https://api.example.com/subpage/endpoint
If you're planning to contribute or extend url-snapshotter
, the best way to set up your environment is using pyenv and Poetry.
-
Install Pyenv Follow the instructions in the pyenv GitHub repository.
-
Install the Correct Python Version Use pyenv to install the version specified in the
pyproject.toml
:pyenv install 3.12.7
-
Install Poetry Install Poetry for dependency management:
curl -sSL https://install.python-poetry.org | python3 -
-
Create a Virtual Environment and Install Dependencies
pyenv virtualenv 3.12.7 url-snapshotter-env pyenv activate url-snapshotter-env poetry install
-
Run the Application To run the CLI directly after installing the dependencies:
poetry run url-snapshotter
The patterns.py
file contains regex patterns that are used to clean up dynamic elements from the content of the fetched URLs. This is crucial for removing elements that may change frequently (e.g., CSRF tokens, timestamps, session identifiers), allowing you to focus only on meaningful content changes.
Each pattern in patterns.py
consists of a regex pattern and an optional message to describe what is being cleaned up.
To add a new pattern, simply append it to the list in the get_patterns
function in url_snapshotter/patterns.py
. Each entry should be a dictionary containing a compiled regex pattern and an optional message for logging purposes.
For example, to add a new pattern to remove a dynamic authentication token, you can add:
{
"pattern": re.compile(r'auth_token="[^"]+"'),
"message": "Authentication token detected and removed."
}
Make sure the regex patterns are well-tested to avoid removing unintended content.
This assumes you use Contour with httpproxy
crds and also the regular ingress
objects.
(
# Get HTTPProxies URLs
kubectl get httpproxies --all-namespaces -o jsonpath="{range .items[*]}https://{.spec.virtualhost.fqdn}{'\n'}{end}"
# Get Ingresses URLs and ensure only the first URL (if multiple) is grabbed
kubectl get ingresses --all-namespaces -o jsonpath="{range .items[*]}https://{.spec.rules[*].host}{'\n'}{end}" \
| awk '{print $1}' # This will grab only the first URL in case there are multiple
) | sort | uniq > urls.txt
This command is designed to list all fully qualified domain names (FQDNs) from both HTTPProxy and Ingress resources in a Kubernetes cluster, ensuring there are no duplicate URLs in the final output.
Note It's common that multiple virtualhosts point to the same deployment; you might want to filter the urls.txt
and adjust accordingly.
Just using ingress
objects? Awesome, use this snippet.
(
# Get Ingresses URLs and ensure only the first URL (if multiple) is grabbed
kubectl get ingresses --all-namespaces -o jsonpath="{range .items[*]}https://{.spec.rules[*].host}{'\n'}{end}" \
| awk '{print $1}' # This will grab only the first URL in case there are multipe
) | sort | uniq > urls.txt
Using the somewhat new gateway
object or httproutes
? Got you covered.
(
# Get Gateways hostnames
kubectl get gateways --all-namespaces -o jsonpath="{range .items[*]}https://{.spec.listeners[*].hostname}{'\n'}{end}"
# Get HTTPRoute hostnames
kubectl get httproutes --all-namespaces -o jsonpath="{range .items[*]}https://{.spec.hostnames[*]}{'\n'}{end}" \
| awk '{print $1}' # Only take the first hostname if multiple
) | sort | uniq > urls.txt
Make sure you run url-snapshotter
from a machine that has access to the URLs. If you can curl
or wget
a URL, then url-snapshotter
should work fine.
This project is licensed under the MIT License - see the LICENSE file for details.