Prerelease documentation #77

intarga · 2025-03-06T17:49:33Z

This PR includes

A README for the project
A set of markdown documents in /docs covering the architecture of lard's components (with the exception of api [now egress] because its design is not mature yet and won't be before the beta)
Renaming of api to egress as api is a rather redundant and non-descriptive name, especially since ingestion technically contains an API. I feel the name egress better reflects its purpose.
Removal of fake_data_generator as it is disused and not really maintained

We no longer have a need for it and aren't effectively maintaining it

api seems a bit too generic of a name, especially since ingestion technically contains an API. The name egress better reflects what the api is for

Lun4m · 2025-03-24T11:44:24Z

README.md

+
+Lard is built around a Postgres database with two services that interact with it, one focused on ingestion, and one providing an API to access the data.
+
+![Diagram of the architecture on a single node](/docs/images/single-arch.svg)


Can the SVGs have a solid white background? They are unreadable in dark mode 😢 And maybe this first one could be horizontal, (it takes a lot of screen space)? Otherwise we could condense them in a single image (left, single node vs right, horizontal scaling)

Lun4m · 2025-03-24T11:54:38Z

README.md

+
+Here, one node takes responsiblity for ingestion, using [Postgres replication](https://www.postgresql.org/docs/current/high-availability.html) to sync the others. Meanwhile, the others focus on serving read-only requests from the API service, allowing read throughput to scale linearly with the number of replicas. Replicas are also able to take over from the primary in case of outages, minimising downtime.
+
+In addition to read throughput, previous experience with database systems at Met has taught us that as our dataset grows (think past 1 billion observations) write throughput begins to slow to a problematic degree. This happens because the [indexes](https://www.postgresql.org/docs/current/indexes.html) (structures needed speed up queries on large tables) become resource intensive to maintain as they grow larger. Particularly the BTree indices we use to represent time need to remain a balanced, but as we always add data on one side of the tree (the present is one extreme of the time range our dataset covers), we are constantly unbalancing it, and the expense of balancing a tree scales with its size.


Suggested change

In addition to read throughput, previous experience with database systems at Met has taught us that as our dataset grows (think past 1 billion observations) write throughput begins to slow to a problematic degree. This happens because the [indexes](https://www.postgresql.org/docs/current/indexes.html) (structures needed speed up queries on large tables) become resource intensive to maintain as they grow larger. Particularly the BTree indices we use to represent time need to remain a balanced, but as we always add data on one side of the tree (the present is one extreme of the time range our dataset covers), we are constantly unbalancing it, and the expense of balancing a tree scales with its size.

In addition to read throughput, previous experience with database systems at MET has taught us that as our dataset grows (think past 1 billion observations) write throughput begins to slow down to a problematic degree. This happens because the [indexes](https://www.postgresql.org/docs/current/indexes.html) (structures needed to speed up queries on large tables) become resource intensive to maintain as they grow larger. Particularly the BTree indices we use to represent time need to remain balanced, but as we always add data on one side of the tree (the present is one extreme of the time range our dataset covers), we are constantly unbalancing it, and the expense of balancing a tree scales with its size.

Slow down? And should it be Met or MET?

Lun4m · 2025-03-24T12:14:53Z

docs/DATABASE.md

@@ -0,0 +1,64 @@
+# Database
+
+The database is defined by a set of schemas defined in [/db](/db) each prefixed by a number defining the order in which they should be applied.


Suggested change

The database is defined by a set of schemas defined in [/db](/db) each prefixed by a number defining the order in which they should be applied.

The database is defined by a set of schemas found in [/db](/db), each prefixed by a number representing the order in which they should be applied.

Lun4m · 2025-03-24T12:24:01Z

docs/INGESTION.md

+
+## Kafka and legacy
+
+Right now as we aren't ready with confident QC, we are ingesting from KvKafka into the `legacy.data` table that has a different structure. As a result, that ingestor has it's own equivalent to `Datum` and `insert_data` that match `legacy.data`.


Suggested change

Right now as we aren't ready with confident QC, we are ingesting from KvKafka into the `legacy.data` table that has a different structure. As a result, that ingestor has it's own equivalent to `Datum` and `insert_data` that match `legacy.data`.

Right now as we aren't ready with confident QC, we are ingesting from KvKafka into the `legacy.data` table that has a different structure. As a result, that ingestor has its own equivalent to `Datum` and `insert_data` that match `legacy.data`.

Lun4m · 2025-03-24T12:25:52Z

docs/INGESTION.md

+```rust
+async fn handle_kldata(
+    State(pools): State<DbPools>,
+    State(param_conversions): State<ParamConversions>,
+    State(permit_table): State<Arc<RwLock<(ParamPermitTable, StationPermitTable)>>>,
+    State(rove_connector): State<Arc<rove_connector::Connector>>,
+    State(qc_pipelines): State<Arc<HashMap<(i32, RelativeDuration), rove::Pipeline>>>,
+    body: String,
+) -> Json<KldataResp> {
+    metrics::counter!(KLDATA_MESSAGES_RECEIVED).increment(1);
+
+    let result: Result<usize, Error> = async {
+        let mut open_conn = pools.open.get().await?;
+        let mut restricted_conn = pools.restricted.get().await?;
+
+        let (message_id, obsinn_chunk) = parse_kldata(&body, param_conversions.clone())?;
+
+        let (mut open_data, mut restricted_data) = filter_and_label_kldata(
+            obsinn_chunk,
+            &mut open_conn,
+            &mut restricted_conn,
+            param_conversions,
+            permit_table,
+        )
+        .await?;
+
+        qc_and_insert_data(
+            &mut open_data,
+            &rove_connector,
+            &qc_pipelines,
+            &mut open_conn,
+        )
+        .await?;
+        qc_and_insert_data(
+            &mut restricted_data,
+            &rove_connector,
+            &qc_pipelines,
+            &mut restricted_conn,
+        )
+        .await?;
+
+        Ok(message_id)
+    }
+    .await;
+
+    match result {
+        Ok(message_id) => Json(KldataResp {
+            message: "".into(),
+            message_id,
+            res: 0,
+            retry: false,
+        }),
+        Err(e) => {
+            metrics::counter!(KLDATA_FAILURES).increment(1);
+            error!("failed to ingest kldata message: {}, body: {}", e, body);
+            // TODO: log errors?
+            Json(KldataResp {
+                message: e.to_string(),
+                message_id: 0, // TODO: some clever way to get the message id still if possible?
+                res: 1,
+                retry: !matches!(e, Error::Parse(_)),
+            })
+        }
+    }
+}
+```


Should this just be a link to the lines in the file? We have to fix it here too when the underlying code changes? But I also kinda like seeing the code directly in the README 🤔

Lun4m · 2025-03-24T12:28:19Z

docs/INTEGRATION_TESTS.md

+        // with them
+        let client = reqwest::Client::new();
+
+        // Use helper fucntion `ingest_data` to send a request with the obsinn message generated


Suggested change

// Use helper fucntion `ingest_data` to send a request with the obsinn message generated

// Use helper function `ingest_data` to send a request with the obsinn message generated

Lun4m · 2025-03-24T12:31:43Z

docs/INTEGRATION_TESTS.md

+    }
+}
+```
+Here we expect the test_result to complete first, as the server tasks should run indefinitely unless they crash or until we tell them to shut down, so we panic and fail the test if this does not happen.


Suggested change

Here we expect the test_result to complete first, as the server tasks should run indefinitely unless they crash or until we tell them to shut down, so we panic and fail the test if this does not happen.

Here we expect the `test_result` to complete first, as the server tasks should run indefinitely unless they crash or until we tell them to shut down, so we panic and fail the test if this does not happen.

Lun4m · 2025-03-24T12:32:24Z

docs/INTEGRATION_TESTS.md

+
+In the handler for the test closure, we make sure to clean up the database before moving on, so the next test can start on a blank slate. We make sure to catch any panics so we will perform this cleanup even if the closure fails.
+
+It's worth noting that e2e_test_wrapper does not set up a postgres instance, just connects to it, so a postgres instance must be available to run these tests. To help with this, we've set up a justfile to use instead of running `cargo test` directly, which sets up a postgres instance in a docker container.


Suggested change

It's worth noting that e2e_test_wrapper does not set up a postgres instance, just connects to it, so a postgres instance must be available to run these tests. To help with this, we've set up a justfile to use instead of running `cargo test` directly, which sets up a postgres instance in a docker container.

It's worth noting that `e2e_test_wrapper` does not set up a postgres instance, just connects to it, so a postgres instance must be available to run these tests. To help with this, we've set up a justfile to use instead of running `cargo test` directly, which sets up a postgres instance in a docker container.

Lun4m · 2025-03-24T12:53:07Z

Maybe we can also remove the README.md inside the integration_tests directory?

intarga added the documentation Improvements or additions to documentation label Mar 6, 2025

intarga force-pushed the prerelease_documentation branch 3 times, most recently from 60d7a91 to 5072955 Compare March 12, 2025 12:06

intarga added 12 commits March 20, 2025 16:19

docs: start revamping README

13ae857

docs: add single architecture diagram

a9878df

docs: link single architecture diagram in README

f1c39ba

docs: change image scale

7caacda

docs: update image link in README

d887f78

docs: set image size in README

11b812c

docs: add multi-arch image

c4b6a28

docs: fix multi-arch image link

9a57597

docs: relative links in README

804060e

docs: add development section to README

67ba50f

docs: deployment section in README

373085c

docs: database doc

112644c

intarga force-pushed the prerelease_documentation branch from e8658b1 to 112644c Compare March 20, 2025 15:19

intarga added 6 commits March 20, 2025 19:34

docs: add labels section in DATABASE.md

1d6b0a4

docs: add restricted data section to DATABASE.md

a3511bc

docs: add INGESTION.md

7abb40e

remove fake_data_generator

2fed84f

We no longer have a need for it and aren't effectively maintaining it

rename api to egress

908b57b

api seems a bit too generic of a name, especially since ingestion technically contains an API. The name egress better reflects what the api is for

docs: add INTEGRATION_TESTS.md

603c168

intarga force-pushed the prerelease_documentation branch from b584cc0 to 603c168 Compare March 24, 2025 11:22

docs: clean up README

7d44204

intarga marked this pull request as ready for review March 24, 2025 11:30

intarga requested review from Lun4m and jo-asplin-met-no and removed request for Lun4m March 24, 2025 11:30

intarga linked an issue Mar 24, 2025 that may be closed by this pull request

Write pre-release documentation #55

Open

Lun4m reviewed Mar 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prerelease documentation #77

Prerelease documentation #77

intarga commented Mar 6, 2025 •

edited

Loading

Lun4m Mar 24, 2025

Lun4m Mar 24, 2025

Lun4m Mar 24, 2025

Lun4m Mar 24, 2025

Lun4m Mar 24, 2025

Lun4m Mar 24, 2025

Lun4m Mar 24, 2025

Lun4m Mar 24, 2025

Lun4m Mar 24, 2025

Lun4m commented Mar 24, 2025 •

edited

Loading


		Lard is built around a Postgres database with two services that interact with it, one focused on ingestion, and one providing an API to access the data.

		![Diagram of the architecture on a single node](/docs/images/single-arch.svg)


		Here, one node takes responsiblity for ingestion, using [Postgres replication](https://www.postgresql.org/docs/current/high-availability.html) to sync the others. Meanwhile, the others focus on serving read-only requests from the API service, allowing read throughput to scale linearly with the number of replicas. Replicas are also able to take over from the primary in case of outages, minimising downtime.

		In addition to read throughput, previous experience with database systems at Met has taught us that as our dataset grows (think past 1 billion observations) write throughput begins to slow to a problematic degree. This happens because the [indexes](https://www.postgresql.org/docs/current/indexes.html) (structures needed speed up queries on large tables) become resource intensive to maintain as they grow larger. Particularly the BTree indices we use to represent time need to remain a balanced, but as we always add data on one side of the tree (the present is one extreme of the time range our dataset covers), we are constantly unbalancing it, and the expense of balancing a tree scales with its size.

		@@ -0,0 +1,64 @@
		# Database

		The database is defined by a set of schemas defined in [/db](/db) each prefixed by a number defining the order in which they should be applied.


		## Kafka and legacy

		Right now as we aren't ready with confident QC, we are ingesting from KvKafka into the `legacy.data` table that has a different structure. As a result, that ingestor has it's own equivalent to `Datum` and `insert_data` that match `legacy.data`.

	// Use helper fucntion `ingest_data` to send a request with the obsinn message generated
	// Use helper function `ingest_data` to send a request with the obsinn message generated

	Here we expect the test_result to complete first, as the server tasks should run indefinitely unless they crash or until we tell them to shut down, so we panic and fail the test if this does not happen.
	Here we expect the `test_result` to complete first, as the server tasks should run indefinitely unless they crash or until we tell them to shut down, so we panic and fail the test if this does not happen.


		In the handler for the test closure, we make sure to clean up the database before moving on, so the next test can start on a blank slate. We make sure to catch any panics so we will perform this cleanup even if the closure fails.

		It's worth noting that e2e_test_wrapper does not set up a postgres instance, just connects to it, so a postgres instance must be available to run these tests. To help with this, we've set up a justfile to use instead of running `cargo test` directly, which sets up a postgres instance in a docker container.

	It's worth noting that e2e_test_wrapper does not set up a postgres instance, just connects to it, so a postgres instance must be available to run these tests. To help with this, we've set up a justfile to use instead of running `cargo test` directly, which sets up a postgres instance in a docker container.
	It's worth noting that `e2e_test_wrapper` does not set up a postgres instance, just connects to it, so a postgres instance must be available to run these tests. To help with this, we've set up a justfile to use instead of running `cargo test` directly, which sets up a postgres instance in a docker container.

Prerelease documentation #77

Are you sure you want to change the base?

Prerelease documentation #77

Conversation

intarga commented Mar 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lun4m commented Mar 24, 2025 • edited Loading

intarga commented Mar 6, 2025 •

edited

Loading

Lun4m commented Mar 24, 2025 •

edited

Loading