Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prerelease documentation #77

Open
wants to merge 19 commits into
base: trunk
Choose a base branch
from
Open

Prerelease documentation #77

wants to merge 19 commits into from

Conversation

intarga
Copy link
Member

@intarga intarga commented Mar 6, 2025

This PR includes

  • A README for the project
  • A set of markdown documents in /docs covering the architecture of lard's components (with the exception of api [now egress] because its design is not mature yet and won't be before the beta)
  • Renaming of api to egress as api is a rather redundant and non-descriptive name, especially since ingestion technically contains an API. I feel the name egress better reflects its purpose.
  • Removal of fake_data_generator as it is disused and not really maintained

@intarga intarga added the documentation Improvements or additions to documentation label Mar 6, 2025
@intarga intarga force-pushed the prerelease_documentation branch 3 times, most recently from 60d7a91 to 5072955 Compare March 12, 2025 12:06
@intarga intarga force-pushed the prerelease_documentation branch from e8658b1 to 112644c Compare March 20, 2025 15:19
intarga added 6 commits March 20, 2025 19:34
We no longer have a need for it and aren't effectively maintaining it
api seems a bit too generic of a name, especially since ingestion
technically contains an API. The name egress better reflects what the
api is for
@intarga intarga force-pushed the prerelease_documentation branch from b584cc0 to 603c168 Compare March 24, 2025 11:22
@intarga intarga marked this pull request as ready for review March 24, 2025 11:30
@intarga intarga requested review from Lun4m and jo-asplin-met-no and removed request for Lun4m March 24, 2025 11:30
@intarga intarga linked an issue Mar 24, 2025 that may be closed by this pull request

Lard is built around a Postgres database with two services that interact with it, one focused on ingestion, and one providing an API to access the data.

![Diagram of the architecture on a single node](/docs/images/single-arch.svg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the SVGs have a solid white background? They are unreadable in dark mode 😢 And maybe this first one could be horizontal, (it takes a lot of screen space)? Otherwise we could condense them in a single image (left, single node vs right, horizontal scaling)


Here, one node takes responsiblity for ingestion, using [Postgres replication](https://www.postgresql.org/docs/current/high-availability.html) to sync the others. Meanwhile, the others focus on serving read-only requests from the API service, allowing read throughput to scale linearly with the number of replicas. Replicas are also able to take over from the primary in case of outages, minimising downtime.

In addition to read throughput, previous experience with database systems at Met has taught us that as our dataset grows (think past 1 billion observations) write throughput begins to slow to a problematic degree. This happens because the [indexes](https://www.postgresql.org/docs/current/indexes.html) (structures needed speed up queries on large tables) become resource intensive to maintain as they grow larger. Particularly the BTree indices we use to represent time need to remain a balanced, but as we always add data on one side of the tree (the present is one extreme of the time range our dataset covers), we are constantly unbalancing it, and the expense of balancing a tree scales with its size.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In addition to read throughput, previous experience with database systems at Met has taught us that as our dataset grows (think past 1 billion observations) write throughput begins to slow to a problematic degree. This happens because the [indexes](https://www.postgresql.org/docs/current/indexes.html) (structures needed speed up queries on large tables) become resource intensive to maintain as they grow larger. Particularly the BTree indices we use to represent time need to remain a balanced, but as we always add data on one side of the tree (the present is one extreme of the time range our dataset covers), we are constantly unbalancing it, and the expense of balancing a tree scales with its size.
In addition to read throughput, previous experience with database systems at MET has taught us that as our dataset grows (think past 1 billion observations) write throughput begins to slow down to a problematic degree. This happens because the [indexes](https://www.postgresql.org/docs/current/indexes.html) (structures needed to speed up queries on large tables) become resource intensive to maintain as they grow larger. Particularly the BTree indices we use to represent time need to remain balanced, but as we always add data on one side of the tree (the present is one extreme of the time range our dataset covers), we are constantly unbalancing it, and the expense of balancing a tree scales with its size.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slow down? And should it be Met or MET?

@@ -0,0 +1,64 @@
# Database

The database is defined by a set of schemas defined in [/db](/db) each prefixed by a number defining the order in which they should be applied.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The database is defined by a set of schemas defined in [/db](/db) each prefixed by a number defining the order in which they should be applied.
The database is defined by a set of schemas found in [/db](/db), each prefixed by a number representing the order in which they should be applied.


## Kafka and legacy

Right now as we aren't ready with confident QC, we are ingesting from KvKafka into the `legacy.data` table that has a different structure. As a result, that ingestor has it's own equivalent to `Datum` and `insert_data` that match `legacy.data`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Right now as we aren't ready with confident QC, we are ingesting from KvKafka into the `legacy.data` table that has a different structure. As a result, that ingestor has it's own equivalent to `Datum` and `insert_data` that match `legacy.data`.
Right now as we aren't ready with confident QC, we are ingesting from KvKafka into the `legacy.data` table that has a different structure. As a result, that ingestor has its own equivalent to `Datum` and `insert_data` that match `legacy.data`.

Comment on lines +30 to +95
```rust
async fn handle_kldata(
State(pools): State<DbPools>,
State(param_conversions): State<ParamConversions>,
State(permit_table): State<Arc<RwLock<(ParamPermitTable, StationPermitTable)>>>,
State(rove_connector): State<Arc<rove_connector::Connector>>,
State(qc_pipelines): State<Arc<HashMap<(i32, RelativeDuration), rove::Pipeline>>>,
body: String,
) -> Json<KldataResp> {
metrics::counter!(KLDATA_MESSAGES_RECEIVED).increment(1);

let result: Result<usize, Error> = async {
let mut open_conn = pools.open.get().await?;
let mut restricted_conn = pools.restricted.get().await?;

let (message_id, obsinn_chunk) = parse_kldata(&body, param_conversions.clone())?;

let (mut open_data, mut restricted_data) = filter_and_label_kldata(
obsinn_chunk,
&mut open_conn,
&mut restricted_conn,
param_conversions,
permit_table,
)
.await?;

qc_and_insert_data(
&mut open_data,
&rove_connector,
&qc_pipelines,
&mut open_conn,
)
.await?;
qc_and_insert_data(
&mut restricted_data,
&rove_connector,
&qc_pipelines,
&mut restricted_conn,
)
.await?;

Ok(message_id)
}
.await;

match result {
Ok(message_id) => Json(KldataResp {
message: "".into(),
message_id,
res: 0,
retry: false,
}),
Err(e) => {
metrics::counter!(KLDATA_FAILURES).increment(1);
error!("failed to ingest kldata message: {}, body: {}", e, body);
// TODO: log errors?
Json(KldataResp {
message: e.to_string(),
message_id: 0, // TODO: some clever way to get the message id still if possible?
res: 1,
retry: !matches!(e, Error::Parse(_)),
})
}
}
}
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this just be a link to the lines in the file? We have to fix it here too when the underlying code changes? But I also kinda like seeing the code directly in the README 🤔

// with them
let client = reqwest::Client::new();

// Use helper fucntion `ingest_data` to send a request with the obsinn message generated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Use helper fucntion `ingest_data` to send a request with the obsinn message generated
// Use helper function `ingest_data` to send a request with the obsinn message generated

}
}
```
Here we expect the test_result to complete first, as the server tasks should run indefinitely unless they crash or until we tell them to shut down, so we panic and fail the test if this does not happen.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Here we expect the test_result to complete first, as the server tasks should run indefinitely unless they crash or until we tell them to shut down, so we panic and fail the test if this does not happen.
Here we expect the `test_result` to complete first, as the server tasks should run indefinitely unless they crash or until we tell them to shut down, so we panic and fail the test if this does not happen.


In the handler for the test closure, we make sure to clean up the database before moving on, so the next test can start on a blank slate. We make sure to catch any panics so we will perform this cleanup even if the closure fails.

It's worth noting that e2e_test_wrapper does not set up a postgres instance, just connects to it, so a postgres instance must be available to run these tests. To help with this, we've set up a justfile to use instead of running `cargo test` directly, which sets up a postgres instance in a docker container.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It's worth noting that e2e_test_wrapper does not set up a postgres instance, just connects to it, so a postgres instance must be available to run these tests. To help with this, we've set up a justfile to use instead of running `cargo test` directly, which sets up a postgres instance in a docker container.
It's worth noting that `e2e_test_wrapper` does not set up a postgres instance, just connects to it, so a postgres instance must be available to run these tests. To help with this, we've set up a justfile to use instead of running `cargo test` directly, which sets up a postgres instance in a docker container.

@Lun4m
Copy link
Collaborator

Lun4m commented Mar 24, 2025

Maybe we can also remove the README.md inside the integration_tests directory?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Write pre-release documentation
2 participants