docs: add documentations for udsource (numaproj#1142)

KeranYang · web-flow · commit 303fc310ed99 · 2023-10-02T22:08:44.000-04:00
Signed-off-by: Keran Yang &lt;yangkr920208@gmail.com&gt;
diff --git a/docs/core-concepts/watermarks.md b/docs/core-concepts/watermarks.md
@@ -40,9 +40,9 @@ an API. Watermark API is supported in all our client SDKs.
 
 ```go
 // Go
-func handle(ctx context.Context, key string, data funcsdk.Datum) funcsdk.Messages {
-	_ = data.EventTime() // Event time
-	_ = data.Watermark() // Watermark
-	... ...
+func mapFn(context context.Context, keys []string, d mapper.Datum) mapper.Messages {
+    _ = d.EventTime() // Event time
+    _ = d.Watermark() // Watermark
+    ... ...
 }
 ```
diff --git a/docs/development/development.md b/docs/development/development.md
@@ -47,7 +47,7 @@ kind export kubeconfig
   Build container image, and import it to `k3d`, `kind`, or `minikube` cluster if corresponding `KUBECONFIG` is sourced.
 
 - `make docs`
-  Convert the docs to Github pages, check if there's any error.
+  Convert the docs to GitHub pages, check if there's any error.
 
 - `make docs-serve`
   Start [an HTTP server](http://127.0.0.1:8000/) on your local to host the docs generated Github pages.
diff --git a/docs/development/releasing.md b/docs/development/releasing.md
@@ -7,11 +7,11 @@ Always create a release branch for the releases, for example branch `release-0.5
 ## Release Steps
 
 1. Cherry-pick fixes to the release branch, skip this step if it's the first release in the branch.
-1. Run `make test` to make sure all test test cases pass locally.
+1. Run `make test` to make sure all test cases pass locally.
 1. Push to remote branch, and make sure all the CI jobs pass.
 1. Run `make prepare-release VERSION=v{x.y.z}` to update version in manifests, where `x.y.x` is the expected new version.
 1. Follow the output of last step, to confirm if all the changes are expected, and then run `make release VERSION=v{x.y.z}`.
-1. Follow the output, push a new tag to the release branch, Github actions will automatically build and publish the new release, this will take around 10 minutes.
+1. Follow the output, push a new tag to the release branch, GitHub actions will automatically build and publish the new release, this will take around 10 minutes.
 1. Test the new release, make sure everything is running as expected, and then recreate a `stable` tag against the latest release.
    ```shell
    git tag -d stable
diff --git a/docs/operations/installation.md b/docs/operations/installation.md
@@ -74,7 +74,7 @@ To do managed namespace installation, besides `--namespaced`, add `--managed-nam
 
 By default, the Numaflow controller is installed with `Active-Passive` HA strategy enabled, which means you can run the controller with multiple replicas (defaults to 1 in the manifests).
 
-To turn off HA, add following environment variable to the deployment spec.
+To turn off HA, add the following environment variable to the deployment spec.
 
 ```
       name: NUMAFLOW_LEADER_ELECTION_DISABLED
diff --git a/docs/specifications/edges-buffers-buckets.md b/docs/specifications/edges-buffers-buckets.md
@@ -6,7 +6,7 @@
 
 `Edge` is the connection between the vertices, specifically, `edge` is defined in the pipeline spec under `.spec.edges`. No matter if the `to` vertex is a Map, or a Reduce with multiple partitions, it is considered as one edge.
 
-In the following pipeline , there are 3 edges defined (`in` - `aoti`, `aoti` - `compute-sum`, `compute-sum` - `out`).
+In the following pipeline, there are 3 edges defined (`in` - `aoti`, `aoti` - `compute-sum`, `compute-sum` - `out`).
 
 ```yaml
 apiVersion: numaflow.numaproj.io/v1alpha1
@@ -54,9 +54,9 @@ Each `edge` could have a name for internal usage, the naming convention is `{pip
 
 `Buffer` is `InterStepBuffer`. Each buffer has an owner, which is the vertex who reads from it. Each `udf` and `sink` vertex in a pipeline owns a group of partitioned buffers. Each buffer has a name with the naming convention `{pipeline-name}-{vertex-name}-{index}`, where the `index` is the partition index, starting from 0. This naming convention applies to the buffers of both map and reduce udf vertices.
 
-When multiple vertices connecting to the same vertex, if the `to` vertex is a Map, the data from all the from vertices will be forwarded to the group of partitoned buffers round-robinly. If the `to` vertex is a Reduce, the data from all the from vertices will be forwarded to the group of partitoned buffers based on the partitioning key.
+When multiple vertices connecting to the same vertex, if the `to` vertex is a Map, the data from all the from vertices will be forwarded to the group of partitoned buffers round-robinly. If the `to` vertex is a Reduce, the data from all the from vertices will be forwarded to the group of partitioned buffers based on the partitioning key.
 
-A Source vertex does not have any owned buffers. But a pipeline may have multiple Source vertices, followed by one vertex. Same as above, if the following vertex is a map, the data from all the Source vertices will be forwarded to the group of partitoned buffers round-robinly. If it is a reduce, the data from all the Source vertices will be forwarded to the group of partitoned buffers based on the partitioning key.
+A Source vertex does not have any owned buffers. But a pipeline may have multiple Source vertices, followed by one vertex. Same as above, if the following vertex is a map, the data from all the Source vertices will be forwarded to the group of partitioned buffers round-robinly. If it is a reduce, the data from all the Source vertices will be forwarded to the group of partitioned buffers based on the partitioning key.
 
 ## Buckets
 
diff --git a/docs/specifications/overview.md b/docs/specifications/overview.md
@@ -111,7 +111,7 @@ Logic:
 **Matrix of Operations**
 
 |                | Source           | Processor    | Sink          |
-| -------------- |------------------| ------------ |---------------|
+|----------------|------------------|--------------|---------------|
 | ReadFromBuffer | Read From Source | Generic      | Generic       |
 | CallUDF        | Void             | User Defined | Void          |
 | Forward        | Generic          | Generic      | Write To Sink |
@@ -124,7 +124,7 @@ Logic:
 - Numaflow is restartable if aborted or steps fail while preserving
   exactly-once semantics.
 - Do not generate more output than can be used by the next stage in a
-  reasonable amount of time, i.e. the size of buffers between steps
+  reasonable amount of time, i.e., the size of buffers between steps
   should be limited, (aka backpressure).
 - User code should be isolated from offset management, restart, exactly once, backpressure, etc.
 - Streaming process systems inherently require a concept of time, this
@@ -144,7 +144,7 @@ Logic:
     ![Tree Dag](../assets/tree_dag.png)
   - Diamond (In Future)
     ![Diamond Dag](../assets/diamond_dag.png)
-  - Multiple Sources with same schema (In Future)
+  - Multiple Sources with the same schema (In Future)
     ![Multi Source Dag](../assets/multi_source_dag.png)
 
 ## Non-Requirements
@@ -160,7 +160,7 @@ Logic:
 
 - In order to be able to support various buffering technologies, we
   will persist and manage stream "offsets" rather than relying on
-  the buffering technology (e.g. Kafka)
+  the buffering technology (e.g., Kafka)
 - Each processor may persist state associated with their processing
   no distributed transactions are needed for checkpointing
 - If we have a tree DAG, how will we manage acknowledgments? We
@@ -217,14 +217,14 @@ To detect duplicates, make sure the delivery is Exactly-Once:
 ### Unique Identifier for Message
 
 To detect duplicates, we first need to uniquely identify each message.
-We will be relying on the "identifier" available (eg, "offset" in Kafka)
+We will be relying on the "identifier" available (e.g., "offset" in Kafka)
 in the buffer to uniquely identify each message. If such an identifier
-is not available, we will be creating an unique identifier (sequence
+is not available, we will be creating a unique identifier (sequence
 numbers are tough because there are multiple readers). We can use this
 unique identifier to ensure that we forward only if the message has not
 been forwarded yet. We will only look back for a fixed window of time
 since this is a stream processing application on an unbounded stream of
-data and we do not have infinite resources.
+data, and we do not have infinite resources.
 
 The same offset will not be used across all the steps in Numaflow, but
 we will be using the current offset only while forwarding to the next
diff --git a/docs/user-guide/reference/conditional-forwarding.md b/docs/user-guide/reference/conditional-forwarding.md
@@ -1,7 +1,7 @@
 # Conditional Forwarding
 
 After processing the data, conditional forwarding is doable based on the `Tags` returned in the result. 
-Below is list of different logic operations that can be done on tags.
+Below is a list of different logic operations that can be done on tags.
 - **and** - forwards the message if all the tags specified are present in Message's tags.
 - **or** - forwards the message if one of the tags specified is present in Message's tags.
 - **not** - forwards the message if all the tags specified are not present in Message's tags.
diff --git a/docs/user-guide/reference/multi-partition.md b/docs/user-guide/reference/multi-partition.md
@@ -7,7 +7,7 @@ that the JetStream is provisioned with more nodes to support higher throughput.
 Since partitions are owned by the vertex reading the data, to create a multi-partitioned edge
 we need to configure the vertex reading the data (to-vertex) to have multiple partitions.
 
-The following code snippet provides an example of how to configure a vertex (in this case, the `cat` vertex) to have multiple partitions, which enables it (`cat` vertex)  to read at a higher throughput.
+The following code snippet provides an example of how to configure a vertex (in this case, the `cat` vertex) to have multiple partitions, which enables it (`cat` vertex) to read at a higher throughput.
 
 ```yaml
     - name: cat
diff --git a/docs/user-guide/reference/side-inputs.md b/docs/user-guide/reference/side-inputs.md
@@ -3,8 +3,6 @@
 For an unbounded pipeline in Numaflow that never terminates, there are many cases where users want to update a configuration of the UDF without restarting the pipeline. Numaflow enables it by the `Side Inputs` feature where we can broadcast changes to vertices automatically.
 The `Side Inputs` feature achieves this by allowing users to write custom UDFs to broadcast changes to the vertices that are listening in for updates.
 
-
-
 ### Using Side Inputs in Numaflow
 The Side Inputs are updated based on a cron-like schedule, 
 specified in the pipeline spec with a trigger field.
@@ -74,7 +72,7 @@ func handle(_ context.Context) sideinputsdk.Message {
     return sideinputsdk.BroadcastMessage([]byte(val))
 }
 ```
-Similarly,  this can be written in [Python](https://github.com/numaproj/numaflow-python/blob/main/examples/sideinput/simple-sideinput/example.py) 
+Similarly, this can be written in [Python](https://github.com/numaproj/numaflow-python/blob/main/examples/sideinput/simple-sideinput/example.py) 
 and [Java](https://github.com/numaproj/numaflow-java/blob/main/examples/src/main/java/io/numaproj/numaflow/examples/sideinput/simple/SimpleSideInput.java) as well.
 
 After performing the retrieval/update, the side input value is then broadcasted to all vertices that use the side input.
diff --git a/docs/user-guide/sinks/overview.md b/docs/user-guide/sinks/overview.md
@@ -3,7 +3,7 @@
 The Sink serves as the endpoint for processed data that has been outputted from the platform,
 which is then sent to an external system or application. The purpose of the Sink is to deliver 
 the processed data to its ultimate destination, such as a database, data warehouse, visualization 
-tool, or alerting system. It's the opposite of the Source vettex, which receives input data into the platform.
+tool, or alerting system. It's the opposite of the Source vertex, which receives input data into the platform.
 Sink vertex may require transformation or formatting of data prior to sending it to the target system. Depending on the 
 target system's needs, this transformation can be simple or complex.
 
@@ -18,7 +18,7 @@ Numaflow currently supports the following Sinks
 
 A user-defined sink is a custom Sink that a user can write using Numaflow SDK when 
 the user needs to output the processed data to a system or using a certain transformation that is not 
-supported by the platform's built-in sinks.  As an example, once we have processed the input messages, 
+supported by the platform's built-in sinks. As an example, once we have processed the input messages, 
 we can use Elasticsearch as a User defined sink to store the processed data and enable search and 
-analysis  on the data.
+analysis on the data.
 
diff --git a/docs/user-guide/sinks/user-defined-sinks.md b/docs/user-guide/sinks/user-defined-sinks.md
@@ -1,6 +1,6 @@
 # User Defined Sinks
 
-A `Pipeline` may have multiple Sinks, those sinks could either be a pre-defined sink such as `kafka`, `log`, etc, or a `User Defined Sink`.
+A `Pipeline` may have multiple Sinks, those sinks could either be a pre-defined sink such as `kafka`, `log`, etc., or a `User Defined Sink`.
 
 A pre-defined sink vertex runs single-container pods, a user defined sink runs two-container pods.
 
diff --git a/docs/user-guide/sources/overview.md b/docs/user-guide/sources/overview.md
@@ -1,12 +1,16 @@
 # Sources
 
 Source vertex is responsible for reliable reading data from an unbounded source into Numaflow.
+Source vertex may require [transformation](./transformer/overview.md) or formatting of data prior to sending it to the output buffers.
+Source Vertex also does [Watermark](../../core-concepts/watermarks.md) tracking and late data detection.
 
-In Numaflow, we currently support the following builtin sources
+In Numaflow, we currently support the following sources
 
 * [Kafka](./kafka.md)
 * [HTTP](./http.md)
 * [Ticker](./generator.md)
 * [Nats](./nats.md)
+* [User Defined Source](./user-defined-sources.md)
 
-Source Vertex also does [Watermark](../../core-concepts/watermarks.md) tracking and late data detection.
+A user defined source is a custom source that a user can write using Numaflow SDK when 
+the user needs to read data from a system that is not supported by the platform's built-in sources.
diff --git a/docs/user-guide/sources/user-defined-sources.md b/docs/user-guide/sources/user-defined-sources.md
@@ -0,0 +1,36 @@
+# User Defined Sources
+
+A `Pipeline` may have multiple Sources, those sources could either be a pre-defined source such as `kafka`, `http`, etc., or a `User Defined Source`.
+
+With no source data transformer, A pre-defined source vertex runs single-container pods; a user-defined source runs two-container pods.
+
+## Build Your Own User Defined Sources
+
+You can build your own user defined sources in multiple languages.
+
+Check the links below to see the examples for different languages.
+
+- [Golang](https://github.com/numaproj/numaflow-go/tree/main/pkg/sourcer/examples/simple_source/)
+- [Java](https://github.com/numaproj/numaflow-java/tree/main/examples/src/main/java/io/numaproj/numaflow/examples/source/simple/)
+
+After building a docker image for the written user-defined source, specify the image as below in the vertex spec.
+
+```yaml
+spec:
+  vertices:
+    - name: input
+      source:
+        udsource:
+          container:
+            image: my-source:latest
+```
+
+## Available Environment Variables
+
+Some environment variables are available in the user defined source container:
+
+- `NUMAFLOW_NAMESPACE` - Namespace.
+- `NUMAFLOW_POD` - Pod name.
+- `NUMAFLOW_REPLICA` - Replica index.
+- `NUMAFLOW_PIPELINE_NAME` - Name of the pipeline.
+- `NUMAFLOW_VERTEX_NAME` - Name of the vertex.
diff --git a/docs/user-guide/user-defined-functions/map/map.md b/docs/user-guide/user-defined-functions/map/map.md
@@ -8,33 +8,13 @@ There are some [Built-in Functions](builtin-functions/README.md) that can be use
 
 ## Build Your Own UDF
 
-You can build your own UDF in multiple languages. A User Defined Function could be as simple as below in Golang.
-
-```golang
-package main
-
-import (
-	"context"
-
-	functionsdk "github.com/numaproj/numaflow-go/pkg/function"
-	"github.com/numaproj/numaflow-go/pkg/function/server"
-)
-
-func mapHandle(_ context.Context, keys []string, d functionsdk.Datum) functionsdk.Messages {
-	// Directly forward the input to the output
-	return functionsdk.MessagesBuilder().Append(functionsdk.NewMessage(d.Value()).WithKeys(keys))
-}
-
-func main() {
-	server.New().RegisterMapper(functionsdk.MapFunc(mapHandle)).Start(context.Background())
-}
-```
+You can build your own UDF in multiple languages.
 
 Check the links below to see the UDF examples for different languages.
 
-- [Python](https://github.com/numaproj/numaflow-python/tree/main/examples/function)
-- [Golang](https://github.com/numaproj/numaflow-go/tree/main/pkg/function/examples)
-- [Java](https://github.com/numaproj/numaflow-java/tree/main/examples/src/main/java/io/numaproj/numaflow/examples/function)
+- [Python](https://github.com/numaproj/numaflow-python/tree/main/examples/map/)
+- [Golang](https://github.com/numaproj/numaflow-go/tree/main/pkg/mapper/examples/)
+- [Java](https://github.com/numaproj/numaflow-java/tree/main/examples/src/main/java/io/numaproj/numaflow/examples/map/)
 
 After building a docker image for the written UDF, specify the image as below in the vertex spec.
 
@@ -49,8 +29,8 @@ spec:
 
 ### Streaming Mode
 
-In cases the map function generates more than one outputs (e.g. flat map), the UDF can be
-configured to run in a streaming mode instead of batching which is the default mode.
+In cases the map function generates more than one output (e.g., flat map), the UDF can be
+configured to run in a streaming mode instead of batching, which is the default mode.
 In streaming mode, the messages will be pushed to the downstream vertices once generated
 instead of in a batch at the end. The streaming mode can be enabled by setting the annotation
 `numaflow.numaproj.io/map-stream` to `true` in the vertex spec.
@@ -68,13 +48,13 @@ spec:
 
 Check the links below to see the UDF examples in streaming mode for different languages.
 
-- [Python](https://github.com/numaproj/numaflow-python/tree/main/examples/function/flatmap_stream)
-- [Golang](https://github.com/numaproj/numaflow-go/tree/main/pkg/function/examples/flatmap_stream)
-- [Java](https://github.com/numaproj/numaflow-java/tree/main/examples/src/main/java/io/numaproj/numaflow/examples/function/map/flatmapstream)
+- [Python](https://github.com/numaproj/numaflow-python/tree/main/examples/mapstream/flatmap_stream/)
+- [Golang](https://github.com/numaproj/numaflow-go/tree/main/pkg/mapstreamer/examples/flatmap_stream/)
+- [Java](https://github.com/numaproj/numaflow-java/tree/main/examples/src/main/java/io/numaproj/numaflow/examples/mapstream/flatmapstream/)
 
 ### Available Environment Variables
 
-Some environment variables are available in the user defined function container, they might be useful in you own UDF implementation.
+Some environment variables are available in the user defined function container, they might be useful in your own UDF implementation.
 
 - `NUMAFLOW_NAMESPACE` - Namespace.
 - `NUMAFLOW_POD` - Pod name.
diff --git a/docs/user-guide/user-defined-functions/reduce/reduce.md b/docs/user-guide/user-defined-functions/reduce/reduce.md
@@ -4,12 +4,12 @@
 
 Reduce is one of the most commonly used abstractions in a stream processing pipeline to define
 aggregation functions on a stream of data. It is the reduce feature that helps us solve problems like
-"performs a summary operation(such as counting the number of occurrence of a key, yielding user login
-frequencies), etc."Since the input an unbounded stream (with infinite entries), we need an additional
+"performs a summary operation(such as counting the number of occurrences of a key, yielding user login
+frequencies), etc. "Since the input is an unbounded stream (with infinite entries), we need an additional
 parameter to convert the unbounded problem to a bounded problem and provide results on that. That
 bounding condition is "time", eg, "number of users logged in per minute". So while processing an
 unbounded stream of data, we need a way to group elements into finite chunks using time. To build these
-chunks the reduce function is applied to the set of records produced using the concept of [windowing](./windowing/windowing.md).
+chunks, the reduce function is applied to the set of records produced using the concept of [windowing](./windowing/windowing.md).
 
 ## Reduce Pseudo code
 
@@ -63,7 +63,7 @@ The reduce supports parallelism processing by defining a `partitions` in the ver
 
 It is wrong to give a `partitions` > `1` if it is a _non-keyed_ vertex (`keyed: false`).
 
-There are a couple of [examples](examples.md) that demonstrates Fixed windows, Sliding windows,
+There are a couple of [examples](examples.md) that demonstrate Fixed windows, Sliding windows,
 chaining of windows, keyed streams, etc.
 
 ## Time Characteristics
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -51,6 +51,7 @@ nav:
           - user-guide/sources/http.md
           - user-guide/sources/kafka.md
           - user-guide/sources/nats.md
+          - user-guide/sources/user-defined-sources.md
           - Data Transformer:
               - Overview: "user-guide/sources/transformer/overview.md"
               - Built-in Transformers: