Workshop #3: Machine Learning and Data Streaming

Realized by Martín García (@mitgar14).

Overview ✨

In this workshop, the World Happiness Report dataset will be used, comprising four CSV files with data from 2015 to 2019. A streaming data pipeline will be implemented using Apache Kafka. Once processed, the data will be fed into a Random Forest regression model to estimate the Happiness Score based on other scores in the dataset. The results will then be uploaded to a database, where the information will be analyzed to assess the accuracy and insights of the predictions.

The tools used are:

Python 3.10 ➜ Download site
Jupyter Notebook ➜ VS Code tool for using notebooks
Docker ➜ Download site for Docker Desktop
PostgreSQL ➜ Download site
Power BI (Desktop version) ➜ Download site

The dependencies needed for Python are:

python-dotenv
kafka-python-ng
country-converter
pandas
matplotlib
seaborn
plotly
nbformat
scikit-learn
sqlalchemy
psycopg2-binary

These libraries are included in the Poetry project config file (pyproject.toml). The step-by-step installation will be described later.

The images used in Docker are:

confluentinc/cp-zookeeper
confluentinc/cp-kafka

The configuration and installation of these images are facilitated by the Docker Compose config file (docker-compose.yml). The explanation for using these images will be explained later.

Dataset Information

After performing several transformations on the data, the columns to be analyzed in this workshop are as follows:

Column	Description	Data Type
country	The country name, representing each nation	Object
continent	The continent to which each country belongs	Object
year	The year the data was recorded	Integer
economy	A measure of each country's economic status	Float
health	Health index indicating general well-being	Float
social_support	Perceived social support within each country	Float
freedom	Citizens' perception of freedom	Float
corruption_perception	Level of corruption as perceived by citizens	Float
generosity	Level of generosity within the country	Float
happiness_rank	Global ranking based on happiness score	Integer
happiness_score	Overall happiness score for each country	Float

Data flow

Run the project

🛠️ Clone the repository

Execute the following command to clone the repository:

  git clone https://github.com/mitgar14/etl-workshop-3.git

Demonstration of the process

🌍 Enviromental variables

For this project we use some environment variables that will be stored in one file named .env, this is how we will create this file:

We create a directory named env inside our cloned repository.
There we create a file called .env.
In that file we declare 5 enviromental variables. Remember that some variables in this case go without double quotes, i.e. the string notation (").:

# PostgreSQL Variables

# PG_HOST: Specifies the hostname or IP address of the PostgreSQL server.
PG_HOST = # db-server.example.com

# PG_PORT: Defines the port used to connect to the PostgreSQL database.
PG_PORT = # 5432 (default PostgreSQL port)

# PG_USER: The username for authenticating with the PostgreSQL database.
PG_USER = # your-postgresql-username

# PG_PASSWORD: The password for authenticating with the PostgreSQL database.
PG_PASSWORD = # your-postgresql-password

# PG_DATABASE: The name of the PostgreSQL database to connect to.
PG_DATABASE = # your-database-name

Demonstration of the process

📦 Installing the dependencies with Poetry

To install Poetry follow this link.

Enter the Poetry shell with poetry shell.
Once the virtual environment is created, execute poetry install to install the dependencies. In some case of error with the .lock file, just execute poetry lock to fix it.
Now you can execute the notebooks!

Demonstration of the process

📔 Running the notebooks

We execute the 3 notebooks following the next order. You can run these by just pressing the "Execute All" button:

01-EDA.ipynb
02-model_training.ipynb
03-metrics.ipynb

Remember to choose the right Python kernel at the time of running the notebook.

☁ Deploy the Database at a Cloud Provider

To perform the Airflow tasks related to Data Extraction and Loading we recommend making use of a cloud database service. Here are some guidelines for deploying your database in the cloud:

🐳 Run Kafka in Docker

Important

Make sure that Docker is installed in your system.

To set up Kafka using Docker and run your producer.py and consumer.py scripts located in the ./kafka directory, follow these steps:

🚀 Start Kafka and Zookeeper Services

Open your terminal or command prompt and navigate to the root directory of your cloned repository:
```
cd etl-workshop-3
```
Use the provided docker-compose.yml file to start the Kafka and Zookeeper services:
```
docker-compose up -d
```
This command will start the services in detached mode. Docker will pull the necessary images if they are not already available locally.

Check if the Kafka and Zookeeper containers are up and running:
```
docker ps
```
You should see kafka_docker and zookeeper_docker in the list of running containers.

Demonstration of the process
📌 Create a Kafka Topic

Create a Kafka topic that your producer and consumer will use. Make sure to name it whr_kafka_topic to not clash with the Python scripts:
```
docker exec -it kafka_docker kafka-topics --create --topic whr_kafka_topic --bootstrap-server localhost:9092
```
List the available topics to confirm that the whr_kafka_topic has been created:
```
docker exec -it kafka_docker kafka-topics --list --bootstrap-server localhost:9092
```
🏃 Run the Producer Script

In Visual Studio Code, navigate to the ./kafka directory and run the producer.py script in a dedicated terminal. The producer will start sending messages to the whr_kafka_topic.
👂 Run the Consumer Script

Now navigate to the ./kafka directory, and run the consumer.py script in a dedicated terminal. You should now see the consumer receiving it in real-time.
🛑 Shut Down the Services

When you're finished, you can stop and remove the Kafka and Zookeeper containers:
```
docker-compose down
```

Thank you! 💕

Thanks for visiting my project. Any suggestion or contribution is always welcome 🐍.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
docs		docs
kafka		kafka
model		model
notebooks		notebooks
src		src
utils		utils
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Workshop #3: Machine Learning and Data Streaming

Overview ✨

Dataset Information

Data flow