Skip to content

Workshop #3 (Machine Learning and Data Streaming) for the ETL course using scikit-learn to develop the ML model and Apache Kafka to manage the data streaming process.

Notifications You must be signed in to change notification settings

mitgar14/etl-workshop-3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Workshop #3: Machine Learning and Data Streaming Data Icon

Realized by MartΓ­n GarcΓ­a (@mitgar14).

Overview ✨

In this workshop, the World Happiness Report dataset will be used, comprising four CSV files with data from 2015 to 2019. A streaming data pipeline will be implemented using Apache Kafka. Once processed, the data will be fed into a Random Forest regression model to estimate the Happiness Score based on other scores in the dataset. The results will then be uploaded to a database, where the information will be analyzed to assess the accuracy and insights of the predictions.

The tools used are:


The dependencies needed for Python are:

  • python-dotenv
  • kafka-python-ng
  • country-converter
  • pandas
  • matplotlib
  • seaborn
  • plotly
  • nbformat
  • scikit-learn
  • sqlalchemy
  • psycopg2-binary

These libraries are included in the Poetry project config file (pyproject.toml). The step-by-step installation will be described later.


The images used in Docker are:

  • confluentinc/cp-zookeeper
  • confluentinc/cp-kafka

The configuration and installation of these images are facilitated by the Docker Compose config file (docker-compose.yml). The explanation for using these images will be explained later.

Dataset Information Dataset

After performing several transformations on the data, the columns to be analyzed in this workshop are as follows:

Column Description Data Type
country The country name, representing each nation Object
continent The continent to which each country belongs Object
year The year the data was recorded Integer
economy A measure of each country's economic status Float
health Health index indicating general well-being Float
social_support Perceived social support within each country Float
freedom Citizens' perception of freedom Float
corruption_perception Level of corruption as perceived by citizens Float
generosity Level of generosity within the country Float
happiness_rank Global ranking based on happiness score Integer
happiness_score Overall happiness score for each country Float

Data flow Data flow

Flujo de datos #3

Run the project Running code

πŸ› οΈ Clone the repository

Execute the following command to clone the repository:

  git clone https://github.com/mitgar14/etl-workshop-3.git

Demonstration of the process

git clone


🌍 Enviromental variables

For this project we use some environment variables that will be stored in one file named .env, this is how we will create this file:

  1. We create a directory named env inside our cloned repository.

  2. There we create a file called .env.

  3. In that file we declare 5 enviromental variables. Remember that some variables in this case go without double quotes, i.e. the string notation (").:

# PostgreSQL Variables

# PG_HOST: Specifies the hostname or IP address of the PostgreSQL server.
PG_HOST = # db-server.example.com

# PG_PORT: Defines the port used to connect to the PostgreSQL database.
PG_PORT = # 5432 (default PostgreSQL port)

# PG_USER: The username for authenticating with the PostgreSQL database.
PG_USER = # your-postgresql-username

# PG_PASSWORD: The password for authenticating with the PostgreSQL database.
PG_PASSWORD = # your-postgresql-password

# PG_DATABASE: The name of the PostgreSQL database to connect to.
PG_DATABASE = # your-database-name

Demonstration of the process

env variables


πŸ“¦ Installing the dependencies with Poetry

To install Poetry follow this link.

  1. Enter the Poetry shell with poetry shell.

  2. Once the virtual environment is created, execute poetry install to install the dependencies. In some case of error with the .lock file, just execute poetry lock to fix it.

  3. Now you can execute the notebooks!

Demonstration of the process

poetry


πŸ“” Running the notebooks

We execute the 3 notebooks following the next order. You can run these by just pressing the "Execute All" button:

  1. 01-EDA.ipynb
  2. 02-model_training.ipynb
  3. 03-metrics.ipynb

Running the notebooks

Remember to choose the right Python kernel at the time of running the notebook.

Python kernel


☁ Deploy the Database at a Cloud Provider

To perform the Airflow tasks related to Data Extraction and Loading we recommend making use of a cloud database service. Here are some guidelines for deploying your database in the cloud:


🐳 Run Kafka in Docker

Important

Make sure that Docker is installed in your system.

To set up Kafka using Docker and run your producer.py and consumer.py scripts located in the ./kafka directory, follow these steps:

  1. πŸš€ Start Kafka and Zookeeper Services

    Open your terminal or command prompt and navigate to the root directory of your cloned repository:

    cd etl-workshop-3

    Use the provided docker-compose.yml file to start the Kafka and Zookeeper services:

    docker-compose up -d

    This command will start the services in detached mode. Docker will pull the necessary images if they are not already available locally.

    Check if the Kafka and Zookeeper containers are up and running:

    docker ps

    You should see kafka_docker and zookeeper_docker in the list of running containers.

    Demonstration of the process

    docker_1

  2. πŸ“Œ Create a Kafka Topic

    Create a Kafka topic that your producer and consumer will use. Make sure to name it whr_kafka_topic to not clash with the Python scripts:

    docker exec -it kafka_docker kafka-topics --create --topic whr_kafka_topic --bootstrap-server localhost:9092

    List the available topics to confirm that the whr_kafka_topic has been created:

    docker exec -it kafka_docker kafka-topics --list --bootstrap-server localhost:9092

    docker_2

  3. πŸƒ Run the Producer Script

    In Visual Studio Code, navigate to the ./kafka directory and run the producer.py script in a dedicated terminal. The producer will start sending messages to the whr_kafka_topic.

    docker_kafka_producer

  4. πŸ‘‚ Run the Consumer Script

    Now navigate to the ./kafka directory, and run the consumer.py script in a dedicated terminal. You should now see the consumer receiving it in real-time.

    docker_kafka_consumer

  5. πŸ›‘ Shut Down the Services

    When you're finished, you can stop and remove the Kafka and Zookeeper containers:

    docker-compose down

docker_compose_down

Thank you! πŸ’•

Thanks for visiting my project. Any suggestion or contribution is always welcome 🐍.

About

Workshop #3 (Machine Learning and Data Streaming) for the ETL course using scikit-learn to develop the ML model and Apache Kafka to manage the data streaming process.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published