Realized by MartΓn GarcΓa (@mitgar14).
In this workshop, the World Happiness Report dataset will be used, comprising four CSV files with data from 2015 to 2019. A streaming data pipeline will be implemented using Apache Kafka. Once processed, the data will be fed into a Random Forest regression model to estimate the Happiness Score based on other scores in the dataset. The results will then be uploaded to a database, where the information will be analyzed to assess the accuracy and insights of the predictions.
The tools used are:
- Python 3.10 β Download site
- Jupyter Notebook β VS Code tool for using notebooks
- Docker β Download site for Docker Desktop
- PostgreSQL β Download site
- Power BI (Desktop version) β Download site
The dependencies needed for Python are:
- python-dotenv
- kafka-python-ng
- country-converter
- pandas
- matplotlib
- seaborn
- plotly
- nbformat
- scikit-learn
- sqlalchemy
- psycopg2-binary
These libraries are included in the Poetry project config file (pyproject.toml
). The step-by-step installation will be described later.
The images used in Docker are:
- confluentinc/cp-zookeeper
- confluentinc/cp-kafka
The configuration and installation of these images are facilitated by the Docker Compose config file (docker-compose.yml
). The explanation for using these images will be explained later.
After performing several transformations on the data, the columns to be analyzed in this workshop are as follows:
Column | Description | Data Type |
---|---|---|
country | The country name, representing each nation | Object |
continent | The continent to which each country belongs | Object |
year | The year the data was recorded | Integer |
economy | A measure of each country's economic status | Float |
health | Health index indicating general well-being | Float |
social_support | Perceived social support within each country | Float |
freedom | Citizens' perception of freedom | Float |
corruption_perception | Level of corruption as perceived by citizens | Float |
generosity | Level of generosity within the country | Float |
happiness_rank | Global ranking based on happiness score | Integer |
happiness_score | Overall happiness score for each country | Float |
Execute the following command to clone the repository:
git clone https://github.com/mitgar14/etl-workshop-3.git
For this project we use some environment variables that will be stored in one file named .env, this is how we will create this file:
-
We create a directory named env inside our cloned repository.
-
There we create a file called .env.
-
In that file we declare 5 enviromental variables. Remember that some variables in this case go without double quotes, i.e. the string notation (
"
).:
# PostgreSQL Variables
# PG_HOST: Specifies the hostname or IP address of the PostgreSQL server.
PG_HOST = # db-server.example.com
# PG_PORT: Defines the port used to connect to the PostgreSQL database.
PG_PORT = # 5432 (default PostgreSQL port)
# PG_USER: The username for authenticating with the PostgreSQL database.
PG_USER = # your-postgresql-username
# PG_PASSWORD: The password for authenticating with the PostgreSQL database.
PG_PASSWORD = # your-postgresql-password
# PG_DATABASE: The name of the PostgreSQL database to connect to.
PG_DATABASE = # your-database-name
To install Poetry follow this link.
-
Enter the Poetry shell with
poetry shell
. -
Once the virtual environment is created, execute
poetry install
to install the dependencies. In some case of error with the .lock file, just executepoetry lock
to fix it. -
Now you can execute the notebooks!
We execute the 3 notebooks following the next order. You can run these by just pressing the "Execute All" button:
- 01-EDA.ipynb
- 02-model_training.ipynb
- 03-metrics.ipynb
Remember to choose the right Python kernel at the time of running the notebook.
To perform the Airflow tasks related to Data Extraction and Loading we recommend making use of a cloud database service. Here are some guidelines for deploying your database in the cloud:
Important
Make sure that Docker is installed in your system.
To set up Kafka using Docker and run your producer.py
and consumer.py
scripts located in the ./kafka
directory, follow these steps:
-
π Start Kafka and Zookeeper Services
Open your terminal or command prompt and navigate to the root directory of your cloned repository:
cd etl-workshop-3
Use the provided
docker-compose.yml
file to start the Kafka and Zookeeper services:docker-compose up -d
This command will start the services in detached mode. Docker will pull the necessary images if they are not already available locally.
Check if the Kafka and Zookeeper containers are up and running:
docker ps
You should see
kafka_docker
andzookeeper_docker
in the list of running containers. -
π Create a Kafka Topic
Create a Kafka topic that your producer and consumer will use. Make sure to name it
whr_kafka_topic
to not clash with the Python scripts:docker exec -it kafka_docker kafka-topics --create --topic whr_kafka_topic --bootstrap-server localhost:9092
List the available topics to confirm that the
whr_kafka_topic
has been created:docker exec -it kafka_docker kafka-topics --list --bootstrap-server localhost:9092
-
π Run the Producer Script
In Visual Studio Code, navigate to the
./kafka
directory and run theproducer.py
script in a dedicated terminal. The producer will start sending messages to thewhr_kafka_topic
. -
π Run the Consumer Script
Now navigate to the
./kafka
directory, and run theconsumer.py
script in a dedicated terminal. You should now see the consumer receiving it in real-time. -
π Shut Down the Services
When you're finished, you can stop and remove the Kafka and Zookeeper containers:
docker-compose down
Thanks for visiting my project. Any suggestion or contribution is always welcome π.