This project applies machine learning to predict passenger survival on the Titanic. The dataset is processed, and a Random Forest classifier is trained to predict whether a passenger survived based on available features.
The dataset used is the Titanic dataset from Kaggle, which contains passenger details such as age, gender, fare, and class.
- Read the CSV file: The dataset is loaded using
pandas.read_csv()
. - Preprocess the data:
- The target variable (
Survived
) is extracted. - Unnecessary columns (
Name
,Ticket
,Cabin
) are dropped. - Categorical features are converted into numerical representations using one-hot encoding (
pd.get_dummies()
). - Missing values in the
Age
column are filled with the mean age.
- The target variable (
- Split the dataset:
- The dataset is split into training (70%) and testing (30%) sets using
train_test_split()
.
- The dataset is split into training (70%) and testing (30%) sets using
- Train the model:
- A
RandomForestClassifier
with 100 trees and a max depth of 5 is trained on the training data.
- A
- Make Predictions:
- The trained model predicts survival on the test set.
- Evaluate Accuracy:
- The model's accuracy is calculated using
np.mean(predictions == y_test)
, achieving ~85% accuracy.
- The model's accuracy is calculated using
- Save Results:
- Predictions can be saved to a CSV file for submission.
Ensure you have the following Python libraries installed:
pip install pandas numpy scikit-learn
- Tune hyperparameters for better performance.
- Use feature engineering to extract more insights from existing data.
- Try different models like Logistic Regression, XGBoost, or Neural Networks.