AI & Tech

How to Train a KAN Model on the Titanic Dataset for Kaggle

section-content

#AI #Tech

section-content

How to Train a KAN Model on the Titanic Dataset for Kaggle

CHECKOUT LIKE AND SUBSCRIBE TO MY FINAL SUBMISSION WITH [Score: 0.77751] here https://www.kaggle.com/code/dzehtsiarou/better-kan-titenic

YES WE KAN!

Introduction

Training machine learning models on real-world datasets is an exciting yet challenging task. The Titanic dataset on Kaggle is a great starting point for beginners to explore and practice predictive modeling. In this blog post, I will guide you through the steps to train a Knowledge Augmented Network (KAN) model on the Titanic dataset. This tutorial will cover data preprocessing, model training, and making predictions for Kaggle submission.

Step-by-Step Guide

Step 1: Import Necessary Libraries

First, we need to import essential libraries for data manipulation, model training, and preprocessing.

import pandas as pd
import torch
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

Step 2: Load the Data

We load the training and test datasets from CSV files provided by Kaggle.

train_data = pd.read_csv('/kaggle/input/titanic/train.csv')
test_data = pd.read_csv('/kaggle/input/titanic/test.csv')

Step 3: Encode Categorical Variables

Next, we convert categorical columns to numerical format using LabelEncoder.

label_encoder = LabelEncoder()
for col in ['Name', 'Sex', 'Ticket']:
    train_data[col] = label_encoder.fit_transform(train_data[col])
    test_data[col] = label_encoder.fit_transform(test_data[col])

Step 4: Fill Missing Values

To handle missing data, we replace missing values in numeric columns with the column mean.

train_data.fillna(train_data.select_dtypes(include=[np.number]).mean(), inplace=True)
test_data.fillna(test_data.select_dtypes(include=[np.number]).mean(), inplace=True)

Step 5: Prepare Features and Target Variables

We prepare the features and target variables by dropping unnecessary columns and separating features from the target variable.

train_df = train_data.drop(['PassengerId', 'Survived', 'Age', 'Cabin', 'Embarked'], axis=1)
target_df = train_data['Survived']

Step 6: Split the Data

We split the data into training and test sets using train_test_split.

X_train, X_test, y_train, y_test = train_test_split(train_df, target_df, test_size=0.2, random_state=42)

Step 7: Normalize the Data

Standardizing the feature data helps in stabilizing and speeding up the training process.

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Step 8: Convert to PyTorch Tensors

We convert the data into PyTorch tensors for compatibility with the KAN model.

train_input = torch.tensor(X_train_scaled, dtype=torch.float32)
train_label = torch.tensor(y_train.values, dtype=torch.float32).unsqueeze(1)
test_input = torch.tensor(X_test_scaled, dtype=torch.float32)
test_label = torch.tensor(y_test.values, dtype=torch.float32).unsqueeze(1)

Step 9: Ensure Data Integrity

It's crucial to check for and assert that there are no NaN or infinite values in the data.

assert not torch.isnan(train_input).any(), "NaNs found in train_input"
assert not torch.isnan(train_label).any(), "NaNs found in train_label"
assert not torch.isnan(test_input).any(), "NaNs found in test_input"
assert not torch.isnan(test_label).any(), "NaNs found in test_label"
assert not torch.isinf(train_input).any(), "Infinities found in train_input"
assert not torch.isinf(train_label).any(), "Infinities found in train_label"
assert not torch.isinf(test_input).any(), "Infinities found in test_input"
assert not torch.isinf(test_label).any(), "Infinities found in test_label"

Step 10: Create Dataset Dictionary

We organize the training and test data into a dictionary for the KAN model.

dataset = 

Step 11: Initialize KAN Model

Configure and initialize the KAN model with appropriate dimensions and parameters.

from kan import KAN, create_dataset

input_dim = train_input.shape[1]  ## Number of input features
output_dim = 1  ## Binary classification (Survived)
model = KAN(width=[input_dim, 3, output_dim], grid=3, k=2, seed=0)  ## Simplified model configuration

Step 12: Train Model with LBFGS Optimizer

We train the KAN model using the LBFGS optimizer.

model.train(dataset, opt="LBFGS", steps=20, lamb=0.01, lamb_entropy=10.)
#OUTPUT:
train loss: 3.77e-01 | test loss: 3.77e-01 | reg: 3.19e+01 : 100%|██| 20/20 [00:16<00:00,  1.20it/s]

Step 13: Plot Initial Model

Visualize the initial state of the KAN model.

model(dataset['train_input'])
model.plot(beta=100)

Step 14: Continue Training

Further train the KAN model to improve performance.

model.train(dataset, opt="Adam", steps=50, lr=0.001, lamb=0.01, lamb_entropy=10.)
model.plot()

Step 15: Prune the Model

Remove unnecessary neurons and connections to simplify the model.

model.prune()
model.plot(mask=True)
model = model.prune()
model(dataset['train_input'])
model.plot()

Step 16: Continue Training with Symbolic Functions

Set symbolic functions for activation functions and train the model to optimize it further.

model.train(dataset, opt="Adam", steps=50, lr=0.001)
model.plot()
mode = "auto"  ## Use automatic mode for setting symbolic functions
if mode == "manual":
    model.fix_symbolic(0,0,0,'sin')
    model.fix_symbolic(0,1,0,'x^2')
    model.fix_symbolic(1,0,0,'exp')
elif mode == "auto":
    lib = ['x','x^2','x^3','x^4','exp','log','sqrt','tanh','sin','abs']
    model.auto_symbolic(lib=lib)

#OUTPUT:
train loss: 3.52e-01 | test loss: 3.52e-01 | reg: 4.26e+00 : 100%|██| 50/50 [00:14<00:00,  3.57it/s]
fixing (0,0,0) with tanh, r2=1.0000007152557373
fixing (0,1,0) with tanh, r2=0.9095821380615234
fixing (0,2,0) with tanh, r2=1.0000007152557373
fixing (0,3,0) with tanh, r2=0.9738956093788147
fixing (0,4,0) with tanh, r2=0.7780349850654602
fixing (0,5,0) with sin, r2=0.5418534278869629
fixing (0,6,0) with sin, r2=0.4986622929573059
fixing (1,0,0) with tanh, r2=0.9674128890037537

Step 17: Train with Symbolic Functions

Train the model with symbolic functions to further optimize it.

model.train(dataset, opt="LBFGS", steps=50, lr=0.001)
symbolic_formula = model.symbolic_formula()[0][0]
print(symbolic_formula)
#OUTPUT:
train loss: 3.62e-01 | test loss: 3.62e-01 | reg: 4.81e+00 : 100%|██| 50/50 [00:06<00:00,  8.31it/s]
0.55 - 0.41*tanh(-0.43*sin(1.73*x_6 - 3.22) + 0.76*sin(5.94*x_7 + 0.13) + 7.05*tanh(0.85*x_1 + 0.44) + 10.55*tanh(5.33*x_2 - 9.2) + 6.09*tanh(4.72*x_3 + 5.35) + 2.84*tanh(0.1*x_4 - 0.54) + 2.5*tanh(1.38*x_5 - 6.28) + 14.07)

Step 18: Preprocess Test Data

Prepare the test data for making predictions.

test_df = test_data.drop(['PassengerId', 'Age', 'Cabin', 'Embarked'], axis=1)
test_df_scaled = scaler.transform(test_df)
test_input = torch.tensor(test_df_scaled, dtype=torch.float32)

Step 19: Ensure No NaNs in Test Data

Check that there are no NaN values in the preprocessed test data.

assert not torch.isnan(test_input).any(), "NaNs found in test_input"

Step 20: Predict Survival Probabilities

Use the trained KAN model to predict survival probabilities for the test data.

y_pred_sub = model(test_input).detach().numpy()

Step 21: Convert Probabilities to Binary Outcomes

Apply a threshold to convert probabilities to binary outcomes.

threshold = 0.5
y_pred_sub = (y_pred_sub >= threshold).astype(int).flatten()  ## Ensure the predictions are a flat array

Step 22: Create Submission DataFrame

Organize the predictions into a DataFrame for submission to Kaggle.

sub = pd.DataFrame()
sub['PassengerId'] = test_data['PassengerId']
sub['Survived'] = y_pred_sub

Step 23: Display Submission DataFrame

Show the first few rows of the submission DataFrame.

print(sub.head())
   PassengerId  Survived
0          892         0
1          893         1
2          894         0
3          895         0
4          896         1

Conclusion

That's all folks, I was simply following the guide to training a model with pykan and hope you can perfect it by preprocessing the data better and optimizing hyperparameters, here are suggestions:

Feature Engineering:

  • [Titles Extraction: Extract titles from names to add as a new feature.]
  • [Family Size and IsAlone: Create new features to capture family relationships.]

Improved Missing Value Handling:

  • [KNN Imputation: Use KNN imputer to fill missing values for Age and Fare.]

Model Configuration:

  • [Model Width and Grid: Enhance the model configuration for better learning capacity.]

Optimizer:

  • [LBFGS: I Used LBFGS optimizer, try to see if you can get better results with a different optimizer]

Regularization:

  • [Lambda and Entropy: Fine-tune regularization parameters to avoid overfitting.]

Related Posts

AI & Tech

The Last Anthropocentric Century

This essay is not an instruction manual, nor a prophecy, nor a TED Talk disguised as spiritual insight. It is merely an invitation to sit for a moment inside the fog before the shape emerges from it.