Kaggle Titanic - 필사 및 중요 포인트 정리

21 Oct 2021 in Study / Kaggle

Titanic Top 4% with ensemble modeling

by Yassine Ghouzam

데이터 분석 순서:

Feature Analysis and Engineering
Modeling and predicting the survival using an voting procedure

1. Feature Analysis and Engineering

Outlier detection
Since outliers can have a dramatic effect on the prediction (espacially for regression problems), i choosed to manage them.
Numerical values 중 outlier을 찾아서 제거함. Tukey method 를 따라서 (Tukey JW., 1977) 25% - 75% 바깥 범위를 잘라낸 것으로 보임.
최소 두개의 outlier을 가진 row를 outlier로 간주한다.

# Outlier detection 

def detect_outliers(df,n,features):
    """
    Takes a dataframe df of features and returns a list of the indices
    corresponding to the observations containing more than n outliers according
    to the Tukey method.
    """
    outlier_indices = []
    
    # iterate over features(columns)
    for col in features:
        # 1st quartile (25%)
        Q1 = np.percentile(df[col], 25)
        # 3rd quartile (75%)
        Q3 = np.percentile(df[col],75)
        # Interquartile range (IQR)
        IQR = Q3 - Q1
        
        # outlier step
        outlier_step = 1.5 * IQR
        
        # Determine a list of indices of outliers for feature col
        outlier_list_col = df[(df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step )].index
        
        # append the found outlier indices for col to the list of outlier indices 
        outlier_indices.extend(outlier_list_col)
        
    # select observations containing more than 2 outliers
    outlier_indices = Counter(outlier_indices)        
    multiple_outliers = list( k for k, v in outlier_indices.items() if v > n )
    
    return multiple_outliers   

# detect outliers from Age, SibSp , Parch and Fare
Outliers_to_drop = detect_outliers(train,2,["Age","SibSp","Parch","Fare"])

Missing values Embarked 의 경우 결측치를 가장 빈번한 값인 (S) 로 채웠다.
Age
The strategy is to fill Age with the median age of similar rows according to Pclass, Parch and SibSp.
Pclass, Parch and SibSp가 같은 것들의 median age를 가지고 결측치를 채웠다.
Name Mr. Mrs. 등 타이틀은 유의미한 feature 일 수 있으므로 따로 뺀다. 특이한 타이틀 (Lady, the Countess, Capt 등) 은 ‘Rare’ 데이터로 따로 뺀다.

2. Modeling and predicting the survival using an voting procedure

Cross validate models

SVC
Decision Tree
AdaBoost
Random Forest
Extra Trees
Gradient Boosting
Multiple layer perceprton (neural network)
KNN
Logistic regression
Linear Discriminant Analysis

Hyperparameter tuning for best models
grid search 를 AdaBoost, ExtraTrees (무엇인지 알아보기), RandomForest, GradientBoosting and SVC classifiers 에 수행
n_jobs 파라미터를 4 cpu 이기 때문에 4로 설정했고 이것이 연산 시간 향상에 큰 도움을 주었다고 함. 하지만 15분 가량 걸렸다고 한다.

# Adaboost tuning 예
### META MODELING  WITH ADABOOST, RF, EXTRATREES and GRADIENTBOOSTING

# Adaboost
DTC = DecisionTreeClassifier()

adaDTC = AdaBoostClassifier(DTC, random_state=7)

ada_param_grid = {"base_estimator__criterion" : ["gini", "entropy"],
              "base_estimator__splitter" :   ["best", "random"],
              "algorithm" : ["SAMME","SAMME.R"],
              "n_estimators" :[1,2],
              "learning_rate":  [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3,1.5]}

gsadaDTC = GridSearchCV(adaDTC,param_grid = ada_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)

gsadaDTC.fit(X_train,Y_train)

ada_best = gsadaDTC.best_estimator_

gsadaDTC.best_score_
>>> 0.82406356413166859

Learning curves for 5 models
Feature importance 조사
Adaboost 이외에는 4개의 classifiers는 거의 비슷한 prediction을 보이고 있다.
Ensemble modeling
voting classifier를 이용하여 5개의 분류모델을 조합하고 제출하였다.

Kaggle Titanic - 필사 및 중요 포인트 정리

Titanic Top 4% with ensemble modeling

1. Feature Analysis and Engineering

2. Modeling and predicting the survival using an voting procedure

Wooseok's
Dev Blog

Error

Titanic Top 4% with ensemble modeling

1. Feature Analysis and Engineering

2. Modeling and predicting the survival using an voting procedure

Templates (for web app):

Error