Prediction : Rain in Australia

프로젝트 데이터 : Rain in Australia

호주의 여러 지역에서 측정된 온도, 풍향, 습도 등 주어진 데이터를 통해 다음날 비가 왔는지를 예측해 볼 수 있다.

Target : RainTomorrow (Yes or No)

데이터를 선정한 이유

기상청에서도 잘 예측하지 못하여 우리가 늘 불평하는 날씨 예보. 우리 스스로는 얼마나 정확하게 이를 예측할 수 있는지 머신러닝으로 확인해 보고 싶었다.

  • 문제의 유형 -> yes or no 를 판별하는 것이므로 Classification 문제로 접근한다.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score
from category_encoders import OneHotEncoder, TargetEncoder, OrdinalEncoder
from sklearn.feature_selection import f_regression, SelectKBest
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge, LogisticRegression, LinearRegression
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, f1_score, classification_report


from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from lightgbm import LGBMClassifier

import missingno as msno
import seaborn as sns

import imblearn
from imblearn.over_sampling import RandomOverSampler

from scipy.stats import randint, uniform


pd.read_csv('/Users/wooseokpark/Documents/codestates/rain-in-aus/weatherAUS.csv').head()
DateLocationMinTempMaxTempRainfallEvaporationSunshineWindGustDirWindGustSpeedWindDir9am...Humidity9amHumidity3pmPressure9amPressure3pmCloud9amCloud3pmTemp9amTemp3pmRainTodayRainTomorrow
02008-12-01Albury13.422.90.6NaNNaNW44.0W...71.022.01007.71007.18.0NaN16.921.8NoNo
12008-12-02Albury7.425.10.0NaNNaNWNW44.0NNW...44.025.01010.61007.8NaNNaN17.224.3NoNo
22008-12-03Albury12.925.70.0NaNNaNWSW46.0W...38.030.01007.61008.7NaN2.021.023.2NoNo
32008-12-04Albury9.228.00.0NaNNaNNE24.0SE...45.016.01017.61012.8NaNNaN18.126.5NoNo
42008-12-05Albury17.532.31.0NaNNaNW41.0ENE...82.033.01010.81006.07.08.017.829.7NoNo

5 rows × 23 columns

Baseline model 설정 : 항상 비가 오지 않는 것으로 예측

aus_base = pd.read_csv('/Users/wooseokpark/Documents/codestates/rain-in-aus/weatherAUS.csv')

# yes -> 1, no -> 0 으로 바꾸어줌
def yesno(x):
    if x == 'Yes':
        return 1
    elif x == 'No':
        return 0
    
aus_base['RainToday'] = aus_base['RainToday'].apply(yesno)
aus_base['RainTomorrow'] = aus_base['RainTomorrow'].apply(yesno)

# 결측치 제거
aus_base = aus_base.dropna(subset=['RainToday', 'RainTomorrow'])

# baseline model - 항상 비가 오지 않을 것으로 예측
y_pred_base = pd.DataFrame({'RainTomorrow':np.zeros(len(aus_base['RainToday']))})
y_test = aus_base[['RainTomorrow']]

accuracy = classification_report(y_test[:-1], y_pred_base[:-1])
print(accuracy)
              precision    recall  f1-score   support

         0.0       0.78      1.00      0.88    109585
         1.0       0.00      0.00      0.00     31201

    accuracy                           0.78    140786
   macro avg       0.39      0.50      0.44    140786
weighted avg       0.61      0.78      0.68    140786



/Users/wooseokpark/miniforge3/envs/kaggle/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/Users/wooseokpark/miniforge3/envs/kaggle/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/Users/wooseokpark/miniforge3/envs/kaggle/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

baseline model 결과:
이미 accuracy가 78%에 달한다. 상당한 imbalance가 타겟에 존재하기 때문.
averaged f1 score : 0.44

EDA 및 데이터 전처리

# feature 간의 correlation 파악하기 
# 높은 correlation 보이는 feature들은 통일하거나 없앤다.

plt.figure(figsize=(10,10))
sns.heatmap(aus_base.corr(),
           annot=True,
           cmap='rainbow')
<AxesSubplot:>

png

aus_base_high = aus_base.corr().copy()
aus_base_high[~(aus_base_high > 0.7) ] = 0

plt.figure(figsize=(10,10))
sns.heatmap(aus_base_high,
           annot=True,
           )
<AxesSubplot:>

png

Temperature 그리고 pressure 끼리 상당히 높은 correlation을 보이고 있다.
따라서 mean Temperature / mean pressure 지표를 만들어 통일한다.

결측치 확인

%matplotlib inline
msno.bar(aus_base)
<AxesSubplot:>

png

Evaporation, Sunshine, Cloud는 너무 결측치가 많고 유추하기도 힘들기 때문에 관련 파라미터는 모두 제거한다.

데이터 전처리

aus = pd.read_csv('/Users/wooseokpark/Documents/codestates/rain-in-aus/weatherAUS.csv')
print(aus.shape)


#RainToday, RainTomorrow 결측치 행 제거
aus = aus.dropna(subset=['RainToday','RainTomorrow'])

# evaporation, sunshine, cloud는 너무 결측치가 많고 유추하기도 힘들다. 따라서 그냥 제거한다
aus = aus.drop(['Evaporation','Sunshine','Cloud9am','Cloud3pm'], axis=1)

#기타 결측치 제거
aus = aus.dropna()
print('After dropna:',aus.shape)


#RainToday and RainTomorrow values change to 0 and 1
def yesno(x):
    if x == 'Yes':
        return 1
    elif x == 'No':
        return 0
aus['RainToday'] = aus['RainToday'].apply(yesno)
aus['RainTomorrow'] = aus['RainTomorrow'].apply(yesno)

#meanTemp 계산
aus['MeanTemp'] = aus.loc[:,['MinTemp', 'MaxTemp', 'Temp9am', 'Temp3pm']].mean(axis=1)

#meanpressure 계산
aus['MeanPressure'] = aus.loc[:,['Pressure9am', 'Pressure3pm']].mean(axis=1)




(145460, 23)
After dropna: (112925, 19)

WindDir (바람의 방향) : 16가지 방향으로 나뉘어져 있다. 이를 manual 하게 동쪽 : 0 부터 시계방향으로 0~15까지 숫자를 매겼다

aus.loc[:,['WindGustDir', 'WindDir9am', 'WindDir3pm']]
WindGustDirWindDir9amWindDir3pm
0WWWNW
1WNWNNWWSW
2WSWWWSW
3NESEE
4WENENW
............
145454EESEE
145455ESEENE
145456NNWSEN
145457NSEWNW
145458SESSEN

112925 rows × 3 columns

# WindDir manual encoding

def dir_to_num(valdir):

    ref = {'E':0,
         'ESE':1,
         'SE':2,
         'SSE':3,
         'S':4,
         'SSW':5,
         'SW':6,
         'WSW':7,
         'W':8,
         'WNW':9,
         'NW':10,
         'NNW':11,
         'N':12,
         'NNE':13,
         'NE':14,
         'ENE':15
         }
    return ref[valdir]

aus['WindGustDir'] = aus['WindGustDir'].apply(dir_to_num)
aus['WindDir9am'] = aus['WindDir9am'].apply(dir_to_num)
aus['WindDir3pm'] = aus['WindDir3pm'].apply(dir_to_num)

aus.loc[:,['WindGustDir', 'WindDir9am', 'WindDir3pm']]


WindGustDirWindDir9amWindDir3pm
0889
19117
2787
31420
481510
............
145454010
1454550215
14545611212
1454571229
1454582312

112925 rows × 3 columns

# train / test 나누기. 2016년부터 test data로 남겨둠
train = aus[aus['Date'] <= '2016']
test = aus[aus['Date'] > '2016']
train.shape, test.shape

# feature and target selection
features = train.columns.drop(['MinTemp', 'MaxTemp', 'Temp9am', 'Temp3pm', 
                               'Pressure9am', 'Pressure3pm',
                               'RainTomorrow'])
target = 'RainTomorrow'


X_train = train[features]
y_train = train[target]

X_test = test[features]
y_test = test[target]

Oversampling

타겟 데이터의 불균형이 4:1 정도이므로, 모델 학습에서 불균형을 줄이기 위해
적은 데이터(내일 비가 오는 데이터)를 oversampling 하여 늘어난 훈련 데이터를 만든다.

# Oversample RainTomorrow = 1 data
oversample = RandomOverSampler(sampling_strategy='minority', random_state=33)

X_over, y_over = oversample.fit_resample(X_train, y_train)

X_train.shape, X_over.shape
((92513, 14), (144352, 14))

Encoding and Scale

OrdinalEncoder : Date, Location
Scale : Rainfall, WindGustSpeed, WindSpeed9am, WindSpeed3pm, Humidity9am, Humidity3pm, MeanTemp, MeanPressure

머신러닝 모델 성능 비교

LogisticRegression / DecisionTree / LightGBM / RandomForestClassifier 로 모델 성능을 비교하였다.
LogisticRegression은 파라미터 조정 없이 Cross-validation 진행.
나머지 모델들은 RandomizedSearchCV 를 이용해 최적의 파라미터를 탐색하였다.

Logistic Regression

pipe_logreg = make_pipeline(
        OrdinalEncoder(cols=['Date','Location']),
        StandardScaler(),
        LogisticRegression()
)
    

logreg_score = cross_val_score(
                    pipe_logreg,
                    X_over,
                    y_over,
                    cv=5,
                    n_jobs=-1,
                    verbose=0,
                    scoring='f1'
                )

print(logreg_score)
print('Logistic Regression average F1 score:', logreg_score.mean())
[0.69937944 0.74522206 0.76719357 0.78049295 0.74465542]
Logistic Regression average F1 score: 0.7473886885326441

Decision Tree

pipe = make_pipeline(
        OrdinalEncoder(cols=['Date','Location']),
        StandardScaler(),
        DecisionTreeClassifier()
)
    
params = {
    'decisiontreeclassifier__max_depth': [1,3,5,7],
    'decisiontreeclassifier__min_samples_split': [2,4,6,8],

    
}
    
clf = RandomizedSearchCV(
            pipe,
            param_distributions=params,
            n_iter=50,
            cv=5,
            scoring='f1',
            verbose=1,
            n_jobs=-1
)
    
clf.fit(X_over, y_over)
/Users/wooseokpark/miniforge3/envs/kaggle/lib/python3.8/site-packages/sklearn/model_selection/_search.py:292: UserWarning: The total space of parameters 16 is smaller than n_iter=50. Running 16 iterations. For exhaustive searches, use GridSearchCV.
  warnings.warn(


Fitting 5 folds for each of 16 candidates, totalling 80 fits





RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('ordinalencoder',
                                              OrdinalEncoder(cols=['Date',
                                                                   'Location'])),
                                             ('standardscaler',
                                              StandardScaler()),
                                             ('decisiontreeclassifier',
                                              DecisionTreeClassifier())]),
                   n_iter=50, n_jobs=-1,
                   param_distributions={'decisiontreeclassifier__max_depth': [1,
                                                                              3,
                                                                              5,
                                                                              7],
                                        'decisiontreeclassifier__min_samples_split': [2,
                                                                                      4,
                                                                                      6,
                                                                                      8]},
                   scoring='f1', verbose=1)
pd.DataFrame(clf.cv_results_).sort_values(by='rank_test_score').T
0123456789101112131415
mean_fit_time0.223190.2098580.1798380.1663780.2530240.2429380.2561420.236320.4458960.4941520.4857210.4670650.5356030.6068840.6510050.524421
std_fit_time0.0275550.0173160.0320010.0201630.0099280.0103860.0160710.0058680.0707050.0263010.016930.0425830.064310.0204950.0252950.066093
mean_score_time0.0200560.0196190.0166860.015630.0153740.0132130.0137770.0130320.019180.0248330.0206660.0227710.0198530.0223310.016780.014882
std_score_time0.0040590.0073690.0034560.0017110.0022220.0009350.0010830.0012240.0037790.0115830.0066510.0112980.0026750.0058660.0019420.002268
param_decisiontreeclassifier__min_samples_split2468246824682468
param_decisiontreeclassifier__max_depth1111333355557777
params{'decisiontreeclassifier__min_samples_split': ...{'decisiontreeclassifier__min_samples_split': ...{'decisiontreeclassifier__min_samples_split': ...{'decisiontreeclassifier__min_samples_split': ...{'decisiontreeclassifier__min_samples_split': ...{'decisiontreeclassifier__min_samples_split': ...{'decisiontreeclassifier__min_samples_split': ...{'decisiontreeclassifier__min_samples_split': ...{'decisiontreeclassifier__min_samples_split': ...{'decisiontreeclassifier__min_samples_split': ...{'decisiontreeclassifier__min_samples_split': ...{'decisiontreeclassifier__min_samples_split': ...{'decisiontreeclassifier__min_samples_split': ...{'decisiontreeclassifier__min_samples_split': ...{'decisiontreeclassifier__min_samples_split': ...{'decisiontreeclassifier__min_samples_split': ...
split0_test_score0.7115890.7115890.7115890.7115890.5994460.5994460.5994460.5994460.5724850.5724850.5724850.5724850.5680510.5680510.5680510.568051
split1_test_score0.6775440.6775440.6775440.6775440.7259570.7259570.7259570.7259570.7454040.7454040.7454040.7454040.6443030.6443030.6443030.644303
split2_test_score0.7003250.7003250.7003250.7003250.7379650.7379650.7379650.7379650.7532990.7532990.7532990.7532990.7374730.7374730.7374730.737473
split3_test_score0.7217130.7217130.7217130.7217130.7599080.7599080.7599080.7599080.7232230.7232230.7232230.7232230.6990720.6990720.6990720.699072
split4_test_score0.7248210.7248210.7248210.7248210.529580.529580.529580.529580.5284570.5284570.5284570.5284570.5534910.5534910.5534910.553491
mean_test_score0.7071980.7071980.7071980.7071980.6705710.6705710.6705710.6705710.6645740.6645740.6645740.6645740.6404780.6404780.6404780.640478
std_test_score0.017130.017130.017130.017130.0900310.0900310.0900310.0900310.0947140.0947140.0947140.0947140.071650.071650.071650.07165
rank_test_score11115555999913131313
pipe = make_pipeline(
        OrdinalEncoder(cols=['Date','Location']),
        StandardScaler(),
        LGBMClassifier()
)
    
params = {
    'lgbmclassifier__num_leaves': [10,30,50,70,100],
    'lgbmclassifier__max_depth': [3,5,10,-1],
    'lgbmclassifier__n_estimators': randint(50,500),
    
}
    
clf = RandomizedSearchCV(
            pipe,
            param_distributions=params,
            n_iter=50,
            cv=5,
            scoring='f1',
            verbose=1,
            n_jobs=-1
)
    
clf.fit(X_over, y_over)
Fitting 5 folds for each of 50 candidates, totalling 250 fits





RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('ordinalencoder',
                                              OrdinalEncoder(cols=['Date',
                                                                   'Location'])),
                                             ('standardscaler',
                                              StandardScaler()),
                                             ('lgbmclassifier',
                                              LGBMClassifier())]),
                   n_iter=50, n_jobs=-1,
                   param_distributions={'lgbmclassifier__max_depth': [3, 5, 10,
                                                                      -1],
                                        'lgbmclassifier__n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x17cc6c7c0>,
                                        'lgbmclassifier__num_leaves': [10, 30,
                                                                       50, 70,
                                                                       100]},
                   scoring='f1', verbose=1)
pd.DataFrame(clf.cv_results_).sort_values(by='rank_test_score').T
372441240103317495...282316143718422913
mean_fit_time4.2195083.7648544.3067443.1560752.3499962.5412394.2592211.8009811.668461.34132...2.7595262.3953412.2569962.1232781.9882341.8342431.6800751.5531941.5281681.032888
std_fit_time0.0254450.0423250.0675120.0851790.0874750.4099160.7069670.0224930.1625220.020239...0.044350.0564510.0579230.0566240.0569560.041160.0610960.061490.2102530.114712
mean_score_time1.0946930.9744881.0900160.742120.5419420.588720.8962260.4034140.3044680.236299...0.326110.3023560.2715470.2264040.2201560.2274460.2094430.1758710.1497580.132405
std_score_time0.0765760.0890180.0487440.0441010.0907730.0439040.0676840.0515950.0609790.070231...0.0186960.0128530.0110620.0244680.0304040.0084160.019830.0263740.0091240.02512
param_lgbmclassifier__max_depth10101010101010101010...3333333333
param_lgbmclassifier__n_estimators42240148237616623847816214984...370331337309290279238166160145
param_lgbmclassifier__num_leaves7070505010070307070100...50701070305030301070
params{'lgbmclassifier__max_depth': 10, 'lgbmclassif...{'lgbmclassifier__max_depth': 10, 'lgbmclassif...{'lgbmclassifier__max_depth': 10, 'lgbmclassif...{'lgbmclassifier__max_depth': 10, 'lgbmclassif...{'lgbmclassifier__max_depth': 10, 'lgbmclassif...{'lgbmclassifier__max_depth': 10, 'lgbmclassif...{'lgbmclassifier__max_depth': 10, 'lgbmclassif...{'lgbmclassifier__max_depth': 10, 'lgbmclassif...{'lgbmclassifier__max_depth': 10, 'lgbmclassif...{'lgbmclassifier__max_depth': 10, 'lgbmclassif......{'lgbmclassifier__max_depth': 3, 'lgbmclassifi...{'lgbmclassifier__max_depth': 3, 'lgbmclassifi...{'lgbmclassifier__max_depth': 3, 'lgbmclassifi...{'lgbmclassifier__max_depth': 3, 'lgbmclassifi...{'lgbmclassifier__max_depth': 3, 'lgbmclassifi...{'lgbmclassifier__max_depth': 3, 'lgbmclassifi...{'lgbmclassifier__max_depth': 3, 'lgbmclassifi...{'lgbmclassifier__max_depth': 3, 'lgbmclassifi...{'lgbmclassifier__max_depth': 3, 'lgbmclassifi...{'lgbmclassifier__max_depth': 3, 'lgbmclassifi...
split0_test_score0.6304370.6293940.6261810.6198040.6183120.6166170.6136810.6098840.6086240.604069...0.5858830.5846960.5852270.5841490.5839050.5834320.5825530.5805540.5802190.579771
split1_test_score0.6328020.6316830.6231590.616740.6175890.6166960.6055440.6069890.6051560.600612...0.5740790.5746880.5741180.5734050.5728880.5724110.5712540.5685210.5676930.566418
split2_test_score0.6508210.6500560.6457260.6402430.6371350.6359620.6296650.6272730.6265210.620158...0.5960980.5950050.5943820.5963930.5950680.5946550.5953270.5969650.596940.595247
split3_test_score0.6703010.6694570.6672750.6609280.6532840.6522390.6419670.6454510.6431250.64577...0.5995860.5999170.6001140.5988650.598280.5988070.5990950.5998120.5996450.600063
split4_test_score0.6400710.6384240.6327210.6254110.6221270.6201790.6157480.6096110.6074480.604228...0.5795590.5787510.5790360.5787250.578170.5779570.5768050.5728290.5732380.573202
mean_test_score0.6448860.6438030.6390120.6326250.6296890.6283380.6213210.6198410.6181750.614967...0.5870410.5866110.5865750.5863070.5856630.5854530.5850070.5837360.5835470.58294
std_test_score0.0145540.0147020.0161210.0162960.0137490.013920.0129170.0146970.0146150.016828...0.0096410.0095410.0095720.009880.0096960.0099340.0106530.0126020.0127060.012826
rank_test_score12345678910...41424344454647484950

16 rows × 50 columns

pipe = make_pipeline(
        OrdinalEncoder(cols=['Date','Location']),
        StandardScaler(),
        RandomForestClassifier()
)
    
params = {
    'randomforestclassifier__max_depth': [5, 10, 15, 20, None],
    'randomforestclassifier__n_estimators': [20,50,100,200], #50, 500
    'randomforestclassifier__max_features': uniform(0,1)

    
}
    
clf = RandomizedSearchCV(
            pipe,
            param_distributions=params,
            n_iter=20,
            cv=3,
            scoring='f1',
            verbose=1,
            n_jobs=-1
)
    
clf.fit(X_over, y_over)
Fitting 3 folds for each of 20 candidates, totalling 60 fits





RandomizedSearchCV(cv=3,
                   estimator=Pipeline(steps=[('ordinalencoder',
                                              OrdinalEncoder(cols=['Date',
                                                                   'Location'])),
                                             ('standardscaler',
                                              StandardScaler()),
                                             ('randomforestclassifier',
                                              RandomForestClassifier())]),
                   n_iter=20, n_jobs=-1,
                   param_distributions={'randomforestclassifier__max_depth': [5,
                                                                              10,
                                                                              15,
                                                                              20,
                                                                              None],
                                        'randomforestclassifier__max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x17ddd3a90>,
                                        'randomforestclassifier__n_estimators': [20,
                                                                                 50,
                                                                                 100,
                                                                                 200]},
                   scoring='f1', verbose=1)
pd.DataFrame(clf.cv_results_).sort_values(by='rank_test_score').T
391621911417411127061015813518
mean_fit_time4.05052820.17247759.84185317.10983916.3432236.8325695.68712.47533177.20750517.91004947.16972910.57047710.1006448.24004777.19437813.47853424.70212629.73254820.5152895.366548
std_fit_time0.0631320.2115692.2384570.3640820.0967980.1503630.9679680.054691.5239490.3528151.4506050.3739290.552320.3994192.4518610.3955030.7802681.1312281.1454580.161022
mean_score_time0.5082170.3271550.4785510.3108270.7599380.148360.6206240.1706731.2678980.3370971.3048111.1754220.3167590.1778160.8712940.2837270.4967861.0500920.3556830.130447
std_score_time0.0631030.0440590.0502190.0569910.1079230.0108330.0595890.0179720.1849560.0491180.1748670.1185060.0243340.010630.1428350.027840.0585140.1062380.0661540.019752
param_randomforestclassifier__max_depthNoneNone202020201515151515105510101010510
param_randomforestclassifier__max_features0.0800460.607640.9585760.6415760.2205440.5825230.965260.1706680.7037530.5948170.3633150.0874540.4671110.7625740.8965850.6093050.5445910.3545590.9595350.576953
param_randomforestclassifier__n_estimators505010050100202002020050200200100502005010020010020
params{'randomforestclassifier__max_depth': None, 'r...{'randomforestclassifier__max_depth': None, 'r...{'randomforestclassifier__max_depth': 20, 'ran...{'randomforestclassifier__max_depth': 20, 'ran...{'randomforestclassifier__max_depth': 20, 'ran...{'randomforestclassifier__max_depth': 20, 'ran...{'randomforestclassifier__max_depth': 15, 'ran...{'randomforestclassifier__max_depth': 15, 'ran...{'randomforestclassifier__max_depth': 15, 'ran...{'randomforestclassifier__max_depth': 15, 'ran...{'randomforestclassifier__max_depth': 15, 'ran...{'randomforestclassifier__max_depth': 10, 'ran...{'randomforestclassifier__max_depth': 5, 'rand...{'randomforestclassifier__max_depth': 5, 'rand...{'randomforestclassifier__max_depth': 10, 'ran...{'randomforestclassifier__max_depth': 10, 'ran...{'randomforestclassifier__max_depth': 10, 'ran...{'randomforestclassifier__max_depth': 10, 'ran...{'randomforestclassifier__max_depth': 5, 'rand...{'randomforestclassifier__max_depth': 10, 'ran...
split0_test_score0.6662610.6641250.6613080.6621930.6638340.6611720.643910.6369810.6450650.6444560.6444560.6001220.573330.5707510.6034480.6019680.6012520.6007390.5693480.600421
split1_test_score0.7836530.6925610.6885980.6850620.6789930.6846580.6742650.6964670.6723390.6718630.6652420.7158480.7186880.7042880.627340.6228310.6213730.6188840.6885990.619213
split2_test_score0.6846150.6683620.6659630.6664810.6687760.6650480.6397730.6236040.6380320.6369950.6354370.5793280.5398070.5370550.5773160.5758070.5744010.5756410.5369440.574558
mean_test_score0.7115090.6750160.6719560.6712450.6705340.6702930.652650.6523510.6518120.6511050.6483780.6317660.6106090.6040310.6027010.6002020.5990090.5984210.5982970.598064
std_test_score0.051560.0125260.011920.0099250.0063120.010280.0153780.0316690.0147960.0149910.012480.0600580.0776390.0722140.0204290.0192380.0192420.017730.065210.018306
rank_test_score1234567891011121314151617181920

Best model : RandomForestClassifier - Test set에 적용하여 성능 평가

best_pipe = clf.best_estimator_
y_pred = best_pipe.predict(X_test)
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.85      0.96      0.90     15730
           1       0.74      0.43      0.55      4682

    accuracy                           0.84     20412
   macro avg       0.80      0.69      0.72     20412
weighted avg       0.82      0.84      0.82     20412
'''
[Baseline Model classification report]
              precision    recall  f1-score   support

         0.0       0.78      1.00      0.88    109585
         1.0       0.00      0.00      0.00     31201

    accuracy                           0.78    140786
   macro avg       0.39      0.50      0.44    140786
weighted avg       0.61      0.78      0.68    140786
'''
'\n[Baseline Model classification report]\n              precision    recall  f1-score   support\n\n         0.0       0.78      1.00      0.88    109585\n         1.0       0.00      0.00      0.00     31201\n\n    accuracy                           0.78    140786\n   macro avg       0.39      0.50      0.44    140786\nweighted avg       0.61      0.78      0.68    140786\n'

최종 모델 test 결과

baseline model 에 비해 test set 에서 accuracy 0.78 -> 0.84 로 상승.
average f1-score 은 0.44 -> 0.72 로 상승.

머신러닝 모델 해석 - PDP Plot으로 특성들의 모델에 대한 기여도 확인

y_pred = best_pipe.predict(X_test)
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.86      0.95      0.90     15730
           1       0.74      0.47      0.58      4682

    accuracy                           0.84     20412
   macro avg       0.80      0.71      0.74     20412
weighted avg       0.83      0.84      0.83     20412
X_test.columns
Index(['Date', 'Location', 'Rainfall', 'WindGustDir', 'WindGustSpeed',
       'WindDir9am', 'WindDir3pm', 'WindSpeed9am', 'WindSpeed3pm',
       'Humidity9am', 'Humidity3pm', 'RainToday', 'MeanTemp', 'MeanPressure'],
      dtype='object')
X_test.Rainfall.max()
225.0
from pdpbox.pdp import pdp_isolate, pdp_plot

X_test_encoded = OrdinalEncoder(cols=['Date','Location']).fit_transform(X_test)

def pdpplot(feature):
    
    isolated = pdp_isolate(
        model=best_pipe, 
        dataset=X_test_encoded, 
        model_features=X_test_encoded.columns, 
        feature=feature,
        grid_type='percentile', # default='percentile', or 'equal'
        num_grid_points=30 # default=10
    )
    pdp_plot(isolated, feature_name=feature)

pdpplot('Rainfall')

png

pdpplot('Date')

png

pdpplot('MeanTemp')

png

Conclusion

  • 호주 기상 데이터를 바탕으로 다음날 비가 내렸는지를 예측해 보았다.
  • EDA 로 결측치 및 feature correlation 확인하였고, 특성공학을 수행하였다.
  • Imbalanced 되어 있는 Target 에 대해 oversampling하여 훈련셋을 만들었다.
  • LogisticRegression, DecisionTree, LightGBM, RandomForestClassifier 로 각각 CV를 적용한 모델 훈련 결과를 비교하였다.
  • 가장 결과가 잘 나온 RandomForest 로 test 데이터에 적용해 보았고, baseline model 보다 상당한 성능 향상을 보였다.
  • PDP Plot으로 몇 가지 feature들이 타겟 예측에 어떻게 기여했는지 살펴보았다.
  • 모델이 어느 정도 날씨 예측 향상에 기여했다는 것을 확인했다.

아쉬운 점

  • feature importance를 살펴보지 못했다.
  • 여전히 비가 오는 경우 예측의 recall 점수가 너무 낮았다.
  • 좀더 다양한 모델과 hyperparameter 튜닝을 해보지 못해 아쉬웠다.

© 2023. All rights reserved.