WOE值及IV值|Python程式碼

Chi-Heng Hsiung
6 min readDec 22, 2020
資料篩選變數就跟選對象一樣,找出自己對對象最要求的條件

可計算出WOE及IV透過以下程式,使用時機為邏輯式迴歸挑選重要變數時,現今許多金融單位製作評分表時,常會參考變數IV值,WOE(Weight of evidence)及IV(Information Value)定義可參考程式前言

Input:(df = 完整資料, feature = 興趣變數, target = 目標變數)

Output:[age_0_50 = 0.003, age_50_70 = 0.0007, age_70_inf = 0.07]

  1. 先至終端機安裝 Python 套件
  • Window:Windows 鍵+R 輸入 cmd 確認後進入終端機
  • Mac:cmd+space 輸入 terminal 確認後進入終端機
pip install collections

2. 執行以下 Python 程式碼,示範 data 放在 colab 中,亦可至網頁下載

import os
import pandas as pd
import numpy as np
from collections import Counter
def get_IV(df, feature, target, cutpoint = None):


df[feature] = df[feature].fillna('NULL')
if (df[feature].dtypes != 'object') & (cutpoint != None):
for i in range(len(cutpoint)):
if len(cutpoint) > i+1:
df.loc[(cutpoint[i] < df[feature]) & (df[feature] <= cutpoint[i+1]), feature + 'switch'] = \
feature + '_' + str(cutpoint[i]) + '_' + str(cutpoint[i+1])
elif len(cutpoint) == i+1:
df.loc[(cutpoint[i] < df[feature]), feature + 'switch'] = \
feature + '_' + str(cutpoint[i]) + '_inf'
feature = feature + 'switch'


Normal = pd.DataFrame.from_dict(Counter(df[df[target] == 0][feature]), orient = 'index').reset_index() # 正常 (Fraud == 0)
Fraud = pd.DataFrame.from_dict(Counter(df[df[target] == 1][feature]), orient = 'index').reset_index() # 違約 (Fraud == 1)
data = pd.merge(Normal, Fraud, how = 'outer', on = 'index')
data.fillna(0, inplace = True)
data.rename(columns={'index':'Value', '0_x':'Good', '0_y':'Bad'}, inplace = True)
data['Variable'] = feature
total_good = df[df[target] == 0].count()[feature]
total_bad = df[df[target] == 1].count()[feature]

# WOE values
data['Distribution Good'] = data['Good']/ total_good
data['Distribution Bad'] = data['Bad'] / total_bad
data['WoE'] = np.log(data['Distribution Good'] / data['Distribution Bad'])
data = data.replace({'WoE': {np.inf: 0, -np.inf: 0}})

# IV
data['IV'] = data['WoE'] * (data['Distribution Good'] - data['Distribution Bad'])
# data = data.sort_values(by = ['Variable', 'Value'], ascending = [True, True])
data = data.sort_values(by = ['IV'], ascending = [False])
data.index = range(len(data.index))

data['Variable sum IV'] = data['IV'].sum()
return(data)

3. 範例格式(興趣變數:1.連續型變數, 2.類別型變數)

# 連續型變數(age),需設定 cutpoint 分割區間
get_IV(df = df, 'age', 'target', cutpoint = [0, 50, 70])
# 類別型變數(month),無須設定 cutpoint
get_IV(df = df, feature = 'month', target = 'target')
# 記得將 target 變數轉成 0, 1
data['target'].replace({'N':0, 'Y':1}, inplace = True)

4. 回傳結果

(age_0_50 = 0.003, age_50_70 = 0.0007, age_70_inf = 0.07)
(oct = 0.113, may = 0.089, mar = 0.058, ......)

注意事項:當興趣變數為連續型變數時,需設定 cutpoint 作為分段區間,例:cutpoint = [‘0’, ‘50’, ‘70’],代表分三群分別為 0~50, 50, 70↑;若興趣變數為類別型變數時,則依不同類做區分。

提供Google colab執行程式碼,進入網頁後點擊 Shift+Enter 執行。

以上內容對你有幫助請拍手,讓我更有動力撰寫喲~

參考文獻:Kaggle

--

--