본문 바로가기
Kaggle

<Kaggle > Learn- Intro to Machine Learning (Underfitting and Overfitting)

by 바건정 2024. 11. 7.

Overfitting: capturing spruious patterns that won't recur in the future, leading to less accurate predictions.

Underfitting: failing to capture relevant patterns, leading to less accurate predictions.

 

Examples with the decision tree model

  • The most important options determine the tree's depth.
  • As the tree gets deeper, the dataset gets sliced up leaves with fewer datas (2 -> 4 -> 8 -> 16 ...)
  • Leaves with very few datas will make predictions that are quite close to those data's actual values or results, but they may make very unreliable predictions for new data.

≫ Overfitting : a model matches the training data almost perfectly, but does poorly in validation and other new data.

 

  • If a tree divides datas into only 2 or 4, each group still has a wide variety of datas.
  • → Resulting predictions may be far off for most datas, even in the training data.

≫ Underfitting: a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data.

 

The more leaves we allow the model to make, the more we move from the underfitting area in the graph to the overfitting area.

 

 

from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
	model = DecisionTreeRegressor(max_leaf_nodes = max_leaf_nodes, random_state = 0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)
    
for max_leaf_nodes in [5, 50, 500, 5000]:
	my_mae = get_mae[max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d \t\t Mean absolute error: %d" %(max_leaf_nodes, my_mae))
    
# Max leaf nodes: 5		Mean absolute error: 347380
# Max leaf nodes: 50		Mean absolute error: 258171
# Max leaf nodes: 500		Mean absolute error: 243495
# Max leaf nodes: 5000		Mean absolute error: 254983

 

Models can suffer from either

Overfitting

: capturing spurious patterns that won't recur in the future, leading to less accurate predictions.

 

Underfitting

: failing to capture relevant patterns, again leading to less accurate predictions.