# Machine Learning Diagnose：

http://www.holehouse.org/mlclass/10_Advice_for_applying_machine_learning.html

## Diagnosing bias vs Variance

*high bias* means **UNDER FIT**
*high variance* means **OVER FIT**

B => U V => O

##Regularization and Bias/Variance Low lambda => overfit High lambda => underfit

## Learning curves

Condition 1： High bias
more training data is **not** likely to help.

Condition 2：High variance more training data is likely to help.

## Deciding what to do next revisited

Solution for bias/variance.

Action | Result | Reason |
---|---|---|

more training sets | fix high variance | More training sets so no overfit |

Less features | fix high variance | Less parameters so no overfit |

More features | fix high bias | More parameters so no underfit |

More polynomial features | fix high bias | More parameters so no under-fit |

Decreasing | fix high bias | low more parameters |

Increasing | fix high variance | high less parameters |

# Improved model selection

Given a training set instead split into three pieces

- Training set (60%) - m values
- Cross validation (CV) set (20%)mcv
- Test set (20%) mtest

As before, we can calculate:

- Training error
- Cross validation error
- Test error

Using CV to train the degree of polynomial d or lambda:

- The degree of a model will increase as you move towards overfitting
- Lets define training and cross validation error as before
- Now plot
- x = degree of polynomial d
- y = error for both training and cross validation (two lines)
- CV error and test set error will be very similar

«Previous: [BigData-Spark]Classification using Spark.