Introduction to Machine Learning in Python: Key Points

Beta

Introduction to Machine Learning in Python

Introduction

Machine learning borrows heavily from fields such as statistics and computer science.
In machine learning, models learn rules from data.
In supervised learning, the target in our training data is labelled.
A.I. has become a synonym for machine learning.
A.G.I. is the loftier goal of achieving human-like intelligence.

Data preparation

Data pre-processing is arguably the most important task in machine learning.
SQL is the tool that we use to extract data from database systems.
Data is typically partitioned into training and test sets.
Setting random states helps to promote reproducibility.

Learning

Loss functions allow us to define a good model.
\(y\) is a known target. \(\hat{y}\) (\(y hat\)) is a prediction.
Mean squared error is an example of a loss function.
After defining a loss function, we search for the optimal solution in a process known as ‘training’.
Optimisation is at the heart of machine learning.

Modelling

Linear regression is a popular model for regression tasks.
Logistic regression is a popular model for classification tasks.
Decision trees are a useful and easily interpretable alternative for classification tasks.
Probabilities that can be mapped to a prediction class.

Validation

Validation sets are used during model development, allowing models to be tested prior to testing on a held-out set.
Cross-validation is a resampling technique that creates multiple validation sets.
Cross-validation can help to avoid overfitting.

Evaluation

Confusion matrices are the basis for many popular performance metrics.
AUROC is the area under the receiver operating characteristic. 0.5 is bad!
TP is True Positive, meaning that our prediction hit its target.

Bootstrapping

Bootstrapping is a resampling technique, sometimes confused with cross-validation.
Bootstrapping allows us to generate a distribution of estimates, rather than a single point estimate.
Bootstrapping allows us to estimate uncertainty, allowing computation of confidence intervals.

Data leakage

Leakage occurs when training data is contaminated with information that is not available at prediction time.
Leakage leads to over-optimistic expectations of performance.