Previous lesson:
Data is the Foundation of Machine Learning
Before you can train a machine learning model, you need data, and not just any data. The quality of your data determines how well your model will perform. If your data is messy, inconsistent, or biased, your model will be too. That’s why understanding data and preprocessing it properly is one of the most important steps in machine learning.
In this lesson, we’ll break down the key building blocks of machine learning data, explain why data preprocessing matters, and show you how to prepare your dataset for training.
What Are Features and Labels?
Machine learning models learn by mapping features (input data) to labels (desired outputs). Let’s define these terms clearly.
Features: The Information Your Model Learns From
A feature is an individual measurable property or characteristic of your data. Features are what the model looks at when making predictions.
Examples of features:
In a house price prediction model, features could be:
Square footage
Number of bedrooms
Location
Year built
In a spam classifier, features could be:
The number of capital letters in an email
The presence of certain words (e.g., “free,” “offer,” “win”)
Email length
Features are the raw ingredients of machine learning. The better they are, the better your model will be.
Labels: The Correct Answer
A label is what you want the model to predict. It’s the “answer” the model is trying to learn.
Examples of labels:
In a house price prediction model, the label is the price of the house.
In a spam classifier, the label is whether the email is spam or not.
In a digit recognition model, the label is the actual digit (0–9).
During training, the model sees pairs of features and labels and learns the relationship between them.
Why Data Preprocessing is Essential
Raw data is rarely perfect. It often contains:
Missing values (e.g., a dataset of customers where some entries have missing age values)
Outliers (e.g., a house that costs 10x the average price in a city)
Inconsistent formatting (e.g., date fields formatted differently across records)
Irrelevant information (e.g., unnecessary columns that don’t affect predictions)
Data preprocessing helps clean and transform data so that models can learn effectively. Without proper preprocessing, your model might learn the wrong patterns. Or worse, fail to learn anything at all.
Steps of Data Preprocessing
1. Handling Missing Data
Missing data is a common issue in datasets. There are several ways to handle it:
Remove missing values (if the dataset is large enough)
Fill missing values with averages (mean, median, or mode)
Use predictive models to estimate missing values
2. Encoding Categorical Variables
Machine learning models work best with numbers. If your dataset contains text-based categories (like "Male" and "Female"), you’ll need to convert them into numerical values.
One-Hot Encoding: Converts categories into binary columns (e.g., "Red", "Blue", and "Green" become separate columns with 1s and 0s)
Label Encoding: Assigns each category a unique number (e.g., "Male" = 0, "Female" = 1)
3. Feature Scaling (Normalization and Standardization)
Different features can have different scales, which can affect model training. For example, house prices might be in the range of $100,000–$1,000,000, while the number of bedrooms ranges from 1–5. This scale difference can create problems.
To fix this, we use feature scaling:
Normalization (Min-Max Scaling): Rescales values to a range of 0 to 1.
Standardization (Z-Score Scaling): Centers data around 0 with a standard deviation of 1.
4. Removing Outliers
Outliers can distort a model’s learning process. A house that costs $100 million in a dataset where the average price is $200,000 might mislead the model. Methods to handle outliers include:
Removing extreme values based on statistical thresholds
Using log transformation to reduce their impact
Applying robust algorithms that are less sensitive to outliers
5. Splitting Data into Training and Testing Sets
Once your data is clean and preprocessed, you need to split it into separate datasets for training and evaluation:
Training set: Used to train the model (usually 70–80% of the data)
Testing set: Used to evaluate model performance (usually 20–30% of the data)
This ensures the model learns patterns from one set of data and generalizes well to unseen data.
Example: Preprocessing Data in Python
Let’s see an example using Python and Pandas. We will process a dataset step by step and print the intermediate outputs to see what is happening at each stage.
We will be using a simple csv file with some housing data that looks like this. Notice that some values are missing.
square_feet,bedrooms,bathrooms,city,price
1500,3,2,New York,550000
1800,4,3,Los Angeles,720000
2400,3,,San Francisco,980000
1700,2,1,New York,430000
2000,3,2,Los Angeles,680000
1600,3,2,San Francisco,890000
1900,,3,New York,620000
,5,3,Los Angeles,770000
2500,4,4,San Francisco,1010000
2100,3,2,New York,600000
Note: All source codes and datasets can be found on my GitHub.
Understanding Categorical Encoding & Feature Scaling
Encoding Categorical Features
The function encode_categorical_features
transforms categorical data into a one-hot encoded numerical format using TensorFlow’s StringLookup
.
Create a lookup layer that will assign unique indices to each category.
Adapt the lookup layer so it learns the unique values.
Convert categories into one-hot encoded tensors.
Cast the tensor to float for compatibility with training models.
Example:
city = ['New York', 'Los Angeles', 'San Francisco']
Encoded as:
[[0. 1. 0.]
[1. 0. 0.]
[0. 0. 1.]]
Scaling Numeric Features
The function scale_numeric_features
standardizes numerical data using Z-score normalization, which centers the values around zero:
Create a normalization layer.
Adapt the layer to compute the mean and standard deviation.
Apply the transformation, ensuring all values are scaled properly.
Example:
Before scaling:
[[1500, 3, 2],
[1800, 4, 3],
[2400, 3, 2]]
After scaling:
[[-1.2, -0.5, -0.5],
[ 0.0, 0.5, 0.5],
[ 1.2, -0.5, -0.5]]
This ensures that features with different ranges (e.g., square footage vs. bedrooms) don’t disproportionately influence the model.
import tensorflow as tf
import pandas as pd
from tensorflow.keras.layers import Normalization, StringLookup
from typing import Tuple, List
def load_data(file_path: str) -> pd.DataFrame:
"""Load dataset from a CSV file."""
print("Loading dataset...")
data = pd.read_csv(file_path)
print("Dataset loaded successfully! Here’s a preview:")
print(data.head())
return data
def handle_missing_values(data: pd.DataFrame, numeric_cols: List[str]) -> pd.DataFrame:
"""Fill missing values in numeric columns with their median."""
print("Handling missing values...")
data[numeric_cols] = data[numeric_cols].fillna(data[numeric_cols].median())
print("Missing values handled. Any remaining missing values?")
print(data.isnull().sum())
return data
def encode_categorical_features(data: pd.DataFrame, categorical_col: str) -> tf.Tensor:
"""Encode categorical variables using TensorFlow StringLookup."""
print("Encoding categorical variables...")
lookup = StringLookup(output_mode='one_hot', dtype=tf.string)
lookup.adapt(tf.constant(data[categorical_col].astype(str)))
categorical_encoded = lookup(tf.constant(data[categorical_col].astype(str).values))
categorical_encoded = tf.cast(categorical_encoded, tf.float32)
print("Categorical encoding complete. Example output:")
print(categorical_encoded[:5].numpy())
return categorical_encoded
def scale_numeric_features(data: pd.DataFrame, numeric_cols: List[str]) -> tf.Tensor:
"""Apply feature scaling to numeric features using TensorFlow Normalization."""
print("Applying feature scaling...")
norm_layer = Normalization()
numeric_features = tf.constant(data[numeric_cols].values, dtype=tf.float32)
norm_layer.adapt(numeric_features)
numeric_features_scaled = norm_layer(numeric_features)
print("Feature scaling complete. Example output:")
print(numeric_features_scaled[:5].numpy())
return numeric_features_scaled
def preprocess_data(file_path: str) -> Tuple[List[tf.Tensor], List[tf.Tensor]]:
"""Load, preprocess, and split the dataset into training and testing sets."""
data = load_data(file_path)
numeric_cols = ['square_feet', 'bedrooms', 'bathrooms', 'price']
categorical_col = 'city'
data = handle_missing_values(data, numeric_cols)
categorical_encoded = encode_categorical_features(data, categorical_col)
numeric_features_scaled = scale_numeric_features(data, numeric_cols[:-1]) # Exclude price from scaling
print("Combining processed features...")
processed_data = tf.concat([numeric_features_scaled, categorical_encoded], axis=1)
print("Data processing complete. Here’s what the final dataset looks like:")
print(processed_data.numpy()[:5])
print("Splitting data into training and testing sets...")
data_tensor = tf.data.Dataset.from_tensor_slices(processed_data)
data_list = list(data_tensor.as_numpy_iterator())
train_size = int(0.8 * len(data_list))
train_data = data_list[:train_size]
test_data = data_list[train_size:]
print(f"Training set size: {len(train_data)} rows")
print(f"Testing set size: {len(test_data)} rows")
return train_data, test_data
# Run preprocessing
train_data, test_data = preprocess_data("housing.csv")
What Can We Do With Preprocessed Data?
Once the data has been cleaned, encoded, and scaled, we can now train machine learning models. Some common next steps include:
Train a Regression Model → Predict continuous values, such as house prices.
Train a Classification Model → Identify categories, such as spam vs. not spam.
Use in Deep Learning → Feed structured, normalized data into a neural network.
Feature Engineering → Create new meaningful features from existing data.
Anomaly Detection → Identify unusual patterns, such as fraud detection in banking.
By preparing data correctly, we set the foundation for building high-performing, reliable models.
Data is Everything
Machine learning is only as good as the data you feed it. Understanding features, labels, and data preprocessing techniques will set you up for success when building models.
In the next lesson, we’ll go even deeper into Supervised Learning and How Models Learn from Labeled Data. See you there!