Beginner lesson

Lesson 10: Preprocessing and Feature Engineering

Beginner Course position: 10 of 30 Track: Machine Learning Tutorials

This lesson introduces how raw data is transformed into model-ready inputs within a structured machine learning path. It begins with intuition, moves into workflow thinking, and then shows a practical Python example with clear notes.

Learning objectives

Understand the main idea behind how raw data is transformed into model-ready inputs.
See how the concept appears in real machine learning workflows.
Follow a practical Python example step by step.
Finish the lesson with key takeaways and exercises.

Prerequisites

Basic Python familiarity is helpful, but the explanation is written for guided self-study.

Key takeaways

Raw data usually needs preparation before modeling.
Feature engineering can add business meaning and improve performance.
Pipelines reduce mistakes by applying the same transformations every time.
Always be careful to avoid leakage from future or target information.

Concept and intuition

Preprocessing and Feature Engineering is a core topic in machine learning because it shapes how we frame the problem, choose tools, and judge results. Models learn from the features you provide. Good preprocessing can improve performance, while poor preprocessing can hide useful signals or introduce leakage.

When learning how raw data is transformed into model-ready inputs, do not focus only on formulas. The more important habit is to ask what the model is trying to learn, what assumptions it makes, and what could go wrong when the data is noisy, incomplete, or biased.

How it fits into a workflow

In a real project, how raw data is transformed into model-ready inputs sits inside a larger workflow: define the problem, prepare data, choose features, train a model, evaluate it carefully, and improve the system over time. Strong machine learning practice is iterative rather than one-shot.

This means you should connect how raw data is transformed into model-ready inputs to practical questions such as: What data is available? How will predictions be used? Which errors are most costly? How will the system be monitored after deployment? Those questions matter as much as model accuracy.

Common mistakes and practical advice

A common beginner mistake is to treat how raw data is transformed into model-ready inputs as a purely technical task. In practice, success depends on data quality, evaluation design, and the clarity of the business goal. Even a sophisticated model can fail if the data pipeline is weak or the target is poorly defined.

As you read the code example in this lesson, pay attention to how the inputs are shaped, how training and prediction are separated, and how the output is interpreted. Good coding habits make machine learning work more reliable, explainable, and easier to improve.

Three practical examples

Scaling numeric data

Some algorithms work better when features are on similar scales.

Encoding categories

Text categories like city or department must be converted into numeric representations.

Derived features

A business creates `revenue_per_customer` from revenue and customer count.

Building a preprocessing pipeline

This code example focuses on clarity rather than production scale. Read the comments, then study the notes below to understand why each step matters.

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

df = pd.DataFrame({
    "age": [22, 35, 41, 28],
    "income": [3000, 5200, 6100, 4200],
    "city": ["A", "B", "A", "C"],
    "bought": [0, 1, 1, 0]
})

X = df[["age", "income", "city"]]
y = df["bought"]

preprocess = ColumnTransformer([
    ("num", StandardScaler(), ["age", "income"]),
    ("cat", OneHotEncoder(handle_unknown="ignore"), ["city"])
])

model = Pipeline([
    ("prep", preprocess),
    ("clf", LogisticRegression())
])

model.fit(X, y)
print(model.predict(X))

Code walkthrough

`ColumnTransformer` applies different preprocessing steps to different columns.
`StandardScaler` normalizes numeric features for algorithms that are scale-sensitive.
`OneHotEncoder` converts categories into machine-readable indicator columns.
A `Pipeline` keeps preprocessing and modeling together so training and prediction stay consistent.

Summary and key takeaways

Raw data usually needs preparation before modeling.
Feature engineering can add business meaning and improve performance.
Pipelines reduce mistakes by applying the same transformations every time.
Always be careful to avoid leakage from future or target information.

Exercises

Why do categorical columns need encoding?
What kinds of algorithms benefit from scaling?
Create one new feature idea for a sales or education dataset.
What is the benefit of keeping preprocessing and modeling in one pipeline?

Continue your learning

Previous lesson Lesson 9: Evaluation Metrics for Beginners Next lesson Lesson 11: Linear Regression in Practice