Beginner lesson

Lesson 4: Working with Data in NumPy and pandas

Beginner Course position: 4 of 30 Track: Machine Learning Tutorials

This lesson introduces how machine learning data is represented, inspected, cleaned, and prepared in python within a structured machine learning path. It begins with intuition, moves into workflow thinking, and then shows a practical Python example with clear notes.

Concept and intuition

Working with Data in NumPy and pandas is a core topic in machine learning because it shapes how we frame the problem, choose tools, and judge results. Most machine learning time is spent working with data, not just training models. Strong data skills make later lessons much easier.

When learning how machine learning data is represented, inspected, cleaned, and prepared in python, do not focus only on formulas. The more important habit is to ask what the model is trying to learn, what assumptions it makes, and what could go wrong when the data is noisy, incomplete, or biased.

How it fits into a workflow

In a real project, how machine learning data is represented, inspected, cleaned, and prepared in python sits inside a larger workflow: define the problem, prepare data, choose features, train a model, evaluate it carefully, and improve the system over time. Strong machine learning practice is iterative rather than one-shot.

This means you should connect how machine learning data is represented, inspected, cleaned, and prepared in python to practical questions such as: What data is available? How will predictions be used? Which errors are most costly? How will the system be monitored after deployment? Those questions matter as much as model accuracy.

Common mistakes and practical advice

A common beginner mistake is to treat how machine learning data is represented, inspected, cleaned, and prepared in python as a purely technical task. In practice, success depends on data quality, evaluation design, and the clarity of the business goal. Even a sophisticated model can fail if the data pipeline is weak or the target is poorly defined.

As you read the code example in this lesson, pay attention to how the inputs are shaped, how training and prediction are separated, and how the output is interpreted. Good coding habits make machine learning work more reliable, explainable, and easier to improve.

Three practical examples

Tabular data

A CSV file with customer records is loaded into a pandas DataFrame.

Arrays and matrices

NumPy stores numeric values efficiently for computation.

Missing values

A data analyst inspects null values before training a model.

Reading and inspecting a small dataset

This code example focuses on clarity rather than production scale. Read the comments, then study the notes below to understand why each step matters.

import pandas as pd

df = pd.read_csv("students.csv")

print(df.head())
print(df.info())
print(df.isna().sum())

df["hours_studied"] = df["hours_studied"].fillna(df["hours_studied"].median())
print(df.describe())

Code walkthrough

  • `read_csv()` loads tabular data into a DataFrame.
  • `head()` is useful for a quick visual check of structure and column names.
  • `isna().sum()` shows where missing values appear in the dataset.
  • The `fillna()` step demonstrates one simple strategy for replacing missing numeric values.

Summary and key takeaways

  • Machine learning depends heavily on clean, well-understood data.
  • pandas is ideal for inspection, filtering, joining, and basic transformations.
  • NumPy underlies many machine learning calculations and array operations.
  • Before modeling, always check column types, missing values, and summary statistics.

Exercises

  • Create a tiny CSV file and load it with pandas.
  • Why is missing-data inspection important before modeling?
  • What is the difference between a NumPy array and a pandas DataFrame?
  • Try replacing a missing value with the mean instead of the median.

Continue your learning