Data Ingestion

This page explains how the breast cancer dataset was initially loaded into memory, verified for structure, and prepared for downstream modeling. All steps are reproducible and form the foundation of the classification pipeline.

Loading the Dataset

The WDBC dataset was imported from a local CSV using pandas. Column names were applied manually based on published schema definitions, ensuring transparency and alignment with the feature documentation.

import pandas as pd

df = pd.read_csv("wdbc.csv")
df.columns = ["id", "diagnosis"] + feature_names

Column names were defined externally and injected into the schema to match the 30 feature columns required by the model.

Diagnosis Label Encoding

The diagnosis column originally contained letters representing class labels:

Original Values

B = Benign
M = Malignant

Mapped Encoding

0 = Benign
1 = Malignant

This integer mapping was chosen for compatibility with classification metrics and loss functions.

Verifying Input Integrity

After ingestion, the dataframe was verified against expected row and column counts. No missing values were found, and column dtypes were all numeric as expected.

Shape Check

Expected: 569 rows × 32 columns
Actual: 569 × 32

NaN Check

Missing values: 0
Constant values: 0

The result was a clean and stable dataset suitable for downstream splitting, scaling, and modeling operations.

Structured Inputs and Predictable Pipelines

The ingestion step is intentionally simple and transparent. All transformations occur after this point, so that the base dataframe remains available for comparison or debugging.

Once loaded and labeled, the dataset is split into features (X) and target (y), setting the stage for exploratory plots, feature engineering, and model training in the pages that follow.

Key Takeaways

The dataset was ingested with a known schema and verified against documentation.
Label encoding transformed B/M values into binary integers without introducing ambiguity.
No rows or columns were dropped at this phase; all structure was preserved.
This ingestion pattern ensures repeatable pipeline runs and easy debugging for future updates.