Data Ingestion
This page explains how the breast cancer dataset was initially loaded into memory, verified for structure, and prepared for downstream modeling. All steps are reproducible and form the foundation of the classification pipeline.
Loading the Dataset
The WDBC dataset was imported from a local CSV using pandas. Column names were applied manually based on published schema definitions, ensuring transparency and alignment with the feature documentation.
import pandas as pd
df = pd.read_csv("wdbc.csv")
df.columns = ["id", "diagnosis"] + feature_names
Column names were defined externally and injected into the schema to match the 30 feature columns required by the model.
Diagnosis Label Encoding
The diagnosis column originally contained letters representing class labels:
Original Values
B = Benign
M = Malignant
Mapped Encoding
0 = Benign
1 = Malignant
This integer mapping was chosen for compatibility with classification metrics and loss functions.
Verifying Input Integrity
After ingestion, the dataframe was verified against expected row and column counts. No missing values were found, and column dtypes were all numeric as expected.
Shape Check
Expected: 569 rows × 32 columns
Actual: 569 × 32
NaN Check
Missing values: 0
Constant values: 0
The result was a clean and stable dataset suitable for downstream splitting, scaling, and modeling operations.
Structured Inputs and Predictable Pipelines
The ingestion step is intentionally simple and transparent. All transformations occur after this point, so that the base dataframe remains available for comparison or debugging.
Once loaded and labeled, the dataset is split into features (X) and target (y), setting the stage for exploratory plots, feature engineering, and model training in the pages that follow.
Key Takeaways
- The dataset was ingested with a known schema and verified against documentation.
- Label encoding transformed B/M values into binary integers without introducing ambiguity.
- No rows or columns were dropped at this phase; all structure was preserved.
- This ingestion pattern ensures repeatable pipeline runs and easy debugging for future updates.