Data Cleaning

Before any modeling can occur, raw inputs must be refined into structured, noise-reduced features. This page outlines the removal of irrelevant columns, treatment of potential outliers, and preparation of inputs for analysis.

Cleaning Strategy

The dataset required minimal intervention due to its clean origin. No null values or malformed rows were present. However, one identifier column was removed, and feature scaling was deferred until after model selection.

Cleaning here does not include feature selection or dimensionality reduction. Those transformations occur downstream and are tightly coupled to model-specific criteria.

Removing Identifier Columns

The id column was retained at ingestion time to support traceability but was excluded from model input. It provides no predictive power and could introduce accidental bias.

if "id" in df.columns:
    df.drop(columns=["id"], inplace=True)

This operation was applied consistently across experiments to ensure no identifier leakage.

Outlier Detection Philosophy

Outliers were not explicitly removed in this pipeline. Since clinical data can vary widely across benign and malignant diagnoses, extreme values were preserved to let the models assess signal relevance.

Pros

  • No bias from artificial thresholds
  • Preserves variance essential to SVM kernels
  • Retains biological realism

Cons

  • Some metrics (e.g. area) span wide ranges
  • Normalization becomes more sensitive

Final Column Review

At the end of the cleaning phase, the dataframe included only the diagnosis column and 30 float features. No reindexing or category encoding was necessary beyond integer mapping of the target.

print(df.columns.tolist())
# ['diagnosis', 'radius_mean', 'texture_mean', ..., 'fractal_dimension_worst']

The structure was now stable enough to pass into exploratory analysis, correlation matrices, and model pipelines.

Key Takeaways

  • No missing or malformed data was present in the source.
  • Identifier fields were excluded from training inputs.
  • Outlier values were retained to preserve class-distinguishing variance.
  • The structure at this point was clean, wide, and float-compatible for scikit-learn.