Feature Engineering

After ingestion and cleanup, the dataset underwent minor structural transformation to improve interpretability and modeling outcomes. This page explains how the original 30 features were organized, filtered, and prepared for algorithm compatibility.

Engineering Philosophy

The core objective was to retain all signal-rich attributes without overfitting or dimensional redundancy. Rather than create new synthetic features, engineering here focused on rearrangement and selective modeling of the most useful data bands.

The WDBC dataset is unique in that it encodes each morphology measure three ways: mean, standard error, and worst-case. These were preserved as separate subgroups to allow flexible model tuning.

Feature Partitioning

The dataset's 30 input features were grouped based on suffixes tied to measurement types:

Mean

e.g. radius_mean, texture_mean

Standard Error

e.g. area_se, smoothness_se

Worst

e.g. concavity_worst, symmetry_worst

These partitions enabled faster iteration when testing models on high-variance subsets like worst-case morphology.

Correlation Filtering

Prior to full model training, a correlation heatmap was generated. Strong multicollinearity was observed between traits such as radius and perimeter.

import seaborn as sns

corr = df.drop("diagnosis", axis=1).corr()
sns.heatmap(corr, cmap="coolwarm")

For certain algorithms like logistic regression, columns with >0.9 correlation were pruned experimentally. SVM and tree-based models retained all features without penalty.

Export Structure

The final modeling pipeline preserved the full 30-feature matrix for compatibility, with optional filters applied on model-specific branches.

X = df.drop(columns=["diagnosis"])
y = df["diagnosis"]

At this point, the feature set was clean, organized, and passed into scaler transforms in the next phase.

Key Takeaways

  • No synthetic features were added to the original dataset.
  • Measurement types were preserved to support focused experimentation.
  • Highly correlated features were logged and selectively excluded depending on the algorithm.
  • The final feature matrix included 30 well-labeled, float-compatible columns.