Dataset Background

This project uses the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, a structured clinical dataset derived from digitized fine needle aspiration (FNA) samples. Each record includes numerical features extracted from tumor cell images and a binary classification indicating whether the sample is benign or malignant.

Dataset Overview

Origin

Collected at the University of Wisconsin Hospitals using FNA procedures, this dataset was curated by Dr. William H. Wolberg and made available through the UCI Machine Learning Repository.

Contents

569 samples labeled as benign or malignant. Each includes 30 numeric features derived from 10 tumor characteristics, measured using mean, standard error, and worst-case metrics.

Feature Structure

The feature set is composed of 10 core morphology types, each measured in three ways. This results in a 30-dimensional input vector for each observation. These features were central to all model training and validation efforts in this capstone.

Morphological Traits

  • Radius
  • Texture
  • Perimeter
  • Area
  • Smoothness
  • Compactness
  • Concavity
  • Concave Points
  • Symmetry
  • Fractal Dimension

Metric Types

  • Mean
  • Standard Error
  • Worst Value

Class Balance

Diagnosis labels are distributed across two classes: benign and malignant. While slightly imbalanced, the data still allows for effective binary classification without synthetic rebalancing.

Total Records:   569
Benign:            357
Malignant:         212

Benign %:          ~62.7%
Malignant %:       ~37.3%

The model training process used stratified sampling to ensure consistent class proportions across training and test sets.

Why This Dataset Was Chosen

This dataset was selected because it offered:

Structured Data

Clean numerical columns with no missing values, making it ideal for supervised learning algorithms.

Medical Relevance

While simplified, it simulates real diagnostic workflows and highlights how ML can aid clinical decision-making.

Key Takeaways

  • This dataset was a core component of the final capstone project for my undergraduate Computer Science degree.
  • Its dimensional structure and clean labels made it ideal for model benchmarking and iteration.
  • Future sections will explore how this data was preprocessed, modeled, and applied to build a usable diagnostic prototype.