Breast Cancer Identifier

As my undergraduate computer science capstone, I developed a machine learning classifier to improve the early detection of breast cancer from cell morphology data. This project demonstrates my skills in data analysis, model development, and building practical, user-facing diagnostic tools.

Clinical Motivation

Fine Needle Aspiration (FNA) is a low-cost diagnostic procedure for evaluating breast masses. While the procedure is simple, interpreting the resulting cell morphologies often requires expert judgment. This project applies supervised learning to improve diagnostic reliability through automation, using well-labeled datasets to train predictive models capable of identifying malignancy.

Dataset: Wisconsin Diagnostic Breast Cancer (WDBC), containing 569 samples of FNA-derived cell morphology.

Machine Learning Objective

This classifier is designed to take a structured array of morphology inputs and return a binary classification with critical clinical implications:

0 (Benign): A prediction that the tumor is non-cancerous.
1 (Malignant): A prediction that the tumor is cancerous, flagging the need for urgent follow-up.

I tested several models, including logistic regression and support vector machines. Each model was trained using stratified train-test splitting and evaluated for precision, recall, and overall accuracy.

Core Technologies

Python Tooling

Pandas, NumPy for data handling
Matplotlib & Seaborn for visualization
scikit-learn for modeling and evaluation

Jupyter & Notebook Format

Inline graphing and commentary
Step-by-step model diagnostics
Cell-based interactivity and reproducibility

Classifier Preview

The following code outlines the basic structure of the logistic regression model used in the final evaluation:

preview.py

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# X, y are pre-loaded features and labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
preds = model.predict(X_test)

print(classification_report(y_test, preds))

Final model performance is detailed in the evaluation page of this series.

What This Project Covers

This summary introduced the purpose, data, and core tooling used in the project. The following pages break this out into detailed sections on:

Dataset cleaning, parsing, and shaping
Feature exploration and visualization
Model selection, comparison, and scoring
Classifier input interfaces and result dashboards
Reflections on what was learned and how it could evolve