Model Evaluation

This page details how the candidate classifiers were evaluated using performance metrics, validation strategies, and diagnostic tools. It explains the rationale for selecting the final model and interprets its results within a clinical context.

Purpose of Evaluation

The evaluation phase determines how well a trained model performs on unseen data. Beyond simply checking accuracy, it ensures clinical reliability by analyzing false positives, recall rates, and class balance. Evaluation also guided model selection among several contenders tested during experimentation.

Metrics Used

Precision

Of all predicted malignant cases, how many were actually malignant?

Why it matters: High precision minimizes false positives, preventing unnecessary anxiety and invasive follow-up procedures for healthy patients.

Recall

Of all actual malignant cases, how many did the model detect?

Why it matters: High recall is the top priority. A false negative (missing a cancer case) has the most severe clinical consequences.

F1 Score

The harmonic mean of precision and recall, used to find a balance between the two metrics.

Accuracy

The overall proportion of correct classifications. While useful, it can be misleading for imbalanced datasets.

Confusion Matrix

The confusion matrix provides a tabular view of prediction outcomes, revealing the types of errors the model makes. With high precision and recall, the model minimizes both false negatives and false positives.

	Predicted
Actual	Benign	Malignant
Benign	68 (True Negative)	2 (False Positive)
Malignant	1 (False Negative)	43 (True Positive)

Accuracy: 98.24%
Precision: 95.55%
Recall (Sensitivity): 97.73%
F1 Score: 96.63%

Cross-Validation Strategy

To reduce bias from a single train-test split, I employed 5-fold stratified cross-validation. This technique ensures the class distribution (the percentage of benign vs. malignant samples) is preserved across all folds, which is critical for imbalanced data.

cv_scoring.py

from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Use 'f1' scoring to optimize for the balance of precision and recall
scores = cross_val_score(model, X, y, cv=cv, scoring="f1")

print(f"F1 Scores per fold: {scores}")
print(f"Mean F1 Score: {scores.mean():.4f}")

The F1 scores remained stable across all folds, confirming that the model generalizes well and its performance isn't due to a lucky train-test split.

ROC Curve and AUC

The Receiver Operating Characteristic (ROC) curve visualizes the trade-off between the true positive rate and false positive rate at various classification thresholds. The Area Under the Curve (AUC) summarizes this relationship in a single number, where 1.0 represents a perfect classifier.

roc_auc.py

from sklearn.metrics import roc_curve, auc, roc_auc_score

# Get probabilities for the positive class (malignant)
probs = model.predict_proba(X_test)[:, 1]
score = roc_auc_score(y_test, probs)

print(f"ROC AUC Score: {score:.4f}")
# Code to plot the curve would typically follow

With an AUC of 0.997, the classifier demonstrates an excellent ability to distinguish between benign and malignant classes.

Final Model Justification

Logistic regression was selected over more complex models like SVMs for several key reasons:

Strong Performance: It achieved excellent results with an F1 score over 96%, meeting the project's clinical requirements.
Interpretability: Logistic Regression models are more transparent. It's easier to inspect their coefficients to understand how certain features influence the outcome, which is a major advantage in a clinical setting.
Simplicity & Robustness: The model is less prone to overfitting on this type of structured, low-dimensional data compared to more complex alternatives. The marginal performance gains from other models did not justify their added complexity.