Logistic Regression

Logistic regression was chosen as the first classifier in this project due to its speed, interpretability, and effectiveness for linearly separable problems. This section covers how the model was implemented, tuned, and evaluated.

Why Logistic Regression?

This model is often used as a baseline for binary classification tasks. It is simple to train, produces interpretable coefficients, and is highly compatible with scaled numeric data like the breast cancer dataset.

Strengths

Fast training on small datasets
Clear output probabilities
Low risk of overfitting

Limitations

Performs poorly on non-linear boundaries
Assumes independence of features

Training and Parameters

The model was trained using stratified splitting, with 80% of the dataset for training and 20% for evaluation. Standard scaling was applied before fitting the model.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

The model converged quickly and produced consistent results across multiple test splits.

Evaluation Results

The model achieved high performance in both precision and recall. Below is a representative classification report:

              precision    recall  f1-score   support

       Benign       0.96      0.98      0.97       72
    Malignant       0.97      0.93      0.95       42

    accuracy                           0.96      114
   macro avg       0.96      0.96      0.96      114
weighted avg       0.96      0.96      0.96      114

These results confirm that even a simple model like logistic regression can perform exceptionally on well-structured diagnostic data.

Interpretability

The model’s learned coefficients were inspected to understand feature importance. Features with strong positive coefficients increased the likelihood of a malignant classification.

Features like concavity_worst and radius_mean emerged as key indicators of malignancy.

This level of explainability makes logistic regression valuable in sensitive domains like healthcare where transparency matters.

Key Takeaways

Logistic regression offers a reliable and explainable baseline for binary classification tasks.
With well-prepared features, it achieved over 96% accuracy on the breast cancer dataset.
Feature coefficients aligned with medical expectations and aided interpretability.
Performance comparisons with other models are explored in upcoming sections.