Breast Cancer Prediction Using Machine Learning

Breast cancer, one of the most prevalent forms of cancer, affects millions of lives worldwide. Early detection significantly enhances the chances of successful treatment. In this blog post, we will explore how cutting-edge machine learning techniques can empower healthcare professionals by predicting breast cancer with high accuracy. We’ll dive into the world of data, algorithms, and empowerment through technology.

What is Breast Cancer

Cancer is a broad term for a class of diseases characterized by abnormal cells that grow and invade healthy cells in the body. Breast cancer starts in the cells of the breast as a group of cancer cells that can then invade surrounding tissues or spread (metastasize) to other areas of the body. Breast cancer is a disease in which malignant (cancer) cells form in the tissues of the breast.

What Causes Breast Cancer

Cancer begins in the cells which are the basic building blocks that make up tissue. Tissue is found in the breast and other parts of the body. Sometimes, the process of cell growth goes wrong and new cells form when the body doesn’t need them and old or damaged cells do not die as they should. When this occurs, a build-up of cells often forms a mass of tissue called a lump, growth, or tumor.

Breast cancer occurs when malignant tumors develop in the breast. These cells can spread by breaking away from the original tumor and entering blood vessels or lymph vessels, branching into tissues throughout the body. When cancer cells travel to other parts of the body and begin damaging other tissues and organs, the process is called metastasis.

What Is A Tumor
A tumor is a mass of abnormal tissue. There are two types of breast cancer tumors: those that are non-cancerous, or ‘benign’, and those that are cancerous, which are ‘malignant’.

Benign Tumors
When a tumor is diagnosed as benign, doctors will usually leave it alone rather than remove it. Even though these tumors are not generally aggressive toward surrounding tissue, occasionally they may continue to grow, pressing on other tissue and causing pain or other problems. In these situations, the tumor is removed, allowing pain or complications to subside.

Malignant Tumors
Malignant tumors are cancerous and may be aggressive because they invade and damage surrounding tissue. When a tumor is suspected to be malignant, the doctor will perform a biopsy to determine the severity or aggressiveness of the tumor.

In this study, advanced machine learning methods will be utilized to build and test the performance of a selected algorithm for breast cancer diagnosis.

1. Understanding the Dataset

Our journey begins with a dataset – a collection of valuable information waiting to reveal patterns. The Breast Cancer Wisconsin (Diagnostic) dataset provides a rich set of features derived from cell nuclei characteristics. Each feature is a potential clue in our quest for early cancer detection.

The Breast Cancer Wisconsin (Diagnostic) dataset, often referred to as the “Breast Cancer dataset” or “WBCD dataset,” is a widely used dataset in machine learning for classification tasks. It was created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian from the University of Wisconsin Hospitals, Madison, Wisconsin, USA. The dataset is publicly available and can be accessed through the UCI Machine Learning Repository.

The dataset contains features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. These features describe the characteristics of the cell nuclei present in the image. The task associated with this dataset is to classify the breast mass as benign (B) or malignant (M) based on these features. Here are the details of the dataset:

  • Number of Instances: 569
  • Number of Features: 30 numeric, real-valued features are computed from cell nuclei characteristics
  • Attribute Information:
      1. ID Number: Unique identification number
      2. Diagnosis (M or B): Malignant (cancerous) or Benign (non-cancerous)
      3. (3-30) Ten real-valued features are computed for each cell nucleus:
        • Radius (mean of distances from the center to points on the perimeter)
        • Texture (standard deviation of gray-scale values)
        • Perimeter
        • Area
        • Smoothness (local variation in radius lengths)
        • Compactness (perimeter^2 / area – 1.0)
        • Concavity (severity of concave portions of the contour)
        • Concave points (number of concave portions of the contour)
        • Symmetry
        • Fractal dimension (“coastline approximation” – 1)

    For each of these ten features, the mean, standard error, and “worst” or largest (mean of the three largest values) values are computed, resulting in 30 features.

2. Data Exploration
To understand the dataset we need to explore the dataset first. Here are the top 5 records of the dataset:

id diagnosis radius_mean symmetry_worst fractal_dimension_worst Unnamed: 32
0 842302 M 17.99 0.4601 0.11890 NaN
1 842517 M 20.57 0.2750 0.08902 NaN
2 84300903 M 19.69 0.3613 0.08758 NaN
3 84348301 M 11.42 0.6638 0.17300 NaN
4 84358402 M 20.29 0.2364 0.07678 NaN

Here are some broad-level statistics of the dataset:

count mean std 50% 75% max
id 569.0 3.037183e+07 1.250206e+08 906024.000000 8.813129e+06 9.113205e+08
radius_mean 569.0 1.412729e+01 3.524049e+00 13.370000 1.578000e+01 2.811000e+01
texture_mean 569.0 1.928965e+01 4.301036e+00 18.840000 2.180000e+01 3.928000e+01
perimeter_mean 569.0 9.196903e+01 2.429898e+01 86.240000 1.041000e+02 1.885000e+02
area_mean 569.0 6.548891e+02 3.519141e+02 551.100000 7.827000e+02 2.501000e+03
smoothness_mean 569.0 9.636028e-02 1.406413e-02 0.095870 1.053000e-01 1.634000e-01
compactness_mean 569.0 1.043410e-01 5.281276e-02 0.092630 1.304000e-01 3.454000e-01
concavity_mean 569.0 8.879932e-02 7.971981e-02 0.061540 1.307000e-01 4.268000e-01
concave points_mean 569.0 4.891915e-02 3.880284e-02 0.033500 7.400000e-02 2.012000e-01
symmetry_mean 569.0 1.811619e-01 2.741428e-02 0.179200 1.957000e-01 3.040000e-01
fractal_dimension_mean 569.0 6.279761e-02 7.060363e-03 0.061540 6.612000e-02 9.744000e-02
radius_se 569.0 4.051721e-01 2.773127e-01 0.324200 4.789000e-01 2.873000e+00
texture_se 569.0 1.216853e+00 5.516484e-01 1.108000 1.474000e+00 4.885000e+00
perimeter_se 569.0 2.866059e+00 2.021855e+00 2.287000 3.357000e+00 2.198000e+01
area_se 569.0 4.033708e+01 4.549101e+01 24.530000 4.519000e+01 5.422000e+02
smoothness_se 569.0 7.040979e-03 3.002518e-03 0.006380 8.146000e-03 3.113000e-02
compactness_se 569.0 2.547814e-02 1.790818e-02 0.020450 3.245000e-02 1.354000e-01
concavity_se 569.0 3.189372e-02 3.018606e-02 0.025890 4.205000e-02 3.960000e-01
concave points_se 569.0 1.179614e-02 6.170285e-03 0.010930 1.471000e-02 5.279000e-02
symmetry_se 569.0 2.054230e-02 8.266372e-03 0.018730 2.348000e-02 7.895000e-02
fractal_dimension_se 569.0 3.794904e-03 2.646071e-03 0.003187 4.558000e-03 2.984000e-02
radius_worst 569.0 1.626919e+01 4.833242e+00 14.970000 1.879000e+01 3.604000e+01
texture_worst 569.0 2.567722e+01 6.146258e+00 25.410000 2.972000e+01 4.954000e+01
perimeter_worst 569.0 1.072612e+02 3.360254e+01 97.660000 1.254000e+02 2.512000e+02
area_worst 569.0 8.805831e+02 5.693570e+02 686.500000 1.084000e+03 4.254000e+03
smoothness_worst 569.0 1.323686e-01 2.283243e-02 0.131300 1.460000e-01 2.226000e-01
compactness_worst 569.0 2.542650e-01 1.573365e-01 0.211900 3.391000e-01 1.058000e+00
concavity_worst 569.0 2.721885e-01 2.086243e-01 0.226700 3.829000e-01 1.252000e+00
concave points_worst 569.0 1.146062e-01 6.573234e-02 0.099930 1.614000e-01 2.910000e-01
symmetry_worst 569.0 2.900756e-01 6.186747e-02 0.282200 3.179000e-01 6.638000e-01
fractal_dimension_worst 569.0 8.394582e-02 1.806127e-02 0.080040 9.208000e-02 2.075000e-01
Unnamed: 32 0.0 NaN NaN NaN NaN NaN

Here are the datatypes of the dataset:

id                           int64
diagnosis                   object
radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
Unnamed: 32                float64
dtype: object

We are going to do the following things from the above statistics:

1. Delete the column “Unnamed: 32”
2. Map 1 for “M” and 0 for “B” in the diagnosis column

3. Correlation Among Fields and Taking Actions
Now, let’s see the correlation among fields. Here is the result:

correlated fields matrix

While we look into the above matrix we can find out the fields to be removed. We see ‘area_mean’, ‘area_se’, ‘area_worst’,  ‘compactness_worst’, ‘concave points_mean’, ‘concave points_worst’, ‘concavity_mean’, ‘concavity_se’, ‘concavity_worst’, ‘fractal_dimension_se’, ‘fractal_dimension_worst’, ‘perimeter_mean’, ‘perimeter_se’, ‘perimeter_worst’, ‘radius_worst’, ‘smoothness_worst’, ‘texture_worst’ are very much correlated to each other. So, after removing them we will see a matrix as below:

 

correlated fields matrix
4. Data Preprocessing

Ensuring data quality is paramount. We preprocess the data by standardizing features. Now, it’s time to standardize all numerical values. To do that, we will use the StandardScaler() function as below.

# Data Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

After transforming the dataset it will look like as below:

[ 1.09706398e+00, -2.07333501e+00,  1.56846633e+00, 3.28351467e+00,  2.21751501e+00,  2.25574689e+00, 2.48973393e+00, -5.65265059e-01, -2.14001647e-01, 1.31686157e+00,  6.60819941e-01,  1.14875667e+00, 2.75062224e+00],
[ 1.82982061e+00, -3.53632408e-01, -8.26962447e-01,  -4.87071673e-01,  1.39236330e-03, -8.68652457e-01, 4.99254601e-01, -8.76243603e-01, -6.05350847e-01, -6.92926270e-01,  2.60162067e-01, -8.05450380e-01, -2.43889668e-01],
[ 1.57988811e+00,  4.56186952e-01,  9.42210440e-01, 1.05292554e+00,  9.39684817e-01, -3.98007910e-01, 1.22867595e+00, -7.80083377e-01, -2.97005012e-01, 8.14973504e-01,  1.42482747e+00,  2.37035535e-01, 1.15225500e+00],
[-7.68909287e-01,  2.53732112e-01,  3.28355348e+00, 3.40290899e+00,  2.86738293e+00,  4.91091929e+00, 3.26373441e-01, -1.10409044e-01,  6.89701660e-01, 2.74428041e+00,  1.11500701e+00,  4.73268037e+00, 6.04604135e+00],
[ 1.75029663e+00, -1.15181643e+00,  2.80371830e-01, 5.39340452e-01, -9.56046689e-03, -5.62449981e-01, 1.27054278e+00, -7.90243702e-01,  1.48306716e+00, -4.85198799e-02,  1.14420474e+00, -3.61092272e-01, -8.68352984e-01]

5. Training Models

Before training the model we split the dataset into Training (80%) and Testing Datasets (20%) by stratified sampling method. The heart of our solution lies in a machine learning model. Now we are ready to train our machine using different machine learning algorithms. There are many algorithms in Python to train our machine. We have used 9 most popular algorithms to train our machine and have built 9 different ones to create our breast cancer prediction models. Overall, the process will take some time to train our machine with those selected algorithms. It will depend mainly on the capacity of the processor & random access memory of our machine.

6. Model Evaluation and Empowerment

After completion of the model building process, we have to see the performance of different models. One algorithm can not be the best for all kinds of datasets.  Our models are now ready to face the ultimate test – real-world data. By evaluating its performance, we gauge its accuracy and effectiveness. Let’s see the performance of these 8 models using 9 Machine Learning algorithms:

roc_curve_breast

Model_Name Jaccard_Score Accuracy_Score F1_Score LogLoss
8 XGBoost Classifier 0.872340 0.947368 0.947591 1.897034
5 Gradient Boosting Classifier 0.847826 0.938596 0.938450 2.213207
6 Ada Boost Classifier 0.847826 0.938596 0.938450 2.213207
7 Support Vector Machine 0.847826 0.938596 0.938450 2.213207
2 K Neighbors Classifier 0.844444 0.938596 0.938122 2.213207
4 Random Forest Classifier 0.829787 0.929825 0.929825 2.529379
0 Logistic Regression 0.813953 0.929825 0.928097 2.529379
3 Decision Tree Classifier 0.804348 0.921053 0.920443 2.845552
1 Gaussian Naive Bayes 0.750000 0.894737 0.894215 3.794069

 

According to the above performance graph and result data, it is clear that the “XGBoost Classifier” is showing the highest performance among all models. So, we have selected “XGBoost Classifier” as our final algorithm with an accuracy score of 94.74%. Now let’s see the relative importance of the features to identify breast cancer is as below:

relative_feature_importance_breast

Feature_Name Relative_Importance
0 radius_mean 0.396653
12 symmetry_worst 0.114696
1 texture_mean 0.084287
9 compactness_se 0.060002
3 compactness_mean 0.058612
5 fractal_dimension_mean 0.056233
11 symmetry_se 0.053392
2 smoothness_mean 0.045545
6 radius_se 0.042849
7 texture_se 0.025224
8 smoothness_se 0.023318
4 symmetry_mean 0.020734
10 concave points_se 0.018455

So, we see that the most important feature to identify breast cancer is radius_mean followed by radius_se, compactness_mean… Through the power of machine learning, we’ve created a tool capable of aiding medical professionals in their battle against breast cancer. With high accuracy and the ability to process vast amounts of data, machine learning stands as a beacon of hope, enabling earlier detection and, ultimately, saving lives.

Conclusion

Breast cancer detection using machine learning in Python is a powerful application that showcases the potential of AI in healthcare. By following these steps and understanding the underlying concepts, you can create a robust breast cancer detection system using machine learning. Remember, continuous learning and experimentation are key to mastering these techniques and making a real impact in the field of healthcare. In this blog post, we’ve embarked on a journey from data to empowerment. Through machine learning, we’ve transformed raw information into a predictive model, empowering healthcare professionals with a potent tool against breast cancer. As technology advances, so does our ability to make a difference. Let us continue harnessing the power of machine learning to shape a healthier, more hopeful future for all. If you want to get updated, you can subscribe to our Facebook page http://www.facebook.com/LearningBigDataAnalytics.

Add a Comment