Heart disease classification using optimized Machine learning algorithms

aiming to create an effective methodology for the early identification of a cardiovascular illness. It is incredibly difficult to diagnose and treat heart disease when contemporary technology and medical specialists are unavailable [4]. In the research community, machine learning technologies have sparked a lot of interest. Machine learning approaches, as demonstrated in several recent papers, have the potential to provide high classification accuracy when compared to traditional data categorization procedures. Accurate prediction is critical since it can lead to adequate protection. The accuracy of predictions may fluctuate depending on the learning approach used. As a result, it's crucial to spot devices that can predict cardiac disease with great accuracy. The accuracy of the prediction achieved in this study is better as compared to the earlier research. Machine learning classification is one of the most practical approaches for generating assessments in both real-world and research contexts. Persistence is also used to evaluate the performance of various machine-learning approaches for the categorization of patients with ABSTRACT: Early detection of heart disease is exceptionally critical to saving the lives of human beings. Heart attack is one of the primary causes of high death rates throughout the world, due to the lack of human and logistical resources in addition to the high costs of diagnosing heart diseases which plays a key role in the healthcare sector, this model is suggested. In the field of cardiology, patient data plays an essential role in the healthcare system. This paper presents a proposed model that aims to identify the optimal machine learning algorithm that can predict heart attacks with high accuracy in the early stages. The concepts of machine learning are used for training and testing the model based on the patient's data for effective decision-making. The proposed model consists of three stages, the first stage is patient data collection and processing, and the second stage is data training and testing using machine learning algorithms Random Forest, Support Vector Machines, K-Nearest Neighbor, and Decision Tree) that show The best classification (94.958 percent) with the Random Forest algorithm and the third stage is optimized the classification results using one of the hyperparameters optimization techniques random search that shows The best accuracy was (95.4 percent) obtained also with RF.


INTRODUCTION
Health data needs to change the data looking for conduct and can be watched around the globe. Challenges confronted by numerous individuals are looking online for health data concerning illnesses, diagnosing diverse medicines that will take part of the time, and squandering money [1].
"World Health Organization" regards cardiovascular infections as the first reason for death universally. Up to 17.9 million individuals deceased in 2016, 31 percent of all deaths worldwide were due to coronary heart disease. Cardiovascular diseases (CVD) are heart and blood vessel problems. Heart attacks and strokes account for four out of every five deaths related to cardiovascular disease. People who are at risk for cardiovascular disease may have high blood pressure or be overweight or obese [2]. The heart, being one of the largest and most essential organs in the human body, needs extra care. Because the majority of diseases are connected to the heart, it is vital to predict cardiac disorders, which needs comparative research in this sector. Because most patients are dying due to the discovery of their diseases at an advanced stage owing to instrument inaccuracy, then more efficient prediction disease algorithms are needed [3]. Current heart disease diagnosis approaches are inefficient in early detection for a variety of reasons, including accuracy and execution time, thus researchers are aiming to create an effective methodology for the early identification of a cardiovascular illness. It is incredibly difficult to diagnose and treat heart disease when contemporary technology and medical specialists are unavailable [4]. In the research community, machine learning technologies have sparked a lot of interest. Machine learning approaches, as demonstrated in several recent papers, have the potential to provide high classification accuracy when compared to traditional data categorization procedures. Accurate prediction is critical since it can lead to adequate protection. The accuracy of predictions may fluctuate depending on the learning approach used. As a result, it's crucial to spot devices that can predict cardiac disease with great accuracy. The accuracy of the prediction achieved in this study is better as compared to the earlier research. Machine learning classification is one of the most practical approaches for generating assessments in both real-world and research contexts. Persistence is also used to evaluate the performance of various machine-learning approaches for the categorization of patients with and without cardiac disease. In addition, the effectiveness of these techniques has been assessed using several categorization performance metrics [5].
This paper presents a proposed model that aims to design and implement an automated model to predict heart disease with high accuracy in the early stages. machine learning model with Hyper Parameter Optimization (HPO) Randomized Search technique was presented. To this end, the researchers create a pipeline of prediction algorithms for the clinical diagnosis of heart disease using machine learning technologies. To determine the characteristics of machine learning approaches, an experiment was undertaken. The heart disease datasets were obtained from the IEEE-data port data source. This dataset was chosen because it was curated by integrating five well-known cardiac disease datasets (Long Shoreline VA, Hungarian, Cleveland, Starlog, and Switzerland) and no research has worked on the same data previously. Four algorithms were used to generate prediction models for this experiment utilizing the provided dataset (Support Vector Machines, K-Nearest Neighbor, Decision tree, and Random Forest). Furthermore, the maximum accuracy is evaluated by the best approach discovered in this study to the highest accuracy obtained in previous research.
The remainder of the paper is organized as follows. In section 2 an overview of the related work is presented. Section 3 discusses the methodology. Research results and discussions are presented in Sections 4 and 5. Finally, the conclusion is given in Section 6.

RELATED WORKS
Several studies and experiments have been conducted on heart disease datasets. Below is a set of previous studies that show the dataset that researchers have worked on and which have been combined by identifying common characteristics. This combined data set will be worked on in this research.
In [6], Heart disease is a serious disease that is a leading cause of death in all countries of the world. However, it is difficult for doctors to predict such Diseases because they are complex and also considered expensive. In this research, the researchers proposed a clinical support system as an aid to medical specialists to predict and diagnose heart diseases and make the best decisions. Some Ml algorithms were applied in this study such as Naïve Bayes and KNN, SVM, RF, and DT to predict heart failure disease using risk factor data retrieved from medical files. Several experiments have been performed to predict the use of the HD UCI data set, and the best result with NB when using both crossvalidations with an accuracy of 82.17 percent and split-test training and with an accuracy of 84.28 percent.
In [7], the goal of this study was to examine machine learning algorithms using several performance criteria to enhance accuracy. In the pre-processing stage, they use the mean value to replace missing values. The findings show that using the mean to replace missing values works effectively. To identify patients with cardiovascular disease, researchers used the UCI heart disease dataset in this research. To demonstrate their results, they have compared the effectiveness of various machine learning algorithms using the accuracy, precision, f1-score, and recall metrics. Using SVM with a linear kernel, a scoring accuracy of 86.8 percent overall was attained.
In [8] the researcher has managed to work on a wide range of machine learning such as the Random Forest algorithm, Support Vector Machine, Naïve Bayes, Logistic model tree algorithm, K-Nearest Neighbour, and data mining methods and has assessed them using the UCI heart disease dataset, which has 303 samples data with fourteen attribute values. Discovered that the SVM accuracy score of 84.1584 percent is the best among them; other algorithms include KNN, Naive Bayes, and decision tree.
In [9] This study Applies feature selection and classification algorithms, and a strategy for identifying cardiac disease has been presented. For feature engineering, feature selection techniques apply, and the Sequential Backward Selection Algorithm (SBS FS) is used. The Cleveland heart disease dataset was used to assess the model. The remaining 30% of the dataset was utilized for validating, and 70% for training. Utilizing assessment metrics, the suggested system's performances have been evaluated. On entire and selected feature sets, the classier K-Nearest Neighbor (K-NN) performance is evaluated. The suggested technique had a 90 percent prediction accuracy.
In [10], This paper's goal is to offer an optimization function based on a support vector machine (SVM). The genetic algorithm (GA) selects the most significant traits to develop heart disease using this objective function. the genetic algorithm is used to provide an efficient feature selection process. A dataset is taken from the Cleveland Heart Disease Database. The cardiovascular predictions are then created using a support vector classifier, which has an accuracy of 88.34 percent when diagnosing cardiac illness using the given attributes.
In [11], The purpose of this study is to find important characteristics and data mining methods that can increase the precision of heart disease prediction. Seven classification algorithms and various feature combinations were used to create prediction models such as Naive Bayes and Logistic Regression, k-NN, Decision Tree, Naive Bayes, Logistic Regression (LR), Support Vector Machine (SVM), Neural Network, and Vote approach. Datasets on cardiovascular disease were obtained from the UCI dataset repository. The researchers wanted to find a mix of variables and data mining approaches that might assist identify cardiovascular problems. To address the larger challenge of cardiac disease predictions, a comprehensive system framework that handles preprocessing, parameter tuning, and feature engineering is required. Results reveal that the diagnosis of the heart disease model created utilizing the recognized significant features and the top data mining method Vote achieves an accuracy of 87.4%.
In [12], To make a reliable prediction of heart disease, machine learning classifiers are created and a comparison analysis is done. Five ML algorithms are developed, and the Cleveland Heart Disease Data set is used to thoroughly assess each one's performance. These classifiers are Logistic Regression, Naive Bayes, Random Forest, Support Vector Machine, and K-Nearest Neighbor. After pre-processing, split the data set into training and testing halves using an 80/20 ratio. there are many well-known classification algorithms for detecting cardiac disease. To illustrate each model's efficiency in identifying cardiac disease, a comparison study of machine learning techniques is performed. Binary classifiers for pre-processed data have receiver operating characteristics that can be tuned using hyperparameters. LR achieved the highest accuracy, 0.93 In [13], To predict the presence of cardiac disease, this research suggests a three-layer, binary classifier based on neural networks (NNs). Univariate and bivariate exploratory data analysis were used in the filtering procedures that were used to create the feature space (EDA). In this study, the Cleveland UCI heart dataset was utilized. the artificially intelligent clinical decision systems at this time are currently gained so that medical efforts can be effectively saved to spend properly. The study makes use of several data engineering strategies to increase accuracy to a maximum of 91.66% and an average of 88.33%.
In [14], This research proposes a deep learning strategy for the diagnosis of heart illness based on Multiple Kernel Learning with an Adaptive Neuro-Fuzzy Inference System (MKL with ANFIS). the UCI heart disease characteristics are modeled using the Extreme Machine Learning Algorithm (EML). The proposed model achieves 80% of precision.

METHODOLOGY
The goal of this study is to create a model that can anticipate heart disease and optimize the classification result using one of the optimization methods. There is a portion in this part-is section that includes Data collection, dataset description, data pre-processing, feature engineering, and applicable machine learning algorithms, as well as a block diagram and evaluation matrices, and also the study's procedure and methodology. Fig.(1) shows the architecture of the proposed model.

Dataset description
This cardiovascular disease dataset was compiled by integrating five famous cardiovascular disease datasets that had previously been available separately but had never been combined. This dataset includes five heart datasets with 11 related variables to give the most complete heart disease dataset for scientific study accessible. The five datasets that were utilized in its administration are as follows Long Shoreline VA, Hungarian, Cleveland, Starlog, Switzerland. There are 1190 samples in this dataset, each with 11 attributes. These datasets were gathered and pooled in one location to aid research into cardiovascular to enhance clinical diagnosis and early treatment in the future, therefore machine learning, and data mining approaches connected to illness CAD will be utilized. This data might be utilized to build a machine-learning algorithm model for detecting the earlier start of cardiovascular disease [15]. Table (1) shows the list of 11 traits on which the framework is working while Table (2) shows the description of a dataset of heart infections of minimal characteristics.

Data Pre-Processing
Data preparation is the most crucial initial stage in a somewhat analytical model; it assists to organize data in a readable fashion, which enhances model effectiveness. Medical data is frequently insufficient, missing characteristic values, and noisy owing to outliers or extraneous data [6].data preprocessing steps are : 1-Data cleaning: Data cleaning is a task in which information is cleaned by evacuating lost information, copying information, and settling information irregularities. As a result, data quality is progressing coming of the inconvenience of data [10]. We performed pre-processing on the information set, from a total of 1190 samples in this dataset 272 duplicate copy records were removed from the dataset. The remaining 918 patient records are used to identify whether or not a person has heart disease. The value is set to one of the patients who have heart disease, otherwise to zero, indicating that the patient does not have heart disease.

2-
Outliers removal: The data in the dataset that differs significantly from the dataset's norm is an outlier. The bulk of the data's outliers are thought to be noise, which reduces the model's performance and doesn't add anything to the relevance of the data [16]. have shown that eliminating outliers from data aids in generating better outcomes and have developed several outlier removal strategies. Boxplot is used in our suggested framework to eliminate outliers. Boxplots are used to display the minimum, first quartile (Q1), median (Q2), third quartile (Q3), and a maximum of five numerical values. Any point that lies outside of the box formed by these five points when they are plotted is regarded as an outlier [17]. In the cholesterol Colome also some outlier values are present so, in the same manner, will remove as shown in Fig.(2-a). An outlier the result of cholesterol after removing the outlier is shown in Fig.(2-b).

3-Data transformation:
The modification of information or data from one format to another format is known as information change. It is ordinarily done when a source format is required to change over into the required format for a particular reason. It includes responsibilities for aggregation, standardization or normalization, and smoothing [18]. Normalization This Refers to rescaling a real numeric attribute to the range 0 to 1. Information normalization is utilized in machine learning to create a model prepared less delicate to the scale of highlights we utilize standard scaling methods [19].

Partitioning of data
Splitting is the process of randomly splitting the data set into two groups, the first of which will be used as training data and are going to be utilized as test data. A model is created and trained using training data, and then it is evaluated using testing data [20]. Based on this study, the researcher decided to divide the data into an 80:20 ratio, with 80 percent going to training and 20 percent going to validation.

Classification Analysis
The Algorithm that has been proposed for the most common disease in the healthcare field is heart disease, and its prevalence is rising yearly. The following three commonly employed machine learning algorithms for predicting heart disease were examined in this research, Random Forest, Support Vector Machines, KNN, and Decision Tree These methods are useful for identifying binary dependent variables. Ml learning algorithms used in this model are: 1-Support Vector Machine: supervised learning models which are used to analyze data and discover patterns in classification and regression analysis [21]. SVM is designed to discover hyperplanes in Ndimensional areas (N-features) that partition data. There are two ideas about whether or not data is linearly distinct given a dataset. The linear kernel works well if the data is linearly separable; however, if the data is not linearly distinct, it becomes difficult to separate the data. To perform classifying, a hyperplane is created, with samples from one class lying on one side and samples from another class lying on the other. To guarantee the greatest possible separation between the two classes, the hyperplane is optimized. Support vectors are those data points from classes that are closest to the hyperplane [22].

2-K-Nearest Neighbor:
It is a categorization system based on distance measurements. It is an instance-based categorization, which implies that comparable instances are classified similarly. The slow or lazy algorithm is another name for it. We have the appropriate X-value and Y-value for each point. We receive the Y-value of both instances when we are given a new instance in terms of the X-value and find the Y-value of the instance. We want to be able to accurately estimate the majority class based on Yi values. We use distance functions like Euclidean Distance, Manhattan Distance, and Minkowski Distance to locate the most similar/ adjacent example [23].

3-Decision Tree:
Decision trees (DTs), are one of the most powerful and widely used classifying and predictive methods in machine learning today. Many academics have utilized it as a classifier in the healthcare area to assess data and make choices. DT creates a model that predicts the value of a target variable by learning fundamental decision rules generated from data properties and splitting data into branch-like segments. There are two types of input values: continuous and discontinuous. The leaf nodes return class labels or probability scores. It is possible to turn the tree into a set of decision rules. These categorization principles can be readily shown graphically (6).

4-Random Forest (RF): is a data categorization strategy that employs a large number of decision trees.
Bagging and feature randomization are used to create an uncorrelated forest of trees whose committee prediction is more accurate than any individual tree commonly used in classification and regression problems. To achieve the optimum outcome, this classification technique constructs many decision trees and combines them. It mostly uses bootstrap aggregation or bagging for tree learning [24].

Hyperparameter optimization
hyperparameters in the ML are used to control the operation of the algorithm in the model. Hyperparameter optimization fits a group of hyperparameters of the classification algorithm to enhance the operation of the ML model [25]. There are different types of hyperparameters in ML algorithms that must be tuned to improve the result [26].

Random Search
As candidate hyper-parameter values, RS chooses at random a predefined set of samples from the range between the upper and lower bounds. These candidates are subsequently trained up until the budget allotment is reached. According to the theory behind RS, the global optimums, or at the very least their approximations, can be found in the configuration space sufficiently big. Since each evaluation is independent, RS's key benefit is that it is simple to parallelize and distribute resources. It increases system efficiency by decreasing the likelihood of spending a lot of time on a small, underperforming region by sampling a fixed number of parameter combinations from the given distribution. Furthermore, if given enough budgets, RS can identify the global optimum or a close approximation of it [25]. The main steps in RS are : Step 1: Best  a few starting randomized potential solutions.
Step2 : Repeat Step 3: S  a potential randomized solution.
Step 4: if Accuracy (S) > Accuracy (Best) then Step 5: Best  S Step 6: until Best is the optimal course of action or until time runs out.
Step 7: Return the Best

Hyperparameters used in Random Search optimization method
There are different types of Hyperparameters in each ML algorithm that must be tuned to enhance the result, Table  (3) shows the Hyperparameters chosen for use in the RS optimization method with their definitions and the default values.

Results and Discussion
Before data preprocessing, the dataset must split into (80-20) % train-test dataset. This splitting ratio gives the ability to model for train and testing on an unseen dataset and the train and test set is then applied to the classifiers in the mode. Table (5) shows the classification result using (SVM, KNN, DT, and RF) classifiers and the best results when using the RF classifier, while Fig.(3) shows the comparison of test accuracy results for all classifiers without and with Standard Scaling  The optimization stage begins with applying RS to the data set with all classifiers, and this method was used to select the optimal value for the parameters specified for each classifier that affects the classification result. Table (7) shows the best test accuracy with the optimal value of hyperparameters for each classifier and the best result when using the RF classifier while Fig. (4) shows the best Acc comparison before and after optimization and table (8) shows Compression between this work and previous works.

Conclusions
The heart is one of the most vital organs in the human system. This paper presents a proposed model to design and implement an automated model to predict heart attacks with high accuracy in the early stages. Data preprocessing techniques and machine learning algorithms were used to achieve the highest desired efficiency of the model. The model validation is conducted with the train-test split (80-20) of the dataset. The experiment result revealed that RF achieved better accuracy than all comparative models as shown in Table (5).
After the start of the optimization process, the parameters for each classifier and its range are determined using the RS optimization method, and the highest results were obtained when using the RF classifier with max depth(1000), max_features (Log2), n_estimators (9), Min-samples-split(16), min_samples_leaf (3) and bootstrap (False) .all these optimal values enhance the result and reached the best test Acc to (95.4 percent)

ACKNOLEDGEMENT
The author we would like to thank the reviewers for their valuable contribution in the publication of this paper.

CONFLICTS OF INTEREST
The author declares no conflict of interest.