Feature Selection in Data Science with Examples
In this article, I am going to discuss Feature Selection in Data Science with Examples. Please read our previous article where we discussed Python Plotly for Data Science with Examples.
How to select the right data?
Suitable data analysis and right data selection are mutually reliant; in fact, this is a critical activity to ensure that appropriate data samples are arriving, which will finally lead to success. Because your data is user-driven, getting it properly is impossible; nevertheless, a data analyst has complete control over collecting the right data for your study!
Because data is frequently biased due to the nature of the business, the locations in which it is used, seasonal variations, and a variety of other factors, you should never allow this bias into the data selection sample.
Let’s take an example, where you want to start an Italian Cuisine restaurant at a place that is completely new to you. How will you decide whether it will be an apt choice for business or not? How will you collect data for the neighboring people who can be target customers for your restaurant? How will you get to know whether this cuisine is even in demand in that locality?
Bias in survey results may lead to erroneous decisions as we try to avoid the negativity in the data, resulting in poor decisions or even a larger mistake that has a negative influence on revenue. Because this part of the activity is picked up with a lot of considerations, it’s critical for data analysts to be involved in this process.
Well, we assume that Machine Learning is the answer to everything, while, contrary to this belief, Machine Learning Algorithms are based on the correct data sample; it’s not the model that fits the problem and produces a logical result that decides; it’s the phase before selecting the right sample that chooses!
What are the best features to use?
In many cases, utilizing all of the characteristics in a data set does not result in the most accurate model. Model performance can be affected by the type of model used, the size of the data collection, and a variety of other factors, including superfluous features.
Often, in a data set, the given set of features does not provide enough, or the best, information to train a performant model in its raw form. Feature selection is the process of removing unneeded or conflicting features when they are no longer needed. Feature selection has three basic objectives.
- Improve the model’s ability to forecast new data with greater accuracy.
- Reduce the cost of computing.
- Create a model that is easier to understand.
There are several reasons why you would prefer to omit some features over others. This comprises features’ associations, whether or not a statistical relationship to the target variable exists or is significant enough, and the value of the data contained within a feature.
Feature selection can be done manually, both before and after training, by analyzing the data set, or it can be done automatically using statistical approaches.
Feature Selection Techniques in Data Science
Feature selection can be done manually, both before and after training, by analyzing the data set, or it can be done automatically using statistical approaches. Let’s have a look at both the methods –
Manual Feature Selection –
You may want to remove a feature from the training phase for a variety of reasons. These are some of them:
- A feature in the data set that is substantially associated with another feature. If that’s the case, both features are essentially delivering the same data. Correlated properties are susceptible to some algorithms.
- Features that don’t provide much information. A feature where the majority of samples have the same value is an example.
- Features with a weak to non-existent statistical link to the target variable.
Here are a handful of frequent methods for manually selecting features –
1. Checking Correlation –
The linear relationship between two or more variables is measured via correlation. We can predict one variable from the other using correlation. The good variables are highly linked with the target, therefore employing correlation for feature selection makes sense. In addition, variables should be correlated with the aim yet uncorrelated with one another.
We can anticipate one from the other if two variables are correlated. As a result, if two features are correlated, the model only requires one of them, as the other does not provide any extra information.
To pick the variables, we need to set an absolute value, such as 0.5, as the threshold. If the predictor variables are found to be associated, we can exclude the variable having the lowest correlation coefficient value with the target variable. We can also compute multiple correlation coefficients to see if more than two variables are linked together. Multicollinearity is the term for this phenomenon.
Multicollinearity affects the accuracy of a model. So, it’s better to eliminate features with high correlation values (many people prefer dropping features with correlation greater than 0.8)
Example – Let’s check the correlation between features of a dataset.
# Import Libraries import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # Import dataset df = pd.read_csv('train (3).csv') # Set figure size # Create a heatmap to check the correlation between features in the dataset plt.figure(figsize=(15, 15)) sns.heatmap(df.corr(), annot=True)
2. Univariate Selection –
After we’ve trained a model, we can utilize statistical analysis to learn more about the effects features have on the model’s output and determine which characteristics are the most beneficial.
To determine the relevance of a feature, a variety of methods and methodologies are available. Some strategies are specialized to a single algorithm, while others are model agnostic and may be applied to a wide range of models.
We’ll utilize scikit-learn’s feature importance technique for a random forest classifier to demonstrate feature importance.
This provides a good indication of which features have an impact on the model and which do not. After analyzing this chart, we may decide to delete some of the less significant aspects. The code below fits the classifier and displays the feature importances in a plot –
# Splitting data into test and train sets X_train, X_test, y_train, y_test = train_test_split(df.drop('price_range', axis=1), df['price_range'], test_size=0.20, random_state=0) # fitting the model classifier = RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=42) classifier.fit(X_train, y_train) # plotting feature importances features = df.drop('price_range', axis=1).columns importances = classifier.feature_importances_ indices = np.argsort(importances) plt.figure(figsize=(10,15)) plt.title('Feature Importances') plt.barh(range(len(indices)), importances[indices], color='b', align='center') plt.yticks(range(len(indices)), [features[i] for i in indices]) plt.xlabel('Relative Importance') plt.show()
This provides a good indication of which features have an impact on the model and which do not. After analyzing this chart, we may decide to delete some of the less significant aspects.
In the next article, I am going to discuss Feature Selection Case Study in Data Science with Examples. Here, in this article, I try to explain Feature Selection in Data Science with Examples. I hope you enjoy this Feature Selection in Data Science with Examples article.