{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Feature Selection Demo using Scikit -- PIMA Indians Diabetes Dataset\n", "\n", "In this tutorial, I will use the \"PIMA Indians Diabetes\" dataset, where all patients are females at least 21 years old of Pima Indian heritage.\n", "\n", "The classification task is to predict whether an individual has diabetes from the following features:\n", "1. Pregnancies: Number of times pregnant \n", "2. Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test \n", "3. BloodPressure: Diastolic blood pressure (mm Hg) \n", "4. SkinThickness: Triceps skin fold thickness (mm) \n", "5. Insulin: 2-Hour serum insulin (mu U/ml) \n", "6. BMI: Body mass index (weight in kg/(height in m)^2) \n", "7. DiabetesPedigreeFunction: Diabetes pedigree function \n", "8. Age: (years) \n", "9. Outcome: Class variable (0 or 1) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Utility Functions & Imports" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The autoreload extension is already loaded. To reload it, use:\n", " %reload_ext autoreload\n" ] } ], "source": [ "%matplotlib inline\n", "%load_ext autoreload\n", "%autoreload 2\n", "#See bmes.ahmet/README.TXT for setting up BMESAHMETDIR environment variable.\n", "import sys,os; sys.path.append(os.environ['BMESAHMETDIR']); import bmes\n", "\n", "def printfileheadtruncated(file, N):\n", " from itertools import islice\n", " with open(file) as f:\n", " for line in islice(f, 3):\n", " if len(line)>80: print(line[0:80]+' ...')\n", " else: print(line.rstrip())\n", "\n", "\n", "#perform four-fold cross-validation of the method\n", "def crossvalidate(classifier,X,T):\n", " from sklearn.model_selection import cross_val_score\n", " scores = cross_val_score(classifier, X, T, cv=4)\n", " print('--- 4-fold cross-validation accuracy: %%%.1f (+/-%.1f)' % (scores.mean()*100,scores.std()*100))\n", "\n", "\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis\n", "#classifier = DecisionTreeClassifier();\n", "\n", "from sklearn.svm import SVC\n", "classifier = SVC(kernel='linear')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download data file(s)" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--- First 5 lines of data file:\n", "6,148,72,35,0,33.6,0.627,50,1\n", "1,85,66,29,0,26.6,0.351,31,0\n", "8,183,64,0,0,23.3,0.672,32,1\n" ] } ], "source": [ "datafile = bmes.datadir() + '/pima-indians-diabetes.data';\n", "bmes.downloadurl('https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv',datafile);\n", "print('--- First 5 lines of data file:')\n", "printfileheadtruncated( datafile, 5 )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load data" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--- First 5 rows of data:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
numpregnanciesglucosediastolictricepskininsulinbmipedigreeagehasdiabetes
061487235033.60.627501
11856629026.60.351310
28183640023.30.672321
318966239428.10.167210
40137403516843.12.288331
\n", "
" ], "text/plain": [ " numpregnancies glucose diastolic tricepskin insulin bmi pedigree \\\n", "0 6 148 72 35 0 33.6 0.627 \n", "1 1 85 66 29 0 26.6 0.351 \n", "2 8 183 64 0 0 23.3 0.672 \n", "3 1 89 66 23 94 28.1 0.167 \n", "4 0 137 40 35 168 43.1 2.288 \n", "\n", " age hasdiabetes \n", "0 50 1 \n", "1 31 0 \n", "2 32 1 \n", "3 21 0 \n", "4 33 1 " ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas, numpy\n", "names = ['numpregnancies', 'glucose', 'diastolic', 'tricepskin', 'insulin', 'bmi', 'pedigree', 'age', 'hasdiabetes'];\n", "data = pandas.read_csv(datafile, names=names);\n", "print('--- First 5 rows of data:')\n", "data.head(5)" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "X = data.values[:,:-1];\n", "T = data.values[:,-1];\n", "names=names[:-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Perform Classification using all features" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--- 4-fold cross-validation accuracy: %76.3 (+/-1.7)\n" ] } ], "source": [ "crossvalidate(classifier, X, T)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Select Features using Filters\n", "\n", "For the list of available filters, see:\n", "http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--- Scores of the features:\n", "insulin : 2175.6\n", "glucose : 1411.9\n", "age : 181.3\n", "bmi : 127.7\n", "numpregnancies: 111.5\n", "tricepskin : 53.1\n", "diastolic : 17.6\n", "pedigree : 5.4\n" ] } ], "source": [ "from sklearn.feature_selection import (SelectKBest, chi2)\n", "\n", "# Select top 3 features.\n", "selector = SelectKBest(chi2, k=3)\n", "selector.fit(X, T)\n", "\n", "print('--- Scores of the features:')\n", "sortedscores,sortednames = zip(*sorted(zip(selector.scores_,names), reverse=True))\n", "for name, score in zip(sortednames,sortedscores):\n", " print('{:<14}'.format(name) + ': ' + str(round(score,1)));\n" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--- Data with only the top k features:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
glucoseinsulinage
0148.00.050.0
185.00.031.0
2183.00.032.0
389.094.021.0
4137.0168.033.0
\n", "
" ], "text/plain": [ " glucose insulin age\n", "0 148.0 0.0 50.0\n", "1 85.0 0.0 31.0\n", "2 183.0 0.0 32.0\n", "3 89.0 94.0 21.0\n", "4 137.0 168.0 33.0" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print('--- Data with only the top k features:')\n", "subX = selector.transform(X)\n", "subnames = selector.transform([names])[0]\n", "pandas.DataFrame(data=subX, columns=subnames).head(5)" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--- 4-fold cross-validation accuracy: %73.7 (+/-3.3)\n" ] } ], "source": [ "# Perform classification with the selected features:\n", "crossvalidate(classifier, subX, T)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Select Features using Recursive Elimination" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--- Selected Features (5):\n" ] }, { "data": { "text/plain": [ "['numpregnancies', 'glucose', 'diastolic', 'bmi', 'pedigree']" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.feature_selection import (RFECV)\n", "\n", "# RFECV using the SVM classifier would take forever; let's use a faster method.\n", "#quickclassifier = DecisionTreeClassifier()\n", "from sklearn.linear_model import LogisticRegression\n", "quickclassifier = LogisticRegression(solver='liblinear')\n", "selector = RFECV(quickclassifier,cv=4,scoring='accuracy')\n", "\n", "selector.fit(X, T)\n", "\n", "print('--- Selected Features (%d):' % (selector.n_features_))\n", "[names[i] for i in range(len(names)) if selector.get_support()[i]]" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from pylab import *\n", "xlabel(\"Number of features removed\")\n", "ylabel(\"Cross validation accuracy\")\n", "plot(range(0, len(selector.grid_scores_)), selector.grid_scores_)" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--- 4-fold cross-validation accuracy: %76.7 (+/-1.3)\n" ] } ], "source": [ "crossvalidate(classifier, selector.transform(X), T)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sequential Forward Selection" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--- Selected Features (8):\n", "(0, 1, 2, 3, 4, 5, 6, 7)\n" ] }, { "data": { "text/plain": [ "['numpregnancies',\n", " 'glucose',\n", " 'diastolic',\n", " 'tricepskin',\n", " 'insulin',\n", " 'bmi',\n", " 'pedigree',\n", " 'age']" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# mlxtend.feature_selection is an alternative library that can do forward/backward/floating selection.\n", "bmes.pipinstall('mlxtend')\n", "\n", "from mlxtend.feature_selection import SequentialFeatureSelector as SFS\n", "from sklearn.linear_model import LinearRegression\n", "# Sequential Forward Selection(sfs)\n", "\n", "from sklearn.feature_selection import (RFECV)\n", "\n", "#quickclassifier = DecisionTreeClassifier()\n", "from sklearn.linear_model import LogisticRegression\n", "quickclassifier = LogisticRegression(solver='liblinear')\n", "# I am using k_features to select all the way to the end, so I can manually visualize/decide how many features works best.\n", "# You may want to avoid that if it takes too much time.\n", "selector = SFS(quickclassifier,cv=4,scoring='accuracy',k_features=X.shape[1],forward=True)\n", "\n", "selector.fit(X, T)\n", "\n", "#features shown here are NOT ordered by their performance.\n", "print('--- Selected Features (%d):' % (len(selector.k_feature_idx_)))\n", "print(selector.k_feature_idx_)\n", "[names[i] for i in selector.k_feature_idx_]" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [], "source": [ "# get each added attribute and the resulting score\n", "selectnames=[];\n", "selectscores=[];\n", "for x in selector.get_metric_dict().values():\n", " newnames=set([names[i] for i in x['feature_idx']]).difference(selectnames);\n", " selectnames.append(','.join(newnames));\n", " names = names + list(newnames);\n", " selectscores.append(x['avg_score'])" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "xlabel(\"Added Feature\")\n", "ylabel(\"Cross validation accuracy\")\n", "plot(range(1, len(selectscores)+1), selectscores)\n", "import matplotlib.pyplot as plt\n", "plt.xticks(range(1, len(selectscores)+1),selectnames,rotation=60);\n", "\n", "#also see: mlxtend.plotting.plot_sequential_feature_selection() for extra plotting functionalities (stdev, etc)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 2 }