Hey there, so lately I’ve been trying to expand my knowledge of machine learning. To do this I’ve been reading some books and watching lots of videos on the subject. I decided to write this post because learning about how these different algorithms actually work under the hood (without doing a ton of math) it helped me understand the underlying principles of machine learning much more clearly. Lots of the code examples in this post come from the book “Introduction to Machine Learning with Python,” written by Andreas C. Müller & Sarah Guido, however some of them were edited by me. This book is amazing. It’s the first one I have read that has actually been very easy to read and follow along with. If you are interested in purchasing it click here to view it on Amazon. Anyways, lets get started!

  • First off, if you are going to follow along you will need to have an IDE or text editor and a terminal open. To get started, go ahead and create a new folder called “ML tutorial” or something and inside create a new python file. Then open it up.
  • Now, you will need to install a few packages including: sklearn, pandas, mglearn, matplotlib, and numpy. So, assuming you are using a Windows machine, in the command prompt type the following and hit ‘enter’ after each:
    pip install sklearn
    pip install pandas
    pip install mglearn
    pip install matplotlib
    pip install numpy
  • k-Nearest Neighbors, or KNN, is one of the simplest and most popular models used in Machine Learning today. Technically it is a non-parametric, lazy learning algorithm. Basically all it does is store the training dataset, then, to predict a future data point it looks for the closest existing data point to it and categorizes it with the existing point. It might be easier to understand with a picture, so here is one depicting a 2-D data set with the green dot representing the prediction point and the different circles show different values for K.
  • In the case of K = 1, the new data point would be labeled “class 1.” You can see that by increasing the value of K you can get different levels of accuracy on new predictions. There is a sweet spot however, you don’t want K to be too large or you’re predictions would be too general to actually mean anything.

  • Here is another example, this time with a value of 3 for K. Here you can see that, by using the three closest neighbors to the prediction we were able to accurately identify all three predictions.
  • Now, lets implement a version of KNN with Python. So go back to your file and at the top we need to import some packages so type:
    
    from sklearn.model_selection import train_test_split
    import pandas as pd
    import mglearn
    from sklearn.neighbors import KNeighborsClassifier
    import matplotlib.pyplot as plt
    import numpy as np
    from sklearn.model_selection import train_test_split
    

    These packages will allow us to use KNN to categorize our own data sets.

  • Good, now, below the imports we need to instantiate the training data we will use which is provided by the mglearn package. So go ahead and type:
    
    X, y = mglearn.datasets.make_forge()
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
    

    This will set X and y equal to the data and then use a random number to split the data up into X_train, X_test, y_train, y_test. The random_state=0 will create a seeded random number so that the same input will always give you the same output. This is necessary to be sure the algorithm is working properly.

  • Alright, next up we need to state the number of neighbors we will use, and then we need to fit the classifier using our training set. So type this:
    
    clf = KNeighborsClassifier(n_neighbors=3)
    clf.fit(X_train, y_train)
    print("Test set predictions: {}".format(clf.predict(X_test)))
    print("Test set accuracy: {:.2f}".format(clf.score(X_test, y_test)))
    

    This instantiates the KNN algorithm to our variable clf and then trains it to our X_train and y_train data sets. After that we pass our X_test and y_test data and print the results! This is the end of the machine learning part of the code, the rest is just used to plot the data and display it using matplotlib.

  • Neat-o, now lets plot this thing. To do this we will use matplotlib and mglearn. Go ahead and type this:
    
    fig, axes = plt.subplots(1, 3, figsize=(10, 3))
    for n_neighbors, ax in zip([1, 3, 9], axes):
        # the fit method returns the object self, so we can instantiate
        # and fit in one line
        clf = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X, y)
        mglearn.plots.plot_2d_separator(clf, X, fill=True, eps=0.5, ax=ax, alpha=.4)
        mglearn.discrete_scatter(X[:, 0], X[:, 1], y, ax=ax)
        ax.set_title("{} neighbor(s)".format(n_neighbors))
        ax.set_xlabel("feature 0")
        ax.set_ylabel("feature 1")
    
    axes[0].legend(loc=3)
    plt.show()
    

    The output from the full code should look like this:

  • With the graph you can see how using a small value for K makes the decision boundary between the circles and triangles follow the data very closely, whereas using values of 3 and 9 make the boundary much sharper, even causing a few points to be labeled incorrectly.

  • Full Code, TLDR:
    
    from sklearn.model_selection import train_test_split
    import pandas as pd
    import mglearn
    from sklearn.neighbors import KNeighborsClassifier
    import matplotlib.pyplot as plt
    import numpy as np
    from sklearn.model_selection import train_test_split
    
    
    X, y = mglearn.datasets.make_forge()
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
    clf = KNeighborsClassifier(n_neighbors=3)
    clf.fit(X_train, y_train)
    print("Test set predictions: {}".format(clf.predict(X_test)))
    print("Test set accuracy: {:.2f}".format(clf.score(X_test, y_test)))
    
    fig, axes = plt.subplots(1, 3, figsize=(10, 3))
    for n_neighbors, ax in zip([1, 3, 9], axes):
        # the fit method returns the object self, so we can instantiate
        # and fit in one line
        clf = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X, y)
        mglearn.plots.plot_2d_separator(clf, X, fill=True, eps=0.5, ax=ax, alpha=.4)
        mglearn.discrete_scatter(X[:, 0], X[:, 1], y, ax=ax)
        ax.set_title("{} neighbor(s)".format(n_neighbors))
        ax.set_xlabel("feature 0")
        ax.set_ylabel("feature 1")
    
    axes[0].legend(loc=3)
    plt.show()
    
  • Anyways, that’s it for this post. It’s getting late and I’m tired. But I plan on putting up more posts about various types of ML algorithms here in the future. See-Yaa, thanks for reading!