Clustering hockey players: an introduction to k-means and hierarchical agglomerative clustering…

Clustering hockey players: an introduction to k-means and hierarchical agglomerative clustering…


Play all audios:


A little while back, I created a model to cluster hockey players based on their skill and playstyle. This was an unsupervised learning project, meaning that I did not have any output


variable to train the data on. The purpose was therefore to attempt to use basic and advanced statistics to classify hockey players into their own categories independent of any previously


defined positions or labels in order to try and better group together similar players. Below are descriptions of the models and metrics that I have used thus far K-MEANS CLUSTERING For


K-Means Clustering, the first step is to specify the number of clusters that you want to use. Personally, I wanted to use at least 15 clusters, and I determined the exact number of clusters


I would use (20) based on the model with the highest Silhouette Score (which I’ll cover later) that had at least 15 clusters. Then, the model creates 20 cluster “centers”, each of which are


meant to represent the center of a cluster. To begin with, the cluster centers are assigned to be at random locations, generally very far apart from one another. As a next step, the distance


between each data point and the cluster centers is calculated, with a specific data point then belonging to the cluster with the nearest center. The cluster center is then moved to the


center of the points that belong to the cluster, and the process is repeated until the cluster centers stop moving. Source:


https://sites.google.com/site/dataclusteringalgorithms/k-means-clustering-algorithm HIERARCHICAL AGGLOMERATIVE CLUSTERING There are both bottom-up and top-down hierarchical agglomerative


clustering methods. They are inverses of each other but work in similar ways. In bottom-up agglomerative clustering, each individual point is considered its own cluster to start out with.


Then, the two clusters “closest” to one another merge, and the process repeats itself until there is only one cluster left (or until the process is cut off). The distance between two


clusters can be defined as the shortest distance between points in the clusters, the furthest distance between points in the clusters, or the average distance between points in the clusters.


Top-down works in the opposite fashion: every point belongs to the same cluster, and then the clusters are split until every point belongs to its own cluster (or until the process is


stopped). Source: https://www.saedsayad.com/clustering_hierarchical.htm SILHOUETTE SCORE: The Silhouette Score is basically a measurement of how well-defined the clusters actually are. It is


defined by (p-q)/max(p,q), where p is the average distance to points in the nearest cluster, and q is the average distance to points in the same cluster. A Silhouette Score approaching 1


implies that a point is in the correct cluster, a Silhouette Score approaching -1 means that a point is not in the correct cluster, and a Silhouette Score of 0 means that a point is very


close to the boundary between the two clusters Source: https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_python_analysis_of_silhouette_score.htm Recently, I


have been working to improve on the original model that I created, which was what prompted me to write this blog post to begin with as a review. That being said, it is quite difficult to


“improve” on a clustering model since determining what constitutes improvement is not as simple as improving the accuracy or F1 score. There is certainly a degree of personal judgement


beyond Silhouette Score about how well the model performs. Therefore, I also need to use my knowledge of the sport and of which features are important to create the best possible clustering


model. Hopefully, I will be able to continue improving on my model and obtain the most informative clustering of players I can.