Clustering hockey players: an introduction to k-means and hierarchical agglomerative clustering…

Medium

Select a language for the TTS:
UK English Female
UK English Male
US English Female
US English Male
Australian Female
Australian Male
Language selected: (auto detect) - EN

Play all audios:

A little while back, I created a model to cluster hockey players based on their skill and playstyle. This was an unsupervised learning project, meaning that I did not have any output

variable to train the data on. The purpose was therefore to attempt to use basic and advanced statistics to classify hockey players into their own categories independent of any previously

defined positions or labels in order to try and better group together similar players. Below are descriptions of the models and metrics that I have used thus far K-MEANS CLUSTERING For

K-Means Clustering, the first step is to specify the number of clusters that you want to use. Personally, I wanted to use at least 15 clusters, and I determined the exact number of clusters

I would use (20) based on the model with the highest Silhouette Score (which I’ll cover later) that had at least 15 clusters. Then, the model creates 20 cluster “centers”, each of which are

meant to represent the center of a cluster. To begin with, the cluster centers are assigned to be at random locations, generally very far apart from one another. As a next step, the distance

between each data point and the cluster centers is calculated, with a specific data point then belonging to the cluster with the nearest center. The cluster center is then moved to the

center of the points that belong to the cluster, and the process is repeated until the cluster centers stop moving. Source:

https://sites.google.com/site/dataclusteringalgorithms/k-means-clustering-algorithm HIERARCHICAL AGGLOMERATIVE CLUSTERING There are both bottom-up and top-down hierarchical agglomerative

clustering methods. They are inverses of each other but work in similar ways. In bottom-up agglomerative clustering, each individual point is considered its own cluster to start out with.

Then, the two clusters “closest” to one another merge, and the process repeats itself until there is only one cluster left (or until the process is cut off). The distance between two

clusters can be defined as the shortest distance between points in the clusters, the furthest distance between points in the clusters, or the average distance between points in the clusters.

Top-down works in the opposite fashion: every point belongs to the same cluster, and then the clusters are split until every point belongs to its own cluster (or until the process is

stopped). Source: https://www.saedsayad.com/clustering_hierarchical.htm SILHOUETTE SCORE: The Silhouette Score is basically a measurement of how well-defined the clusters actually are. It is

defined by (p-q)/max(p,q), where p is the average distance to points in the nearest cluster, and q is the average distance to points in the same cluster. A Silhouette Score approaching 1

implies that a point is in the correct cluster, a Silhouette Score approaching -1 means that a point is not in the correct cluster, and a Silhouette Score of 0 means that a point is very

close to the boundary between the two clusters Source: https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_python_analysis_of_silhouette_score.htm Recently, I

have been working to improve on the original model that I created, which was what prompted me to write this blog post to begin with as a review. That being said, it is quite difficult to

“improve” on a clustering model since determining what constitutes improvement is not as simple as improving the accuracy or F1 score. There is certainly a degree of personal judgement

beyond Silhouette Score about how well the model performs. Therefore, I also need to use my knowledge of the sport and of which features are important to create the best possible clustering

model. Hopefully, I will be able to continue improving on my model and obtain the most informative clustering of players I can.

Most Viewed Business News Articles, Top News Articles | The Economic Times

You can search EconomicTimes.com for similar content, browse our most read articles, or go to our Home PageDonald Trump'...

Rett mutations attenuate phase separation of mecp2

Dear Editor, Methyl-CpG-binding protein 2 (MeCP2) is a ubiquitously expressed nuclear protein originally identified as a...

Cellular phones and shanties

THE dawning of the free market era inRussia has plunged Siberia into a vortexof contradictions. Although the regionis a ...

The Standard

The Standard is a marketing name for Standard Insurance Company (Portland, Oregon), licensed in all states except New Yo...

Chelsea vs tottenham on tv: what channel is the match on?

Chelsea boss Maurizio Sarri revealed he is undecided over whether or not to select Kepa Arrizabalaga against Tottenham. ...

An optimized antigen–protein fusion

The fusion of an immunogenic peptide and the protein transthyretin protects the peptide antigen from proteolytic degrada...