[Rod Stephens Books]
Index Books Python Examples About Rod Contact
[Mastodon] [Bluesky]
[Build Your Own Ray Tracer With Python]

[Beginning Database Design Solutions, Second Edition]

[Beginning Software Engineering, Second Edition]

[Essential Algorithms, Second Edition]

[The Modern C# Challenge]

[WPF 3d, Three-Dimensional Graphics with WPF and C#]

[The C# Helper Top 100]

[Interview Puzzles Dissected]

[liveProjects: Four AI Algorithm Projects with Python] This example shows how to use k-means clustering to classify points. My liveProject series Four AI Algorithm Projects with Python explains that algorithm and several others and uses them to solve more interesting problems like deciphering hand-written digits.

Title: Use the k-means artificial intelligence algorithm to classify points in Python

[Points classified into clusters]

This example demonstrates a simple but remarkably useful artificial intelligence algorithm: k-means clustering.

Suppose you have a bunch of data points and you want to group them into categories or clusters. For example, in the picture on the right it's pretty clear that the points belong to three clusters. (Okay, the blue and yellow clusters are pretty close to being one big cluster, but suppose we know or suspect that there should be three.)

The k-means algorithm identifies natural clusters.

The Algorithm

The idea behind k-means is relatively straightforward. First you pick random locations for the clusters. In this example, I assumed three clusters so the program chose three random locations.

Next, the program loops through the data points and assigns each to the cluster that is closest.

The program then finds the centroid of the points assigned to each cluster. To find a cluster's centroid, it adds up the X coordinates of the points in that cluster and divides by the number of those points to get the X coordinate for the centroid. It calculates the centroid's Y coordinate similarly. It then moves each cluster to its centroid.

The program repeats those steps until the clusters settle down and stop moving around. This program repeats the steps until the clusters' centroids move by less than one pixel in a round, but you could pick some other stopping criterion if you like.

Here's a summary of the algorithm.

  1. Pick random initial locations for the clusters.
  2. Assign each point to the nearest cluster.
  3. Use the assignments to calculate each cluster's centroid. Move the cluster to its centroid.
  4. Repeat steps 2 and 3 until the centroids stop moving around.

The algorithm usually stops surprisingly quickly. In the picture above, for example, the program took only nine iterations for the clusters to stabilize. In most of my tests, the algorithm needed between four and 12 rounds.

Using the Program

[Points classified into clusters] To make the data points, the program creates a number of "seeds" (the pink diamonds in the picture) and then generates random points centered around the seeds. There is some randomization so a point doesn't always land closest to its seed, but the result often gives some recognizable clusters. The picture on the right shows the initial setup in one test.

Enter the number of seeds you would like in the top text box. Then enter the number of points you would like to create and click Make Points. If you don't like the arrangement, for example, if two seeds are too close together, click Make Points again.

After you have a nice arrangement of points, enter the number of "suspected" clusters in the # Clusters text box and click Make Clusters. The program runs the algorithm, showing the results after each round. Use the slider to adjust the animation speed.

Mistakes

[Points classified into clusters] Note that the clusters sometimes reach an equilibrium that doesn't really reflect the clustering in the data. In the picture on the right, for example, the blue and yellow clusters divide up the group of points in the upper left and the green cluster grabs all of the rest of the points. This particular example worked most of the time, however, and it took me quite a few tries clicking Make Clusters before it messed up.

Occasionally you'll even see two of the clusters divide up all of the points and leave no points for the third cluster. If you suspect there should be three clusters, then this is obviously an incorrect result so the program should start over with new random cluster locations.

To prevent these kinds of errors, a program could repeat the whole algorithm several times. You'll find that usually the clusters are almost identical across different runs.

Conclusion

[Points classified into clusters] Try implementing the algorithm on your own or download the example program to see how I did it.

Grouping points in clusters is somewhat abstract, but this algorithm does a remarkably good job with more complicated problems. For example, my liveProject series Four AI Algorithm Projects with Python uses k-means clustering to identify hand-drawn digits. As the picture on the right shows, the program was able to identify test data with a 96% success rate, a remarkable result for such a simple algorithm.

That series of liveProjects also explains how to implement several other clustering algorithms, simulate swarms, build expert systems, and make neural networks. Click this link for more information.

© 2024 Rocky Mountain Computer Consulting, Inc. All rights reserved.