Topic: In this assignment we will show K-means algorithm. This algorithm belongs in the clustering algorithms family.

Exercise: Create a Python script file and perform the following tasks:

  1. Import libraries:
    • numpy: For the calculation of arrays.
    • cdist: To calculate Euclidean distance.
    • load_digits: Dataset with pictures of numbers 0 to 9.
    • PCA: To reduce the dimension of picture from 64 to 2.
    • Matplotlib: To plot points in 2-dimension figure.
  1. Create a class of K-means with the name k_means, contains a constructor _init_ and a function fit. The constructor takes k which is an integer and the number of unique labels (also the number of clusters). Function fit takes 2 parameters data and k. Data is a numpy.ndarray of 2-dimensions.
  2. Load and use the dataset of sklearn digits. This dataset contains 1797 samples of 64 (8×8) pixels pictures of numbers 0 to 9.
  3. Load and use sklearn method PCA (Principal component analysis) which transforms the dimensions of pictures from 1797×64 to 1797×2. The reason of doing that is to plot the pictures as points in 2-dimensions board. Load and use the library matplotlib to plot the points.
  4. Select k random points from data as centroids. Plot the centroids with mark ‘x’.
  5. Find the distances between centroids and all the data points. The metric for calculating the distances will be the Euclidean metric. For example, for points A(x_A, y_A)   and  B(x_B, y_B), distance  d will be  d=\sqrt{(x_A-x_B)^2+(y_A-y_B)^2}. Blur the image.
  6. Assign all the points to the closest centroid, more specific the one with the minimum distance.
  7. Calculate the new centroids. The calculation comes from the mean value of points for each cluster.
  8. The algorithm ends when newly centroids are the same with the previous ones. Another way to stop the algorithm faster to solve the problem of waiting too many times, which depends in sample size is by defining a number of iterations. In our case dataset is short and we don’t need to consider that, but you should implement both ideas in your code.
  9. The K-means function returns the labels of data points, more specific a list of values from 0 to 9.
  10. Plot the diagram of points and centroids with unique color for each cluster. Make a .gif file of the changes of centroids and points that change during the process.

Help (if needed): Click Pdf  to download the exercise help.

Lecture for better understanding:http://icarus.csd.auth.gr/data-clustering-lecture/

This exercise was developed by G.R.Pegias.

————————————————————————————————————–

For the solutions to the exercises, please contact koroniioanna@csd.auth.gr