# Clustering¶

A clustering example is explained for the SALSA package on the Iris dataset [UCI2010].

This package provides a function `salsa`

and explanation on `SALSAModel`

for the clustering case. This use case is supported by the particular choices of loss functions and distance metrics applied within the Regularized K-Means approach [JS2015] and cross-validation criterion `SILHOUETTE`

(Silhouette index).

```
using SALSA, Clustering, Distances, MLBase, Base.Test
Xf = readcsv(joinpath(Pkg.dir("SALSA"), "data", "iris.data.csv"))
Y = convert(Array{Int}, Xf[:,end])
k_clusters = length(unique(Y))
dY = Array{Int}(length(Y))
X = Xf[:,1:end-1]
srand(1234)
algorithm = RK_MEANS(k_clusters)
model = SALSAModel(LINEAR, algorithm, LEAST_SQUARES,
validation_criterion=SILHOUETTE(),
global_opt=DS([-1]), process_labels=false,
cv_gen = Nullable{CrossValGenerator}(Kfold(length(Y),3)))
model = salsa(X, dY, model, X)
mappings = model.output.Ytest
```

By taking a close look at the code snippet above we can notice that we use a special type of an algorithm `RK_MEANS()`

which implements approach in [JS2015]. By instantiating `RK_MEANS(k_clusters)`

we provide a maximum number of clusters to be extracted. Learning of individual prototype vectors will be repeated `algorithm.max_iter`

times after re-partitioning of the dataset `X`

(by default `algorithm.max_iter==20`

). The default choice of the loss function is `LEAST_SQUARES`

and the distance metric is `Euclidean()`

[1]. This corresponds to the original setting of the unregularized K-Means approach. Please refer to *Algorithms* section and `RK_MEANS()`

function for more details regarding which combinations of loss functions and metrics are supported.

[UCI2010] | Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. |

[JS2015] | (1, 2) Jumutc V., Suykens J.A.K., “Regularized and Sparse Stochastic K-Means for Distributed Large-Scale Clustering”, Internal Report 15-126, ESAT-SISTA, KU Leuven (Leuven, Belgium), 2015. |

Footnotes

[1] | metric types are defined in Distances.jl package |