A clustering example is explained for the SALSA package on the Iris dataset [UCI2010].

This package provides a function salsa and explanation on SALSAModel for the clustering case. This use case is supported by the particular choices of loss functions and distance metrics applied within the Regularized K-Means approach [JS2015] and cross-validation criterion SILHOUETTE (Silhouette index).

using SALSA, Clustering, Distances, MLBase, Base.Test

Xf = readcsv(joinpath(Pkg.dir("SALSA"), "data", ""))
Y = convert(Array{Int}, Xf[:,end])
k_clusters = length(unique(Y))
dY = Array{Int}(length(Y))
X = Xf[:,1:end-1]

algorithm = RK_MEANS(k_clusters)
model = SALSAModel(LINEAR, algorithm, LEAST_SQUARES,
         global_opt=DS([-1]), process_labels=false,
         cv_gen = Nullable{CrossValGenerator}(Kfold(length(Y),3)))
model = salsa(X, dY, model, X)
mappings = model.output.Ytest

By taking a close look at the code snippet above we can notice that we use a special type of an algorithm RK_MEANS() which implements approach in [JS2015]. By instantiating RK_MEANS(k_clusters) we provide a maximum number of clusters to be extracted. Learning of individual prototype vectors will be repeated algorithm.max_iter times after re-partitioning of the dataset X (by default algorithm.max_iter==20). The default choice of the loss function is LEAST_SQUARES and the distance metric is Euclidean() [1]. This corresponds to the original setting of the unregularized K-Means approach. Please refer to Algorithms section and RK_MEANS() function for more details regarding which combinations of loss functions and metrics are supported.

[UCI2010]Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.
[JS2015](1, 2) Jumutc V., Suykens J.A.K., “Regularized and Sparse Stochastic K-Means for Distributed Large-Scale Clustering”, Internal Report 15-126, ESAT-SISTA, KU Leuven (Leuven, Belgium), 2015.


[1]metric types are defined in Distances.jl package