Clustering¶
A clustering example is explained for the SALSA package on the Iris dataset [UCI2010].
This package provides a function salsa
and explanation on SALSAModel
for the clustering case. This use case is supported by the particular choices of loss functions and distance metrics applied within the Regularized K-Means approach [JS2015] and cross-validation criterion SILHOUETTE
(Silhouette index).
using SALSA, Clustering, Distances, MLBase, Base.Test
Xf = readcsv(joinpath(Pkg.dir("SALSA"), "data", "iris.data.csv"))
Y = convert(Array{Int}, Xf[:,end])
k_clusters = length(unique(Y))
dY = Array{Int}(length(Y))
X = Xf[:,1:end-1]
srand(1234)
algorithm = RK_MEANS(k_clusters)
model = SALSAModel(LINEAR, algorithm, LEAST_SQUARES,
validation_criterion=SILHOUETTE(),
global_opt=DS([-1]), process_labels=false,
cv_gen = Nullable{CrossValGenerator}(Kfold(length(Y),3)))
model = salsa(X, dY, model, X)
mappings = model.output.Ytest
By taking a close look at the code snippet above we can notice that we use a special type of an algorithm RK_MEANS()
which implements approach in [JS2015]. By instantiating RK_MEANS(k_clusters)
we provide a maximum number of clusters to be extracted. Learning of individual prototype vectors will be repeated algorithm.max_iter
times after re-partitioning of the dataset X
(by default algorithm.max_iter==20
). The default choice of the loss function is LEAST_SQUARES
and the distance metric is Euclidean()
[1]. This corresponds to the original setting of the unregularized K-Means approach. Please refer to Algorithms section and RK_MEANS()
function for more details regarding which combinations of loss functions and metrics are supported.
[UCI2010] | Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. |
[JS2015] | (1, 2) Jumutc V., Suykens J.A.K., “Regularized and Sparse Stochastic K-Means for Distributed Large-Scale Clustering”, Internal Report 15-126, ESAT-SISTA, KU Leuven (Leuven, Belgium), 2015. |
Footnotes
[1] | metric types are defined in Distances.jl package |