Approximation Algorithms for Explainable Clustering

Shan, Liren

doi:https://doi.org/10.21985/n2-c778-3v18

Work

Approximation Algorithms for Explainable Clustering

Public

Download PDF

Download All Files (.zip)

Clustering is a fundamental task in unsupervised learning, which aims to partition the data set into several clusters. It is widely used for data mining, image segmentation, and natural language processing. One of the most popular clustering methods is centroid-based clustering, including k-medians and k-means clustering. k-medians and k-means clustering choose k centers and assign each data point to its closest center. Thus, this clustering forms a Voronoi partition of the space based on k centers. Each cluster corresponds to a Voronoi cell, which usually has a complicated boundary. Hence, it is not necessarily easy for humans to understand these clusters. In real-world applications, many important decisions are made for different clusters created by clustering algorithms. To make these decisions more interpretable, we want to find more explainable clustering. In this thesis, we study approximation algorithms for explainable k-medians and k-means clustering. The problem of explainable k-medians and k-means was recently introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML 2020). For this problem, our goal is to find a threshold decision tree that partitions data into k clusters and minimizes the k-medians or k-means objective. The obtained clustering is easy to interpret because every decision node of a threshold tree splits the node into two groups with a threshold cut on a single feature. The price of explainability is defined as the ratio of its cost and the optimal unconstrained cost. We provide an efficient algorithm that achieves the optimal and near-optimal upper bounds on the price of explainability for k-medians in l1 and k-means, respectively. We also provide a competitive algorithm and lower bound for explainable k-medians in l2. Finally, we provide a bi-criteria competitive algorithm that creates a k clustering by using a threshold tree with slightly more than k leaves. We show an exponential improvement in the price of explainability for k-means by adding a constant fraction of extra leaves. This captures the tradeoff between accuracy and explainability.

Creator