K-Means clustering and its real use case in the Security Domain
Let’s start with a Basics…
What is Clustering?
Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.
Or in other words, Clustering is an unsupervised machine learning task. It involves automatically discovering natural grouping in data. Unlike supervised learning (like predictive modeling), clustering algorithms only interpret the input data and find natural groups or clusters in feature space.
The data points in the graph below clustered together can be classified into one single group. We can distinguish the clusters, and we can identify that there are 3 clusters in the below picture.
It is not necessary for clusters to be a spherical. Such as :
DBSCAN: Density-based Spatial Clustering of Applications with Noise
These data points are clustered by using the basic concept that the data point lies within the given constraint from the cluster center. Various distance methods and techniques are used for calculation of the outliers.
Why do we need Clustering ?
Clustering is very much important as it determines the intrinsic grouping among the unlabeled data present. There are no criteria for a good clustering. It depends on the user, what is the criteria they may use which satisfy their need. For instance, we could be interested in finding representatives for homogeneous groups (data reduction), in finding “natural clusters” and describe their unknown properties (“natural” data types), in finding useful and suitable groupings (“useful” data classes) or in finding unusual data objects (outlier detection). This algorithm must make some assumptions which constitute the similarity of points and each assumption make different and equally valid clusters.
Methods for Clustering
- Density-Based Methods : These methods consider the clusters as the dense region having some similarity and different from the lower dense region of the space. These methods have good accuracy and ability to merge two clusters. Example DBSCAN (Density-Based Spatial Clustering of Applications with Noise) , OPTICS (Ordering Points to Identify Clustering Structure) etc.
- Hierarchical Based Methods : The clusters formed in this method forms a tree-type structure based on the hierarchy. New clusters are formed using the previously formed one. It is divided into two category
- Agglomerative (bottom up approach)
- Divisive (top down approach)
- examples CURE (Clustering Using Representatives), BIRCH (Balanced Iterative Reducing Clustering and using Hierarchies) etc.
- Partitioning Methods : These methods partition the objects into k clusters and each partition forms one cluster. This method is used to optimize an objective criterion similarity function such as when the distance is a major parameter example K-means, CLARANS (Clustering Large Applications based upon Randomized Search) etc.
- Grid-based Methods : In this method the data space is formulated into a finite number of cells that form a grid-like structure. All the clustering operation done on these grids are fast and independent of the number of data objects example STING (Statistical Information Grid), wave cluster, CLIQUE (CLustering In Quest) etc.
Types of Clustering
Broadly speaking, clustering can be divided into two subgroups :
- Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or not.
- Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a probability or likelihood of that data point to be in those clusters is assigned. For example, from the above scenario, each customer is assigned a probability to be in either of 10 clusters of the retail store.
What is K-means Clustering?
k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.
K-Means Clustering Algorithm
Let’s say we have x1, x2, x3……… x(n) as our inputs, and we want to split this into K clusters.
The steps to form clusters are:
Step 1: Choose K random points as cluster centers called centroids.
Step 2: Assign each x(i) to the closest cluster by implementing euclidean distance (i.e., calculating its distance to each centroid)
Step 3: Identify new centroids by taking the average of the assigned points.
Step 4: Keep repeating step 2 and step 3 until convergence is achieved
Let’s take a detailed look at it at each of these steps.
We randomly pick K (centroids). We name them c1,c2,….. ck, and we can say that
Where C is the set of all centroids.
We assign each data point to its nearest center, which is accomplished by calculating the euclidean distance.
Where dist() is the Euclidean distance.
Here, we calculate each x value’s distance from each c value, i.e. the distance between x1-c1, x1-c2, x1-c3, and so on. Then we find which is the lowest value and assign x1 to that particular centroid.
Similarly, we find the minimum distance for x2, x3, etc.
We identify the actual centroid by taking the average of all the points assigned to that cluster.
Where Si is the set of all points assigned to the i’th cluster.
It means the original point, which we thought was the centroid, will shift to the new position, which is the actual centroid for each of these groups.
Keep repeating step 2 and step 3 until convergence is achieved.
Diagrammatic Implementation of K Means Clustering
STEP 1:Let’s choose number k of clusters, i.e., K=2, to segregate the dataset and to put them into different respective clusters. We will choose some random 2 points which will act as centroid to form the cluster.
STEP 2: Now we will assign each data point to a scatter plot based on its distance from the closest K-point or centroid. It will be done by drawing a median between both the centroids. Consider the below image:
STEP 3: points left side of the line is near to blue centroid, and points to the right of the line are close to the yellow centroid. The left one Form cluster with blue centroid and the right one with the yellow centroid.
STEP 4:repeat the process by choosing a new centroid. To choose the new centroids, we will find the new center of gravity of these centroids, which is depicted below :
STEP 5: Next, we will reassign each datapoint to the new centroid. We will repeat the same process as above (using a median line). The yellow data point on the blue side of the median line will be included in the blue cluster
STEP 6: As reassignment has taken place, so we will repeat the above step of finding new centroids.
STEP 7: We will repeat the above process of finding the center of gravity of centroids, as being depicted below
STEP 8: After Finding the new centroids we will again draw the median line and reassign the data points, like the above steps.
STEP 9: We will finally segregate points based on the median line, such that two groups are being formed and no dissimilar point to be included in a single group
The final Cluster being formed are as Follows
Use-Cases in the Security Domain
Here is a list of some of the interesting use cases of K-means in Security Domain:
1. Identifying crime localities
With data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association between the two can give quality insight into crime-prone areas within a city or a locality.
2. Insurance fraud detection
Machine Learning has a critical role to play in fraud detection and has numerous applications in automobile, healthcare, and insurance fraud detection. Utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on its proximity to clusters that indicate fraudulent patterns. Since insurance fraud can potentially have a multi-million dollar impact on a company, the ability to detect frauds is crucial.
3. Cyber-profiling criminals
Cyber-profiling is the process of collecting data from individuals and groups to identify significant co-relations. The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene.
4. Call record detail analysis
A call detail record (cdr) is the information captured by telecom companies during the call, sms, and internet activity of a customer. This information provides greater insights about the customer’s needs when used with customer demographics. We can cluster customer activities for 24 hours by using the unsupervised k-means clustering algorithm. It is used to understand segments of customers with respect to their usage by hours.
5. Automatic clustering of it alerts
Large enterprise use infrastructure technology components such as network, storage, or database generate large volumes of alert messages. Because alert messages potentially point to operational issues, they must be manually screened for prioritization for downstream processes. Clustering of data can provide insight into categories of alerts and mean time to repair, and help in failure predictions.
6. Crime document classification
Cluster documents in multiple categories based on tags, topics, and the content of the document. This is a very standard classification problem and k-means is a highly suitable algorithm for this purpose. The initial processing of the documents is needed to represent each document as a vector and uses term frequency to identify commonly used terms that help classify the document. The document vectors are then clustered to help identify similarity in document groups.
K Means clustering helps us to analyze the historical crime rates and enhance the crime resolution rate of the present. Take actions to prevent future incidents by using preventive mechanisms based on observed patterns. Reduce the training time of the officers that are assigned to a new location and have no prior knowledge of site-specific crimes. Increase operational efficiency by optimally redeploying limited resources to the right places at the right times.