How is Cluster Algorithm like K-Means clusters used for Customer Segmentation?

Data and information relating to customers can come from a variety of sources but all can be input into a company’s data store and then analyzed to make necessary conclusions. Such data stores used to collect data and information include Big Data and relational data stores or data warehouse. The conclusions made from the analyses can then tailor marketing efforts, advertisements, collaboration efforts with partners, production process and product characteristics. Companies are receiving large volume of data, of many varieties, and at high velocity and need advanced systems such as Big Data to organize, analyze it and draw necessary conclusions.

Why K-Means algorithm and Market Segmentation?

K-mean is used to partition n observations (data set) into k partitions, each cluster containing observation with the nearest means. On the other hand, market segmentation involves division of a pool of customers to classes, each with a common need. Segmentation helps companies to identify different categories of customers as regards needs and demands so as to design relevant marketing strategies to suit each group and to implement them using the right media channels and touch points to make the most out of customer needs and demands.

Customers are divided into classes in view of their social classes, knowledge they have for products and services, personalities, attitudes and response for products and services, lifestyles, values, occasions, perceived benefits of using products and services, and geography. Such division can be achieved by use of cluster algorithm, an example being K-mean. Companies begin by collecting information from different sources such as Microsoft CRMs, ERP systems such as the Microsoft Dynamics GP, QuickBooks and other accounting systems, and including them into a data warehouse. Usually, the first step is selection of feature where as much information as possible about client is incorporated (from a variety of data sources), the second is selection of the algorithm (which in this case is known to be K-means), third is validation of results using various techniques, and interpretation of results as final step (drawing conclusions from results).

Extracting data from various systems into a data store

Pulling data from variety of sources can be done using data mining technology. Putting data into a single portal for analysis needs to focus on three sources. Ideally, data is collected from variety of sources, Web Traffic, external datasources like Facebook, linkedin or purchased databases, CRMs such as the Salesforce.com, Microsoft CRM, Microsoft Dynamics GP and other ERP systems, web traffic, accounting systems and others, through Demand Side Platforms (DSPs). These sources are discussed below:

  • First Party Data: This type of data is collected from your own systems. Take an example of the company web traffic analysis. In this case, data collected include that about pages visited, number of viewers, number of searches, time spent on site and searches done for particular keywords. These first party data systems are integrated into the Big Data system.  
  • Third Party Data: This is data purchased from third party gathering systems. In some cases, companies have been doing data integration. These third party systems are connected or integrated with the Big Data solutions or technologies for management and analysis. For instance, you can integrate your Microsoft CRM and other CRM system, Microsoft Dynamics GP and other ERPs, Google Analytics and accounting systems with your data warehouse. In order to integrate data, you need to identify a unique number shared by all data sets, called a primary key. For instance, a visitor scooped custom variable (a flexible way to add as many custom variables for purpose of segmenting and categorizing customers) can be stored into the Google Analytics or other systems, by adding a customer ID as a primary key. This customer ID comes from Microsoft CRM, other CRM and other systems as data sources. Each of the custom variables has different values. Coding is also needed in order to integrate some systems with others. Systems such as the DBSync system will help companies integrate some data already before it can be linked to other systems, for instance, the DBSync system technology links information with Salesforce CRM and Quickbooks. For B2B, metrics like IP to Network or company name would be usefull and could be purchased from companies like GEOIP
  • Data sourcing and exchange by various parties: Companies can sign up with fellow companies to share and exchange data.

Yet we can analyze top 10 sources for Big Data for you as follows;

  • Social network profiles: this includes all social sites by means of API integration systems. The integration system involves gathering information and importing fields and values from the social sites such as a system that gathers information about each B2B marketer on a social site
  • Social influencers: editors and analysts by means of following expert blogs and comments, reviews and processing the words, deriving their meanings and writing their results through the Natural Language Processing
  • Activity-generated data: for instance the log files through mobile phones and computers through parsing technologies.
  • Software as a Service (SaaS and cloud) through the use of variety of technologies such as API and Distributed data integration. Examples include the DBSync system.
  • Public: data available from WorldBank, Wikipedia and other places via means of parsing technologies and categorization, among others.
  • Use of data warehouse technology. Hadoop MapReduce application results: use of data from logs and web posts is one of the most talked about platforms to compute and understand such data and its storage into Columnar/NoSQL data sources
  • Legacy documents
  • Network and in-stream monitoring technologies
  • CRM systems such as the Salesforce
  • ERP systems such containing Customer purchases and payment history
  • Email, Chat and other communications.

There are a lot of data to analyze once you start looking at the potential variables. Best is to extract these into a Big Data Store like HDFS (Hadoop File Ssystems) or Databases like Cassandra, Hive etc.

Using the K-Means to analyze data in a Big Data

It is important to note that actual analysis of data and clustering happens in the Big Data system. The first step towards clustering data using K-means in a Big Data setting is to prepare the data. In this case, a table to contain information about usage habit or other attributes is prepared. For instance, the information contained in this table includes number of visits made into the site, number of products purchased, pages visited, number of calls made from the phone, and others. This table has several rows, each with information about one customer. Customers are identified by means of Customer ID.

In creating segments, for instance, a mobile phone company could target three segments from the whole pool of its clients, namely those making short calls and few calls in a day; those making calls at average duration and frequency and during business hours; and those who make long duration calls at business hours and weekdays. The segmentation results in a table that categorizes the customers into three segments. That is the simplest example: in an average case, we are talking about a company that must use its CRM technology and Accounting and ERP systems across the world, its accounting systems (take a case of Quickbooks), website data, and platforms built on mobile phones and other sources to collect data that is very diverse. In other words, the diversity of data collected can be so large.

Before running the algorithm, one must define the type of data to be inputted and outputted into the algorithm.

Using K-means in partitioning data has still one problem: identifying the number of clusters required. This is an input before running the procedure. The number of cluster chosen will affect results although all the parameters will make a contribution. Some of the techniques recommended for choosing the number of clusters to use include the explosive two steps and the internal index comparison procedures. Many also used heuristics, which refers to use of indices, graphs and dendrograms.

Others use subjective opinions together with the heuristics. Others used subjective assessment. Studies show that a majority (two thirds) use 3, 4 or five clusters regardless of the problem involved, variables, segmentation nature and the number of respondents in question. One can also choose various clusters and generate various results, and then determine the more stable solution. One might also discuss the produced results with the management.

Another way of determining the number of clusters is to use the Elbow Criterion. In this case, the rule is that adding an extra cluster should not add any sufficient information. One plots a table of different number of clusters against the results and it is possible to note where the “elbow” is, i.e. where the intra-cluster distance decreases the largest from the one number of cluster to another (as is the case with a elbow of a hand). You will choose the number where elbow is.

Before the results are used or implemented, it is necessary to use the external and/or internal means of validating output. External means include unary measures (which compares a single output clustering result with some known “gold standard” or “ground truth”); and binary measures (compares two outputs such as partitions for similarities and agreements). Internal measures include compactness (which assesses intra-cluster variance and homogeneity of data sets), connectedness (which assesses data sets to check how connected they are with their group data items and densities, and separation (assesses how one cluster is separated from the other), among other methods.

Leave a Reply