In today’s world, data is power. With companies having terabytes of data stored in servers, everyone is in the quest to discover insights that add value to the organization. With various examples to quote in which analytics is being used to drive actions, one that stands out is text classification.
Text Classification Process
The process of text classification starts with reading the document into the code. This is followed by series of data pre- processing steps chosen based on business problem:
- Tokenization: It breaks down longer string of text into smaller pieces for e.g. “This sentence needs to be tokenized” will be broken to subsequent words:
{‘This’, ‘sentence’, ‘needs’, ‘to’, ‘be’, ‘tokenized’}
- Text normalization: It aims to bring all the text content to same level. Some of the Text normalization options are as follows:
- Basic Normalization steps: Lowercasing text, Removal of Punctuations/Tags/whitespaces.
- Stemming removes affixes (prefix, suffix, infixes, circumfixes) from the word, for example studies will become studi and studying will become study.
- Lemmatization: Obtains the canonical/dictionary form of the word. For example, studies and studying will be converted to study. It is useful in context where words need to retain meaning after pre-processing.
- Stop Word Removal: Removal of common words such as the, and, is, a that provide no value to the overall sentence.
- Vectorization: The text sequences are transformed into numerical features that can be used in the model. TF-IDF, Count vectorizer are some of the commonly used approaches for same.
Feature Selection techniques selects a subset of the features based on their importance. Document Frequency is one of the common Feature Selection methods used, where words/features present below certain frequency in the document are filtered out. Feature Extraction is an optional step in some business scenarios, where additional features are created from pre-existing features. Clustering methods is one technique used to add new features.
The final step in this process is the tagging of the data to predefined categories using one of the following methods:
- Manual tagging
- Rule Based Filtering or string-matching algorithms such as fuzzy matching.
- Learning Algorithms such as Neural Networks that can utilize several hundred features to tag the Text content. The Learning Algorithms can be classified into two approaches:
- Unsupervised Learning –Applied where there is lack of previously tagged data. Techniques like clustering and associating rule-based algorithms can be applied to group together similar text. An example is segmentation of customers into groups based on their details, purchase history and behavior. These groups can then be further analyzed to identify patterns that allows for customizing customer approach.
- Supervised Learning –Applied where there is enough volume of accurately categorized data available. The ML algorithms learns the mapping function between the text and the tags based on already categorized data. Algorithms such as SVM, Neural Networks, Random Forest are commonly used for text classification.
Text Classification, both through supervised and unsupervised approach, finds application in various fields such as social media, marketing, customer experience management, digital media etc. Some of these will be elucidated below with use cases.
Unsupervised Learning:
The unsupervised approach to text classification looks for similar patterns and structures between text to group them together. It finds application in real world in scenarios where the data volume is too large to be completely classified, is in real time or the labels are not predefined.
Use Case 1: CRM Automation – Speeding up response time to customer complaints.
Any product-based company needs to act quickly on customer complaints to provide efficient customer service. However, the task of reading every customer complaint/ feedback can be both time consuming and error prone. By application of unsupervised learning methods, the complaints can be divided into main topics such as Technical Issues, Subscription related queries etc that can be assigned to respective teams to work upon. Given below is the Process to achieve this:
- Data Extraction and Pre-Processing: Data is extracted from the mails of customers or the CRM database. This data is then tokenized, normalized and TF-IDF algorithm is used to extract relevant features from the data. Once the data has been cleaned and features it extracted it is ready for the next step.
- Clustering: The clustering approach aims at grouping together similar unlabeled texts in same group (clusters). The DBSCAN approach of clustering is applied to group together complaints with similar text content. This method doesn’t require us to define the number of clusters needed, as it gains that information based on initial parameters set- eps (the min distance between two points to be part of same cluster) , minPoints(minimum number of points that can form a dense region). The clusters give us an idea on the number of complaint types that exist. However, to get any meaningful insights from these clusters the complaint types need to be identified and named. This takes us to another technique in unsupervised text classification.
- Topic Modelling: The topic model discovers abstract topics from a collection of documents. Examples of topic models are Latent Dirichlet Allocation (LDA), TextRank, Latent Semantic Analysis(LSA).
The LDA topic model takes input of the customer complaints and assigns topic to them. The basis it functions on is that every document- the complaint here, is a mixture of small topics. It does so by calculating the probability of the words in the complaint occurring in given topics. For example, it may identify the following main sub-topic clusters from a software company’s customer complaints:
Based on same and given that Technical, Subscription, Customer Service, No reply has the highest weights in their clusters we can identify the topics as: Technical Issues, Subscription Issues, Poor Customer Service, No response etc. For a customer mail with subject: “Server is throwing up error on running app” it finds the words app, server, error having highest probability of occurring within Technical Issues and thus assigns the complaint under the same.
Once the complaints are tagged the customer service team and the product team can segregate these mails and follow up based on the tags provided. They can also create automatic mails and responses to commonly occurring issues, that can help decrease load on the customer call executives.
Use Case 2: Search Engine Optimization using Topic Modeling
Most of the top read websites today have used the power of Topic Clusters to their benefit. Topic Clusters are a collection of content that is grouped under multiple relevant topics and sub-topics. For example, a website themed around Analytics field, can have this Topic cluster:
- Pillar Page: “Analytic” and overview of cluster content.
- Cluster Pages: Related clusters – Tools used for Data Analysis, Applications of Analytics, Magazines for Analytics, and content for these topics, going into depth of each sub-topic within the cluster. Linking of pages with each other and the pillar page where relevant. Usage of key words and cluster headings to improve the Google search ranking.
However, creating relevant content is not easy as it sounds. Here is where AI driven software are coming to the rescue of content marketers. The basis of these AI software are unsupervised techniques such as LDA, clustering etc. They take keywords from the marketer’s content pages and scan millions of related content pages on web. Based on this, they provide insights on what are the sub-topics that are relevant and popular, what the target audience is and what are the questions that are frequently asked around a topic. With this information, it becomes much easier for content marketers to create, modify, and upgrade their content and thus improve their page’s Google ranking.
Supervised Learning:
The supervised approach to text classification works with training models with already classified/tagged text. This approach has better accuracy and scaling up possibilities than unsupervised learning techniques.
Use Case 1: Deriving insights on new product from Social Media Posts
“Statistics show that 49% of world’s population, 3.8 Billion people today use social media.”
Source: Digital 2020, We are Social and Hootsuite
Thus, it goes without saying that organizations today have a strong social media presence. It is a medium for organizations to promote brand/product, identify/target potential customers, get public opinion on the brand, its’ products, its’ service, competitors etc. At launch of a new product, companies often try to understand the public perception on it by analyzing the social media posts shared about their product. Since reading each and every post can be tiresome, they utilize Supervised Learning Techniques to derive insights.
- Topic Labelling: After data extraction from social media posts, pre-processing, the first step is to find out what product features are been discussed the most by using Contextual Semantic Search(CSS). It takes concepts such as Price, feature, customer service, user accessibility as inputs and filters the posts to relevant concepts, respectively. Sentiment Analysis can then provide insights on the social sentiment across these topics.
- Sentiment Analysis can lead way in understanding if the social perception is positive, negative, or neutral towards the new product and its features. The algorithms applied for this can either be rule based (look for presence of certain words that are grouped as positive/negative/neutral) to decipher overall sentiment of the post or ML based that uses previously tagged posts to train upon and predict the final sentiment on posts. Sentiment Analysis on competitor’s social post can be used to obtain a benchmark.
Use Case 2: Spam Out.
Every mail account has the Spam folder. One prominent example of spam classification can be found in Gmail. Google applied AI based algorithm using Tensorflow to detect spams in Feb 2019. This enabled it to identify and isolate an additional 100M spams/day as compared to the initial rule-based approach. The ruled based approach functioned on finding presence of certain words to classify a mail as spam. The AI on other hand utilized previously categorized spam mails to identify patterns that helped it detect spam that the rule-based approach could not detect.
There are many more use cases of Text Analytics such as Ecommerce Product classification, Chat Bots, Chat Routing through Language detection etc. In all the use cases what is common is the ability of Text classification to increase efficiency of processes, decrease manuals efforts and derive insights from data. With growth in Technology, ML and AI occurring at exponential rates, the use cases of Text classification will also grow, thus increasing scope for Analytics to solve real world problems.