The Google search engine, on average, handles over 3.8 million searches per minute. People are using the web more and more every day to gain knowledge, to connect with people, for entertainment, shopping, transactions, news, etc. Among these, a lot of people use the web to gather information or knowledge by surfing search engines which have an abundance of data and are easier means of obtaining information.
Search engines generate revenue by placing ads on their search result pages. By placing the ads in the results of most searched queries, there is a higher chance of generating more revenue. However, as different people use different languages and different text patterns to search for the things they want, it would be difficult to skim through each individual query. By standardizing queries, it would be easier to identify commonly searched queries through categorization.
In this blog, we will walk you through the approach we can use to identify the most common searches using text mining techniques.
Approach
Step 1: Getting Distinct Search Queries
There will be millions of the same search queries that people make each day in the search engines. It would be very time-consuming and repetitive to analyze the same query multiple times. So, we have considered only distinct queries that have higher search share for the next step.
Step 2: Pre-Processing
Search queries have a lot of noise that needs to be removed before feeding them into the model. These are the broad pre-processing steps:
- Tokenization
- The process of breaking sentences into words or tokens for the machine to understand.
- Lower-casing/Upper-casing query terms
- The machine interprets lower-case and upper-case words differently. So, it is preferred to have the text in the same case, and we have chosen to apply lower-case to the texts.
- Text Normalization
- HTML tags, ASCII characters, digits, and punctuations are difficult for machines to interpret. Regex techniques can be applied to remove all these noises from the text that we are analyzing.
- Stop words like ‘the’, ‘this’, ‘there’, and ‘many’ do not add any useful information to the text and should be removed to make it easy for the machines to interpret the text.
- Stemming/Lemmatization
- The process of converting the words in different tenses into the root word. For e.g., ‘eats’ and ‘eating’ is converted to ‘eat’ in the process of stemming.
- Lemmatization can also be used in place of Stemming.
Step 3: Text Vectorization
Once the preprocessing is done, text must be converted to vectors which is the format that ML models support. This process is called Text Vectorization. There are various techniques such as TF (Term Frequency), TF-IDF (Term Frequency Inverse Document Frequency) & BERT (Bidirectional Encoder Representations from Transformers) used for text vectorization.
- TF (Term Frequency): As the name suggests, it refers to the number of times the term has occurred in our data. Some of the insignificant words such as ‘of’, ‘or’ etc. will be counted in this technique. To avoid this, we tried the TF-IDF.
- TF-IDF (Term Frequency Inverse Document Frequency): Unlike the TF technique, it will not consider insignificant words while counting. Alongside calculating the more frequent words, it also gives more weightage to the rare words in our data. The major disadvantage of this technique is that two words of similar meanings may be counted separately, e.g., ‘joy’ and ‘happy’ will be treated as two different words.
- BERT (Bidirectional Encoder Representations from Transformers): In BERT, similar vectors are assigned to words with similar meanings, which enables us to compute the similarity between sentences. By this, similar sentences are grouped together in a single cluster. It not only groups the terms based on the exact match of the word but also groups based on the similarities, e.g., while clustering ‘mercedes benz’ and ‘used toyota camry’ will be in the same cluster as they are cars and ‘april’, ‘how many weeks from today’, ‘minutes timer’ will be in another cluster as they are related to some season or time of the year.
Step 4: Clustering
Clustering is the process of grouping similar data points together. There are several clustering algorithms available to create clusters, and we have applied the K-Means algorithm to group our data points.
K-Means clustering groups the data points into K clusters. Each data point belongs to the cluster with the nearest mean. We can determine the ‘K’ using the Elbow method. We can select K at the elbow, the point where distortion starts decreasing linearly. From the below graph, the optimal ‘K’ is 3, which means creating 3 clusters will be optimal for our data. We can also use the Silhouette method in place of the Elbow method.
We can also use other clustering algorithms in place of K-Means.
Step 5: Topic Modelling
After creating clusters, we need to identify the topics of each cluster. For this, we can apply n-grams or noun extraction techniques and determine the topics based on the results.
- N-grams – Getting the frequency for a set of co-occurring words
- Noun Extraction – Extract the nouns from the data
Using these two, we can identify the most occurred or used word in the cluster and assign the title based on that.
Analyze the result
We can analyze the views and revenue by cluster. We can investigate the cluster with the highest revenue to identify what type of queries are driving high profit to the business. Based on this, we can perform a Market Basket analysis (to identify the low-profit pages that are mostly viewed along the high-profit pages) and identify the low-profit search pages where ads can be placed to drive more profit.
Conclusion
The clustering algorithm helped us find the target groups for improving the search engine’s revenue. Similarly, we can use this for identifying different target groups (customers, behaviors, features, etc.) for different business campaigns, which will help them improve their business profits.