Analysing Search Terms: N-grams

Our search marketing campaigns generate a lot of search terms. Search engines can assign all kinds of search terms (which they think is relevant) to the keywords we generate. While this is a good additional source of traffic, it also has significantly higher risk of generating bad traffic as these search terms haven't gone through our keyword generation process. As always we are on the lookout for two things:

  • Problems: Figure out any systematic bad search terms and block them out by either improving our keyword generation rules or adding negatives.
  • Opportunities: Are we getting good search terms that we don’t have as keywords? We should improve our keyword generation process to include them.


The challenge is that we get a lot of search terms (at least 500k search terms for each of the top languages) and mostly very few clicks per search term so it is difficult to spot problems or trends. Here is a subset of what a search term report output (minus the performance statistics) looks like after aggregation:


As you can see, there are two obviously problematic search terms of people not searching for accommodation. These can be easily identified automatically using conversion statistics but they are only two cases that popped up. Since the data is very long tail, it is not possible to use performance statistics to identify all the bad search terms which mostly have a few clicks each. While insignificant by themselves, overall they add up to a large sum. So just as in many cases of doing quantitative analysis of text, n-grams are very useful. By splitting search terms into words and then grouping them into n-grams we are able to notice significant trends. It is easier to explain by an example:

The search term “cheap hotels in Amsterdam” will get split into:

  • 1-grams: “cheap”, ”hotels”, ”in”, ”amsterdam”
  • 2-grams: ”cheap hotels”, ”hotels in”, ”in amsterdam”
  • 3-grams: “cheap hotels in”, “hotels in amsterdam”
  • 4-grams: “cheap hotels in amsterdam”

Then for each n-gram level, the output from all other search terms is aggregated which makes it possible to identify the problems and opportunities. In our case n-grams up to 5 are interesting, as longer ngrams appear too infrequently in our data to be able to identify trends.

Here is an example of a Python script that does the job:

from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from collections import defaultdict
from nltk import word_tokenize
from nltk.util import ngrams
import string
import operator

sentences = [
    'Find hotel at Amsterdam',
    'Book hotel in Amsterdam?',
    'Book cheap room in ugly busy hotel middle of nowhere',
    'Book busy hotel in the middle of nowhere']

stemmer = SnowballStemmer("english")
ngram_count = defaultdict(lambda: 0)
N = 5   # Get me those 5-grams!
english_stopwords = stopwords.words('English')
for sentence in sentences:
    dirty_words = word_tokenize(sentence)
    # Remove stopwords and punctuation
    clean_words = [w for w in dirty_words
                   if w not in english_stopwords
                   and w not in string.punctuation]
    # Use stems (maybe not what you want?)
    stems = [stemmer.stem(x) for x in clean_words]
    for gram in ngrams(stems, N):
        ngram_count[gram] += 1

# Print the result
for ngram, count in sorted(ngram_count.items(),
                           key=operator.itemgetter(1), reverse=True):
    print("{}: {}".format(ngram, count))

And an example output of top n-grams:

1-grams 2-grams 3-grams 4-grams
in hotels in bed and breakfast bed and breakfast in
hotels hotel in places to stay places to stay in
hotel bed and b and b list of hotels in
and and breakfast cheap hotels in b and b in
near hotels near to stay in
accommodation cheap hotels
bed accommodation in

As you can see, now the search terms are really aggregated and we can start to make sense of the share of certain n-grams in the overall traffic. For a language with 500k+ search terms, this process would generate less than 100k n-grams of which usually the top few hundred would be potentially interesting. This removes the long tail words (mainly related to specific locations) and highlights the most common words people use to search. A potential learning in this example is to use the short spelling of “b and b” as a keyword, in case we had missed it (our marketeers would never forget such a thing). A funny example is “tea”: It turned out people were searching for having tea at hotels in England, of all places.

This approach can be simplified by using “bag-of-words” so that “london hotel” and “hotel london” are grouped together. There is a reason other than laziness that I haven’t done this though: There are cases where the order of the words can make a difference. They can either be users searching for different things, or have different levels of competition in the ad auctions.

So this is one of the first steps in analysing search terms in scale. After this comes using dictionaries and more advanced natural language processing methods. Do you have any other ideas on how to make sense of and draw conclusions from search terms in scale?