Natural language human-friendly search

Rafael Haro
Search Architect
August 31, 2021
8 Min read or listen

Rafa has been a Computer Engineer since 2006. Ever since graduating from university, he has pretty much dedicated himself to Software Development. Currently, he works as a Search Architect developing solutions to help organisations integrate, access and share content subject to licensing.

He has worked on developing Semantic Search Engines, Intelligent Dialogue Systems, Recommender Systems, Entity Recognition Engines in the medical field, Automatic Document Processing Systems, Chatbots and Systems Based on Knowledge Graphs, among others.

But his biggest challenge was working for an international company for the first time. Or at least it was a time when he felt he was on shakier ground because of having to deal with colleagues, projects, clients and presentations in English. However, it turned out to be the best thing that could have happened to him to take a step forward in his professional development.

Rafa believes motivation is the key to growing as a developer. He is driven by the desire to develop cool projects and work with exciting tools and the sensation that you are helping to build a product or project that makes a difference every day.

How is your day-to-day routine today different to that of ten years ago?

Ten years ago, I was a mid-developer in a company working on developing Intelligent Dialogue Systems. At the same time, I was studying a Master's in Language Technologies and Data Mining, while at night, I tried to consume as much English content as I could: series, books, and so on.

Today I work as a Search Architect in a major Boston-based company. My work focuses on defining common solutions to all the needs related to the search for content within the company's different software products. I am also part of a team working on emerging technologies to apply our products or new product prototypes.

How do you think natural language search has evolved? Could we say that search is becoming increasingly human?

As a concept, search is extensive and encompasses a variety of different use cases. So applying Natural Language techniques to improve search depends a great deal on the context of application.

In my opinion, an initial distinction can be made between Enterprise Search and Web Search. By Enterprise Search, we mean the practice of identifying and making specific content accessible within an organisation so that it can be indexed and made searchable for a particular type of user, both within the organisation and outside it. In this case, search is implemented in a totally dependent way on the characteristics of the content, the business model, and the use cases. Generally speaking, it is easier to apply Natural Language techniques in an Enterprise setting because the content is more specialised, making it easier to find and generate resources such as dictionaries, domain vocabularies and language models.

Web Search is another world entirely. By Web Search, we mean search using everyday general search engines like Google and Bing. The recognised search patterns used in Web Search are different. For example, many queries made in Google are navigational. In other words, they include a few keywords meant to find a particular web page to navigate to then do something specific on that website. In the early years, Google's PageRank algorithm performed this kind of search brilliantly. However, rather than using phrases closer to natural language, people got used to "googling”. So companies tried to position themselves at the top of the search rankings by using particular keywords, whose inclusion in the link texts connected them to other top-rated websites (backlinks). That was the time of Google Bombing.

In recent years, Google has made changes to PageRank, its ranking algorithm, with keywords losing importance and positioning increasingly dependent on the quality and originality of sites' content. Google's ranking system has undoubtedly evolved alongside the improvements in its technical capacity to better "understand" both user queries and page content. Previously, everything was based on token or word frequency, but today Google is able to represent content at a semantic level. Naturally, that makes search more human. As in many other things, Google was among the first to use large language models trained using deep neural networks. Today, these language models and the concept of "transfer learning" are the foundation for any Natural Language Processing task. In the Web Search field, language models have allowed the search engine to understand the user's intention better anduse that to apply some heuristics or others.

How does natural language search work?

Natural language search is based on "queries" (search requests) whose complex linguistic constructions go beyond a set of keywords. For example, to find pages explaining how to change a laptop battery, we used to search something like "change laptop battery", but today we can ask the natural question: "how do I change a laptop battery?"

As mentioned before, current technology allows Google to detect the search intention and narrow the search results to pages explaining how to change the battery. Years ago, multiple links would have snuck in that may have been about laptops or batteries, but which didn't answer the question.

Natural Language Processing techniques are applied to the search to make it more natural. These techniques seek to represent both the indexed content and the searches users make in a way that facilitates their understanding by an algorithm. Nowadays, text representation models are mainly based on language models. In 2018, Google released BERT, a language model pre-trained on a giant corpus of text from the web using Deep Learning techniques. BERT can encode linguistic characteristics about a specific language and better understand the different contexts in which words are used by their relationship to other words in the same sentence. This is especially important for understanding a user's search intention and to be able to narrow down the type of responses better.

What advice would you give about setting up a natural language search-friendly site?

As I mentioned before, Google's current ranking system gives particular value to high-quality, original, specific content. There is a clear technical explanation for this. As Google is now better able to "understand" the content of our site, SEO must now change from attracting Google to our site to attracting our target users. The more specific we can be in our content about what our site offers, the more likely Google will be able to list us as a relevant result when a potential client searches for the information or products we can offer them.

We have to know what kind of questions a user may ask to turn that into content.

What metrics do you use to test a natural language search model?

The metrics used to evaluate a search system remain the same. Using NLP is a way to improve the results we obtain with those specific metrics, but how we measure them doesn't change. Others can be applied, but the two metrics traditionally used to evaluate an information retrieval system are Precision and Recall (coverage or completeness). Precision measures how many of those search results are genuinely relevant to the user's search. In contrast, Recall measures how many of the results we already knew were relevant we've managed to return. These metrics do not usually take into account the order of the results. If we also want to include order as a variable, there are other metrics, such as Information Gain.

The problem with these metrics is that we often need to define a sufficiently large set of test queries and the results that are relevant to them. This is usually expensive and may also involve manual labour. For that reason, other methods related to Relevance Feedback are often used to measure the relevance of results by monitoring user activity: for example, how many different results does the user visit? In what order? How long do they spend on them? Do they redefine their search or not?

Google claims that 27% of the world's online population uses voice search on their mobile. So how do you think voice search will change SEO?

I have a particular opinion about the use of voice for search. Generally speaking, Speech Recognition is seen as a technology that was fully realised years ago – speech recognisers fail very infrequently. However, in my opinion, the adoption of voice as an interface has not had the impact expected. There are probably many different reasons for this, relating above all to usability.

I don't have a prediction of how voice search might change SEO or lead it to evolve. I would have a more definite opinion if I knew what kind of searches (more or less natural) users usually make when using their voice.

What is your tech stack dream team?

When I work on straight-up Enterprise Search, my "tool" is Elasticsearch, an incredible framework that has managed to turn search into an exercise in API integration.

When I work on tasks related to NLP, my stack is now entirely within the Python ecosystem: Spacy, Gensim, Scikit Learn, Transformers and TensorFlow.

And finally, do you think Google's new MUM algorithm will transform the way we search? Is it the future of natural language search?

With MUM, Google is taking the use of language models another step forward. It claims that MUM is 1000 times more powerful than BERT. The model has been developed to enable the search engine to solve more complex tasks that would generally take a user a session of different, more or less complex searches. MUM is based on the new T5 model, a Text to Text model. For example, with BERT, Google could guess the user's intention (classify) or recognise the meaning of a sentence by the context. With T5, Google can carry out response generation tasks based on an input text, for example, a translation from one language to another, a summary of a text, or a direct answer to a question such as "What year was Michael Jordan born?" All of this using the same base model.

MUM will allow Google to understand the relationships between concepts and content better, and it may give specific answers by adding different content to very complex questions. So yes, it sounds like a step forward.

Become a beta tester
by joining the DWX BETA Program

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.