Understanding the Anatomy of Search: The Data Pipeline

In a type: entry-hyperlink id: 1PGumMGfs2VnZ1IeRTquMz we discussed some of the reasons that your organization should invest in search and discovery. In this post, we’re going to walk through one of the first steps in a successful site search deployment: pipeline creation. For the sake of this post, we're going to assume that you've already developed some form of personas, user journeys, or use cases for search specifically.

This post will lean heavily towards site search with Algolia ¹; however, general concepts discussed here may apply in other use cases or search systems such as ElasticSearch, Solr and SwiftType.

Before we dive in, let’s cover a few definitions that will be important as we move through the topics in this series:

Search: Search is what we call the action of knowing what you want and querying until you ultimately find it.²
Discovery: Discovery is what happens when the universe (or an organization, or a friend) helps you encounter something you didn't even know you were looking for.³
Indexes / Indices: You’ll often find these terms used interchangeably. The former is the American spelling, and the latter is the British spelling. Indexes are collections of records and the source for all of your searches.
Object / Record: In search, you’ll often find these words used interchangeably. A record represents a single item within the search index. This is the underlying data counterpart to the results you see in the user interface.
Attribute: Attributes describe the record; they are key/value pairs. Sometimes you’ll see the term field, property, or element as well. For the context of search, these terms are mostly interchangeable. Examples include title, name, description, author, etc.
Federated: Federated search is an information retrieval technology that allows the simultaneous search of multiple searchable resources. A user makes a single query request which is distributed to the search engines, databases or other query engines participating in the federation.²
Content Model: A content model is a representation of the different types of content, the content’s attributes, and the content’s inter-relationships.

Selecting Attributes for Searching: Relevance and Ranking

Now that we have a common language for the terms of search, let's dive in a bit on the process. When we talk about the data pipeline, we’re talking about the process that we use to collect, process and push data into your search indexes.

If you already have a Content Model documenting the content that is available on your website, then you’re a step ahead. Your CMS may have dozens, if not hundreds, of fields split across multiple content types such as blog, landing pages, news, products, etc... All that content is not necessarily of equal value when determining what content is most relevant to the user’s search.

Our goal here is to document all of the available content, working to distill and normalize the data down to only the indexes and attributes we need for search.

How do we decide how to organize the data? Let’s pause here and talk a bit about two concepts, relevance and ranking. Algolia defines relevance as “an intelligent matching that takes into account typo tolerance, partial word matching, spatial distance between matching words, the number of attributes that match, synonyms and query rules, natural language characteristics like stop words and plurals, geolocation, and many other intuitive aspects of what people would expect from search”. This is where the search engine looks at your universe of content and returns everything that appears to be relevant to the user’s query.

Ranking, on the other hand, is how the search engine sets the order of the data that is presented to the user. We know from years of working with major search engines like Google that the priority of results is critical to meaningful search experiences and conversions.

With this in mind, we look at the universe of data available from our CMS, and we discard any attributes that do not contribute to determining relevance or ranking. Attributes such as your page's title or body content are usually obvious inclusions. Other attributes such as entry timestamps, images, author information, and related content data might not be useful in your search use cases and can therefore be excluded. As you evaluate each piece of data you are asking yourself a few basic questions:

Should this attribute be searchable?
Should this attribute be returned with the result set?
What is the importance of this attribute compared to the other searchable attributes?
Which indexes should this attribute be associated with?

You're resulting list of attributes should not, in most cases, include everything that was available to you from your CMS. The process should be selective about what goes in the record, gathering only information that’s useful for building the search experience and solving the specific use cases you are targeting.

In addition to the fields available to you from the CMS, there may be value in adding additional meta such as the total number of page views, comments or purchases to your index. These additional fields can serve as factors for ranking your results.

Before moving to the next step, we'll also want to determine a method for creating unique Object IDs for the records in our indexes. These IDs serve as our reference points for deletions and updates. In many cases you can use the unique identifier from your CMS as the Object ID; however, that often only works in cases where you are not combining content from multiple systems into a single index, something that is often of benefit.

Selecting Attributes for Searching: Normalization

Once we've determined the data we want to send to the search engine, we should work to normalize the various attributes down to a concise set of attributes for the context of our search use cases. Often, the nature of the CMSs we're working with, or the nature of our projects, results in multiple fields that have nearly identical meanings. Regardless of the question of whether or not this should happen, it is common to find fields named blog_body, news_body, body, and so forth when reviewing our field list. When possible, we should bring consistency to our attribute names, taking all of these occurrences of {something}_body and simply remapping them to "body" for the sake of indexing. This effort ultimately results in less attributes in the search index, improving our ability to accurately determine rank and relevancy, while also reducing the complexity of our frontend display logic when building the user interface.

Once we've normalized the available field names, we should look for opportunities to combine fields that do not provide value as unique attributes. As an example, many modern CMSs provide page builders where a set of fields can be repeated and reordered in unique combinations to create a single page body. In these instances, we often see the combination of those fields as a single attribute, and as such, we will document that these fields should be combined into a single attribute in our index.

When working in Algolia, it is important to consider the maximum size of objects in the index as there are hard limits. Additionally, long objects have the potential to increase latency and reduce relevancy. For more on long documents in Algolia, see their FAQs. At the end of this step we should have a clear mapping of how each field available to us in the CMS maps to a specific attribute in the search indexes.

Importing Your Data

Once we've mapped and normalized our field data, we need to think about how we're going to get that data into our search engine. In general, we need to the ability to run a bulk import to bring all of the data into the search engine, and then we need some method to bring data into the search engine incrementally.

When determining the frequency and architecture for your data import, you'll want to try and find a balance between having fresh information available to your users as fast as possible and reducing the number of operations, because the number of operations may have an impact on your pricing and index performance. The specific architecture and frequency of your import process will be determined by your particular use cases and technology stack. For indexing operations, you'll want to consider the use of batching whenever possible. With Algolia, we can send multiple records in a single API call, a process we refer to as batching or batch sending. Batching has many benefits including reducing networking calls and optimizing indexing performance.

Next Steps

I hope this post provides you with a general sense of the thought process behind creating a data pipeline, and perhaps, insight into what to expect from this step of the process when working with the team here at Foster Made. Stay tuned for more posts about search and discovery, from implementation to impact.

_References:_{1. Algolia:}_{https://www.algolia.com/}_{2. Search vs Discovery:}_{https://seths.blog/2014/04/search-vs-discovery/}_{3. Federated Search:}_{https://en.wikipedia.org/wiki/Federated_search}

Article