While Higher Logic Vanilla's (Vanilla) search functionality (called Elastic Search) allows many types of specific filtering, Full Text Search is the main way users search. In order to help you leverage Elastic Search to achieve the best search results, this article describes exactly how text matching works in Vanilla.
Before we begin…
Let's take a moment to understand common search jargon that will be referenced throughout this article:
- Term: This is a word or phrase supplied by the end user when performing a search.
- Keyword: This is the result of "normalizing" a term in a specific way (this is explained in more detail throughout this article).
- Stemming: This is a process used by search engines to treat similar words as equivalent based on the language being searched. For example, applying stemming to all of the following terms result in the same keyword:
visit
visiting
visited
visits
.
Locale-specific text matching
The primary type of text search occurs when you add a few terms into the search bar. This text-matching mechanism is specific to the current language that you're searching in.
📝 NOTE: The current language is determined by either the site default locale or the locale of the current subcommunity.
Text in both the search box and the search results are "normalized" in a locale-specific way. Here a few examples of normalization in an English locale.
- Text is broken up into terms by splitting on whitespace and special characters. For example, the search term "How does full-text searching work in the Vanilla Community?" would be broken into the following keywords:
How
does
full
text
searching
work
in
the
Vanilla
Community
- All keywords are lowercased and stemmed according to their locale:
how
does
full
text
searching
work
in
the
vanilla
community
- Filler words (these are different based on the locale) are removed:
how
does
full
text
searching
work
vanilla
community
- Locale-aware stemming is applied.
how
doe
full
text
search
work
vanilla
commun
- Synonyms are applied.
This same process occurs on both the search term and the search results in question to generate our search "keywords," and then a comparison is done. In Vanilla Search, only 75% of search keywords must match. For example, the term above would match a search result with only the sentence "searching in vanilla communities."
Keep the following in mind regarding this matching:
- Even though only 75% of keywords must match, a search result that matches more keywords, or repeats some of the keywords multiple times, will match higher than a search result that does not. This primarily impacts the search results when sorting by the default sort "Relevance."
- Order of these search keywords has no impact on the search.
Exact text matching
If "
characters are used around certain terms, this will change the keyword generation for these terms. It also applies different rules than the full text matching.
Usage
Other terms "Exact Terms"
Other terms + "a plus can be used as well"
"Match this" "and this" + "also this"
If multiple exact terms are passed, a search result must match all of them in order to be considered a match.
Terms only receive part of the first two steps of the full text normalization:
- Terms are split up on white space.
- Terms are then lowercased.
No locale-specific stemming or removal of "filler" words is applied.
When matching search results in an exact search, all keywords must match in the specific order but the last keyword only needs to be a partial match. In addition, a small amount of "slop" is allowed, meaning one keyword may be skipped in the matching search results.
Let's look at an example search term: "release 2022"
- This will match search results containing any of the following (matching keywords are bolded:
- Release 2022.023 will be a great release
- Since "2022" is the last term, it is used as a prefix and will match a keyword like "2022.024" even though the keywords are different.
- Deploy the release in 2022 to the following sites.
- Note the "slop" allows for an additional keyword in search results only.
- Release 2022.015
- Note that the whitespace or new lines are ignored.
- This will not match the following search results (the incorrect/not matching parts are bolded:
- Releasing 2022
- Keywords must match exactly. No stemming is applied.
- can we send out the Release? 2022 seems like the time for it.
- Special characters are not removed for exact text matching.
Exact text exclusion
If -
characters and "
characters are used around certain terms, this will change the keyword generation for these terms. It also applies different rules than the full text matching.
Usage
Other terms -"don't match this"
Other terms -"not this" -"also not this"
Exact text exclusion works the same way as exact text matching, but instead of filtering for search results with those keywords, it excludes search results with those keywords.
📝 NOTE: If multiple exclusions are applied, it will exclude any search result that matches any of the exclusions.
Synonyms
⚠️ IMPORTANT: Synonyms require Vanilla Release 2024.006 or later.
Many communities use special terms that can difficult to search for in the default setup. These tend to be be terms that are not known words in the language being search. Common examples include:
Type of Term | Examples |
---|
Abbreviation / Acronym | API FAQ docs UI |
Product Names / Branding Terms | SuccessTeam FreeSync |
Jargon / Technical Names | accessToken userID |
These can be configured in your search settings page under Settings > Technical > Search Settings > Manage Synonyms
When you configure a set of of synonyms, you have to provide both a target term and one or more synonyms.
Each synonym is replaced behind the scenes in both the search term and the search results. Because of this, it is preferred to use full words and sentences as the target term and abbreviations as the synonyms rather than vice versa. This can allow us to match additional search results when querying.
Synonyms example
Let's assume I have three documents:
- Document 1 contains the term "Search Engine Optimization"
- Document 2 contains the term "Search Optimization"
- Document 3 contains the term "SEO"
"Search Engine Optimization" As the Target Term (preferred)
- If I search "SEO" I will match documents 1 and 3
- If I search "Search Engine Optimization" I will match documents 1 and 3
- If I search "Optimization" I will match all 3 documents
"SEO" As the Target Term (not preferred)
- If I search "SEO" I will match documents 1 and 3.
- If I search "Search Engine Optimization" I will match documents 1 and 3.
- If I search "Optimization" I will match documents 1 and 2. Notably document 3 which contains the term "SEO" will not be matched because it was never replaced. Additionally, "Optimization" was not a full match for my synonym "Search Engine Optimization."
You can read more about Synonym Sets, including how to set them up, in the article below:
How Best Match Sorting Works
Vanilla's "Best Match" sorting is based off of the BM25 algorithm with some additional filtering an boosting of scores
- First searchable documents are filtered to the current user's permissions. This ensures proper access control. This does not affect ranking of documents.
- The user's query is then goes through the processing detailed in earlier sections of this document.
- Scoring is based off of BM25 from this point. There are 3 factors that affect this:
- Frequency of the terms used.
- How many terms match.
- The percentage of the document that was matched. Starting in Vanilla 2024.016 this now how significantly less effect on ranking. In communities very short comments are typically less relevant so we've tuned this parameter down.
- Additional boosting is applied.
- Text matching is title's is 2x as relevant as text matching in contents of a document.
- Matching Knowledge Base Content is marked as 10x more relevant than community content.
- Recent Community Content get's boost based to prioritize more recent content. Content created within the last 90 days boosted between 1.5x and 3x. content between 90 days and 365 days old is boosted between 1.5x and 1x.
The boost multipliers and timeframes can be configured upon request for your site. If feel these values don't align with your type of community please contact Vanilla support with details of your use case and we can adjust the boost parameters as necessary.