Unsupported languages: Arabic, Tigrinya or Chinese characters in SiteSeeker search result using EPiServer

The latest version of the SiteSeeker search engine supports indexing and searching in a variety of different languages; however, languages that are not using the latin alphabet, for instance arabic, hebrew, japanese as well as chinese, cannot be indexed. If you still try, you will be getting a bunch of nonsense characters instead of your expected text for each hit.

One way of solving this (idea from the team behind Vardguiden.se) is to detect if a hit is in an unsupported language, and then fetch the title and description texts from the relevant EPiServer page rather than using the ones supplied by the search engine. The most obvious method for solving this however does not work all the way due to a design choice in the search engine. Euroling support told me that they would discuss the issue in an upcoming meeting; in the meantime however, there is a slight workaround which won’t make you feel too dirty.

How SiteSeeker determines languages

To my understanding, the SiteSeeker search engine employs a rather powerful language detection method allowing the search engine to categorize your indexed pages to their respective language. It is however possible to help SiteSeeker out by supplying meta information containing the proper ISO code for the language in question.

<meta name="language" content="en" />

After adding the metadata to your pages your will also have to update a server setting in the SiteSeeker Admin interface telling the search engine to use your language tag while indexing. This setting may be found under Admin Start –> Servers –> {your server} –> Advanced settings: Meta information, but is not necessary if you are using the mentioned workaround. The setting Language:

Would need to be changed to:

After activating the change and reindexing the site, you may find that the way SiteSeeker has determined the page’s languages is mostly through the new meta attribute rather than by language detection.

As you can see in the image above, SiteSeeker will only recognize and consider languages using the latin alphabet; while turkish (ISO language code: tr) and albanian (ISO language code: sq) are marked as unknown (swedish: okänd), they are still added. As displayed in the beginning of this article, the arabic and tigrinian pages are properly indexed. They will not however appear in this list.

This issue becomes even more clear when you are looping through the search hits in your code. Here you may be expecting to find that the Language property of the SiteSeeker.Model.Hit object should be containing the proper language for each hit. While in reality, you will only get the correct Language information for the supported languages; arabic and tigrinya will revert back to your default language, swedish in my case.

private Hit GetHit(SiteSeeker.Model.Hit hit, SearchResponse response, SearchRequest request)
{
  var hitLanguageCode = hit.Language.Key;  // "en"
  var hitLanguageName = hit.Language.Name; // "english"

The bottom line is, this way you will never know when you get a hit on a page in your default language or in your unsupported language.

How to support languages like arabic, hebrew or japanese in your EPiServer SiteSeeker search result

The way to solve this issue is quite simple. All you really have to do is add an Additional meta attribute to the list of custom attributes for SiteSeeker to include in its indexing. As you can see in the image below, add the highlighted row string:language, activate the change and reindex your site. This does of course require you to having previously added the said language meta attribute to your pages.

This time, while looping through your SiteSeeker search result hits, you will find that you can extract the value of your custom meta attribute simply by asking for it from the meta attributes collection.

private Hit GetHit(SiteSeeker.Model.Hit hit, SearchResponse response, SearchRequest request)
{
  var hitLanguageCode = hit.MetaAttributes.GetStringValue("language");

So, now we have the proper language code for each search result hit, but how do we know if it is a supported language or not? The SiteSeeker SearchResponse object contains a collection of languages (response.Languages) which will come in quite handy here. This is a collection of all the supported languages that SiteSeeker has found pages for on your site; i.e. the same list of languages as in the image in the previous section.

private static bool IsSupported(SiteSeeker.Model.Hit hit, IEnumerable<Language> supportedLanguages)
{
  if (!hit.MetaAttributes.Any())
    return true;

  var lang = hit.MetaAttributes.GetStringValue("language");

  if (string.IsNullOrEmpty(lang))
    return true;

  return supportedLanguages
          .Any(sl => lang.StartsWith(sl.Key));
}

It is of course optional what to do with pages lacking, or with an empty, language attribute. For my current project however, we saw no fault in having those edge cases handled as supported hits.

How to get the proper search result hit title and description

How to go about getting the title and description texts for your unsupported search hits may vary from website to website. The title value is likely the easiest one as this will probably be some general heading property on your page or the page name itself. The Heading extension method (29) in the below snippet does just this in my current project.

private Hit GetHit(SiteSeeker.Model.Hit hit, SearchResponse response, SearchRequest request)
{
  var title = hit.Title;
  var description = hit.Description;
  if (!IsSupported(hit, response.Languages))
  {
    var page = TargetPageFor(hit.SourceLink);
    title = page != null ? page.Heading() : title;
    description = page != null ? DescriptionFrom(page) : description;
  }

The DescriptionFrom method on the other hand will probably come out a bit more complex. As you will get a general PageData object from the TargetPageFor method (snippet below), you will have to decide what will do for a description text in your case. Things worth considering designing this method includes:

  • Which page properties to get description text from depending on page type.
  • Length of description text; what happens to a sentence in a specific language if you cut it randomly after n characters?
  • What to do with custom HTML tags added by editors to the fields you are using; images, bullet lists, etc.
  • Possible scripting in description fields.
  • Dynamic content.
  • Right-to-left (RTL) languages in search hits.
  • Right-to-left languages with interspersed Left-to-right (LTR) text.
  • Adapted text sizes for certain languages; arabic for instance is difficult to read if the characters are too small.
private PageData TargetPageFor(string externalUrl)
{
  var result = _friendlyUrlProvider.ConvertToInternal(externalUrl);
  return result.Success ? result.TargetPage : null;
}

The _friendlyUrlProvider.ConvertToInternal method uses EPiServer’s Global.UrlRewriteProvider to get the page’s PageReference and in turn the DataFactory.Instance.GetPage method for retrieving the proper target PageData object.

What if SiteSeeker starts supporting our unsupported languages?

What if SiteSeeker would suddenly start supporting one of our unsupported languages? Well, then the language code would start appearing in the supported languages list, and our IsSupported method would automatically make the hit skip our new custom way of retrieving the search hit title and description values and instead use the ones supplied by SiteSeeker.

One Response

  1. Fredrik Haglund May 17, 2013