#HtmlAgilityPack C | Explore Tumblr posts and blogs

Sitecore Site Search Crawler Approach

Sometimes there is the need to provide the ability to do a site search that will return results based on the page HTML content. Sometimes this approach can be handled by querying each desired property of a Sitecore Item of Page type, some other times a custom search field could be implemented, and this custom field will contain different values including the needed Sitecore item properties and some datasources that belong to different page renderings. Sometimes reading the datasources, reading list page properties might be difficult and might complicate the development of this custom field.

The custom index field is the right approach to follow, although we could populate it a little bit different. The approach that I am proposing in this post is the crawler approach. The idea behind this approach is to populate the custom field using the content of the page that the user sees when hitting a page. Normally when a site visitor runs a site search query, he/she is really waiting to see a result back that contains what he searched for, this approach really hits this requirement.

The technical recipe for handling this approach is:

Declare a new custom computed index field

Create the code that will compute the content of this field

Inside the computed index field determine if its a page Sitecore item

Generate the URL for this page

Get the HTML response

Crawl the HTML

Save the important HTML content into the custom field

Note: We don’t need content that is in every single page i.e. Main Menu, Footer, Copyright, etc.

Now lets go deeper with each step of the previous list.

Declare a new custom index field

To create a new custom index field you need to do two things. First declare the field and its type and second detail which class will process this field.

<?xml version="1.0" encoding="UTF-8"?> <configuration xmlns:patch="http://www.sitecore.net/xmlconfig/"> <sitecore> <contentSearch> <indexConfigurations> <defaultLuceneIndexConfiguration type="Sitecore.ContentSearch.LuceneProvider.LuceneIndexConfiguration, Sitecore.ContentSearch.LuceneProvider"> <fieldMap type="Sitecore.ContentSearch.FieldMap, Sitecore.ContentSearch"> <fieldNames hint="raw:AddFieldByFieldName"> <field fieldName="_sitesearchfield" storageType="yes" indexType="TOKENIZED" vectorType="NO" boost="1f" type="System.String" settingType="Sitecore.ContentSearch.LuceneProvider.LuceneSearchFieldConfiguration, Sitecore.ContentSearch.LuceneProvider" /> </fieldNames> </fieldMap> <documentOptions type="Sitecore.ContentSearch.LuceneProvider.LuceneDocumentBuilderOptions, Sitecore.ContentSearch.LuceneProvider"> <fields hint="raw:AddComputedIndexField"> <field fieldName="_sitesearchfield" returnType="string">DEMO._Classes.IndexComputedFields.SiteSearchField, DEMO</field> </fields> </documentOptions> </defaultLuceneIndexConfiguration> </indexConfigurations> </contentSearch> </sitecore> </configuration>

In the code above you can notice that we have declared a new custom field named: _sitesearchfield. This field is TOKENIZED string. Also you can see that this computed field has a class associated to it DEMO._Classes.IndexComputedFields.SiteSearchField this class will compute the content of this field.

Create the code that will compute the content of this field

The class below does the job of generating the content of the custom field. First the class identifies if the current item that is being indexed is of type Page. Then it queries the page using the generated URL. Then it strips out all the html code and it only considers the content inside a specific HTML id. This is to prevent saving un relevant content that is present in every page. Content like the header, footer, copyrights and random banners.

public class SiteSearchField : IComputedIndexField { public object ComputeFieldValue(IIndexable indexable) { var item = indexable as SitecoreIndexableItem; if (item == null || item.Item == null) return string.Empty; string url = null; string content = string.Empty; try { if (item.Item.Paths.FullPath.StartsWith("/sitecore/content/") && item.Item.TemplateInheritsFrom(new TemplateID(IWeb_Base_WebpageConstants.TemplateId))) { #region PageUrl using (new SiteContextSwitcher(Factory.GetSite(AppSettingsHelper.GetPublixSitecoreSiteName()))) { url = LinkManager.GetItemUrl(item, new UrlOptions() { AlwaysIncludeServerUrl = true }); } #endregion #region WebRequestToPage // Request the web page using (var client = new WebClient()) { string pageContent = client.DownloadString(url); // Parse the page's html using HtmlAgilityPack HtmlDocument htmlDocument = new HtmlDocument(); htmlDocument.LoadHtml(pageContent); // remove all html tags and keep just the relevant content HtmlNode mainContainer = htmlDocument.GetElementbyId(AppSettingsHelper.GetSectionMainContentId()); content = mainContainer != null ? GetAllInnerTexts(mainContainer) : null; } #endregion return content; } } catch (WebException webException) { Log.Warn($"Failed to populate field {indexable.Id} ({url}): {webException.Message}", webException, this); throw; } catch (Exception exc) { Log.Error($"An error occurred when indexing {indexable.Id}: {exc.Message}", exc, this); } return content; } protected virtual string GetAllInnerTexts(HtmlNode node) { node.Descendants() .Where(n => n.Name == "script" || n.Name == "style") .ToList() .ForEach(n => n.Remove()); return RemoveWhitespace(node.InnerText.Replace(Environment.NewLine, " ")); } private static string RemoveWhitespace(string inputStr) { const int n = 5; StringBuilder tmpbuilder = new StringBuilder(inputStr.Length); for (int i = 0; i < n; ++i) { string scopy = inputStr; bool inspaces = false; tmpbuilder.Length = 0; for (int k = 0; k < inputStr.Length; ++k) { char c = scopy[k]; if (inspaces) { if (c != ' ') { inspaces = false; tmpbuilder.Append(c); } } else if (c == ' ') { inspaces = true; tmpbuilder.Append(' '); } else { tmpbuilder.Append(c); } } } return tmpbuilder.ToString(); } public string FieldName { get; set; } public string ReturnType { get; set; } }

Inside the computed index field determine if its a page Sitecore item

if (item.Item.Paths.FullPath.StartsWith("/sitecore/content/") && item.Item.TemplateInheritsFrom(new TemplateID(IWeb_Base_WebpageConstants.TemplateId)))

This line is in charge of identifying if the current item is a page type and that it is inside the Sitecore content tree.

Generate the URL for this page

using (new SiteContextSwitcher(Factory.GetSite(AppSettingsHelper.GetPublixSitecoreSiteName()))) { url = LinkManager.GetItemUrl(item, new UrlOptions() { AlwaysIncludeServerUrl = true }); }

First we need to switch to the right site context otherwise we would be trying to generate a URL based on the index job context. Then using the link manager we can create the site URL for this specific item.

Get the HTML response

using (var client = new WebClient()) { string pageContent = client.DownloadString(url); // Parse the page's html using HtmlAgilityPack HtmlDocument htmlDocument = new HtmlDocument(); htmlDocument.LoadHtml(pageContent); // remove all html tags and keep just the relevant content HtmlNode mainContainer = htmlDocument.GetElementbyId(AppSettingsHelper.GetSectionMainContentId()); content = mainContainer != null ? GetAllInnerTexts(mainContainer) : null; }

Here we get the webpage content using System.Net.Webclient

Crawl the HTML

Once we have the webpage content we stored it in a HtmlDocument (HtmlAgilityPack) and we proceed to getting only the HTML id that has the main content of the page. Then using some utilities we remove the HTML tags and any javascript or CSS line declarations.

protected virtual string GetAllInnerTexts(HtmlNode node) { node.Descendants() .Where(n => n.Name == "script" || n.Name == "style") .ToList() .ForEach(n => n.Remove()); return RemoveWhitespace(node.InnerText.Replace(Environment.NewLine, " ")); }

Save the important HTML content into the custom field

Now that we have the content that we want to save in this custom field, that later will be used to query information for the site search functionality, we just return the value of the final string.

This post was created by Carlos Araujo. You can contact me in twitter @caraujo