shopsnapdev - Tumblr blog

shopsnapdev · 11 years ago

Text

AngularJS and SEO

Search engines are designed to parse and index HTML content. Simple enough for static HTML pages or traditional MVC applications where HTML templates are rendered server-side. But what about modern web apps where HTML templates are rendered on the client?

Search bots are not going to run Javascript and will end up indexing your HTML templates instead. At best, they will be mostly empty. At worst, some otherwise hidden snippet of content will get inadvertently indexed.

<span ng-show="isEmpty()">No results</span>

The snippet above is only shown on empty search results but gets indexed indiscriminately and the Google crawler reports the entire page as a soft 404 crawl error. Ouch.

The problem in a nutshell - search engines index content, not apps.

Here are the two basic types of solutions to the Javascript SEO problem:

The most obvious is to have the server generate a custom static version of the page specifically targeted for crawlers. There are many ways of doing this, but they all end up forcing you to pretty much rewrite your entire app to do server-side generation. This is really a regression no matter how much clever code reuse you come up with. The static content can be a hidden part of the actual page or be served exclusively to search bots (the latter can regress to the much maligned practice of cloaking). What makes this approach most deficient though is that it cannot be applied universally to all web apps as it depends on each particular technology stack and requires constant maintenance.

The other approach is to simply run the Javascript code in a headless browser and give the resulting HTML snapshot of the page to the search bots. Let's look into this in more detail.

Generating HTML Snapshots in a Headless Browser

This approach is actually what the major search players are pushing for and you can read Google's specification here. Here's the basic idea.

The page to be indexed tells the Googlebot that it should be specially requested for indexing via a meta tag:

The bot will rewrite the URL by adding an additional query parameter _escaped_fragment_. The value of the paramter will be either empty or contain the escaped hashbang query if one is present. (I will not get into a discussion of the hash bang approach vs HTML5 pushState as it's orthogonal the problem at hand - just know that it works in both cases).

The server detects the presence of the special _escaped_fragment_ parameter and intercepts the request. Depending on your set up this can be done either in your Apache or Nginx configuration or directly in your ExpressJS middleware stack for example.

The URL of the interecepted request is then rewritten back to its original form and fetched from the server by a headless browser like PhantomJS and returned to the crawler.

Let's do it! - A Simple MEAN Stack Implementation

Various paid solutions abound - just do a simple search for "AngularJS SEO" or "Javascript SEO problem". Some of the more popular ones that I've looked at include prerender.io, brombone and SEO.js. Let's see how hard it is to write this from scratch for a MEAN (MongoDB, ExpressJS, AngularJS and Node) stack web app.

Let's start with detecting and forwarding _escaped_fragment_ requests. You can do this in your Apache or Nginx configuration, or directly in your ExpressJS middleware stack. Here's my Gist for the latter using http-proxy for request forwarding.

This middleware is very simple - it checks for the presence of the special query parameter and uses http-proxy to forward the request to our PhantomJS server. Here's a link to Gist containing to a simple PhantomJS setup as it too lengthy to include here in full:

PhantomJS SEO HTML Snapshot Server

The PhantomJS server will rewrite the URL so we don't get into an infinite loop and fetch the request from our ExpressJS server. To run the server, install PhantomJS and invoke on our server:

npm install -g phantomjs phantomjs seo-server.js

One last bit here is how we signal PhantomJS that we are done loading the page so it knows its ready to take the snapshot. This is done via a special callback in our Angular app. Here's a simple module that registers a callback attached to the root scope - $scope.htmlReady(). This is best called on successful loading of all dynamic content.

angular.module('phantomjs-callback').run(['$rootScope', '$window', function($rootScope, $window) { // phantomjs callback when ready for SEO server to take HTML snapshot $rootScope.htmlReady = function() { if (typeof $window.callPhantom == 'function') { $rootScope.$evalAsync(function() { // fire after $digest setTimeout(function() { // fire after DOM rendering $window.callPhantom() }, 0) }) } } } ])

This is it. We can add snapshot caching if performance is a concern although that is not likely given the frequency of bot requests. Happy snapshotting!

#angularjs #seo #ajax crawling #single-page-app #prerender.io #mean stack #phantomjs

0 notes