Detecting Browsers, Crawlers, and Web Bots in C# ASP .NET

2009-02-15 modified 2009-05-31 | by Kory Becker

Introduction

There are a lot of different entities crawling around your web applications in the wild, including web browsers, web crawlers, web spiders, web bots, and automated scripts. Determining the difference between a regular user visiting your site and an automated web bot can help aid you with more accurately recording statistics, customizing content, and optimizing web application performance.

The .NET framework, used to create C# ASP .NET web applications, actually comes with a built-in web browser detector, called the BrowserCaps feature. .NET 2.0 adds an additional detector, called the .Browser feature. Regardless of the .NET version, determining the difference between a user’s web browser and an automated web crawler can make a big difference in a web application, and it’s easy to do.

In this article, we’ll discuss three methods for determining the web browser type. We’ll also describe how to tell the difference between a user’s web browser and an automated crawler.

What’s Inside the User-Agent String

It really all starts with the web browser user-agent string. The user-agent is a string of text, sent in the HTTP header by the web browser, for each request made when accessing a page in the C# ASP .NET web application. The user-agent typically describes the web browser client type, name, version, and other information.

Some example User-Agent strings:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; SLCC1; .NET CLR 2.0.50727)
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; Yahoo! Slurp; +http://help.yahoo.com/help/us/ysearch/slurp)

As you can tell from the above examples, quite a bit of information can be parsed out of the user-agent string. We can tell that the first user-agent is a Microsoft Internet Explorer web browser, and thus a regular user. The other two user-agents are web bots. By looking at the details of the user-agent string, you can probably determine the most direct method of detecting the user’s web browser is by simply looking for sub-strings.

Looking for Keywords in a User-Agent

The most direct and simple method for detecting web browsers accessing your C# ASP .NET web application is to simply search for a sub-string within the user-agent and classify the web browser accordingly.

if (Request.UserAgent.ToString().IndexOf("Googlebot") > -1)
{
   // We have a GoogleBot web crawler.
}
else
{
   // We do not have a GoogleBot web crawler.
}

By parsing a simple sub-string from the UserAgent property of the HttpRequest, we can determine the type of web client accessing the site. While this method is simple and direct, it suffers from the problem of being unable to classify the many different types of user-agent strings out there. You could certainly obtain a list of user-agent strings and add keywords to parse for each, but this could take a long time. It would also be difficult to maintain the list and keep it updated as new web bots and browsers emerge. There must be an easier way and this is exactly where Microsoft is one step ahead.

Digging Deeper Into Request.Browser

In the above code sample, we pulled the user-agent string from the HttpRequest object. Rather than parse a sub-string from the Request.UserAgent property, the Request object provides us with an additional object for accessing information about the web browser client via Request.Browser. One of the properties of interest for telling the difference between a user and a web bot is Request.Browser.Crawler. This property is a boolean and will indicate true if the web browser is actually a web bot.

if (Request.Browser.Crawler)
{
   // We have a web crawler.
}
else
{
   // We do not have a web crawler.
}

Request.Browser.Crawler Always Returns False

If you try using the above code sample and testing using various user-agent strings to simulate web bots (ie. with the Firefox User-Agent Switcher plug-in), you’ll notice that Request.Browser.Crawler always returns false. This is due to missing information in one of .NET’s configuration sections, called BrowserCaps. We’ll need to populate the list of BrowserCaps (the list of available user-agents that we have information about) in order to use this feature.

Using the BrowserCaps To Detect Web Browsers From Web Bots

BrowserCaps http://msdn.microsoft.com/en-us/library/sk9az15a(vs.71).aspx is a section in the web.config file, within the system.web section. BrowserCaps allows you to specify a list of web browser user-agent strings, via regular expressions, to match against. Each item in the list indicates the capabilities of the web browser, version, whether it’s a crawler, and much more.

Inside the web.config (or machine.config) file:

<configuration>
<system.web>
<browserCaps>
   <result type="class"/>
   <use var="HTTP_USER_AGENT"/>
        browser=Unknown
        version=0.0
        majorver=0
        minorver=0
        frames=false
        tables=false
      <filter>
         <case match="Windows 98|Win98">
            platform=Win98
         </case>
      <case match="Windows NT|WinNT">
         platform=WinNT
      </case>
   </filter>
   <filter match="Unknown" with="%(browser)">
      <filter match="Win95" with="%(platform)">
      </filter>
   </filter>
</browserCaps>
</system.web>
</configuration>

The above is a sample entry for detecting Windows 98 and Windows NT operating systems in the user-agent string from the web browser. While you can proceed to add entries by hand to match each web browser and crawler of interest, you can actually download a complete and updated list of user-agent BrowserCaps to add to your C# ASP .NET web application.

To add the list of BrowserCaps to your development machine or server, follow these steps:

Open the following file for editing:
C:\windows\Microsoft.NET\Framework\v2.0.50727\CONFIG\machine.config
Download the BrowserCaps list from http://owenbrady.net/browsercaps (direct download list).
Paste the entire contents of the XML file into the machine.config, just before the line </system.web>.

If you only want the BrowserCaps list available to a single web application, paste the BrowserCaps section into your local web.config. If you want all web applications to have access to the information, use the machine.config as noted above.

After saving the changes and refreshing the C# ASP .NET web application, you will now have proper values displaying for Request.Browser.Crawler. The regularly updated list helps you detect the majority of web crawlers, bots, scripts, and web browsers.

Using the Newer .BROWSER

BrowserCaps was introduced in the .NET 1.0 Framework. While it is still active and supported by Microsoft, it has been deprecated with .NET 2.0. The current standard is to use the .BROWSER feature to indicate the list of user-agent strings. It’s important to note that entries specified in the .BROWSER feature are merged with the contents of the BrowserCaps, so that both methods may be used.

.BROWSER http://msdn.microsoft.com/en-us/library/ms229858(vs.80).aspx provides a way of specifying the web browser user-agents via XML in separate files in C:\windows\Microsoft.NET\Framework\v2.0.50727\CONFIG\Browsers. After creating a .browser file, you can execute aspnet_regsql.exe to build the browser files into the global assembly, giving access to the list to all web applications. This allows you to add new entries to the list without restarting the web application process. The actual command line to use is: C:\WINDOWS\Microsoft.NET\Framework\\aspnet_regsql.exe -i

The .browser feature provides a more seamless way of incorporating web browser detection into an ASP .NET application. However, at this time, a greater number of entries are available for the BrowserCaps method, which provides a more accurate detection method of web bots in the wild. Since both methods can be used together, there is no harm in combining them.

Perfecting Traffic Statistics with Web Bot Detection

One of the primary reasons to determine a web bot from a regular user’s web browser is to allow for accurate recording of statistics. For example, when counting the hits to a particular page in an ASP .NET web application, the numbers would become skewed if you included hits from GoogleBot, Yahoo Slurp, and the many other web bots. By using the Request.Browser.Crawler value, we can easily detect a web bot from a user and provide a more accurate figure.

Cloaking Isn’t Just in Star Trek

The discussion about web bot detection in C# ASP .NET web applications wouldn’t be complete without briefly cautioning against displaying different content to web bots and regular user web browsers, also called cloaking. More specifically, cloaking is when your web application detects a web bot and shows a different page or content, with the goal of affecting search engine ranking. It’s generally a rule of thumb to display the same content to web bots as you would to normal users and only use the web bot detection methods shown above for traffic statistical means or other behind-the-scenes activities.

Conclusion

The .NET Framework provides two powerful features for detecting the web browser client and determining web spiders from users’ web browsers. .NET 1.0 provides the BrowserCaps feature, which can be updated regularly with new user-agent strings as they become available. .NET 2.0 provides the .BROWSER feature, in addition to the BrowserCaps feature, for incorporating new user-agent matches more seamlessly in web applications. By using web browser and web bot detection responsibly, you can help enhance web application traffic statistics and features, creating a more powerful and resiliant C# ASP .NET web application.

About the Author

This article was written by Kory Becker, software developer and architect, skilled in a range of technologies, including web application development, machine learning, artificial intelligence, and data science.

Software Development, Programming, AI

Programming.NET