C# ASP .NET is one of the first programming platforms which provides a vast array of easy-to-use libraries for accessing the Internet, download files and web pages, and manipulating the Internet stream. With this comes the ability to program your own C# ASP .NET web spiders, crawlers, and robots. However, since the web is an unforgiving place filled with broken links, delayed servers, and misconfigured IP addresses, using the C# streaming libraries with a fault-tolerant method is extremely important.
Reading and writing Internet streams in C# ASP .NET starts with the System.IO and System.Net libraries. The main functions used involve issuing a request with the System.Net.WebRequest and reading back the Internet stream as a System.Net.WebResponse. This effectively implements a web client, from which you can “browse” the web and read the bytes.
The beauty of C# ASP .NET is how easy it is to manipulate the stream in different ways once you have it. For example, you can read the Internet stream (eg. web page) as a plain old byte array, an XML stream, binary data, text, and probably a lot more.
As is expected with the web, any crawler you design in C# ASP .NET is likely to come across a misconfigured server or broken link which freezes the connection. You could wait until the default remote server timeout or local client timeout occurs, but if this happens several times during your spidering, it could take a long time to complete your crawl. Instead, you can be proactive and configure your own timeout value when downloading Internet streams.
The code below shows a GetURLStream function and includes a default timeout of 5 seconds. If a connection fails to open or the server fails to respond, after 5 seconds your program will break the Internet call and continue moving on to the next link in the queue.
To use the above code to download a web page you would do the following:
There are several important items to note above. The first one is the use of the try catch finally statement. This assures that any error while downloading the stream (and you will have an error at some point in your streaming) is handled properly. It also assures that after you are finished all your spidering, the stream is closed properly.
The example above reads the stream as bytes and converts the byte array to a string for display. You could use a number of different StreamReader objects instead, such as TextReader to read the stream as text or XmlTextReader for XML parsing. (eg. XmlTextReader MyXmlReader = new XmlTextReader(strm)).
If you’ve ever checked your web server logs, you’ll see a long list of different web spiders that have accessed your site. Each one usually leaves a uniquely identifying user agent string. For example, here are some popular user agent strings:
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
Mozilla/2.0 (compatible; Ask Jeeves/Teoma; +http://about.ask.com/en/docs/about/webmasters.shtml)
Now how do you get your custom web spider to leave its own user agent string? Simple. You set the UserAgent property of the HttpWebRequest object. If you take another look at the GetURLStream function above, you’ll notice the user agent string embedded in there.
You can make this string anything you like, simple text, impersonate Internet Explorer, or even a full message in HTML (which might even show up in the web server logs). The important part is that you have control over what the web sites see when your program accessing their web servers.
For stealth web spiders, simply set your UserAgent property to Internet Explorer’s default. Web logs will be unable to tell your request from a user’s web browser (unless you make many subsequent requests to the same server, giving away your program’s identity).
At some point, you may need to download an image or binary file stream. What if you wanted to check if a URL points to an image and not an HTML page? Normally, checking for this would require downloading the entire stream and checking its contents. This requires too much time, especially when we only need to look at the first few bytes to know if its an image or not.
C# ASP .NET provides a much easier way to do this with the GetURLStream technique above. Instead of downloading the full stream, we will only download the first 25 bytes, check if it contains an image signature, and move on. Here is an example:
Imagine all the time and bandwidth you’ll save by only downloading the first 25 bytes of a stream to do a validity check like this on an image. You could reduce the read-amount even less, possibly down to 4 bytes.
Using your imagination, you can come up with all kinds of uses for the GetURLStream() function for validity checks on Internet files and parsing streams.
Now that you have the core function behind a web robot spider in C# ASP .NET, you can greatly enhance those ASP applications. Users will be grateful for your instant validation of submitted links, spidered content, and speedy results. Just be careful with your loops and double-check your try catch finally blocks. Make sure those streams are always cleaned up and closed. Always optimize your stream reading for only the amount of bytes you really need.
Bandwidth costs money, but time costs even more.
Primary Objects creates advanced C# ASP .NET web applications, many of which includes advanced streaming technology, web spidering, and web content parsing. If you are in need of a custom web application with these features, please contact us.
This article was written by Kory Becker, software developer and architect, skilled in a range of technologies, including web application development, machine learning, artificial intelligence, and data science.