 | WebRequest and WebResponse Reading and writing Internet streams in C# ASP .NET starts with the System.IO and System.Net libraries. The main functions used involve issuing a request with the System.Net.WebRequest and reading back the Internet stream as a System.Net.WebResponse. This effectively implements a web client, from which you can "browse" the web and read the bytes. The beauty of C# ASP .NET is how easy it is to manipulate the stream in different ways once you have it. For example, you can read the Internet stream (eg. web page) as a plain old byte array, an XML stream, binary data, text, and probably a lot more. |
Timing Out on a Frozen URL As is expected with the web, any crawler you design in C# ASP .NET is likely to come across a misconfigured server or broken link which freezes the connection. You could wait until the default remote server timeout or local client timeout occurs, but if this happens several times during your spidering, it could take a long time to complete your crawl. Instead, you can be proactive and configure your own timeout value when downloading Internet streams. The code below shows a GetURLStream function and includes a default timeout of 5 seconds. If a connection fails to open or the server fails to respond, after 5 seconds your program will break the Internet call and continue moving on to the next link in the queue. private Stream GetURLStream(string strURL) { System.Net.WebRequest objRequest; System.Net.WebResponse objResponse = null; Stream objStreamReceive; try { objRequest = System.Net.WebRequest.Create(strURL); objRequest.Timeout = 5000; ((HttpWebRequest)objRequest).UserAgent = "MyWebRobot/1.0 (compatible; http://www.primaryobjects.com)"; objResponse = objRequest.GetResponse(); objStreamReceive = objResponse.GetResponseStream(); return objStreamReceive; } catch (Exception excep) { Debug.WriteLine(excep.Message); objResponse.Close(); return null; } }
To use the above code to download a web page you would do the following: Stream strm = null; StreamReader MyReader = null; try { // Download the web page. strm = GetURLStream("http://www.yahoo.com); if (strm != null) { // We have a stream, let's attach a byte reader. char[] strBuffer = new char[3001]; MyReader = new StreamReader(strm); // Read 3,000 bytes at a time until we get the whole file. while (MyReader.Read(strBuffer, 0, 3000) > 0) { string strLine = new string(strBuffer); Debug.WriteLine(strLine); } } } catch (Exception excep) { Debug.WriteLine("Error: " + excep.Message); } finally { // Clean up and close the stream. if (MyReader != null) { MyReader.Close(); } if (strm != null) { strm.Close(); } } There are several important items to note above. The first one is the use of the try catch finally statement. This assures that any error while downloading the stream (and you will have an error at some point in your streaming) is handled properly. It also assures that after you are finished all your spidering, the stream is closed properly. The example above reads the stream as bytes and converts the byte array to a string for display. You could use a number of different StreamReader objects instead, such as TextReader to read the stream as text or XmlTextReader for XML parsing. (eg. XmlTextReader MyXmlReader = new XmlTextReader(strm)). Marking Your Territory in the Web Server Logs If you've ever checked your web server logs, you'll see a long list of different web spiders that have accessed your site. Each one usually leaves a uniquely identifying user agent string. For example, here are some popular user agent strings: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) msnbot/1.0 (+http://search.msn.com/msnbot.htm) Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) Exabot/3.0 Mozilla/2.0 (compatible; Ask Jeeves/Teoma; +http://about.ask.com/en/docs/about/webmasters.shtml)
Now how do you get your custom web spider to leave its own user agent string? Simple. You set the UserAgent property of the HttpWebRequest object. If you take another look at the GetURLStream function above, you'll notice the user agent string embedded in there. ((HttpWebRequest)objRequest).UserAgent = "MyWebRobot/1.0 (compatible; http://www.primaryobjects.com)"; |  |
You can make this string anything you like, simple text, impersonate Internet Explorer, or even a full message in HTML (which might even show up in the web server logs). The important part is that you have control over what the web sites see when your program accessing their web servers. For stealth web spiders, simply set your UserAgent property to Internet Explorer's default. Web logs will be unable to tell your request from a user's web browser (unless you make many subsequent requests to the same server, giving away your program's identity). Checking if the Stream is an Image or HTML At some point, you may need to download an image or binary file stream. What if you wanted to check if a URL points to an image and not an HTML page? Normally, checking for this would require downloading the entire stream and checking its contents. This requires too much time, especially when we only need to look at the first few bytes to know if its an image or not. C# ASP .NET provides a much easier way to do this with the GetURLStream technique above. Instead of downloading the full stream, we will only download the first 25 bytes, check if it contains an image signature, and move on. Here is an example: // Download the stream using GetURLStream() as above. ... StreamReader MyTextReader = new StreamReader(strm);
// Read the first 25 characters, we will be checking for a GIF or JPG signature. char[] strBuffer = new char[25]; MyTextReader.ReadBlock(strBuffer, 0, 25); string stringBuffer = new string(strBuffer); // Is this an image? if (stringBuffer.IndexOf("GIF8") > -1 || stringBuffer.IndexOf("JFIF") > -1) { Debug.WriteLine("It's an image!"); } else { Debug.WriteLine("It's HTML or other junk."); } Imagine all the time and bandwidth you'll save by only downloading the first 25 bytes of a stream to do a validity check like this on an image. You could reduce the read-amount even less, possibly down to 4 bytes. Using your imagination, you can come up with all kinds of uses for the GetURLStream() function for validity checks on Internet files and parsing streams. |