Internet Streams and Downloading Files in C# ASP .NET

modified

Introduction

C# ASP .NET is one of the first programming platforms which provides a vast array of easy-to-use libraries for accessing the Internet, download files and web pages, and manipulating the Internet stream. With this comes the ability to program your own C# ASP .NET web spiders, crawlers, and robots. However, since the web is an unforgiving place filled with broken links, delayed servers, and misconfigured IP addresses, using the C# streaming libraries with a fault-tolerant method is extremely important.

Internet Streams and Web Spiders in C# ASP .NET

WebRequest and WebResponse

Reading and writing Internet streams in C# ASP .NET starts with the System.IO and System.Net libraries. The main functions used involve issuing a request with the System.Net.WebRequest and reading back the Internet stream as a System.Net.WebResponse. This effectively implements a web client, from which you can “browse” the web and read the bytes.

The beauty of C# ASP .NET is how easy it is to manipulate the stream in different ways once you have it. For example, you can read the Internet stream (eg. web page) as a plain old byte array, an XML stream, binary data, text, and probably a lot more.

Timing Out on a Frozen URL

As is expected with the web, any crawler you design in C# ASP .NET is likely to come across a misconfigured server or broken link which freezes the connection. You could wait until the default remote server timeout or local client timeout occurs, but if this happens several times during your spidering, it could take a long time to complete your crawl. Instead, you can be proactive and configure your own timeout value when downloading Internet streams.

The code below shows a GetURLStream function and includes a default timeout of 5 seconds. If a connection fails to open or the server fails to respond, after 5 seconds your program will break the Internet call and continue moving on to the next link in the queue.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
private Stream GetURLStream(string strURL)
{
System.Net.WebRequest objRequest;
System.Net.WebResponse objResponse = null;
Stream objStreamReceive;
try
{
objRequest = System.Net.WebRequest.Create(strURL);
objRequest.Timeout = 5000;
((HttpWebRequest)objRequest).UserAgent = "MyWebRobot/1.0 (compatible; http://www.primaryobjects.com)";
objResponse = objRequest.GetResponse();
objStreamReceive = objResponse.GetResponseStream();
return objStreamReceive;
}
catch (Exception excep)
{
Debug.WriteLine(excep.Message);
objResponse.Close();
return null;
}
}

To use the above code to download a web page you would do the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
Stream strm = null;
StreamReader MyReader = null;
try
{
// Download the web page.
strm = GetURLStream("http://www.yahoo.com);
if (strm != null)
{
    // We have a stream, let's attach a byte reader.
char[] strBuffer = new char[3001];
    MyReader = new StreamReader(strm);
    // Read 3,000 bytes at a time until we get the whole file.
    while (MyReader.Read(strBuffer, 0, 3000) > 0)
    {
        string strLine = new string(strBuffer);
        Debug.WriteLine(strLine);
    }
}
}
catch (Exception excep)
{
Debug.WriteLine("Error: " + excep.Message);
}
finally
{
// Clean up and close the stream.
if (MyReader != null)
{
MyReader.Close();
}
if (strm != null)
{
strm.Close();
}
}

There are several important items to note above. The first one is the use of the try catch finally statement. This assures that any error while downloading the stream (and you will have an error at some point in your streaming) is handled properly. It also assures that after you are finished all your spidering, the stream is closed properly.

The example above reads the stream as bytes and converts the byte array to a string for display. You could use a number of different StreamReader objects instead, such as TextReader to read the stream as text or XmlTextReader for XML parsing. (eg. XmlTextReader MyXmlReader = new XmlTextReader(strm)).

Marking Your Territory in the Web Server Logs

If you’ve ever checked your web server logs, you’ll see a long list of different web spiders that have accessed your site. Each one usually leaves a uniquely identifying user agent string. For example, here are some popular user agent strings:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
msnbot/1.0 (+http://search.msn.com/msnbot.htm)
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
Exabot/3.0
Mozilla/2.0 (compatible; Ask Jeeves/Teoma; +http://about.ask.com/en/docs/about/webmasters.shtml)

Now how do you get your custom web spider to leave its own user agent string? Simple. You set the UserAgent property of the HttpWebRequest object. If you take another look at the GetURLStream function above, you’ll notice the user agent string embedded in there.

1
((HttpWebRequest)objRequest).UserAgent = "MyWebRobot/1.0 (compatible; http://www.primaryobjects.com)";

Reading and Writing Streams C# ASP .NET

You can make this string anything you like, simple text, impersonate Internet Explorer, or even a full message in HTML (which might even show up in the web server logs). The important part is that you have control over what the web sites see when your program accessing their web servers.

For stealth web spiders, simply set your UserAgent property to Internet Explorer’s default. Web logs will be unable to tell your request from a user’s web browser (unless you make many subsequent requests to the same server, giving away your program’s identity).

Checking if the Stream is an Image or HTML

At some point, you may need to download an image or binary file stream. What if you wanted to check if a URL points to an image and not an HTML page? Normally, checking for this would require downloading the entire stream and checking its contents. This requires too much time, especially when we only need to look at the first few bytes to know if its an image or not.

C# ASP .NET provides a much easier way to do this with the GetURLStream technique above. Instead of downloading the full stream, we will only download the first 25 bytes, check if it contains an image signature, and move on. Here is an example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Download the stream using GetURLStream() as above.
...
StreamReader MyTextReader = new StreamReader(strm);
// Read the first 25 characters, we will be checking for a GIF or JPG signature.
char[] strBuffer = new char[25];
MyTextReader.ReadBlock(strBuffer, 0, 25);
string stringBuffer = new string(strBuffer);
// Is this an image?
if (stringBuffer.IndexOf("GIF8") > -1 || stringBuffer.IndexOf("JFIF") > -1)
{
Debug.WriteLine("It's an image!");
}
else
{
    Debug.WriteLine("It's HTML or other junk.");
}

Imagine all the time and bandwidth you’ll save by only downloading the first 25 bytes of a stream to do a validity check like this on an image. You could reduce the read-amount even less, possibly down to 4 bytes.

Using your imagination, you can come up with all kinds of uses for the GetURLStream() function for validity checks on Internet files and parsing streams.

Releasing Him to the Wild

Now that you have the core function behind a web robot spider in C# ASP .NET, you can greatly enhance those ASP applications. Users will be grateful for your instant validation of submitted links, spidered content, and speedy results. Just be careful with your loops and double-check your try catch finally blocks. Make sure those streams are always cleaned up and closed. Always optimize your stream reading for only the amount of bytes you really need.

Bandwidth costs money, but time costs even more.

Primary Objects creates advanced C# ASP .NET web applications, many of which includes advanced streaming technology, web spidering, and web content parsing. If you are in need of a custom web application with these features, please contact us.

About the Author

This article was written by Kory Becker, software developer and architect, skilled in a range of technologies, including web application development, machine learning, artificial intelligence, and data science.

Share