Parsing Hostname and Domain from a Url with Javascript

2012-11-19 | by Kory Becker

Introduction

Parsing urls is a common task in a variety of web applications, including both server-side based code (as in the case of C# ASP .NET or node.js web applications) and client based code, including Javascript. Parsing urls to extract the hostname and domain can present its own set of challenges, as the format of a url may vary greater. Typically, extracting the hostname itself is often easier than determining the actual domain. This is simply due to the format of a hostname, which may include multiple sub-domains preceding the actual domain (for example, sub.site.co.uk).

In this tutorial, we’ll describe two basic Javascript methods for parsing a url on the client and extracting the host name and domain. We’ll also include a bonus Javascript method for determining if a url points to an external domain from the existing web page.

Parsing the Hostname From a Url

Extracting the hostname from a url is generally easier than parsing the domain. The hostname of a url consists of the entire domain plus sub-domain. We can easily parse this with a regular expression, which looks for everything to the left of the double-slash in a url. We remove the “www” (and associated integers e.g. www2), as this is typically not needed when parsing the hostname from a url.

The Javascript for parsing the hostname from a url appears as follows:

function getHostName(url) {
    var match = url.match(/:\/\/(www[0-9]?\.)?(.[^/:]+)/i);
    if (match != null && match.length > 2 && typeof match[2] === 'string' && match[2].length > 0) {
    return match[2];
    }
    else {
        return null;
    }
}

The above code will successfully parse the hostnames for the following example urls:

http://WWW.first.com/folder/page.html
first.com

http://mail.google.com/folder/page.html
mail.google.com

https://mail.google.com/folder/page.html
mail.google.com

http://www2.somewhere.com/folder/page.html?q=1
somewhere.com

https://www.another.eu/folder/page.html?q=1
another.eu

Parsing the Domain From a Url

We can extract the domain from a url by leveraging our method for parsing the hostname. Since the above getHostName() method gets us very close to a solution, we just need to remove the sub-domain and clean-up special cases (such as .co.uk). Our Javascript code for parsing the domain from a url appears as follows:

function getDomain(url) {
    var hostName = getHostName(url);
    var domain = hostName;
    
    if (hostName != null) {
        var parts = hostName.split('.').reverse();
        
        if (parts != null && parts.length > 1) {
            domain = parts[1] + '.' + parts[0];
                
            if (hostName.toLowerCase().indexOf('.co.uk') != -1 && parts.length > 2) {
              domain = parts[2] + '.' + domain;
            }
        }
    }
    
    return domain;
}

In the above code, we take the hostname from the url, split its parts by period, and then reverse the list of parts. We then concatenate the first two parts of the hostname (actually, the last two parts of the hostname, but reversed). We optionally pre-pend any additional pieces, per the TLD rules for the domain, such as in the case of .co.uk.

The above code will successfully parse the domains for the following example urls:

http://sub.first-STUFF.com/folder/page.html?q=1
first-STUFF.com

http://www.amazon.com/gp/registry/wishlist/3B513E3J694ZL/?tag=123
amazon.com

http://sub.this-domain.co.uk/folder
this-domain.co.uk

http://mail.google.com/folder/page.html
google.com

https://mail.google.com/folder/page.html
google.com

http://www2.somewhere.com/folder/page.html?q=1
somewhere.com

https://www.another.eu/folder/page.htmlq=1
another.eu

https://my.sub.domain.get.com:567/folder/page.html?q=1
get.com

Note, the above code works for a large percentage of urls. However, for perfect accuracy you would require a TLD lookup table. This would allow you to determine the exact types of sub-domains to parse for country-specific domains. In the above method, we’re hard-coding in Javascript the determination for .co.uk. Each country has their own rules and definitions for their TLD. The above code successfully parses the majority, but not 100% of all country-specific urls.

Checking If a Url is an External Link

Our final Javascript method determines if a link is an external link on a page. An external link is a url that points to a 3rd-party outside domain, different from the domain the web browser is currently displaying. A different domain includes a different root domain, different sub-domain, different protocol (http vs https), or different port number. We can use a Javascript regular expression to parse the url and compare the url parts, as described above, to determine if the url is an external link. The Javascript code appears as follows:

function isExternal(url) {
    var match = url.match(/^([^:\/?#]+:)?(?:\/\/([^\/?#]*))?([^?#]+)?(\?[^#]*)?(#.*)?/);
    if (match != null && typeof match[1] === 'string' &&
        match[1].length > 0 && match[1].toLowerCase() !== location.protocol)
        return true;

    if (match != null && typeof match[2] === 'string' &&
        match[2].length > 0 &&
        match[2].replace(new RegExp(':('+{'http:':80,'https:':443}[location.protocol]+')?$'),'')
           !== location.host) {
        return true;
    }
    else {
        return false;
    }
}

Try the Javascript Yourself @ JSFiddle

You can try the above Javascript to parse the hostname and domain from a url online at JSFiddle.

Just a globe

About the Author

This article was written by Kory Becker, software developer and architect, skilled in a range of technologies, including web application development, machine learning, artificial intelligence, and data science.

Software Development, Programming, AI

ProgrammingJavascript