Data Analysis with MongoDb and R

2014-07-23 | by Kory Becker

Introduction

I’ve recently completed an online course in R programming and have been weighing its strengths and weaknesses to determine where it would be most advantageous. Traditional workplace programming languages, such as C# and Javascript, seem to always have a place.

It just so happens, a problem arose of needing to query two different MongoDb databases. One database was a production instance, while the other was QA. A report was needed that queried against both databases to compare the contents for missing and invalid documents.

If the goal had been querying over one Mongo instance, an easy solution would be to use an off-the-shelf Mongo client tool. Another possibility is to query a sister) database from within the first. However, since the query spans two databases and contains slightly complicated logic, writing a custom program might better suit the purpose. A custom program also has the advantage of being scheduled to run on a daily basis and is more easily extensible than an inline query.

Where the R Programming Language Excels

Thinking a little more about the problem at hand, there were some features a solution program would need to do. First, it would need to query two MongoDb databases. It would need to cross-reference data from one database against records in the other, by checking against certain columns. It would need to combine some of the columns together to fill in missing data from one and the other. It may also need to perform other table manipulations on rows and columns. Finally, it would need to export the result to a comma-separated (CSV) or Excel file.

These are all features that R is exceedingly good at.

R makes it easy to manipulate tables, matrices, and data frames, seamlessly joining and combining different columns and rows together, and even shaping data. Reading data sources consisting of fixed-width files, comma-separated data, XML, and databases is particularly suitable for R. The data automatically comes in as a matrix, in the form of a data-frame. R also makes it easy to export data to csv. Most of the functionality is native to the R environment.

Why Not Use .NET? Or Node.js? Or ..?

Of course, a console application could be written in C# .NET to query the two databases and perform the cross-validation. Although, this requires creating classes to model the documents, as well as associated plumbing for executing the database queries. A typical example for querying MongoDb with C# .NET appears as follows:

using MongoDB.Bson;
using MongoDB.Driver;
using MongoDB.Driver.Builders;
using MongoDB.Bson.Serialization.Attributes;

[BsonIgnoreExtraElements]
public class Stock
{
  public ObjectId Id { get; set; }
  public string Name { get; set; }
  [BsonElement("isProfitable")]
  public bool IsProfitable { get; set; }
}

var connectionString = "mongodb://user:pass@localhost:27017/test";
var client = new MongoClient(connectionString);
var server = client.GetServer();
var database = server.GetDatabase("test");
var collection = database.GetCollection<Stock>("stock");

var query = Query<Stock>.EQ(e => e.IsProfitable, false);
var entities = collection.Find(query);
foreach (var entity in entities)
{
   Console.WriteLine(entity.Name);
}

The above code example demonstrates connecting to a MongoDb instance and searching for a document by IsProfitable. Note, C# requires the creation of an entity class to support the result from the query.

Querying MongoDb with Node.js

Just as easily, a program could be written in node.js to query the MongoDb instance.

var MongoClient = require('mongodb').MongoClient;

MongoClient.connect('mongodb://user:pass@localhost:27017/test', function(err, db) {
  var collection = db.collection('Stock');

  collection.find({ isProfitable: false }).toArray(function(err, entities) {
      for (var i in entities) {
        console.log(entities[i].Name);
      }

      db.close();
  });
});

The above code shows a port of the original C# example, written in node.js.

Building a Solution in R

Similar to the above examples, a Mongo library is also available for R. The first step towards querying MongoDb with R is to check for the existence of the library package and install it if necessary. While this step can be skipped during development, it’s handy to include, especially if you will be distributing the R program to others.

We’ll be using the RMongo package.

# Increase JVM memory for large queries.
options(java.parameters = "-Xmx3000m")

packages <- c("RMongo")
if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
    install.packages(setdiff(packages, rownames(installed.packages())))  
}

require("RMongo")

The above code checks if the package “RMongo” is present in the list of installed packages within the R environment. If it does not exist, the package is installed. Note, there is an additional line to increase the memory of JVM. This allows R to handle larger query datasets in memory. Finally, we require the RMongo library.

Querying MongoDb in R

To connect, authenticate, and query a mongo database in R, we can use the following code:

require("RMongo")

# Connect to Prod database.
mongo <- mongoDbConnect("test", "localhost", 27017)

# Login.
auth <- dbAuthenticate(mongo, "user", "pass")

# Find all matching documents.
dataProd <- dbGetQuery(mongo, "Stock", "{isProfitable: false}", 0, 999999)

# Disconnect from the database.
dbDisconnect(mongo)

The above code example demonstrates querying a Mongo database with R.

RStudio showing a connection to MongoDB, querying the database, and cleaning the results.

Using the Features of R

With the above query we can read data from a Mongo database. We can repeat the same query for the second database. Once we have the data in memory, we can begin taking advantage of some of the power features in R to tweak the resulting data set to our liking. First, we’ll want to format a date column as an actual Date, rather than a string. This will allow sorting the column in Excel. We’ll also want to convert a boolean column to the “logical” data type, rather than string.

# Set date format for parsing mongo datetime fields.
dateFormat <- "%a %b %d %H:%M:%S EDT %Y"

# Convert strings to booleans.
dataProd$isProfitable <- as.logical(dataProd$isProfitable)

# Format dates.
dataProd$startDate <- as.POSIXlt(dataProd$startDate, format = dateFormat)

The above code transforms two columns in the resulting data.frame to be of type “logical” (boolean) and POSIXlt (datetime).

Finding Missing Documents in Two Tables with R

After querying the second database, we can now check for missing documents in the two databases.

# Find stories in the production database that are missing in the QA database.
missingQA <- dataProd[!(dataProd$Id %in% dataQA$Id),]

# Find stories in Prod that exist in QA but are not profitable in QA or have a null value.
notProfitableQA <- dataQA[dataQA$Id %in% dataProd$Id & 
                  (is.na(dataQA$isProfitable) | dataQA$isProfitable == FALSE),]

Saving the Results to a CSV File

After manipulation of the data is completed, we can save the result to a CSV file, compatible with Excel, by using the following code:

# Sort the data sets by startDate.
missingQASorted <- missingQA[order(missingQA$startDate),]
notProfitableQASorted <- notProfitableQA[order(notProfitableQA$startDate),]

# Save to csv.
write.csv(missingQASorted, "missingQA.csv", quote=FALSE, row.names=FALSE)
write.csv(notProfitableQASorted, "notProfitableQA.csv", quote=FALSE, row.names=FALSE)

Conclusion

There are often multiple ways to implement a solution for a problem, typically solvable by a number of different programming languages. A good solution is, of course, one that works. However, the best solution is one that is also easy to understand and maintain. When the task involves manipulating and transforming large data sets, a solution built upon R can offer a quick and powerful result.

About the Author

This article was written by Kory Becker, software developer and architect, skilled in a range of technologies, including web application development, machine learning, artificial intelligence, and data science.

Software Development, Programming, AI

ProgrammingData Science