Maybe you’ve heard people talking about ditching their SQL Servers and other RDBMS
entirely. There is a movement out in the software development world called
the "No
SQL" movement and it’s taking the web application
world by storm.
“Insanity!” you may cry, “for where will people put their data if not in a database?
Flat files? Tell me we aren’t going back to flat files.”
No, but in the relational model, something does has to give. The NoSQL movement
is about re-evaluating the constraints and scalability of data storage systems in
the light of the way modern web applications generate and consume data.
The outcry about flat files above is meant to highlight an assumption developers
often have about building data-driven applications: Data goes in the database (SQL
Server, Oracle, or MySql). Just maybe, if we are really cutting-edge, we might consider
storing our data in the cloud, but the choices generally stop there.
The NoSQL movement asks the question:
“Is the relational database (RDBMS) always the right tool for data storage and data
access?”
Starting from an RDBMS is virtually an
axiom of software development. However, those of us who are excited about
NoSQL believe that relational databases are not always the answer. I think this
highlights one of the reasons this NoSQL thing is called a movement. People are
realizing they have a choice where they thought they had none.
The converse is, of course, also true. The NoSQL databases are also not always the
right choice either. If you look carefully however, you will find that they are
a good choice much of the time. Don’t take my word on it. Ask Facebook, Twitter,
Digg, SourceForge, WebEx, Reddit and a bunch of other companies
here and
here
that are using NoSQL databases.
This move towards NoSQL is driven by pressure from two angles in the web application
world:
- Ease-of-use and deployment
- Performance - especially when there are many writers as compared to the number of
readers (think Twitter or Facebook).
|
Choosing NoSQL for Ease-of-Use and Deployment
I cover the programming model in detail as well as introduce the actual database
server below. For some vague motivation, let me just give you a quick look at how
you define the data model and maintain it.
- Define your classes in C# (largely) without regard to putting them in a database.
Related classes? Easy - one has a collection of the others.
- Create a simple DataContext-like class which exposes each top-level type that is
to be stored in the database. This is only a few lines of code per collection (think
of this as a table).
- Interact with the database using LINQ. This creates the collections (think tables),
sets the schema, etc.
- Maintain the database and evolve it by maintaining your classes from step 1. *
|
Why, in the name of all that is right, do we have to model our system twice? Once
in the database and once, in parallel, in code? With NoSQL, you have one place to
do that - in your C# classes.
* You may have to run a transformation tool if you’re making radical data changes,
but that’s true in SQL systems as well.
Choosing NoSQL for Performance
When the number of concurrent clients using your application - and thus your database
- is reasonably small (let’s say 500 users as a baseline) RDBMS can work great.
But what if that number grows? And if you are writing a web app, you definitely
want that number to grow. At 50,000 users, can you still run on a single instance
of SQL Server or MySql? How powerful does your hardware have to be to handle that?
What about at 500,000 or 5,000,000 users, still good?
I’m sure there are some of you out there thinking, “What a minute now! There are
plenty of systems with tons of users built upon relational databases.”
It’s true, there are. But how much expensive hardware and software do these require?
How easy is it to leverage *commodity* hardware and free software? A basic SQL Server
cluster might run you $100,000 just to get it up and running on decent hardware.
Rather than leveraging crazy scaling-up options, the NoSQL databases let you scale-out.
They make this possible (dare I say easy?) by dropping the relational aspects of
a database. Some NoSQL systems such as MongoDB get even better scalability by loosening
some of the durability guarantees – which they backfill somewhat with redundancy
(more on MongoDB shortly).
“Ok, ok. So it’s cheaper and simpler,” you say. “How much faster than the finely
tune system that is SQL Server 2008 can these open source NoSQL systems be?”
The answer is: MUCH MUCH FASTER. Here’s a simple comparison of running a bunch of
concurrent inserts into SQL Server 2008 and MongoDB on the same computer.
Looks like under heavy load, I’d say it’s about 100 times faster. I’m sure there
going to be tons of second guessing this graph and so on. Hold your comments please!
I’ll be posting a full performance comparison with source code soon. Let me just
say that I think the comparison was fair - I’ll back that up in a later post.
NoSQL and a New Programming Model
If we do not have joins and primary / foreign key relationships, how do we associate
related data? In NoSQL, there is a way to mimic foreign keys for certain relationships.
However the main answer is that you do not disassociate your data in the first place.
I’m sure that you’ve all heard of the
object-relational impedance mismatch.
A large part
of that mismatch comes from the fact that we normalize the data in our database
to the extreme and then use joins to reassemble that data. Not only does that cause
this so-called impedance mismatch, but those joins can be really slow and they can
be the death of any scale-out solution. The key to many of the NoSQL databases’
scalability is that they do not use joins. You simply save large swaths of your
data as a single blob (which in MongoDB’s case, is still deeply queriable).
Shortly we’ll look at an example where we build out a disconnected, offline RSS
reader that uses MongoDB and LINQ to store its data. But just think about how you
might structure your data storage if you could save entire object graphs and still
query them? Your "row" might be a Blog object which has an array of BlogEntries
which contain the entry text, link, date, etc. Then your *entire* query to pull
all the details of a single blog would hit a single “table” in the database. That
might look like this query which has one result:
var blog =
(from b in ctx.Blogs
where b.Id == requestedBlogId
select b).FirstOrDefault();
There are no joins or anything like that because you’re saving objects not columns
and those objects contain their collections already (e.g. RssEntries). There is
an important distinction to make here. These NoSQL databases generally are *not*
the same as object databases. They are what are known as document databases. There’s
actually a
big difference between the two.
Introducing MongoDB
The NoSQL database we are using in this example is
MongoDB.
This is free, open-source database which runs on Windows, Linux, and Mac OS X
systems. You can access it from many platforms including .NET, Ruby, Java, PHP,
and so on.
We’ll be using .NET and C# of course. You have several options when choosing
how
to access MongoDB from .NET but generally that means using LINQ and a light-weight
object-mapper on top of MongoDB itself. Note that common terminology might categorize
the object mapper that moves objects into and out of the database as an ORM. While
that’s OK, there is technically no "R" in this ORM because MongoDB is not relational.
Hence I’m calling simply an Object-Mapper (OM).
In MongoDB nomenclature, theses libraries are called drivers. My favorite .NET driver
is called NoRM. It’s being actively developed and was created by
Karl Seguin,
Andrew Theken,
Rob Conery,
James Avery, and
Jason Alexander.
You can find
NoRM on GitHub and discuss it in its related
Google Group.
If you want to learn more about MongoDB you should listen to these Podcast interviews:
Michael Dirolf also has a great book in the works. You can catch a preview of it
on
Safari Books Online.
Here’s the amazon page:
MongoDB: The Definitive Guide.
NoSQL in Action
Let’s write some code. The first step typically in a data-driven application is
to spec out the database. Then we’d use LINQ to SQL or Entity Framework to generate
the ORM classes. MongoDB is different. MongoDB has no schema or rather its schema
is flexible and defined via usage rather than being predefined in the database.
So our first step is to define the classes we’d be storing in the DB via NoRM.
We’re going to define 3 classes: Blog, RssEntry, and RssDetail. The Blog object
will contain a collection of RssEntry objects. In practice you might just go with
the Blog and RssEntry classes. But I wanted to model both the embedded case (Blog
+ RssEntry) and the loosely defined foreign key style relationship that mimic joins
(RssEntry + RssDetail). That way we can demonstrate both use-cases.
Here’s a taste of the Blog class:
public class Blog
{
public ObjectId _id { get; set; }
public string Name { get; set; }
public string Url { get; set; }
public string RssUrl { get; set; }
public List<RssEntry> Entries { get; set; }
// ...
}
Notice that it contains a collection (List<T> really) of RssEntry objects.
That’s the relationship supported by nesting. The Blog class just has this collection
as part of its data model.
The RssEntry class has the summary info for a blog entry:
public class RssEntry
{
public ObjectId _id { get; set; }
public Guid UniqueId { get; set; }
public DateTime PostedDate { get; set; }
public string Title { get; set; }
public string RssGuid { get; set; }
}
And the larger data is stored in the RssDetails class (for example the text of the
post):
public class RssDetails
{
public ObjectId _id { get; set; }
// this is kinda like the foreign key.
public Guid RssEntryId { get; set; }
public List<string> Categories { get; set; }
public string Link { get; set; }
public string Text { get; set; }
// ...
}
Let’s see how we insert an entire set of Blog data into the database. We begin by
generating the objects (Blog, RssEntry, etc) in memory and then serializing them
via NoRM to MongoDB much as you would in LINQ to SQL. The difference is this will
actually generate the collections (analogous to tables) if they don’t already exist
and it will define the implicit schema to match our objects:
void SaveBlogToMongoDb(
string rssUrl, XElement root, RssDataContext ctx)
{
Blog blog = new Blog();
blog.RssUrl = rssUrl;
blog.Name = GetBlogName(root);
blog.Url = GetBlogUrl(root);
blog.Entries = ParseEntries(root);
IEnumerable<RssDetails> details
= GetDetails(blog.Entries, root);
foreach (RssDetails detail in details)
{
ctx.Add(detail);
}
ctx.Add(blog);
}
Here we are using a class called RssDataContext which we wrote manually. It is very
similar to what LINQ to SQL and Entity Framework use to do the object-relational
mapping. Want to do a query? Do you know LINQ? Well then you’re all set:
var results =
from b in ctx.Blog
where b.Name.Contains( "MongoDB" )
select b;
How do you add a new entry to an existing blog and update it in the database?
void AddEntry(Blog blog, RssEntry entry)
{
blog.Entries.Add(entry);
ctx.Save(blog);
}
We leverage the fact that the blog.Entries collection is a List and just add to
it. Then save will update the record in the DB.
All this works great and is highly performant. But do be careful as not all the
LINQ operations are fully implemented yet in NoRM and some (like join) may never
be added because MongoDB doesn’t support it.
To get started, download MongoDB the tools and server here:
http://www.mongodb.org
You unzip the zip file and run the mongod.exe program. Be sure that you have created
the C:\data\db folder. It appears at first that you have to run MongoDB in a console
window. But you can register it as a Windows Service:
Here’s some helpful advice on installing MongoDB as a Windows Service (there is
a small bug you have to work around):
http://www.deltasdevelopers.com/post/Running-MongoDB-as-a-Windows-Service.aspx
There’s also a management console (and I mean "console"):
It’s a little different. You’ll get used to it. The means of interaction with the
server is through JavaScript rather than T-SQL and the storage format is a binary
form of JSON as you can see.
For a project I’m working on I’ve built a Windows Forms UI that lets me manage the
database easily by just adding an object data source and doing some drag-drop magic
in Visual Studio. Generally I look down upon that sort of development, but for an
admin tool it’s just fine.
Now It’s Your Turn!
Try it out for yourself. Download MongoDB and the NoRM driver and build some apps.
You may also want to check out the source code for my demo app:
Download Sample: RssMongoSample-Kennedy.zip
Got feedback? Write a comment or
contact me on Twitter where I'm @mkennedy or find me in
any of these other ways.
Recommended Reading:
Here are some other blogs on this subject.