Prajwal Tuladhar’s Blog
 
programming, life and some random thoughts

Apr 05 2010

An example where caching simply doesn’t work

Published by Prajwal Tuladhar under Programming

The number of “notes” count and actual notes items are inconsistent in the above screenshot.

This is a typical example where caching really does not work. Facebook likes, Tumblr notes count and article listing in Hacker News are few examples where in memory database is more useful than DB caching.


Comments

Mar 22 2010

To embed or not to embed

Published by Prajwal Tuladhar under MongoDB

I have been using MongoDB for few months. It is one of the arrays of open source NoSQL solutions available. The way MongoDB including CouchDB stores data in JSON format as a document (It’s called BSON, Binary JSON in MongoDB). A document is a loosely typed data format as compared to strictly typed in SQL based systems. You can find more detail information about document data format here.  In MongoDB, all those documents are contained in a collection which acts as a bucket for those documents. So, a collection “contact_info” may be formatted as:

{
    "_id": "Prajwal Tuladhar"
    "street": "3478 78 ST",
    "zip": 11378,
    "email": "praj@prajwal-tuladhar.net.np",
    "phone": {
        "home": "64622251452"
    }
}

{
    "_id": "Max Payne"
    "street": "17th Floor 225 Park Avenue",
    "zip": 10001,
    "phone": {
        "mobile": "7182524423",
        "office": "2127783264"
    },
    "fax": "1.913.384.6577"
}

So these two documents are contained in a collection called “contact_info”. _id is a mandatory field in MongoDB while others are user defined. As you can see, documents are self-containded and their structure is slighly different from each other. This is one of the main advantages of using document based systems. One can define rich data structure without sacrificing SQL like features with option to query them in number of ways.

If you look at the above two documents, field “phone” is an example of embedded document. Other examples may be:

db.students
=======

{
    "_id": "Jane",
    "address": {
        "address": "123 Main St",
        "city": "New York",
        "zip": 10011
    },
    "scores": [
        {
            "subject": "Biology",
            "grade": 4.00
        },
        {
            "subject": "Physics",
            "grade": 3.50
        }
    ]
}

While designing schemas for MongoDB, one of the issues I have faced for sometime is: when to use embedded document.

Here are some of the excerpts about embedded documents from MongoDB docs:

Line item detail objects typically are embedded.

Objects which follow an object modelling “contains” relationship should generally be embedded.

I think one key point is missing here. The document should be embedded or used as a reference depending on the context of the domain. If the document can be expressed as domain model then, it should be declared as first class document while optionally expressing their relationship as reference or it can be also done from application level rather than DB level (Just imagine table relationship for MySQL ISAM). So, in the above examples, if the scores can be expressed as domain object (which depends on the context of the application), it should not be used as embedded document.

And from database point of view also, embedded documents are hard to update / manipulate while they are loaded in advance which can be problematic from performance point of view when size is freaking high. There is no concept of lazy loading in the form of cursor object for embedded documents so, one should acknowledge that large embedded documents may lead to bad design and the maximum size of the MongoDB document is 4 MB,


Comments

Mar 16 2010

Nice write up on HBase

Published by Prajwal Tuladhar under Hadoop

The two part articles On HBase is a must read if you are interested in NoSQL technology.

HBase is more complex than other systems (you need Hadoop, Zookeeper, cluster machines have multiple roles). We believe that for HBase, this is not accidental complexity and that the argument that “HBase is not a good choice because it is complex” is irrelevant. The advantages far outweigh the problems. Relying on decoupled components plays nice with the Unix philosophy: do one thing and do it well.

There has been quite a bit of war going on between Cassandra and HBase and of course, they have different design philosophies (first one give emphasis on Consistency while other for Availability under CAP theorem)

Happy Reading!


Comments

Feb 25 2010

Yet another database variation talk

Published by Prajwal Tuladhar under MongoDB, Scalability

Recently there has been lots of talk about using hybrid databases for a system like: using traditional SQL based database for storing static data and using Key-Value stores (Cassandra, HBase) and document based databases (MongoDB, CouchDB) for storing data domain with high magnitude of frequency of changes.

This approach seems more pragmatic as compared to using a single database implementation. And once again, one should not forget that one solution does not fit every contexts.

Presentation from Pycon:

The presenter is telling that Redis is the only of its kind in NoSQL ecosystem which is not true because MongoDB is also in-memory database but unlike Redis its document based while Redis is more key-value values based.

Apart from that, the talk is worth watching!!!


Comments

Jan 18 2010

SQL and NoSQL - the rant continues

Published by Prajwal Tuladhar under Scalability

It’s been quite sometime I’ve been subscribing to Planet CouchDB and it’s a great resource for getting new information about NoSQL technologies especially CouchDB.

From the same source, I got chance to read these two interesting blog posts. One was about criticizing Amazon SimpleDB and overall NoSQL technologies and other one being answer to that criticism.

One can find large number of articles and blog posts arguing SQL and NoSQL group. It seems like whole database world has been divided into two camps just like during Cold War: capitalism and socialism (I won’t select which one is capitalism and communism, decide yourself :) ).

In my opinion, all these arguments and counter arguments are kind of unnecessary because both of these tools are quite powerful in their respective context.

I often find people giving example of foo company using foo tools / technology and are doing great job scaling their overall architecture.

People often give example of Google, Yahoo! and Facebook when they have to make points about SQL and NoSQL but it is also to be considered that these companies are being able to scale with such an efficiency by not using only SQL or only using NoSQL technology.

Google, for example uses its BigTable, a column based database technology (one of the instances of NoSQL horizon) for indexing the web while they also use MySQL in significant ratio, in fact they have also provided patches for MySQL. And same is true for Facebook and Yahoo!.

Databases are hammers; MapReduce is a screwdriver

The article is quite interesting read differentiating normal databases (SQL) and MapReduce, a Google developed technology for aggregating large sets of data in distributed environment which is also used by number of NoSQL technologies like: MongoDB, CouchDB and many others.

I think the same concept can be used for SQL and NoSQL.

SQL is a hammer while NoSQL is a screwdriver

So, instead of ranting which is superior to each other, it would be better to combine them both and use them to create scalable + robust architecture. And Technology Agnostic design and Technology Agnostic Architecture (that include database in abstract term i.e. using SQL and/or NoSQL as demand by the context) are the most important things to consider when talking about scalability.

Update: When people used the term NoSQL, it would be better if they mean Not Only SQL rather than No SQL.


Comments

Next »

RSS Feed
Subscribe by email
Follow me @ Twitter