Prajwal Tuladhar’s Blog
 
programming, life and some random thoughts

May 10 2010

Ensure Index in MongoDB

Published by Prajwal Tuladhar under MongoDB

I just learned something hard way today.

I have a collection, lets call it foo_collection. The data format of that collection is something like this:


{
	file: "filename1",
	started: "Sat Feb 20 2010 04:15:02 GMT-0500 (EST)",
	ended: "Sat Feb 20 2010 04:15:04 GMT-0500 (EST)"
}

Initially, the key “file” is used as normal index


db.ensureIndex({file: 1});

Later on, I had to change the index from normal to unique. So, I did:


db.ensureIndex({file: 1}, {unique: true});

This didn’t work because, the collection already contained multiple similar values of file key. And MongoDB console also doesn’t show any error or any message when the above command is run.

It should have been:


db.ensureIndex({file: 1}, {unique: true, dropDups: true});

Again, it  should be noted that the above command will delete all the duplicate entries with the indexed key. So, use with caution.

I know this was kind of childish mistake but wanted to share so, someone reading this post won’t repeat it again.

Update: You can also check if db.ensureIndex() is working or not by calling db.getLastError() and you can also do the same from driver level by setting “safe” option. (Credit: Kristina)


Comments

Mar 22 2010

To embed or not to embed

Published by Prajwal Tuladhar under MongoDB

I have been using MongoDB for few months. It is one of the arrays of open source NoSQL solutions available. The way MongoDB including CouchDB stores data in JSON format as a document (It’s called BSON, Binary JSON in MongoDB). A document is a loosely typed data format as compared to strictly typed in SQL based systems. You can find more detail information about document data format here.  In MongoDB, all those documents are contained in a collection which acts as a bucket for those documents. So, a collection “contact_info” may be formatted as:

{
    "_id": "Prajwal Tuladhar"
    "street": "3478 78 ST",
    "zip": 11378,
    "email": "praj@prajwal-tuladhar.net.np",
    "phone": {
        "home": "64622251452"
    }
}

{
    "_id": "Max Payne"
    "street": "17th Floor 225 Park Avenue",
    "zip": 10001,
    "phone": {
        "mobile": "7182524423",
        "office": "2127783264"
    },
    "fax": "1.913.384.6577"
}

So these two documents are contained in a collection called “contact_info”. _id is a mandatory field in MongoDB while others are user defined. As you can see, documents are self-containded and their structure is slighly different from each other. This is one of the main advantages of using document based systems. One can define rich data structure without sacrificing SQL like features with option to query them in number of ways.

If you look at the above two documents, field “phone” is an example of embedded document. Other examples may be:

db.students
=======

{
    "_id": "Jane",
    "address": {
        "address": "123 Main St",
        "city": "New York",
        "zip": 10011
    },
    "scores": [
        {
            "subject": "Biology",
            "grade": 4.00
        },
        {
            "subject": "Physics",
            "grade": 3.50
        }
    ]
}

While designing schemas for MongoDB, one of the issues I have faced for sometime is: when to use embedded document.

Here are some of the excerpts about embedded documents from MongoDB docs:

Line item detail objects typically are embedded.

Objects which follow an object modelling “contains” relationship should generally be embedded.

I think one key point is missing here. The document should be embedded or used as a reference depending on the context of the domain. If the document can be expressed as domain model then, it should be declared as first class document while optionally expressing their relationship as reference or it can be also done from application level rather than DB level (Just imagine table relationship for MySQL ISAM). So, in the above examples, if the scores can be expressed as domain object (which depends on the context of the application), it should not be used as embedded document.

And from database point of view also, embedded documents are hard to update / manipulate while they are loaded in advance which can be problematic from performance point of view when size is freaking high. There is no concept of lazy loading in the form of cursor object for embedded documents so, one should acknowledge that large embedded documents may lead to bad design and the maximum size of the MongoDB document is 4 MB,


Comments

Nov 15 2009

MongoDB’s performance as compared to others

Published by Prajwal Tuladhar under MongoDB

Click to view the full size

I haven’t used PostgreSQL and TokyoTyrant so, can’t say much about them. And technically, I really don’t think that one should compare MySQL which is relational database with document based non-relational databases like: CouchDB and MongoDB.

In my opinion, MongoDB out-performs CouchDB in terms of querying, insertion and ease of usage but CouchDB’s support for MVCC and transaction are quite interesting. One of the crons of MongoDB is it’s data size grow at freaking high rate.

Thoough great to see that, NOSQL (NOt Only SQL) is on full swing.

Download OpenSQL comparison PDF (Don’t forget to read the conclusion though) via HackerNews.


Comments

Nov 15 2009

MapReduce API for MongoDB

Published by Prajwal Tuladhar under MongoDB

Currently, I’ve been doing some stuffs using MongoDB. If you don’t know or haven’t use it, it’s a document based key-value database systems, that means it’s fundamentally different from traditional DBMS like MySQL, Oracle.

Systems like MongoDB along with similar technologies like CouchDB make significant use of MapReduce. MapReduce is basically a two step process consisting of Map and Reduce where Map is used for reducing a dataset to smaller sub-sets while Reduce is used for for some specific operations into that mapped or grouped data. You can find more information about it all over the web.

Since, PHP driver MongoDB does not provide any specific MapReduce API, I’ve created mine own using MongoDB::command. You can find it @ Github.

Simple Usage:


<?
$db_name = "test_dbs";
$mongodb = new MongoDB(new Mongo(), $db_name);

$map = <<<MAP
	function()	{
		this.tags.forEach(
			function(x)	{
				emit(x, 1);
			}
		);
	}
MAP;

$reduce = <<<REDUCE
	function(key, values)	{
		return {count: values.length };
	}
REDUCE;

$map_reduce = new MongoMapReduce($map, $reduce);
$collection_name = "animal_tagsaa";
$response = $map_reduce->invoke($mongodb, $collection_name);
print_r($response->getRawResponse());
if ($response->valid())	{
	echo "Total Execution Time: {$response->getTotalExecutionTime()} Milli Seconds\n";
	$count_data = $response->getCountsData();

	echo "Count Data\n";
	foreach ($count_data as $key=>$value)	{
		echo "{$key}: {$value}\n";
	}
	echo "********************\n";
	foreach ($response->getResultSet() as $tag)	{
		echo "{$tag["_id"]}\n";
		echo "Count: {$tag["value"]["count"]}\n";
		echo "****************\n";
	}
}

Usage with Mongo Collections


<?php

function __autoload($class_name) {
    require_once "../lib/".$class_name . '.php';
}

$db_name = "test_dbs";
$mongodb = new MongoDB(new Mongo(), $db_name);

class AnimalTag extends XMongoCollection	{

	const COLLECTION_NAME = "animal_tags";

	public function __construct(MongoDB $mongoDB)	{
		$this->collectionName = self::COLLECTION_NAME;
		parent::__construct($mongoDB, $this->collectionName);
	}
}

$animal_tags = new AnimalTag($mongodb);

$map = <<<MAP
	function()	{
		this.tags.forEach(
			function(x)	{
				emit(x, 1);
			}
		);
	}
MAP;

$reduce = <<<REDUCE
	function(key, values)	{
		return {count: values.length };
	}
REDUCE;

$response = $animal_tags->mapReduce(new MongoMapReduce($map, $reduce));
if ($response->valid())	{
	foreach ($response->getResultSet() as $tag)	{
		echo "{$tag["_id"]}\n";
		echo "Count: {$tag["value"]["count"]}\n";
		echo "****************\n";
	}
}

Enjoy!!!


Comments

RSS Feed
Subscribe by email
Follow me @ Twitter