Prajwal Tuladhar’s Blog
 
programming, life and some random thoughts

Jan 14 2010

One crucial difference between MapReduce and SQL query

Published by Prajwal Tuladhar under Hadoop

MapReduce is a linearly scalable programming model. The programmer writes two functions—a map function and a reduce function—each of which defines a mapping from one set of key-value pairs to another. These functions are oblivious to the size of the data or the cluster that they are operating on, so they can be used unchanged for a small dataset and for a massive one. More importantly, if you double the size of the input data, a job will run twice as slow. But if you also double the size of the cluster, a job will run as fast as the original one. This is not generally true of SQL queries. - Excerpt from Hadoop - The Definite Guide


Comments

Nov 15 2009

MongoDB’s performance as compared to others

Published by Prajwal Tuladhar under MongoDB

Click to view the full size

I haven’t used PostgreSQL and TokyoTyrant so, can’t say much about them. And technically, I really don’t think that one should compare MySQL which is relational database with document based non-relational databases like: CouchDB and MongoDB.

In my opinion, MongoDB out-performs CouchDB in terms of querying, insertion and ease of usage but CouchDB’s support for MVCC and transaction are quite interesting. One of the crons of MongoDB is it’s data size grow at freaking high rate.

Thoough great to see that, NOSQL (NOt Only SQL) is on full swing.

Download OpenSQL comparison PDF (Don’t forget to read the conclusion though) via HackerNews.


Comments

Nov 15 2009

MapReduce API for MongoDB

Published by Prajwal Tuladhar under MongoDB

Currently, I’ve been doing some stuffs using MongoDB. If you don’t know or haven’t use it, it’s a document based key-value database systems, that means it’s fundamentally different from traditional DBMS like MySQL, Oracle.

Systems like MongoDB along with similar technologies like CouchDB make significant use of MapReduce. MapReduce is basically a two step process consisting of Map and Reduce where Map is used for reducing a dataset to smaller sub-sets while Reduce is used for for some specific operations into that mapped or grouped data. You can find more information about it all over the web.

Since, PHP driver MongoDB does not provide any specific MapReduce API, I’ve created mine own using MongoDB::command. You can find it @ Github.

Simple Usage:


<?
$db_name = "test_dbs";
$mongodb = new MongoDB(new Mongo(), $db_name);

$map = <<<MAP
	function()	{
		this.tags.forEach(
			function(x)	{
				emit(x, 1);
			}
		);
	}
MAP;

$reduce = <<<REDUCE
	function(key, values)	{
		return {count: values.length };
	}
REDUCE;

$map_reduce = new MongoMapReduce($map, $reduce);
$collection_name = "animal_tagsaa";
$response = $map_reduce->invoke($mongodb, $collection_name);
print_r($response->getRawResponse());
if ($response->valid())	{
	echo "Total Execution Time: {$response->getTotalExecutionTime()} Milli Seconds\n";
	$count_data = $response->getCountsData();

	echo "Count Data\n";
	foreach ($count_data as $key=>$value)	{
		echo "{$key}: {$value}\n";
	}
	echo "********************\n";
	foreach ($response->getResultSet() as $tag)	{
		echo "{$tag["_id"]}\n";
		echo "Count: {$tag["value"]["count"]}\n";
		echo "****************\n";
	}
}

Usage with Mongo Collections


<?php

function __autoload($class_name) {
    require_once "../lib/".$class_name . '.php';
}

$db_name = "test_dbs";
$mongodb = new MongoDB(new Mongo(), $db_name);

class AnimalTag extends XMongoCollection	{

	const COLLECTION_NAME = "animal_tags";

	public function __construct(MongoDB $mongoDB)	{
		$this->collectionName = self::COLLECTION_NAME;
		parent::__construct($mongoDB, $this->collectionName);
	}
}

$animal_tags = new AnimalTag($mongodb);

$map = <<<MAP
	function()	{
		this.tags.forEach(
			function(x)	{
				emit(x, 1);
			}
		);
	}
MAP;

$reduce = <<<REDUCE
	function(key, values)	{
		return {count: values.length };
	}
REDUCE;

$response = $animal_tags->mapReduce(new MongoMapReduce($map, $reduce));
if ($response->valid())	{
	foreach ($response->getResultSet() as $tag)	{
		echo "{$tag["_id"]}\n";
		echo "Count: {$tag["value"]["count"]}\n";
		echo "****************\n";
	}
}

Enjoy!!!


Comments

Sep 26 2009

Know more about MySQL Explain

Published by Prajwal Tuladhar under MySQL

This has to be the one of the most in depth presentation I have ever seen about MySQL explain.

If you are working with MySQL, EXPLAIN is crucial whether you are a developer or a DBA.

And recently, I have also been using EXPLAIN quite frequently both in my works and my posts.


Comments

Sep 23 2009

Things you should know about MySQL index

Published by Prajwal Tuladhar under MySQL

Indexes are used for reading the data from the table with comparatively faster execution time. When an index is defined, MySQL maintain separate list of index values and update them when the values are updated / changed. Defining index will require more storage that means we are trading off space with the speed.

MULTI-COLUMN AND SINGLE COLUMN INDEX

Lets consider this table and its index:


create table employees (
id int auto_increment,
first_name varchar(32) not null,
last_name varchar(32) not null,
phone_number varchar(32) not null,
primary key(id)
);

ALTER TABLE employees ADD INDEX first_last_name (first_name, last_name);

mysql> select count(*) from employees;
+----------+
| count(*) |
+----------+
|   100001 |
+----------+
1 row in set (0.01 sec)

Index Cards DNA By ayalan

Index Cards' "DNA" By ayalan

Now lets make selection:


SELECT * FROM employees where last_name='tuladhar' and first_name = 'prajwal';

mysql> explain select * from employees where last_name='tuladhar' and first_name='prajwal' \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: employees
type: ref
possible_keys: first_last_name
key: first_last_name
key_len: 196
ref: const,const
rows: 1
Extra: Using where
1 row in set (0.00 sec)

By using EXPLAIN keyword we can know which keys are being used in the query. (Focus on possible_keys and key column)

This is simple example of using multi-column index.

Now, instead of using, double criteria, lets use only single.


SELECT * FROM employees where last_name='tuladhar';

mysql> explain select * from employees where last_name='tuladhar' \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: employees
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 100001
Extra: Using where
1 row in set (0.00 sec)

Did you see that the key column is null. So, the query is not taking advantage of any index at all. This is because we have only defined two indexes: a primary key and multi-column index containing first_name and last_name.

Now lets use ORDER BY clause to sort data using columns: first_name and last_name


SELECT * FROM employees order by last_name, first_name LIMIT 10;

mysql> EXPLAIN SELECT * FROM employees order by last_name, first_name LIMIT 10 \G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: employees

type: ALL

possible_keys: NULL

key: NULL

key_len: NULL

ref: NULL

rows: 100001

Extra: Using filesort

1 row in set (0.00 sec)

If the sort order has last_name as the first column while first_name as the last, index won’t be used. In order to make the query use the index we have to :


SELECT * FROM employees order by first_name, last_name LIMIT 10;

mysql> EXPLAIN SELECT * FROM employees order by first_name, last_name LIMIT 10 \G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: employees

type: index

possible_keys: NULL

key: first_last_name

key_len: 196

ref: NULL

rows: 10

Extra:

1 row in set (0.00 sec)

Now lets create two additional single column index:


ALTER TABLE employees ADD INDEX first_name (first_name);

ALTER TABLE employees ADD INDEX last_name (last_name);

SELECT * FROM employees where last_name='tuladhar' and first_name = 'prajwal' ;

mysql> EXPLAIN SELECT * FROM employees where last_name='tuladhar' and first_name= 'prajwal' \G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: employees

type: ref

possible_keys: first_last_name,first_name,last_name

key: first_last_name

key_len: 196

ref: const,const

rows: 1

Extra: Using where

1 row in set (0.00 sec)

In the above query, there are 3 possible queries and MySQL is using the multi-column index. We can force MySQL to use other two indexes by giving it hint.


SELECT * FROM employees IGNORE INDEX(first_last_name) where last_name='tuladhar' and first_name = 'prajwal' ;

mysql> EXPLAIN SELECT * FROM employees IGNORE INDEX(first_last_name) where last_name='tuladhar' and first_name = 'prajwal' \G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: employees

type: ref

possible_keys: first_name,last_name

key: first_name

key_len: 98

ref: const

rows: 1

Extra: Using where

1 row in set (0.01 sec)

The above query is using two single indexes instead of a multi-column index.

Consider the query:


mysql> EXPLAIN SELECT * FROM employees where last_name='tuladhar' or first_name = 'prajwal' \G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: employees

type: index_merge

possible_keys: first_last_name,first_name,last_name

key: last_name,first_last_name

key_len: 98,98

ref: NULL

rows: 2

Extra: Using sort_union(last_name,first_last_name); Using where

1 row in set (0.00 sec)

It is using single column indexes instead of multi-column. That means, Multi column indexes are helpful when you need AND criterial and single column indexes are helpful when you need OR criteria.

PRIMARY KEY AND UNIQUE KEY

Primary Key can’t be NULL while Unique key can be NULL but that column can have only one value of NULL. (a bit weird right? This is because NULL is also a valid data format in MySQL )

FULL TEXT INDEX

I also discussed about this topic in my last post.

When executing a query with varchar and text type and if the column has large number of values, it is generally recommend to use Full Text Index rather than simple pattern matching using LIKE and/or Regular Expression.

AVOID USING SQL EXPRESSIONS WHEN POSSIBLE

This is because SQL expressions can’t take the advantage of indexes. I have taken example from this post:


SELECT * FROM customer WHERE year(customer_join_date) = 2009;

SELECT * FROM customer WHERE customer_join_date BETWEEN '2009-01-01' AND '2009-12-31';

First query can’t take advantage of index while second query can only if `customer_join_date` is defined as index.

References:

And I strongly recommend you to read High Performance MySQL if you are engaged in some sort of MySQL related tasks.

Hope this helps you. I would love to hear your feedbacks :)


Comments

« Prev - Next »

RSS Feed
Subscribe by email
Follow me @ Twitter