I have been reading a lot lately about articles and blogs mentioning about the future of data warehousing and how big data is going to slowly replace traditional Data Warehouse. I just want to lay out some facts which shows these systems has to be integrated together in enterprise architecture. If you look at the growth of PC computing over years factors like
- CPU speed was in MHz in 90s and now it has reached GHz
- RAM was in Ks (640K) and grew to MBs and now in GBs.Same with disk capacity.
What it means is, along with growth of computing speed and memory it was very obvious that data also has to grow. Though data is growing the requirement remains basically same to process the data and get intelligence out of it in lesser time.
Now there are two ways to achieve it. Use a very expensive single powerful computing machine or to use many cheap commodity hardware in parallel. It is a easy choice.
How does it work?
A standard computer (single node) will need 3 to 4 hours to read 1 TB of data , say at 80Mb/second. If same task is split among 1000 nodes it will take 12 seconds. Simple trick named grid computing , parallel processing. Relational databases like Teradata and DB2 were using this long before the term Big Data even got coined. Only difference is that the trick with a relational database is you specify what you want accomplished and not how to go about accomplishing the task.
Now let us looks at some factors associated with data processing and how a Big Data platform and relational database fares.
First and foremost , both systems has to deal with data. Huge volume of data,variety of data and how fast data is coming in.
Hadoop platform handles data better.
- It works well with huge Volume, Variety and Velocity (3Vs).
- It works well with structured and unstructured data.
Where it lags is
- Not good in transactional processing. it uses mostly batch process.
- Throughput is given priority over response.
- Only inserts and deletes as we are dealing with file system.
- Data loss is possible compared to traditional databases.(will touch this point in hardware and maintenance)
Relational database are better for
- Transactional processing but not good to scan 100 TB data at once.
- It deal with structured data , some databases can handles XML well.
- Random access to data is fast and possible, thanks to indexing.
- Read, Update, Delete possible.
Hardware and Maintenance
The best thing to happen is to have all the data you need at single place. This make maintenance faster ,data i/o faster but we know it is not feasible. When data is scattered across multiple place it needs synchronization which causes lot of overheads like dead locks and time outs, data consistency issues, data redundancy issues, need for back ups, logging etc etc.
As we discussed earlier Hadoop runs on commodity hardware (only data nodes, name node memory requirement is larger) , cheap compared to enterprise hardware like RDBMS. Even though scalability is easier and cheaper, chances of nodes going down is much higher here which may in turn cause data loss unless data is replicated to multiple nodes.
No much points here in favor of Hadoop system
- Security and auditing is limited unless enforced by third party tool on the enterprise platform.
- No significant data encryption.
- Single file compression where as relational databases use sophisticated compression techniques.
So looking at the above facts you might have already came to a conclusion. Both the systems have strengths and weakness in different areas. Both the systems has to integrated well into Enterprise architecture to get maximum out of the data.