The Elephant in the room v/s Hadoop – the Elephant on your side
For years, ETL technology and tools have remained almost the same, especially in the data warehouse context. The tools have improved, but the methodologies have remained largely unchanged. You extract data from various sources, run a set of scripts or ETL workflows to transform that data, then load it into a star schema or semi-normalized data warehouse or master data management system.
Disadvantages of this method:
- Flexibility: Targeting only relevant data for output means that any future requirements, that may need data that was not included in the original design, will need to be added to the ETL routines. Due to nature of tight dependency between the routines developed, this often leads to a need for fundamental re-design and development. As a result this increases the time and costs involved.
- Hardware: Most third party tools utilize their own engine to implement the ETL process. Regardless of the size of the solution this can necessitate the investment in additional hardware to implement the tool’s ETL engine.
- Skills Investment: The use of third party tools to implement ETL processes compels the learning of new scripting languages and processes.
- Learning Curve: Implementing a third party tool that uses foreign processes and languages results in the learning curve that is implicit in all technologies new to an organization and can often lead to following blind alleys in their use due to lack of experience.
In the process, the tools have made ETL a long and tedious exercise, one that often results in lags of weeks or even months from when data is first collected until it reaches a point where it can be analyzed. Or have caused severe headache to businesses and stakeholders who pray that their nightly ETL jobs are complete without an error, so as to prevent a showdown with the application users.
Traditional data integration and ETL tools are becoming an inhibitor to timely availability of high-value data to the business and will not be able to scale effectively with ever-growing volumes of data. In addition to the costs and challenges of attempting to scale these environments as data volumes grow, the issue of data latency due to the intermediate systems required for these platforms becomes a bigger and bigger threat to the enterprise.
The Hadoop Based Solution:
Apache Hadoop is an open source distributed software platform for storing and processing data. Written in Java, it runs on a cluster of industry-standard servers configured with direct-attached storage. Using Hadoop, you can store petabytes of data reliably on tens of thousands of servers while scaling performance cost-effectively by merely adding inexpensive nodes to the cluster.
The Apache Hadoop platform includes the Hadoop Distributed File System (HDFS), which is designed for scalability and fault- tolerance. HDFS stores large files by dividing them into blocks and replicating the blocks on three or more servers. HDFS provides APIs for MapReduce applications to read and write data in parallel. Capacity and performance can be scaled by adding Data Nodes, and a single NameNode mechanism manages data placement and monitors server availability. HDFS clusters in production use today, reliably hold petabytes of data on thousands of nodes.
Central to the scalability of Apache Hadoop is the distributed processing framework known as MapReduce. MapReduce helps programmers solve data-parallel problems for which the data set can be sub-divided into small parts and processed independently. MapReduce is an important advance because it allows ordinary developers, not just those skilled in high-performance computing, to use parallel programming constructs without worrying about the complex details of intra-cluster communication, task monitoring, and failure handling. MapReduce simplifies all that.
Hadoop, at its core, is two components:
- HDFS – Massive, redundant storage
- MapReduce – Batch oriented data processing to scale
The Hadoop ecosystem brings additional functionality
- Higher level languages and abstractions on MapReduce (Hive, Pig)
- File, relational and streaming data integration (Flume, Sqoop)
- Process Orchestration and Scheduling (Oozie)
Using these technologies, Gold Coast offers new approaches to Data Integration instead of the old dated and inefficient ETL process.
The Cost Advantage:
|Solution||Cost / Terabyte||Base License Costs / Year / License|
The Hadoop Advantage:
RDBMSs were created to do queries (Q for short), and not batch transformations (T for short). So, as data sizes started to grow, not only did this approach lead to missing ETL SLA windows (Service Level Agreements), it started missing the Q performance SLAs, too. So, this approach eventually led to a double whammy: the Transformations (Ts) and the Qs slowed down. Hundreds of organizations are moving the T function from their databases to Hadoop because of a number of key benefits:
- Hadoop can perform T much more effectively than RDBMSs. Besides the performance benefits it is also very fault tolerant and elastic. If you have a nine-hour transformation job running on 20 servers, and at the eighth hour four of these servers go down, the job will still finish — you will not need to rerun it from scratch. If you discover a human error in your ETL logic and need to rerun T for the last three months, you can temporarily add a few nodes to the cluster to get extra processing speed, and then decommission those nodes after the ETL catch-up is done.
- Ingest massive amounts of data without specifying a schema on write: A key characteristic of Hadoop is called “no schema- on-write,” which means you do not need to pre-define the data schema before loading data into Hadoop. This is true not only for structured data (such as OLTP Application data, Webservice responses, Other RDBMS records, call detail records, general ledger transactions, and call center transactions), but also for unstructured data (such as user comments, doctor’s notes, insurance claims descriptions, and web logs and social media data). Regardless of whether your incoming data has explicit or implicit structure, you can rapidly load it as-is into Hadoop, where it is available for downstream analytic processes.
- Offload the transformation of raw data by parallel processing at scale: Once the data is in Hadoop (on a Hadoop-compatible file system), you can perform the traditional ETL tasks of cleansing, normalizing, aligning, and aggregating data for your EDW by employing the massive scalability of MapReduce. Hadoop allows you to avoid the transformation bottleneck in your traditional ETL by off-loading the ingestion, transformation, and integration of unstructured data into your data warehouse. Because Hadoop enables you to embrace more data types than ever before, it enriches your data warehouse in ways that would otherwise be infeasible or prohibitive.
Regardless of whether your enterprise takes the Data Warehousing approach or the Data Lake approach, you can reduce the operational cost of your overall BI/DW solution by offloading common transformation pipelines to Hadoop and using MapReduce on HDFS to provide a scalable, fault-tolerant platform for processing large amounts of heterogeneous data.
Contact us today to learn more about how you can realize your Data Dreams!