Sweet Spots of DB2 V11 with Hadoop

Big data and business analytics represent the new IT battleground. Here is some stats:

  • IDC estimates the big data market will reach $16.9 billion by 2015, and that enterprises will invest more than $120 billion to capture
  • IDC estimates the big data market will reach $16.9 billion by 2015, and that enterprises will invest more than $120 billion to capture the business impact of analytics, across hardware, software and services that same year.
  • The “digital universe” will grow to 2.7ZB in 2012, up 48% from 2011 and rocketing toward nearly 8ZB by 2015 (IDC).

DB2v11 with Hadoop

  • 53% of business leaders don’t have access to the information from across their organizations they need to do their jobs (IBM CMO Study).
  • Organizations applying analytics to data for competitive advantage are 2.2x more likely to substantially outperform their industry peers (MIT/IBV Report)

The amount and types of data being captured for business analysis is growing. A classic example of this large superset of data is Web logs, which contain unstructured raw data.

In an increasing trend unstructured data is being stored on new frameworks. These infrastructures encompass hardware and software support such as new file systems, query languages, and appliances. A prime example being Hadoop.

So what is Hadoop?
•A java-based framework that supports data intensive distributed applications and allows applications to work with thousands of nodes and petabytes of data.
•Hadoop framework is ideal for distributed processing of large data sets .
•It utilizes a distributed file system that is designed to be highly fault tolerant and allows high throughput access to data and is suitable for applications that have large data sets.

The DB2 11 goal is to connect DB2 with IBM’s Hadoop based BigInsights big data platform, and to provide customers a way to integrate their traditional applications on DB2 z/OS with Big Data analytics. Analytics jobs can be specified using JSON Query Language (Jaql) and submitted to IBM’s Bigdata platform and the results will be stored in Hadoop Distributed File System (HDFS).

DB2 11 plans to integrate DB2 for z/OS with BigInsights from the database side and enable applications on DB2 z/OS to access big data analytics. It will include the ability to submit jobs specified in JSON Query Language (JAQL) to BigInsights and to access the Hadoop file system via user-defined functions.
(Remember that traditional table UDFs require that the output schema of the UDF is specified statically at function creation time. There would be a need to write a different external user-defined table function for reading each different Hadoop files which produce different
output schema.

DB2 11 will provide a table UDF (HDFS_READ) to read the Bigdata analytic result from HDFS so that it can used in an SQL query. Since the shape of HDFS_READ’s output table varies, we will also support a generic table UDF which improves the usability of HDFS_READ.

There would be a need to write a different external user-defined table function for reading each different Hadoop files which produce different output schema. DB2 11 will implement a new kind user-defined table functions which are called generic table UDFs. Its output schema are determined by at query compile-time. Therefore generic table UDFs are polymorphic, it increases reusability as the same table function can be used read different Hadoop files and produce different output tables.

Ref: IBM on DB2 v11

Author: Srini

Experienced Data Engineer, having skills in PySpark, Databricks, Python SQL, AWS, Linux, and Mainframe