SQL Server + Hadoop + Azure Data = PolyBase

by Robin (Ajay) Robert · Published January 8, 2019 · Updated January 11, 2019

In this blog I will introduce you to PolyBase. Instead of writing a paragraph, I tried to make it more interesting by converting them in sequence of questions. Please read all questions alphabetically.

PolyBase

Fig 1.0 PolyBase

A). What is PolyBase?

PolyBase query engine integrates SQL Server with external data in Hadoop or Azure Blob storage or Data Lake. You can import and export data as well as executing query.

B). What is Hadoop?

Hadoop is a solution, a tool for managing Big Data.

C). What is Big Data?

Big Data describes a massive volume of structured, semi-structured, and unstructured data collected within an organization. This data is so large that it is difficult to process using common database management tools or traditional data processing applications

D). Small Data vs Big Data?

Small Data	Big Data
Data from traditional enterprise systems like Enterprise resource planning, Customer relationship management(CRM), Financial Data like general ledger data, Payment transaction data from website etc.. are small data.	Big data generates from the system like Purchase data from point-of-sale, Click stream data from public websites, GPS Mobility stream data, Social media – facebook, twitter etc..
Size of data range is in tens or hundreds of GB.	Size of Data range is in TBs ( 1 TB=1000 GB), PB or ZB.
Manually Entered Data	Machine or Action generated Data
Small Data is used in Business Intelligence, Analysis and Reporting.	Complex data mining for prediction, recommendation, pattern finding etc.
Structured Data	Semi – Structured Data, Non – Structured Data
Quality Data	Quality of data not guaranteed

E). What are the Types of Big Data Formats?

Structured data within a database management system (DBMS)
Documents (Word, Excel, PowerPoint, PDF, etc.)
Email Messages
Text Messages
Text Files
HTML Files
Video
Audio / Voice
Graphics
Log Files
Data Files
XML Files
JSON Files
Binary Files

F). What is structured data, semi-structured data and non-structured data?

Types	Detail	Growth	Types
Structured Data	Structured Data is defined in fixed data model, data is formatted and relational	Low	Database, Data warehouse, Data marts, Manually Entered Data
Semi Structured	Semi Structured Data structure with a probable pattern defined in variable data model	Medium	Excel, XML
Quasi Structured	Textual data with arbitrary format. This data needs tool and effort to format	High	Public Web, Click data, GPS Navigation, Logs
Unstructured	Data has no integral structure and usually stored in files.	Very High	Sensor Data, PDFs, Text Files, Video, Audio, Images

Fig 2.0 Spectrum of Data

G). What is the need of Big Data?

The eventual objective for data analysis to get timely insights to support decision-making.

Small Data helps find answers to questions you already know i.e known dimensions. Big Data helps you find the questions beyond your dimensions, which you want to ask.

The way we analyze data is crystallizing day by day.


Static Report	Interactive Report	Advance Machine Learning and Forecasting reports

H). Where the Big Data is stored?

Generally, 80-90% of Big Data which is unstructured data. Stored where schema definition is not required at write time (and can be defined at code level at read time) – generally NoSQL databases.

Some of the NoSQL databases are MongoDB, Cassandra, HBase, Oracle NoSQL, Amazon DynamoDB, Couchbase, CouchDB

I). How to process and read Big Data?

Hadoop is a set of open source programs and procedures. Its Java-based programming framework that supports the processing and storage of excessively large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.

J). How Big can be Big Data?

An organization that has Big Data may have billions to trillions of records stored their organization’s databases, file systems, and storage units which could contain zettabytes (1000 petabytes), exabytes (1000 petabytes), petabytes (1000 terabytes), terabytes (1000 gigabytes/GB) of data.

K). What is Azure Data Lake and Blob Storage?

Azure Data Lakes and Blob storage is to store an unlimited amount of structured, semi-structured or unstructured data from a variety of sources. The service does not impose limits on account sizes, file sizes, or the amount of data that can be stored.

Blob storage is good at non-text based files – database backups, photos, videos and audio files.

Whereas data lake bit better at large volumes of text data.

L). SQL Vs Hadoop Tools?

Big data is difficult to work with using most relational database management systems, business intelligence and analytics applications, statistics tool and visualization packages. When dealing with extremely large datasets, organizations face complications in being able to create, manipulate, manage, transfer and query the data.

SQL cannot read Hadoop data directly as Data structure, Data integrity, Schema, Query Language and Method to store data is different.

M). How SQL can read Hadoop Data, Azure Data Lake Data and Blob Storage Data?

In SQL Server Parallel Data warehouse (APS), you can query in T-SQL across structured and unstructured data with data stored on Hadoop Clusters, Azure Data Lake and Blob Storage.

Just imagine, being able to query Big Data in a single statement, all based on the T-SQL Syntax you are familiar with.

The same query can also access relational tables in your SQL Server. PolyBase enables the same query to also join the data from Hadoop and SQL Server.

How PolyBase SQL can read Hadoop Data, Azure Data Lake Data and Blob Storage Data

Fig 3.0 PolyBase Flow.

With PolyBase, if you use SQL Server 2016 as a data source, you also have access to Hadoop and Azure Blob Storage regardless of your environment.

Thanks for reading 🙂
Keep reading, share your thoughts, experiences. Feel free to contact us to discuss more. If you have any suggestion / feedback / doubt, you are most welcome.
Stay tuned on Knowledge-Junction, will come up with more such articles

Small Data

Big Data

Types

Detail

Growth

Types

You may also like...

Leave a ReplyCancel reply

Trending Articles

Recent Posts

Archives

SQL Server + Hadoop + Azure Data = PolyBase

A). What is PolyBase?

B). What is Hadoop?

C). What is Big Data?

D). Small Data vs Big Data?

Small Data

Big Data

E). What are the Types of Big Data Formats?

F). What is structured data, semi-structured data and non-structured data?

Types

Detail

Growth

Types

G). What is the need of Big Data?

H). Where the Big Data is stored?

I). How to process and read Big Data?

J). How Big can be Big Data?

K). What is Azure Data Lake and Blob Storage?

L). SQL Vs Hadoop Tools?

M). How SQL can read Hadoop Data, Azure Data Lake Data and Blob Storage Data?

Share this:

Like this:

Related Articles

You may also like...

Office 365 – Regarding Office 365 ProPlus – Part 2 – Office Deployment Tool

Microsoft 365: Microsoft Graph API – Get All Site Collections in Tenant using Microsoft Graph API in SharePoint Framework (SPFx) component – Part 1

Configure PowerShell For Microsoft Azure Az Module

Leave a ReplyCancel reply

Trending Articles

Recent Posts

Archives

Discover more from Microsoft 365