In the previous chapter, we looked at how to architect database solutions that are scalable and secure. This chapter will look at several options available for architects when designing solutions that must work with large datasets for analysis and reporting.
Big data is an industry term for working with terabytes (TB), or even petabytes (PB) of data, to create analytical dashboards and gain insights. Specialist tools are often required to perform this kind of processing, and it would be expensive to build them in your own data center.
Azure provides some of the world’s most popular data tools for loading, transforming, and analyzing data. We will examine what a data pipeline looks like and then delve deeper into some of those tools.
Specifically, this chapter will cover the following topics:
- Understanding data flows
- Comparing integration tools
- Exploring data analytics
Technical requirements
This chapter will use the Azure portal (https://portal.azure.com) for examples.
Understanding data flows
Many organizations gather massive amounts of data and continue to amass data in many different forms from various systems. This data can be used to bring great value to a company.
One example may be an e-commerce company that collects sales and marketing data from its day-to-day operations. By analyzing the data, customer patterns could be ascertained, as well as the relative success of different advertising campaigns. This information could then be used to develop the company website to create a better customer journey or to identify the strongest performing marketing activities so that these can be honed while less effective ones are dropped.
Scientific organizations also make use of data to create better treatments, drugs, and methodologies.
Manufacturers can use data from internet of things (IoT) devices and sensors to optimize supply chains, increase operational efficiencies, or identify risks in products or processes.
Data sources include sales and marketing databases, product inventories, human resources (HR), machinery, heating systems, security devices, and even personal devices that monitor health and well-being. The internet has granted many organizations the ability to make vast amounts of data available to the public, which in turn can be consumed by other companies and combined with their sources to provide even greater insights.
Big data has become big business for many organizations. The act of analyzing data can be broken down into several different areas, including the following:
- Descriptive: What is happening now within the business or organization, trial, marketing campaign, and so on.
- Diagnostic: Why is this happening? What has led to the current state?
- Predictive: What is going to happen next, and how can we influence the outcome? What is the probability?
- Prescriptive: Automate and recommend responses to given events to achieve the desired outcome.
The sheer volume of data has led to a few problems, as outlined here:
- The amount of computing power required to process so much data can be costly.
- As information comes from many different sources, it can often be in various formats.
- Data can be incomplete or not follow a prescribed format.
Therefore, organizations wishing to use data for analysis have several issues to overcome. This is achieved by creating data flows, which are end-to-end processes that define the following activities:
- Ingest: Pull data in from either a central location in the form of files or continually receive data from a streaming service such as an IoT device.
- Clean: Ensure your data is normalized and transformed and that gaps are filled or removed from the dataset.
- Store: Once your data has been cleaned and transformed, you may want to store it for further analysis.
- Train: Investigate and explore data to derive deeper insights.
- Model: Analyze and learn from your data by applying algorithms and data maps.
- Serve: Query and visualize data.
Conceptually, a data flow pipeline may look like this:

Figure 13.1 – Data flow pipeline
Now we know at a high level what we are trying to achieve, we can investigate the various tools available to us in Azure.
Leave a Reply