With the explosion of data in recent years and the significant decrease in processing and storage costs, the need to process large amounts of data has become a necessity. Likewise, the regular scenario of connecting to a data source to load the data in memory and work with it has become challenging due to the size of the data. Therefore, computing capacities are distributed, and each compute node is connected to chunks of the data. Then, computation is run parallelly on all the data chunks that construct the full dataset, and finally, the results are aggregated.
And with those ideas, the Hadoop platform came into existence. It was left to the community and also commercial software companies to take it further with one of the richest ecosystems in the world for a platform. One of those "better versions" of Hadoop is Spark, which makes use of the underlying Hadoop architecture and the power of putting data into memory. This big data cluster with many nodes can process data in parallel and take advantage of the distributed architecture to finish problems that used to take so long on single machines, or sometimes weren't possible at all.
In this blog, we'll take a deep dive into a commercial version of Spark that's called Databricks. Databricks, at the time of writing this blog, is only available as a cloud implementation. This blog's focus will be Databricks on Microsoft's cloud platform, Azure. We'll examine the tool and its location within the overall architecture by examining the surrounding ecosystems that contribute to Azure Databricks, end-to-end architecture for advanced analytics, machine learning, and real-time analytics solutions.
In this post, we'll cover the following topics:
Understanding Databricks on Azure
Provisioning your first workspace
Databricks Notebooks
Understanding Databricks on Azure
Databricks' workspace and clusters can be provisioned from the analytics services in the Azure portal; the entire platform is embedded within the Azure ecosystem, which makes it easier to start clusters, run jobs, and use the output in any shape or form. Unlike other big data implementations, the entry point to the service is through the workspace, and this is helpful and more productive for the following reasons:
Provisioning workspaces are free; the cost occurs only when clusters are run to process data. The cost of Azure Databricks covers the compute power of the chosen nodes and the Databricks runtime cost.
The workspace can act as an entry to the services, which allows the creation of multiple clusters (dev, test, and production) and multiple environments for different workloads of processing data, machine learning, or streaming data. All clusters can be set to self-destruct after minutes of inactivity and can be auto-resize based on the workload.
Databricks integration with Azure Blob Storage or Azure Data Lake Gen2 makes it easier and more architecturally sound when we decouple the processing power from the storage. This is very important if the data is required to be used with other tools and services or if the compute cluster is destroyed because it's not being used to reduce cost. In all cases, we maintain the data. In the next post, there is a deep dive into the architecture and how Azure Databricks serves in the middle of an end-to-end platform for analytics.
Through the rest of this post, we'll look at more reasons that make Azure Databricks more productive and easy-to-use service for data analytics, data transformation, and machine learning.
Why is Azure Databricks a better Spark flavor?
In order to understand the advantages of Azure Databricks over the vanilla Spark installation or other Spark distributions, we need to have an overview of the architecture of Azure Databricks. The following diagram highlights this architecture, with a focus on a few advantages that Azure Databricks provides out of the box:
Azure Databricks features overview
Collaborative workspace
Looking at the architecture, the first thing we notice is the target audience for Azure Databricks. Who are the typical users of this service? We have listed the three main audiences:
Data Engineers: Data engineers are responsible for extracting, loading, and transforming the data. They also ensure that the correct pipelines of data are in place for the various data wrangling and changes required for the different use cases.
Data Scientists: Data scientists are responsible for building machine learning models, taking care of the whole cycle from cleaning data, highlighting features, choosing the best model, fitting the model to the data, and scoring the model to get the best model that represents the problem.
Business Analysts: Business analysts are usually the domain experts for the problem and have a business background related to the solution. They can use Databricks to test the model's accuracy and relevance to the domain. They can also connect visualization tools like Microsoft PowerBI to the data models and test the outcome to ensure it makes sense to the business.
Aside from research-focused establishments, if the platform is meant to work for a business entity in any industry (financial, oil and gas, telecommunications, or public sector, and so on), the three functions mentioned above (data engineers, data scientists, and business analysts) are required to carry out a successful advanced analytics solution. More insights on those functions will come in the next post.
The Databricks workspace facilitates the collaborative work of the three functions, putting security and access governance in place in the same area where data engineers wrangle data, data scientists build models, and business analysts report on the transformation and data modeling. This comes out of the box from Azure Databricks and doesn't need any installation of tools.
Production jobs and workflow
Another important and helpful feature of any advanced analytics solution is the ability to easily schedule and move code to production. Databricks provides the following support:
· Multi-stage pipelines: As the code in Databricks is organized into cells in notebooks (Databricks uses a version of Jupyter Notebooks with some special features. We'll look at it later in this post in the Databricks notebooks section), a pipeline can be generated with multiple notebooks to organize stages of data manipulation, curation, and machine learning.
· Job scheduler: In an easy and friendly way, Azure Databricks allows engineers to schedule notebooks to run at a certain time rather than installing monitoring tools.
· Notifications and logs: No production system will be helpful without the notifications and logs; Azure Databricks provides them out of the box to enable users to review clusters and notebooks.
Optimized Databricks runtime
The optimization in the Databricks runtime compared to other Spark distributions provides a variety of additions, such as the following:
Databricks I/O: Provides up to 10x faster access to big data stored in the data lake.
Apache Spark: 75% of the code committed to the Spark community comes from Databricks. With the wide Spark community, it's easy to search for anything that can run on Spark, and it will run on Databricks.
Serverless: Databricks clusters depend on multi-user auto-configured clusters that are shareable but are fault-isolated. The Azure platform provides this capability for swift cluster creation and resizing to provide a better experience.
Rest APIs: Databricks clusters, notebooks, and other resources are accessible through REST APIs. This helps a lot with automation and building pipelines of data that not only contain stages of Databricks but may contain other tools in a chain of the pipeline.
These enterprise-grade features are key to ensuring that testing a tool or building proof of value, and moving from the proof-of-concept phase to the production phase, are easy steps for a real solution that serves a large number of users and works on petabytes of data.
Integration with Azure
The seamless integration with the different Azure components and platforms is key for speedy development and adoption of the Azure Databricks. We will deep dive into the integration throughout the blog in different posts related to data engineering and data science, but we've mentioned a few advantages of integration here:
Security: Out-of-the-box integration with Azure Active Directory (AAD) helps a lot with security and planning authorization for Databricks administration and usage without the need to carry different access schemas of usernames and passwords or to install tools to allow Active Directory integration. The first landing screen of the workspace will ask for AAD authentication without any special configuration.
Data access: As mentioned before, the seamless integration between Azure Databricks and the data lake makes it very easy to put data in the data lake and access it through Databricks for manipulation.
Reporting and database tools: From Azure Databricks, you can move data to reporting tools like PowerBI, data warehousings tools like SQL Data Warehouse, or NoSQL tools like CosmosDB. These tools are easily implemented and help a lot with the enrichment of advanced analytics solutions.
In this section, we discussed some of the features that make Azure Databricks a better Spark solution, and throughout the blog, a complete view will be constructed on how Azure Databricks made advanced analytics solutions easier to develop and faster to productize.
Provisioning your first workspace
To get our hands on Azure Databricks, nothing is better than creating a workspace and playing around with it. To be able to create an Azure Databricks workspace, you need to have a valid Azure subscription. Navigate to https://portal.azure.com/ to log in to the Azure portal and, after authentication, the portal will direct you to the home page, where you can create or manage any service.
You can have a free trial account to Azure that, at the time of writing this blog, is valid for 12 months with $200 of Azure credit valid for 30 days and some services that are always free while the account is valid. You can sign up on the following URL: https://azure.microsoft.com/en-us/free
After successful authentication on the Azure portal, navigate to Create a resource; it can be found on the home page or in the left menu.
After clicking on it, search for Azure Databricks and select it from the marketplace. Clicking on it will open the service overview page with some details.
Clicking on the Create button will open the initial workspace configuration blade. You'll need to input some information, which we'll discuss now. The blade will look like the following screenshot:
Azure Databricks workspace blade
As seen in the screenshot, you need to input some details to create the Azure Databricks workspace. The fields are as follows:
Workspace name: Type the name of the workspace – it must be more than three characters. We will call our workspace mainwrkspc.
Subscription: This is a dropdown list of the valid subscriptions that this user has access to.
Resource Group: A resource group (RG) in Azure is a virtual collection of services for organization purposes. Services can be organized according to projects, departments, or types of environments, such as development, test, and production. We will call our RG analytics.
Location: This is the Azure datacenter location where the service will be provisioned; we will keep it as West US 2. For a complete list of Azure regions and service availability, refer to https://azure.microsoft.com/en-us/global-infrastructure/services/
Pricing Tier: Here, you need to choose between Standard or Premium services. Premium provides more fine-grained security features with role-based access to clusters, notebooks, and jobs along with virtual network (VNet) injection. We will choose Premium.
Deploy Azure Databricks workspace in your own Virtual Network (VNet): This setting is for securing connection through virtual networks, we will keep it as No for now.
After entering these details, we click Create to provision the services. Azure takes a few minutes to get the workspace ready for access.
After provisioning the service, click on Go to Resource; this will direct you to the main Azure Databricks blade with some documentation and the main information panel. Finally, click the Launch Workspace button; this will take you to the workspace.
Navigating through the workspace
After clicking on the Launch Workspace button, the browser directs you to the main workspace landing page. But first, it requires authentication against the user directory; after a successful login, you will land on the workspace home page, where the main navigation starts. The home page will look like this:
Azure Databricks workspace home page
Conveniently, you will find the main three panels on the home page for common tasks, recent notebooks and files, and tool documentation links. On the left, you'll find a navigation pane with tabs for the following components:
Home: will open a blade with all users. Every user will have their folders, libraries, and notebooks accessible from the navigation tree.
Recents: shows the latest accessed files, a very good place for users to catch up from where they last stopped.
Data will open databases and tables; it's only active when there's a cluster running.
Clusters: is where users manage clusters by creating, updating, and deleting different clusters.
Jobs: opens the view on the current schedule of jobs, status, and the ability to create jobs.
Search: is an easy way to locate a file within the workspace.
The icon of the user in the top-right corner of the screen can be used to manage users, access, and permissions.
Thus, most of the common tasks can be easily accessed from the home page of the workspace.
Directories
Every user with access to this workspace will have a major branch under their username. From this top branch, the user can start creating folders to organize work and carry out some of the tasks, such as the following:
· Creating notebooks, libraries, subfolders, or MLflow experiments, which will be covered later in this blog
· Cloning, importing and exporting the content in this folder
· Control permissions on the content, granting or revoking access to other users
Similar to a local directory on a machine, it's important for notebooks and libraries to be organized in an easy and understandable way. A small advanced analytics project will contain some data extraction and transformation pipelines and machine learning models. In any project, many data scientists and data engineers will be creating multiple notebooks and importing different libraries, hence the project will quickly grow in terms of the number of artifacts. Organizing these artifacts is extremely important to help with testing and, later, with packaging the deployment using the correct models, notebooks, and libraries.
Creating the first cluster
For users to start executing any piece of code, import libraries, and train models, compute power should be ready for such tasks. The computing power comes with provisioning clusters. Databricks clusters consist of two types of virtual machine (VM):
The driver node (head node) is the main node. It manages the cluster by computing accessibility, managing tasks, distributing execution, and managing resources. The driver node maintains the status of all data nodes and the Spark engine context.
There are between 1 and as many worker nodes (data nodes) as the user requires. Workers are the nodes that execute the distributed workload with Spark executors and load data into its memory for fast execution.
Cost incurs on the cluster nodes when the cluster is running. The cost is divided into the compute cost of the VMs and Databricks units (DBUs). If a cluster is terminated, there is no cost incurred for creating the workspace or maintaining the notebooks.
Now, let's create the first cluster to understand the different parameters and how to configure them. If you click on the Clusters shortcut in the workspace on the left, a list of all available clusters will open. As we don't have any clusters yet, it's no surprise that this list is empty.
Before creating a new cluster, let's discuss the two types of clusters. Interactive clusters are created by the UI, the CLI, or REST APIs. These clusters are used for development and have multi-user access. However, automated clusters are created automatically to execute jobs and then they get terminated.
Clicking on the Create Cluster button will open a screen to configure an interactive cluster. The screen will be similar to the below screenshot:
Cluster configuration screen
The required fields to be filled are as follows:
Cluster Name: We can add any name. In this case, we chose dev_test .
Cluster mode: There are two types of cluster modes, either Standard or High concurrency. For development and testing, Standard type is recommended, which is the one chosen in the screenshot. The High concurrency cluster type provides more fine-grained resource sharing utilization and minimum query latency. It's more suitable for production workloads. High concurrency clusters also provide a separate environment for each user to create fault isolation.
Pool: A cluster pool is a predefined idle instance of VMs to expedite cluster provisioning.
Databricks Runtime Version: The Databricks runtime is a core component of Databricks on the Spark engine. There are multiple available runtime versions to choose from. Databricks Runtime 6.0 and higher depend on Python version 3, hence you can't change the Python version. For any Databricks runtime version prior to 6.0, users can choose between Python 2 or 3. The Databricks runtime also defines the cluster functionalities from the libraries pre-installed on it. For runtime 6.0 ML, machine learning libraries like Tensorflow and Keras are preinstalled on the cluster data nodes. We will start with a data engineering cluster, so we will choose the 6.1 runtime without ML.
The autopilot option is very helpful in dev and test cases; two main controls can happen on the cluster automatically:
Autoscaling, where the cluster is autoscaled between the minimum and the maximum number of nodes set based on the utilization of resources and intensity of queries and workload.
Termination after a specific period of inactivity helps with cost reduction after development has finished. We set 120 minutes, so after 2 hours of inactivity, the cluster will be terminated. The configuration will continue being available but the cluster won't be running, so next time we need to execute code on it, we'll have to start it again.
Worker Type and Driver type are the sizes of the machines used from memory and cores and whether we're using GPU machines or not. The size of the machines is the compute cost specifier. More powerful machines mean more cost. The dropdown provides a list of the available machines to be used as Driver and Worker nodes.
In most cases, we can hit the Create Cluster button if there's no need for advanced configurations. The next section looks at these advanced configurations in more detail.
Advanced cluster configurations
There are some advanced configurations that can be carried out on the provisioned clusters. The advanced configuration input is found at the bottom part of the create cluster screen and it looks like this:
Advanced options in cluster creation
Let's discuss the components of this screen:
Spark Config and Environment Variables: This is where you create a global init script to set the Spark configuration and define some environment variables to the cluster.
Tags are associated with any Azure service. They help with reporting on Azure service usage with tags. Creating and associating specific tags with each project will help with reporting on the usage of services, or the cost of all the tags, thus delivering the ability to quantify how much this project is costing from Azure's perspective. It can be defined for the type of workload, project, or department so the Azure cost can be associated with the solution or the team that's using this cluster.
Logging: Users can define a location to which all the driver and worker nodes logs are pushed. Logs are pushed every five minutes; however, if the cluster shuts down, all the logs are copied to the desired place.
Init Scripts are the shell scripts required to run on the machines during startup before loading the worker JVM or the Spark drivers. It's used to install packages outside of Spark or Databricks that are required for some libraries or specific functionalities.
For this example, we will keep the advanced options without any customization, and throughout the blog, there'll be some cases where we will define the settings especially on the Init Scripts part to install libraries on the worker nodes.
Hitting Create Cluster will instruct the Databricks services infrastructure to provision the driver and minimum workers and change the cluster status to an online state so that it's ready to execute commands and jobs.
Databricks notebooks
Notebooks are a web-based interactive environment for writing code for different functionalities, visualizing data, and writing documentation text. While Jupyter notebooks are among the most commonly used in the Spark community, the Databricks notebooks take it further with better and more sophisticated functionalities to boost productivity.
To create a notebook, open the workspace blade by clicking on the Workspace link on the left corner. A list will appear showing options for Documentation, Release Notes, Training & Tutorials, Shared, and Users.
Clicking on Shared will open the notebooks and libraries and folders shared between users, and clicking on Users will open a list of all users. Depending on the security administration, users can have access either to their usernames only or to others. Even from the very first listing, users can click on the down arrow and click Create and then click on Notebook, as shown in the below screenshot:
A popup will appear to define a name for the notebook. The notebook's default coding languages are Python, R, SQL, and Scala. You'll need to define the cluster to attach the notebook on after creation.
It's important to keep the organization of the notebooks similar to the organization of code files for any language on the hard drive. Don't create notebooks on the root, create folders that are descriptive, and then create the notebooks. Projects easily grow large in the number of notebooks, and keeping track of the code tree helps to easily maintain a large project.
Similar to creating notebooks, users can create folders. As shown in the screenshot, there's a possibility of importing or exporting either one notebook or a large set of notebooks organized together with the possibility of defining permissions and sorting all child artifacts for visibility.
The language selected when creating a notebook is usually associated with the notebook as a default coding language. It's easy to mix multiple languages in the same notebook. This is done using the language magic command as shown here:
%python allows you to execute Python code in a notebook (even if that notebook is not in Python).
%sql allows you to execute SQL code in a notebook (even if that notebook is not in SQL).
%r allows you to execute R code in a notebook (even if that notebook is not R).
%scala allows you to execute Scala code in a notebook (even if that notebook is not Scala).
%sh allows you to execute shell code in your notebook.
%fs allows you to use Databricks Utilities (dbutils) filesystem commands.
%md allows you to include rendered markdown.
In the same notebook, each cell can write code in the desired language by setting the cell code with the % as highlighted above. If not highlighted, the default language will be executed against the code
Writing the first notebooks
Let's create a folder named Introduction under the logged-in username by navigating to Workspace > Users > and your logged-in username.
Then click on Create and chose Folder in the popup; let's name the folder Introduction.
Then navigate to this folder and click Create to create a notebook named hello_world and keep the default language like Python. Also, the cluster attached is our dev_test cluster created earlier.
The resulting notebook will look like the following image, with multiple functionalities that we will cover briefly because some of these functionalities will be discussed in detail in later posts.
Below the name of the notebook and the default language mentioned between two brackets, there's a drop-down of the available clusters. This drop-down must be set to an active cluster to be able to execute any command. If the cluster isn't online, which means not running as in the above image, no code will be executed unless we bring this cluster or any other cluster from the drop-down into the running state by doing either of the following:
· Clicking on the drop-down under the desired cluster and clicking on the start cluster link
· Clicking on Clusters from the left pane to open the clusters screen and selecting the cluster and clicking on the play arrow button to start the cluster
Either of the two options will fire a request to start the cluster. It will take a few minutes. Once it's ready, the gray dot will turn green. Now we can execute the code.
Let's get familiar with the notebook environment and its capabilities. As most of our development will occur in the notebook, it's important to understand all options available. Let's start with the File menu as shown in the below screenshot:
The File menu contains some basic functionalities for the notebook, such as cloning, renaming, exporting, and other options similar to the File menu in your regular Microsoft Word application.
As an Azure Databricks notebook contains the results below each cell, the View menu contains options to allow us to choose whether to see the code or only the results organized in some way.
Permissions set the access rules for this notebook.
As each cell can run by itself, the Run All option runs all cells in the regular top to bottom order, meaning it executes the full notebook.
Clear can be used to clear the cell's results or the notebook's state.
On the right-hand side, functionalities start with the keyboard icon, which shows all important keyboard shortcuts in command mode or edit mode.
Though the most common shortcut to be used is CTRL+ENTER, which executes the highlighted cell, it's useful to have a look at the keyboard shortcuts to get used to them and speed up the coding routine.
Schedules set a schedule to run the full notebook, which is very useful in cases of data manipulation, processing new data, or retraining machine learning models.
Comments are used to write comments on code in cells, which is useful for collaboration and is similar to commenting in Word documents.
Runs show the last notebook runs, and Revision history shows the history of the notebook versions. Clicking on Revision history opens a link just under it with Git: not linked. Clicking on it will open one of the most useful functionalities and additions to Azure Databricks notebooks, and that is linking your notebook to a Git repository to establish source control for your notebooks. A Git repository can be defined to allow you to push the code and easily establish DevOps practices within the development phase.
this notebook can be found here on GitHub https://github.com/WagdyIshac/Advanced-Analytics-with-Azure-Databricks/blob/master/Chapter1.ipynb
In the first cell, let's write the following code:
print('I love Azure Databricks')
Then, hit CTRL+ENTER while the blinking cursor is active in this cell. This will execute it against default notebook language, which is Python, and print the results as expected under the cell:
I love Azure Databricks
Given that the execution is correct, under the results there's some small print showing the total time to execute this command on the cluster, the user who executed this command when the results were printed.
Under the executed cell there's a plus icon that we can click on to open another cell (or by typing ALT+ENTER in the first cell) and, using the magic commands, we can achieve the same print results, but with Scala this time, using the following code.
%scala
println("I Love Azure Databricks")
Executing the cell will show, as expected, the same results but this time using Scala. We can do the same with SQL by using the below code; it will produce the same message but as a table:
%sql
SELECT "I Love Azure Databricks" as message
We also notice that a Spark job has been fired to run the SQL command. The output is a table as shown in the screen below with one column named message and one-row showing
I love Azure Databricks.
"A Spark job is fired" means that the worker nodes executed the command, but when the driver node executes the command without using the worker nodes, then no Spark engine job is fired and the driver can handle the request as a single machine.
We also note as the results are shown in a table, the cell has additional functionalities to this table like downloading the data or visualizing it as a bar chart or any other different chart type available in the dropdown. Maybe it's completely useless in this case to use such functionalities, but when working with datasets that are in tabular format, this feature is very helpful for quickly visualizing or downloading data:
The screenshot shows the buttons for tables, output, and the number of Spark jobs executed, with details to drill down on what happened in this job.
Let's finalize this very basic notebook by writing some markdown with what we have done by writing the following code:
%md
This is my first notebook using the magic command
This statement will be shown in a clear markdown cell. It's used many times to define the details of the coming cells or comment on code. Think of it as a rich textbox editor in the middle of your code file. In the markdown cell you can write the following:
Different font sizes
Numbered and bulleted lists
Images
Tables
Regular HTML
Libraries
Similar to any coding language, the use of libraries is very important, and as you can write in Python and R you will want to use any of the pip or CRAN libraries or a JAR file. Similar to creating a new table, clicking on Create and then Library will open the below screen to define the library parameters:
Summary
In this post, we learned some of the basic concepts of Azure Databricks. We saw why it's better than Spark and the main features that Databricks come with that boost the development of advanced analytics solutions.
We also learned how to create Databricks workspaces, what you can do in a workspace, and how you can create a cluster for code execution, write code in notebooks in different languages, and upload different libraries to help with the code.
Before we deep dive into coding data engineering-related scenarios in the later posts, in the next post, we'll explore the concepts of building a large, advanced analytics solution with Azure Databricks. We will look at some of the concepts and ideas that are helpful for creating architecture governance across the organization that allows the development of multiple analytics use cases on the same platform and building frameworks for collaboration between data engineering and data science capabilities.
コメント