SAP Data Hub
With SAP Data Hub, companies bring together data from different sources and formats to extract valuable knowledge. On this page, we explain in detail what is behind the platform.
What is SAP Data Hub?
SAP Data Hub is a platform on which data streams from different sources are merged. In this context, there is often talk of building a data pipeline, based on the unhindered flow of data. Possible data sources include ERP systems, data warehouses and big data lakes (large storage with unformatted data). As the central management level for data landscapes, SAP Data Hub treats all data the same, regardless of its origin. The software can integrate and manage data and then transfer it to other applications such as analysis tools. SAP Data Hub also enables the management of metadata.
The merging and processing of big data is becoming increasingly important in view ofIndustry 4.0. In this article, you will learn how companies can benefit from SAP Data Hub.
What is SAP Data Hub used for?
SAP Data Hub is primarily aimed at companies that want to generate a higher level of knowledge from their data despite complex data landscapes. According to an SAP study from 2018, this is 86 percent. The German-speaking SAP User Group e. V. (DSAG) also confirms in its 2019 investment report that "Big Data" is one of the current top 3 digitization topics of its members. With the SAP Data Hub solution, the Walldorf-based team around Hasso Plattner takes this fact into account. The platform's overarching goal is to realize an intelligent (data-driven) organization of data from ERP and other systems, which provides users with reliable data in a correct context at all times.
The most important use cases of SAP Data Hub are summarized as follows:
- Structure of data pipelines
- Orchestration of complex data processes across system boundaries
- Data acquisition and processing, e.g. from ERP systems
- Setup, operation, management and control of complex data landscapes
- Metadata management
- Data discovery
- Data governance
Let us take a closer look at these use cases in the following.
Structure of data pipelines
A central element of SAP Data Hub are data pipelines, which can extend across data lakes (e.g. based on Hadoop), object storage (e.g. Amazon S3, relevant for IoT sensor data, among other things), cloud databases, local databases and data warehouses. The solution thus covers the entire data landscape and data flows of an organization. Developers thus have the option of building various pipeline models that can be used to retrieve, harmonize, transform and process information from a wide variety of sources. In addition, various functions and processes can be built directly into the data pipelines. These include machine learning technologies such as TensorFlow and libraries for calculations.
Orchestration of complex data processes across system boundaries
As part of the orchestration, SAP Data Hub can be used to create workflows including monitoring and analysis functions for the data landscape. The aim here is to map and execute so-called end-to-end data processes. These begin with the collection of data from the source (e.g. data lake or ERP system), include data processing and data flow and finally end with the provision or integration of the resulting data into applications and business processes.
Data collection and processing
Another important task of the data hub is to accommodate large amounts of structured and unstructured data or data flows from data lakes. Users are supported by ready-made functions for data integration, cleansing, enrichment, masking and anonymization. In addition, function modules are available for monitoring data quality and governance. Furthermore, integration of the SAP solutions SAP HANA Smart Data Integration, SAP Data Services and SAP BW is possible.
Setup, operation, management and control of complex data landscapes
Data landscapes of companies today are extremely complex and fragmented. SAP Data Hub brings together the distributed components of corresponding landscapes in a central view. This gives data managers complete transparency of data processes across all connected components. Adapters are supplied to establish connections to the relevant data sources.
If required, the data landscape can be divided into specific areas with their own guidelines and service levels (for example, production and test environment). Functions for access control and data security are also available.
SAP Data Hub has its own tool for managing and controlling metadata, called SAP Data Hub Metadata Explorer. This tool is used to collect information such as attributes, storage location, quality, and confidentiality of data. This transparency enables you to make informed decisions on the following questions, for example:
- Which datasets should be published?
- Who should have access to the data?
- Authenticity (genuineness) of the data source
- Compliance with data protection regulations
- Logging of access rights as well as access, changes, origin and use of data
Thus, the Metadata Explorer is an important component of Data Governance. However, it can also be used to generate a data preview, create indexes of the content and add keywords to make searching for records easier.
Data Discovery with SAP Data Hub
Another use case of SAP Data Hub is data discovery, that is, the recognition of patterns in large amounts of data. To do this, the data is searched automatically using the tools provided. Identified data elements can also be marked. The "discovered", that is, relevant data can then be made available for further use (for analyses, for example). All in all, this approach helps to filter out valuable information from Big Data.
Data Governance with SAP Data Hub
Data Governance describes a holistic data management that is intended to ensure the availability, usability, integrity and security of data. SAP Data Hub also provides suitable tools for this purpose. Among other things, the following factors can be ensured:
In which scenarios is SAP Data Hub particularly recommended?
In principle, SAP Data Hub is suitable for all companies that want to optimize their data handling and ERP. However, there are a number of scenarios in which its use is particularly recommended:
- The data is stored in silos (e.g. data warehouses, Hadoop, files) and is not available company-wide. A manual merging would be too costly.
- The data landscape is too complex to ensure that security and data protection guidelines are still adhered to "end-to-end".
- Existing Data-Lake solutions reach their limits in terms of governance, controllability and automation.
- The tools currently in use require the deployment of highly qualified employees with correspondingly high personnel costs.
- There is a lack of specialists to implement the planned strategy in the big data area.
- The current tools require too much manual intervention, which means that the desired data results are not available quickly enough.
- Amazon Web Services (AWS): Amazon Elastic Kubernetes Service (Amazon EKS)
- Microsoft Azure: Azure Cubernet Service (AKS)
- Google Cloud Platform (GCP): Google Cubernet Engine (GKE)
- SAP Data Hub simplifies the orchestration of complex data processes. It also provides governance for modern and fragmented data landscapes.
- SAP Vora is an easy-to-use in-memory engine for distributed data systems. The primary goal is to identify and then process usable elements within large amounts of data. This data is usually stored in Hadoop clusters and NoSQL solutions.
What is the architecture of SAP Data Hub?
From a technical point of view, SAP Data Hub is based on the powerful in-memory database SAP HANA on the one hand and on SAP Vora on the other. The latter is a platform for the integration and management of data from Apache Hadoop - a widely used technology in the big-data environment (for more details, see the section "SAP Data Hub vs. SAP Vora"). Although SAP Data Hub integrates and manages data from multiple sources, the data itself is never taken from its native source and stored elsewhere. This procedure is also called the push-down model and enables distributed data processing directly on the source system. Compared to the classic ETL process (Extract, Transform, Load), a higher performance is achieved in processing and output of results.
A simple desktop design variant or a cockpit can be used as the front end. The cockpit enables users to create data pipelines on their own (in self-service). It also displays all connected data systems including the current connection status. Furthermore, the underlying data sources are visualized. This ensures a structured overview of the data landscape at all times. In addition, drag-and-drop functions are available that employees can use to create graphical data flow models.
As far as provision is concerned, SAP Data Hub supports all conceivable variants. The platform can be operated locally as well as in cloud and hybrid environments.
How is SAP Data Hub deployed in the cloud?
Due to its architecture (from version 2.3), which is completely based on containers, SAP Data Hub can be provided on any Kubernetes platform. In addition to private clouds, this includes the following managed cloud services:
What is the difference between SAP Data Hub and SAP Data Intelligence?
One of the latest cloud services from Walldorf is called SAP Data Intelligence. It is based on the SAP Cloud Platform and includes all the functionality of SAP Data Hub. Accordingly, the service could also be called a cloud variant of SAP Data Hub or "SAP Data Hub as a Service". However, the range of functions is even more extensive. For example, SAP Data Intelligence also includes the functions of the SAP Leonardo Machine Learning Foundation. A central module here is the Machine Learning Scenario Manager. It enables you to centrally manage, provide, and execute various artifacts of machine learning (such as models and pipelines).
Companies that already use SAP Data Hub and are interested in the extensive Leonardo functions do not need to switch to SAP Data Intelligence. Instead, thanks to a maintenance contract, the features are now also available in SAP Data Hub at no additional cost.
All in all, SAP Data Intelligence and SAP Data Hub can therefore be described as identical solutions. The only difference is the delivery: While SAP Data Intelligence is offered via SAP Cloud Platform (subscription model), SAP Data Hub is licensed and can be operated on any Kubernetes environment (cloud, on premise, hybrid).
SAP Data Hub vs. SAP Cloud Platform Integration (SAP CPI)
On the face of it, SAP Data Hub and SAP Cloud Platform Integration (SAP CPI) have a lot in common. Both solutions are a type of "middleware" that allows objects from local systems and the cloud to be connected to each other. However, the focus is different. While SAP Data Hub focuses exclusively on integrating and orchestrating data, SAP CPI has the core task of connecting complete systems. SAP Cloud Platform Integration is about smooth business processes and easy data exchange across SAP and non-SAP applications.
SAP Data Hub vs. SAP Vora
The differences between SAP Data Hub and SAP Vora are best understood by looking at the basic objectives:
It is important to know that SAP Vora is a component of SAP Data Hub. The two solutions are therefore not in competition with each other. Rather, the interaction of the two components makes it possible to combine data from external sources and from SAP HANA.
Is SAP Data Hub already available?
SAP's data hub has been in existence since 2017, and the current version (2.3) was released at the end of 2018. For the first time in this version, all components are container-based. This means that components such as agents, engines and metadata stores are executed in SAP HANA in isolated environments (containers). The Walldorf-based software group SAP is thus in line with the general trend towards "containerization", which is associated with greater portability, flexibility and speed.
The access to SAP Data Hub was also revised. The SAP Data Hub Launchpad with its modern, tile-like interface now serves as the central entry point. It displays all applications such as system administration, SAPora tools, SAP Data Hub Connection Management, Pipeline Modeler, and Metadata Explorer.
What advantages does SAP Data Hub offer?
To summarize, the following facts in particular speak for SAP Data Hub:
- It is universally applicable.
- It significantly simplifies the handling of data through intelligent functions.
- It is scalable independent of the data source.
- It works according to guidelines and laws.
Let us take a closer look at these arguments in conclusion.
Universal use of SAP Data Hub
With the Data Hub, users receive a tool that enables them to access not only their own company data - for example, from SAP HANA. With the help of the software, existing information can be supplemented with data from all external sources, applications and processes. This makes it much easier to tap into Big Data and use the valuable information it contains. At the same time, it is possible to put internal and external information into context, allowing for much more profound insights.
Simplification of data management
With SAP Data Hub, users have the opportunity to optimize data quality in self-service. To do this, the software provides a visual representation of the data correlations in the enterprise. In addition, the preparation, cleansing and connection control is largely automated. System and metadata recognition is also advantageous. It enables users to search through every connected data system and to feed relevant data into the second step of their further use.
Scalability of data volumes with SAP Data Hub
SAP Data Hub focuses on the orchestration of data. However, they are processed directly in the source system. This push-down approach not only optimizes performance, but also avoids unnecessary, costly data movements. Moreover, if the volume of data or the number of data sources increases, this does not pose a challenge thanks to the push-down model.
Compliance with SAP Data Hub
The fulfillment of internal company and legal requirements in handling data has the highest priority today. SAP Data Hub also takes this into account. The platform enables security policies to be maintained at a central location. The metadata can also be used to identify and correct quality errors.