The ETL process belongs to the computer science basics. It enables data from different sources to be converted into valuable knowledge. On this page you can learn more about the details!
Frequently asked questionsSee all questions
What is the ETL process?
The ETL process consists of the individual steps Extract (E), Transform (T) and Load (L). This method from computer science makes it possible to read data from various sources, prepare it and make it available in a central system.
Today, data relevant to the company comes from a wide variety of internal and external sources. To make all this information usable, the various data sources must first be tapped. Since these are different formats and not every single data record is relevant, the second step involves cleansing and preparing the raw data. The aim here is to convert the data into information relevant for decision-making. Finally, the prepared data is then made available in a central database or data warehouse (database system for analyses) to make it accessible to users.
The three steps of the ETL process are summarized as follows:
- Extract / Extraction ("extraction") of raw data from different data sources
- Transform / Transformation into the format and structure of the central target database
- Load / Load (make available) the data in the target system
When is the ETL process useful?
The implementation of an ETL process is recommended whenever a company needs access to data from different sources in order to make well-founded management decisions (using business intelligence). Especially when data queries are difficult to implement with existing resources, are error-prone or completely impossible, the implementation of ETL makes sense. The same applies if a company wants to implement a central instance for all data analyses.
Another aspect that is closely related to the ETL schema is big data analysis. This refers to the indexing and evaluation of extremely extensive data in a wide variety of formats. Incidentally, ETL is not only relevant for large companies. The approach is also important for SMEs with growing data volumes and market requirements.
Last but not least, the ETL process is used to migrate data between different applications and for data replication for backup purposes.
Why is ETL so important?
Today, ETL is a significant component of Business Intelligence (BI). With the introduction of ETL-based processes and tools, companies gain a competitive advantage by transforming existing raw data into valuable knowledge, enabling them to make data-driven decisions. In short: ETL significantly increases the availability and value of data.
However, the ETL process is not only significant in terms of data discovery, but also in terms of data quality. Of course, it cannot be the goal of companies to use data from different internal and external sources for analysis without verification. ETL therefore also ensures that only consistent and cleansed data enters data warehouses and BI tools.
What are the advantages of an ETL process?
In addition to the main benefit already mentioned - the exploitation of valuable information - the ETL process brings further improvements for companies. Through integration, for example, it ensures that all corporate divisions operate on the basis of uniform data. It not only provides management information, but also enables the specialist departments to carry out analyses on various issues at any time. All in all, data access is also much faster than with traditional approaches.
Thanks to transformation and aggregation, the source data can also be converted into key business indicators (KPIs). Subsequent steps such as graphical representations can thus also be implemented without any problems.
How does an ETL process work from a technical perspective?
In this section we would like to go into more detail and explain how the sub-processes extraction, transformation and loading work from a technical point of view. There are different approaches to this, which we will also show.
Step 1: ETL extraction process
In this step, conventions for the connection types to the various source systems are first defined so that the extraction can begin. The transfer types are also defined.
The update frequency is also defined. You can differentiate between synchronous and asynchronous extraction. With synchronous extraction, datasets are updated continuously. They are therefore always up to date. However, this procedure causes an increased network load. For this reason, some companies use the resource-saving asynchronous extraction. This can be scheduled for time windows in which sufficient resources are available - such as at night.
Furthermore, the extraction can be differentiated according to its scope. The following specifications are possible:
- Static extraction: A complete image of the database is created (relevant for initial filling and recovery)
- Incremental extraction: Only the changes between the current and the last extraction are read
Step 2: ETL transformation process
The second step of the ETL process is the data transformation. It is responsible for transforming data from different sources into a uniform format that can be used within the company. Among other things, the following actions are carried out:
- Adaptation to uniform data types
- Conversion or new coding (e.g. for country coding)
- Unification of character strings and time specifications
- Recalculation of measurement units
In addition to the resolution of structural (technical) differences, differences in content (technical) are also resolved. This is done using a correction scheme that takes the following points into account:
- Incorrect (inconsistent) data
- Redundant (double) data
- Obsolete data
- Missing values
In addition, business harmonization and aggregation of the data can take place in this ETL phase. It is also possible to enrich the data with additional data and key figures.
Step 3: ETL loading process
The load process ensures that the transformed data is loaded from the staging area directly into the data warehouse. During this process, the data warehouse is usually locked to prevent incorrect evaluations. If it is an update, the existing data records can be overwritten or newly created. Changes can also be logged, which also allows you to revert to previous versions.
Finally, after the "refuelling" of the data warehouse database, an update of the analysis systems (e.g. BI software) is also necessary.
ETL in the Data Warehouse: Examples and possible applications
The ETL process is an integral part of data warehouses and offers numerous application possibilities in companies and organizations. Not only reports, statistics and key figures are provided flexibly. It is also possible to uncover previously hidden connections. Some application examples are:
- Consumer goods industry: sentiment analysis with data from social networks to analyse market trends, combination of market data with existing data from the company's own CRM system
- Medicine: Linking patient files, laboratory results and radiology images to determine the risk of disease
- Energy sector: collection of consumption data, broken down by region, age, sex or type of household
- Aviation: Linking data such as payload, route, aircraft type and fuel consumption to identify profitable and unprofitable flight routes
What is the difference between ETL and ELT?
As the abbreviation already indicates, the loading of the data in the ELT process takes place before the transformation. In contrast to ETL, the transformation in ELT takes place in the target database first. Both approaches have advantages and disadvantages. When to use which method depends on the individual scenario.
Due to the fact that in the ELT process without an upstream processing server, data is completely transferred to the final destination, the time lag between extraction and provision is significantly shorter. However, the data is not immediately usable, since it must first be transformed for analyses.
Therefore, if a high acquisition speed is required, ELT may be the better choice. In addition, the possibility of accessing raw data can also be considered a plus point if, for example, data scientists are involved in the analysis. Especially in the big-data environment, ELT is now often preferred to ETL.
Which ETL tools (ETL software) are available?
With classic ETL tools it is increasingly difficult to make Big Data available simply, quickly and with quality assurance. Suppliers have therefore started to expand their products with data tools for processing real-time data. We are also talking about data integration solutions here. Both open source products and commercial systems are available.
Leading providers of commercial data integration tools are technology companies such as IBM, Microsoft, Oracle, Informatica and SAS Institute. But also ERP providers such as SAP have established their own ETL tools within the framework of Business Intelligence. Important SAP solutions in this area are SAP BusinessObjects Data Integrator, SAP Data Services and SAP HANA Smart Data Integration (SDI). Let us take a closer look at these products in the following.
SAP BusinessObjects Data Integrator
SAP BusinessObjects Data Integrator is a feature of the SAP BusinessObjects business intelligence solution. It enables users to integrate data from external systems such as Microsoft Office and Salesforce to perform analysis and reporting.
SAP Data Services
SAP Data Services is a software for the integration and transformation of data. For this purpose, the solution obtains its data from numerous sources (databases, applications, Web services, cloud systems, etc.), which are also called data stores. In addition to traditional ETL functionality, SAP Data Services offers features such as text analysis and data profiling (automated analysis of data quality). In addition to traditional batch processing, the tool also supports real-time services - that is, data queries with immediate answers. Within an SAP landscape, SAP Data Services works with products such as SAP Master Data Management (MDM), SAP Business Warehouse, and SAP Data Quality Management.
SAP HANA Smart Data Integration (SDI)
With regard to the new SAP database generation HANA, the tool SAP HANA Smart Data Integration (SDI) is considered the successor of previous integration technologies such as SAP Data Services. SDI has numerous connectors to typical data sources and works with so-called flow graphs. This is a flow-based ETL technology. Both batch and real-time processing is supported.
In particular, however, HANA Smart Data Integration is currently the only SAP technology that combines the ETL process, real-time replication and virtualization in one product. Therefore, if the target database is a HANA database, no additional component for data integration is required.
What infrastructural requirements does the ETL process have?
If the ETL process is to be carried out in-house, enormous storage and computing capacities are required. Since valuable data is stored in a central location, extensive measures must also be taken to protect against failures, data loss, errors, theft and natural disasters. Appropriately secured data centers cause high costs.
The following development can be expected in the coming years:
- The amount of data flowing through companies will continue to grow strongly.
- The value of the data will also continue to increase.
- The demand for computing power and suitable ETL tools will increase analogously.
All these factors require a high-performance, scalable IT infrastructure, which can only be mapped with a high level of effort using a dedicated data center. The trend is therefore clearly moving towards cloud-based ETL processes ("ETL-as-a-Service").
What is the significance of ETL-as-a-Service?
As an alternative to classic in-house operation, ETL can now also be obtained as a service from the cloud. With the exception of end devices such as PCs, no separate IT infrastructure is required for use. Cloudnative ETL processes follow the traditional scheme, but the steps are run through in a different sequence.
The establishment of cloud-based ETL processes is due in particular to the Apache Hadoop technology. Hadoop distributes computing processes, which makes the following actions possible:
- Extract data remotely from different source systems
- Transforming data over a computer network
- Load data for local analyses in a target system
Thanks to Hadoop, computing resources can be distributed all over the world. But with technology, they are united to share massive computing tasks. This achieves a much higher speed than traditional ETL processes.
Moreover, it is not just the ETL process that is increasingly moving to the cloud. Extensive, fully managed platforms are now available for application and data integration. We are also talking about iPaaS (Integration Platform as a Service).
Meinolf Schäfer, Senior Director Sales & MarketingDo you have any questions? I am happy to help.
+49 2241 8845-623