Implementation of change data capture in ETL process for data warehouse using HDFS and apache spark

Denny, I. Putu Medagia Atmaja, Ari Saptawijaya, Siti Aminah

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Citation (Scopus)

Abstract

This study aims to increase ETL process efficiency flud reduce processing time by applying the method of Change Data Capture (CDC) in distributed system using Hadoop Distributed file System (HDFS) and Apache Spark in the data warehouse of Learning Analytics system of Universitas Indonesia. Usually, increases in I lie number of records in the data source result in an increase in ETL processing time for the data warehouse system. This condition occurs as a result of inefficient ETL process using the full load method. Using the tull load method, ETL has to process the same number of records as the number of records in the data sources. The proposed ETL model design with the application of CDC method using HDFS and Apache Spark can reduce the amount of data in the ETL process. Consequently, the process becomes more efficient and the ETL processing time Is reduced approximately 53% in average.

Original languageEnglish
Title of host publicationProceedings - WBIS 2017
Subtitle of host publication2017 International Workshop on Big Data and Information Security
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages49-55
Number of pages7
ISBN (Electronic)9781538620380
DOIs
Publication statusPublished - 29 Jan 2018
Event2017 International Workshop on Big Data and Information Security, WBIS 2017 - Jakarta, Indonesia
Duration: 23 Sep 201724 Sep 2017

Publication series

NameProceedings - WBIS 2017: 2017 International Workshop on Big Data and Information Security
Volume2018-January

Conference

Conference2017 International Workshop on Big Data and Information Security, WBIS 2017
Country/TerritoryIndonesia
CityJakarta
Period23/09/1724/09/17

Keywords

  • big data
  • change data capture
  • data warehouse
  • distributed system
  • extract transform load

Fingerprint

Dive into the research topics of 'Implementation of change data capture in ETL process for data warehouse using HDFS and apache spark'. Together they form a unique fingerprint.

Cite this