Spark-gram: Mining frequent N-grams using parallel processing in Spark

Prasetya Ajie Utama, Bayu Distiawan

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Citations (Scopus)

Abstract

Mining sequence patterns in form of n-grams (sequences of words that appear consecutively) from a large text data is one of the fundamental parts in several information retrieval and natural language processing applications. In this work, we present Spark-gram, a method for large scale frequent sequence mining based on Spark that was adapted from its equivalent method in MapReduce called Suffix-σ. Spark-gram design allows the discovery of all n-grams with maximum length σ and minimum occurrence frequency τ, using iterative algorithm with only a single shuffle phase. We show that Spark-gram can outperform Suffix-σ mainly when τ is high but potentially worse when the value of σ grows higher.

Original languageEnglish
Title of host publicationICACSIS 2015 - 2015 International Conference on Advanced Computer Science and Information Systems, Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages129-136
Number of pages8
ISBN (Electronic)9781509003624
DOIs
Publication statusPublished - 19 Feb 2016
EventInternational Conference on Advanced Computer Science and Information Systems, ICACSIS 2015 - Depok, Indonesia
Duration: 10 Oct 201511 Oct 2015

Publication series

NameICACSIS 2015 - 2015 International Conference on Advanced Computer Science and Information Systems, Proceedings

Conference

ConferenceInternational Conference on Advanced Computer Science and Information Systems, ICACSIS 2015
Country/TerritoryIndonesia
CityDepok
Period10/10/1511/10/15

Keywords

  • distributed computing
  • hadoop
  • mapreduce
  • spark
  • Text mining

Fingerprint

Dive into the research topics of 'Spark-gram: Mining frequent N-grams using parallel processing in Spark'. Together they form a unique fingerprint.

Cite this