Indonesian Dataset Expansion of Microsoft Research Video Description Corpus and Its Similarity Analysis

  • Rahutomo F
  • Hafidh Ayatullah A
N/ACitations
Citations of this article
8Readers
Mendeley users who have this article in their library.

Abstract

This paper describes the academic base of an openly Indonesian dataset in Mendeley Data with DOI: 10.17632/d7vx5cc92y.1 [1]. The dataset is an Indonesian language expansion of Microsoft research video description corpus, an open dataset contains about 120 thousand sentences. The dataset is a useful resource because the sentences are a set of roughly parallel descriptions of more than 2,000 video snippets of 35 languages. Both paraphrase and bilingual relation are available but Indonesian description is not available in the dataset. Therefore, this paper describes the research effort to expand the dataset for the Indonesian language. The research collected 43,753 description texts of 1,959 short videos, parallel with Microsoft’s dataset. Adding more value to the dataset, similarity metrics calculations of the texts were done. The metrics were Cosine, Jaccard, euclidian, and Manhattan with average results were 0.22, 0.33, 2.38, and 6.08 respectively.

Cite

CITATION STYLE

APA

Rahutomo, F., & Hafidh Ayatullah, A. (2018). Indonesian Dataset Expansion of Microsoft Research Video Description Corpus and Its Similarity Analysis. Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, 319–326. https://doi.org/10.22219/kinetik.v3i4.680

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free