PERBANDINGAN KINERJA WORD EMBEDDING WORD2VEC, GLOVE, DAN FASTTEXT PADA KLASIFIKASI TEKS

Arliyanti Nurdin, Bernadus Anggo Seno Aji, Anugrayani Bustamin, Zaenal Abidin

Abstract


Karakteristik teks yang tidak terstruktur menjadi tantangan dalam ekstraksi fitur pada bidang pemrosesan teks. Penelitian ini bertujuan untuk membandingkan kinerja dari word embedding  seperti Word2Vec, GloVe dan FastText dan diklasifikasikan dengan algoritma Convolutional Neural Network. Ketiga metode ini dipilih karena dapat menangkap makna semantik, sintatik, dan urutan bahkan konteks di sekitar kata jika dibandingkan dengan feature engineering tradisional seperti Bag of Words. Proses word embedding dari metode tersebut akan dibandingkan kinerjanya pada klasifikasi berita dari dataset 20 newsgroup dan Reuters Newswire. Evaluasi kinerja diukur menggunakan F-measure. Performa terbaik menunjukkan FastText unggul dibanding dua metode word embedding lainnya dengan nilai F-Measure  sebesar 0.979 untuk dataset 20 Newsgroup dan 0.715 untuk Reuters. Namun, perbedaan kinerja yang tidak begitu signifikan antar ketiga word embedding tersebut menunjukkan bahwa ketiga word embedding ini memiliki kinerja yang kompetitif. Penggunaannya sangat bergantung pada dataset yang digunakan dan permasalahan yang ingin diselesaikan.

Kata kunci: word embedding, word2vec, glove, fasttext, klasfikasi teks, convolutional neural network, cnn.


Full Text:

PDF

References


Bengio, Y. et al. (2003) ‘A Neural Probabilistic Language Model’, Journal of Machine Learning Research 3, p. 19.

Bojanowski, P. et al. (2017) ‘Enriching Word Vectors with Subword Information’, Transactions of the Association for Computational Linguistics, 5, pp. 135–146. doi: 10.1162/tacl_a_00051.

Chen, Y. et al. (2015) ‘Event Extraction via Dynamic Multi-Pooling Convolutional Neural Networks’, in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). ACL-IJCNLP 2015, Beijing, China: Association for Computational Linguistics, pp. 167–176. doi: 10.3115/v1/P15-1017.

English word vectors · fastText (2020). Available at: https://fasttext.cc/index.html (Accessed: 21 June 2020).

Finding Syntax with Structural Probes · John Hewitt (2019). Available at: https://nlp.stanford.edu//~johnhew//structural-probe.html?utm_source=quora&utm_medium=referral#the-structural-probe (Accessed: 21 June 2020).

GloVe: Global Vectors for Word Representation (2014). Available at: https://nlp.stanford.edu/projects/glove/ (Accessed: 21 June 2020).

Google Code Archive - Long-term storage for Google Code Project Hosting. (2013). Available at: https://code.google.com/archive/p/word2vec/ (Accessed: 21 June 2020).

Kalchbrenner, N., Grefenstette, E. and Blunsom, P. (2014) ‘A Convolutional Neural Network for Modelling Sentences’, in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). ACL 2014, Baltimore, Maryland: Association for Computational Linguistics, pp. 655–665. doi: 10.3115/v1/P14-1062.

Kang, H. J. et al. (2016) ‘A Comparison of Word Embeddings for English and Cross-Lingual Chinese Word Sense Disambiguation’, in Proceedings of the 3rd Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA2016). Osaka, Japan: The COLING 2016 Organizing Committee, pp. 30–39. Available at: https://www.aclweb.org/anthology/W16-4905 (Accessed: 21 June 2020).

Kim, Y. (2014) ‘Convolutional Neural Networks for Sentence Classification’, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). EMNLP 2014, Doha, Qatar: Association for Computational Linguistics, pp. 1746–1751. doi: 10.3115/v1/D14-1181.

Li, H. et al. (2018) ‘Comparison of Word Embeddings and Sentence Encodings as Generalized Representations for Crisis Tweet Classification Tasks’, New Zealand, p. 13.

Mikolov, T. et al. (2013) ‘Efficient Estimation of Word Representations in Vector Space’, arXiv:1301.3781 [cs]. Available at: http://arxiv.org/abs/1301.3781 (Accessed: 21 June 2020).

Naili, M., Chaibi, A. H. and Ben Ghezala, H. H. (2017) ‘Comparative study of word embedding methods in topic segmentation’, Procedia Computer Science, 112, pp. 340–349. doi: 10.1016/j.procs.2017.08.009.

Nguyen, T. H. and Grishman, R. (2015) ‘Event Detection and Domain Adaptation with Convolutional Neural Networks’, in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). ACL-IJCNLP 2015, Beijing, China: Association for Computational Linguistics, pp. 365–371. doi: 10.3115/v1/P15-2060.

Nurdin, A. and Maulidevi, N. U. (2018) ‘5W1H Information Extraction with CNN-Bidirectional LSTM’, Journal of Physics: Conference Series, 978, p. 012078. doi: 10.1088/1742-6596/978/1/012078.

Pennington, J., Socher, R. and Manning, C. (2014) ‘Glove: Global Vectors for Word Representation’, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar: Association for Computational Linguistics, pp. 1532–1543. doi: 10.3115/v1/D14-1162.

Rong, X. (2016) ‘word2vec Parameter Learning Explained’, arXiv:1411.2738 [cs]. Available at: http://arxiv.org/abs/1411.2738 (Accessed: 21 June 2020).

Selivanov, D. (2020) GloVe Word Embeddings. Available at: https://cran.r-project.org/web/packages/text2vec/vignettes/glove.html (Accessed: 21 June 2020).

Srivastava, N. et al. (2014) ‘Dropout: a simple way to prevent neural networks from overfitting’, The Journal of Machine Learning Research, 15(1), pp. 1929–1958.

The UCI KDD Archive, Information and Computer Science and University of California, Irvine (1999a) 20 Newsgroups. Available at: https://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html (Accessed: 21 June 2020).

The UCI KDD Archive, Information and Computer Science and University of California, Irvine (1999b) Reuters-21578 Text Categorization Collection. Available at: http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html (Accessed: 21 June 2020).




DOI: https://doi.org/10.33365/jtk.v14i2.732

Refbacks

  • There are currently no refbacks.


Copyright (c) 2021 Arliyanti Nurdin, Bernadus Anggo Seno Aji, Anugrayani Bustamin, Zaenal Abidin

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.


Jurnal Tekno Kompak
Published by Universitas Teknokrat Indonesia
Organized by Program Studi D3 Sistem Informasi AkuntansiUniversitas Teknokrat Indonesia
Jl. Zainal Abidin Pagaralam, No.9-11, Labuhanratu, Bandarlampung, Indonesia
Telepon : 0721 70 20 22
W : http://ejurnal.teknokrat.ac.id/index.php/teknokompak
E  : teknokompak@teknokrat.ac.id.

Creative Commons License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.


Jumlah Pengunjung : View Tekno Kompak StatsCounter

Flag Counter