PERBANDINGAN KINERJA WORD EMBEDDING WORD2VEC, GLOVE, DAN FASTTEXT PADA KLASIFIKASI TEKS

Arliyanti Nurdin; Bernadus Anggo Seno Aji; Anugrayani Bustamin; Zaenal Abidin

doi:10.33365/jtk.v14i2.732

PERBANDINGAN KINERJA WORD EMBEDDING WORD2VEC, GLOVE, DAN FASTTEXT PADA KLASIFIKASI TEKS

Arliyanti Nurdin, Bernadus Anggo Seno Aji, Anugrayani Bustamin, Zaenal Abidin

Abstract

Karakteristik teks yang tidak terstruktur menjadi tantangan dalam ekstraksi fitur pada bidang pemrosesan teks. Penelitian ini bertujuan untuk membandingkan kinerja dari word embedding seperti Word2Vec, GloVe dan FastText dan diklasifikasikan dengan algoritma Convolutional Neural Network. Ketiga metode ini dipilih karena dapat menangkap makna semantik, sintatik, dan urutan bahkan konteks di sekitar kata jika dibandingkan dengan feature engineering tradisional seperti Bag of Words. Proses word embedding dari metode tersebut akan dibandingkan kinerjanya pada klasifikasi berita dari dataset 20 newsgroup dan Reuters Newswire. Evaluasi kinerja diukur menggunakan F-measure. Performa terbaik menunjukkan FastText unggul dibanding dua metode word embedding lainnya dengan nilai F-Measure sebesar 0.979 untuk dataset 20 Newsgroup dan 0.715 untuk Reuters. Namun, perbedaan kinerja yang tidak begitu signifikan antar ketiga word embedding tersebut menunjukkan bahwa ketiga word embedding ini memiliki kinerja yang kompetitif. Penggunaannya sangat bergantung pada dataset yang digunakan dan permasalahan yang ingin diselesaikan.

Kata kunci: word embedding, word2vec, glove, fasttext, klasfikasi teks, convolutional neural network, cnn.

Full Text:

PDF

References

Bengio, Y. et al. (2003) ‘A Neural Probabilistic Language Model’, Journal of Machine Learning Research 3, p. 19.

Bojanowski, P. et al. (2017) ‘Enriching Word Vectors with Subword Information’, Transactions of the Association for Computational Linguistics, 5, pp. 135–146. doi: 10.1162/tacl_a_00051.

Chen, Y. et al. (2015) ‘Event Extraction via Dynamic Multi-Pooling Convolutional Neural Networks’, in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). ACL-IJCNLP 2015, Beijing, China: Association for Computational Linguistics, pp. 167–176. doi: 10.3115/v1/P15-1017.

English word vectors · fastText (2020). Available at: https://fasttext.cc/index.html (Accessed: 21 June 2020).

Finding Syntax with Structural Probes · John Hewitt (2019). Available at: https://nlp.stanford.edu//~johnhew//structural-probe.html?utm_source=quora&utm_medium=referral#the-structural-probe (Accessed: 21 June 2020).

GloVe: Global Vectors for Word Representation (2014). Available at: https://nlp.stanford.edu/projects/glove/ (Accessed: 21 June 2020).

Google Code Archive - Long-term storage for Google Code Project Hosting. (2013). Available at: https://code.google.com/archive/p/word2vec/ (Accessed: 21 June 2020).

Kalchbrenner, N., Grefenstette, E. and Blunsom, P. (2014) ‘A Convolutional Neural Network for Modelling Sentences’, in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). ACL 2014, Baltimore, Maryland: Association for Computational Linguistics, pp. 655–665. doi: 10.3115/v1/P14-1062.

Kang, H. J. et al. (2016) ‘A Comparison of Word Embeddings for English and Cross-Lingual Chinese Word Sense Disambiguation’, in Proceedings of the 3rd Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA2016). Osaka, Japan: The COLING 2016 Organizing Committee, pp. 30–39. Available at: https://www.aclweb.org/anthology/W16-4905 (Accessed: 21 June 2020).

Kim, Y. (2014) ‘Convolutional Neural Networks for Sentence Classification’, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). EMNLP 2014, Doha, Qatar: Association for Computational Linguistics, pp. 1746–1751. doi: 10.3115/v1/D14-1181.

Li, H. et al. (2018) ‘Comparison of Word Embeddings and Sentence Encodings as Generalized Representations for Crisis Tweet Classiﬁcation Tasks’, New Zealand, p. 13.

Mikolov, T. et al. (2013) ‘Efficient Estimation of Word Representations in Vector Space’, arXiv:1301.3781 [cs]. Available at: http://arxiv.org/abs/1301.3781 (Accessed: 21 June 2020).

Naili, M., Chaibi, A. H. and Ben Ghezala, H. H. (2017) ‘Comparative study of word embedding methods in topic segmentation’, Procedia Computer Science, 112, pp. 340–349. doi: 10.1016/j.procs.2017.08.009.

Nguyen, T. H. and Grishman, R. (2015) ‘Event Detection and Domain Adaptation with Convolutional Neural Networks’, in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). ACL-IJCNLP 2015, Beijing, China: Association for Computational Linguistics, pp. 365–371. doi: 10.3115/v1/P15-2060.

Nurdin, A. and Maulidevi, N. U. (2018) ‘5W1H Information Extraction with CNN-Bidirectional LSTM’, Journal of Physics: Conference Series, 978, p. 012078. doi: 10.1088/1742-6596/978/1/012078.

Pennington, J., Socher, R. and Manning, C. (2014) ‘Glove: Global Vectors for Word Representation’, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar: Association for Computational Linguistics, pp. 1532–1543. doi: 10.3115/v1/D14-1162.

Rong, X. (2016) ‘word2vec Parameter Learning Explained’, arXiv:1411.2738 [cs]. Available at: http://arxiv.org/abs/1411.2738 (Accessed: 21 June 2020).

Selivanov, D. (2020) GloVe Word Embeddings. Available at: https://cran.r-project.org/web/packages/text2vec/vignettes/glove.html (Accessed: 21 June 2020).

Srivastava, N. et al. (2014) ‘Dropout: a simple way to prevent neural networks from overfitting’, The Journal of Machine Learning Research, 15(1), pp. 1929–1958.

The UCI KDD Archive, Information and Computer Science and University of California, Irvine (1999a) 20 Newsgroups. Available at: https://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html (Accessed: 21 June 2020).

The UCI KDD Archive, Information and Computer Science and University of California, Irvine (1999b) Reuters-21578 Text Categorization Collection. Available at: http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html (Accessed: 21 June 2020).

DOI: https://doi.org/10.33365/jtk.v14i2.732

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Jurnal Tekno Kompak
Published by Universitas Teknokrat Indonesia
Organized by Program Studi D3 Sistem Informasi Akuntansi - Universitas Teknokrat Indonesia
Jl. Zainal Abidin Pagaralam, No.9-11, Labuhanratu, Bandarlampung, Indonesia
Telepon : 0721 70 20 22
W : http://ejurnal.teknokrat.ac.id/index.php/teknokompak
E : teknokompak@teknokrat.ac.id.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Jumlah Pengunjung : View Tekno Kompak StatsCounter

Username
Password
Remember me