Full Article: PDF
Scientific Object Identifier: http://s-o-i.org/1.1/TAS-04-84-55
DOI: https://dx.doi.org/10.15863/TAS.2020.04.84.55
Language: English
Citation: Kozhevnikov, V. A., & Pankratova, E. S. (2020). Research of text pre-processing methods for preparing data in Russian for machine learning. ISJ Theoretical & Applied Science, 04 (84), 313-320. Soi: http://s-o-i.org/1.1/TAS-04-84-55 Doi: https://dx.doi.org/10.15863/TAS.2020.04.84.55 |
Pages: 313-320
Published: 30.04.2020
Abstract: The article includes information about pre-processing methods for preparing text data in Russian language for machine learning. The article covers such techniques as tokenization, normalization, named entity recognition, stemming, lemmatization and removing of stop words. Also, this article shows some approaches using morphological analyzers and libraries for NLP tasks.
Key words: pymorphy2, gensim, mystem, spacy, stemming, lemmatization, ner, deeppavlov
|