Belarusian N-corpus is provided for CLARIN VLO! – Clarin Knowledge Centre for Belarusian text and speech processing

Belarusian N-corpus is provided for the CLARIN VLO.

The Belarusian N-corpus is a corpus of texts in modern Belarusian with structural and grammatical marking and certification. The corpus consists of the following subcorpora:

Basic corpus (14.8 thousand texts, 43.4 million word usages);
Concordance of Belarusian of the 19th century (515 texts, 278 thousand word usages);
Belarusian Wikipedia corpus (287 thousand texts, 126 million word usages);
Translations corpus (1.22 thousand texts, 6.91 million word usages);
Unprocessed texts corpus (68.7 thousand texts, 892 million word usages);
Biblic corpus (16 Bible translations into Belarusian and other languages (Latin, Jewish, Ukrainian, Polish) for comparison).

In total, the corpus comprises 372 thousand texts and 1.07 billion word usages.The basic corpus contains texts of 5 different styles: artistic, scientific, publicistic, official, religious. Within styles, texts are classified into genres; for example, the artistic style is divided into the following genres: narrative, short novel, ballad, fable, verse, fairy tale, ode, poem, play, novel, plot.

As in most Slavic corpora, the Belarusian N-corpus encodes morphological (grammatical) information: initial word forms and grammatical characteristics. The Lexical and Grammatical Base is used for grammatical marking of the corpus. The base is a collection of words with morphological and other tags. The paradigm header provides the identification number of the paradigm, the initial form, and the grammatical feature of the token. If necessary, additional information is recorded: government (for verbs), meaning, remarks. Each declensional form has its own characteristics. The source of the word or word form, stress, spelling and non-canonical forms are also indicated. To date, the Lexical and Grammatical Base has approximately 304 thousand paradigms and more than 4,4 million word forms.

The user interface of the corpus is presented on the page https://bnkorpus.info/korpus.html. It allows the user to limit the set of texts to be searched, i.e. select one or more subcorpora, one or more authors, styles and genres of texts and limit the set of texts to the time of writing and edition. There is also an option of adding words to the search (you can search one or several words) and the ability to choose the grammatical characteristics of the word entered to the search box.

The details are presented here.

Direct link is here.