Albanian National Corpus


Welcome to the start page of the Albanian National Corpus, which includes: the Main corpus of literary Albanian (from 1970s till present) and the Corpus of early Albanian texts.

Details To the Main corpus To the Corpus of early Albanian texts

Albanian National Corpus

Toggle navigation

This is the main page of the website where the Albanian National Corpus (ANC) with 31.12 million words is located.

The Corpus is designed for people interested in various issues related to the Albanian language, and may provide the reference data for both professional linguists and those who show an interest in Albanian and its history due to their occupation or out of mere curiosity. The material collected in the Corpus can be used for studies of the Albanian lexicon and grammar, as well as for the investigation of language changes which happened in Albanian in the previous centuries.

From 2012 to the present, the Albanian National Corpus included primarily original prose texts (fiction, non-fiction, press) representing the Albanian literary language (since the 1970s), as well as translated texts and Albanian poetry. The project aims at the widest possible presentation of not only modern written texts of various genres, but also texts created in different periods of the Albanian language history, as well as colloquial and dialectal texts representing non-literary forms of the modern Albanian language. To date, the Corpus of early Albanian texts has been opened (in September 2019) and continues to develop.

The texts of ANC have user-friendly metadata descriptions and morphological annotation consisting of tags assigned to words.

Annotation

  • automatic morphological annotation (lemmatization, part of speech, all inflectional features), 93.9% words analyzedonly tokens that do not contain digits or characters from other alphabets are taken into account
  • no disambiguation at the current stage
  • glossing
  • English translation of lemmata

Metadata

  • title of the text
  • author or title of the newspaper
  • creation year (exact date in the case of newspapers)
  • genre / type of text

About corpora

Currently, two corpora are available: Corpus of the modern literary Albanian («Main corpus») and Corpus of early Albanian texts. They differ in what kind of texts they contain and how these texts are presented, but have mostly identical annotation and search capabilities. Here are the main characteristics of the corpora:

Main corpus

The Main corpus contains 31.12 million words and consists of the following subcorpora:

Press Fiction Nonfiction Poetry
Size 23.4 million words (75.2% of the Main Corpus) 3.2 million words (10.3% of the Main Corpus) 4.3 million words (13.8% of the Main Corpus) 0.2 million words (0.7% of the Main Corpus)
Texts

    modern newspapers in Albanian:

  • Agjencia Telegrafike Shqiptare (Albanian Telegraphic Agency online, Albania, 2016–2017) — 2.5 mln words
  • Gazeta Shqiptare (Albania, 2006–2007) — 1.9 mln words
  • Panorama (Albania, 2014–2017) — 8.7 mln words
  • Zëri (Kosovo, 2013) — 6.2 mln words
  • Koha.mk (North Macedonia, 2011–2013) — 4.1 mln words
authentic and translated fiction texts: novels, novellas, short stories, plays, books for children (Albania, Kosovo, North Macedonia, 1960–2010s)

    authentic and translated nonfiction texts:

  • essays, interviews, epistolary texts, memoirs, diaries, notes: essays and correspondence of L. Poradeci (1910–1940s); modern essays, diaries and memoirs (Albania, Kosovo, North Macedonia, 1950–2010s) — 1.05 mln words
  • religious texts: the Bible (Old Testament, translated by the Albanian Bible Society; New testament, translated by V. Dervishi), the Quran — 0.75 mln words
  • scientific texts and textbooks (humanities and social sciences; Albania, Kosovo, 2000–2010s) — 1.9 mln words
  • official texts (legislative and diplomatic; Albania, Kosovo, 1990–2000s) — 0.6 mln words
Albanian poetry of the first half of the 20th century (works of Asdreni, A.Z. Çajupi, Migjeni, F. Noli, L. Poradeci); poetry of Albania, Kosovo and North Macedonia, 1950–2010s
Language variety

In most cases, standard written Albanian or close to it.

In the interviews and quotations one can meet examples of dialectal speech.

In most cases, standard written Albanian or close to it.

In the speech of the characters one can meet examples of dialectal speech or dialect stylization.

In most cases, standard written Albanian or close to it.

In the texts created before the 1950s earlier variants of literary language can be used.

In most cases, standard written Albanian or close to it.

In the poetic texts created before the 1950s earlier variants of literary language can be used.

Corpus of early Albanian texts

This Corpus is supposed to include the main monuments of the Albanian writing: the works by Gj. Buzuk, L. Matranga, P. Budi, F. Bardhi, P. Bogdani, P. Mazreku, the Kanun of Arbëria, etc.

To date, the text of the «Christian Doctrine» by Luka Matranga (1592) has been included into the Corpus of early Albanian texts. Matranga’s book is the second largest text written in the Albanian language, and the first in the Tosk dialect.

The text which is used in the Corpus is based on the so-called manuscript  А. We used the transcription of M. de Vaan, based on M. Sciambra. La «Dottrina cristiana» albanese di Luca Matranga (1964) and available at TITUS. We are grateful to M. de Vaan for providing us with the textual version of the transcription. During the text processing, it was verified using the other editions of the book, including the phototypic ones.

Publications

In our publications you may find more information about the Albanian National Corpus, its contents, and development:

Morozova, Maria. Shënime për standardin morfologjik të Korpusit nacional të shqipes [Notes on the morphological standard of the Albanian National Corpus] // Seminari Ndërkombëtar për Gjuhën, Letërsinë dhe Kulturën Shqiptare. Materialet e punimeve të Seminarit XXXI Ndërkombëtar për Gjuhën, Letërsinë dhe Kulturën Shqiptare: Prishtinë, 13–27 gusht 2012 / kryered. B. Rugova. Prishtinë: Universiteti i Prishtinës: Fakulteti i filologjisë, 2012. 31/1. F. 153–156.

Arkhangelskij, Timofej, Mikhail Daniel, Maria Morozova, Aleksandër Rusakov. Korpusi i gjuhës shqipe: drejtimet kryesore të punës [Albanian Corpus: Main directions of work] // Shqipja dhe gjuhët e Ballkanit. Albanian and Balkan Languages. Konferencë e mbajtur më 10–11 dhjetor 2011 në Prishtinë / red. R. Ismajli. Prishtinë: ASHAK, 2012. F. 635-642.

Rusakov, Aleksandër, Maria Morozova. Korpusi i gjuhës shqipe: problemet dhe rezultatet [Albanian Corpus: Problems and results] // Studime për nder të Rexhep Ismajlit me rastin e 65-vjetorit të lindjes / red. B. Rugova. Prishtinë: Koha, 2012. F. 639–649.

Morozova, Maria, Alexander Rusakov, Marina Domosiletskaya. Albanskaya imennaya morfologiya v korpusnom predstavlenii: Natsionalnyj korpus albanskogo yazyka [The Albanian noun morphology in the morphological model developed for the Albanian National Corpus] // Albanskaya filologiya, balkanistika, problemy yazykoznaniya. К 100-letiyu so dnya rozhdeniya chlena-korrespondenta RAN Agnii Vasiliyevny Desnitskoj / red. А. Kh. Girfanova, М. V. Domosiletskaya, А. V. Zhugra, N. N. Kazansky, А. Yu. Rusakov, N. L. Sukhachev. SPb: Nauka, 2013. S. 120–130.

Morozova, Maria, Alexander Rusakov. Korpusi elektronik i shqipes: përpunimi, përmbajtja dhe përdorimi [Electronic corpus of Albanian: Development, contents and use] // Seminari Ndërkombëtar për Gjuhën, Letërsinë dhe Kulturën Shqiptare. Materialet e punimeve të Seminarit XXXII Ndërkombëtar për Gjuhën, Letërsinë dhe Kulturën Shqiptare: Prishtinë, 19–30.08.2013 / kryered. B. Rugova. Prishtinë: Universiteti i Prishtinës: Fakulteti i filologjisë, 2014. 33/1. F. 85-96.

Morozova, Maria, Alexander Rusakov. Albanian National Corpus: Composition, Text Processing and Corpus-Oriented Grammar Development // Sprache und Kultur der Albaner. Zeitliche und räumliche Dimensionen. Akten der 5. Deutsch-albanischen kulturwissenschaftlichen Tagung (5.–8. Juni 2014, Buçimas bei Pogradec, Albanien) / Hrsg. von B. Demiraj. Wiesbaden: Harrassowitz Verlag, 2015. (Albanische Forschungen, 37). S. 270-308.

Morozova, Maria, Timofey Arkhangelsky, Mikhail Daniel, Alexander Rusakov. Albanskij natsionalnyj korpus: osnovnye napravleniya raboty [Albanian National Corpus: Main directions of work] // Acta Linguistica Petropolitana. Trudy Instituta lingvisticheskikh issledovanij RAN / otv. red. N. N. Kazansky. SPb: Nauka, 2016. Т. XII, Ch. 3. S. 169–189.

How to cite the Corpus

If you use the information and examples from the Albanian National Corpus in your research, please use the following link:

Maria Morozova, Alexander Rusakov, Timofey Arkhangelskiy. Albanian National Corpus. (Available online at: albanian.web-corpora.net, accessed on .)

Tagset

The grammatical features of the words in the corpora are marked with short tags. Here is the full list of tags used in the Albanian corpora. Both corpora have identical set of tags.

  • A — adjective
  • ADV — adverb
  • ART — article
  • CLIT_PRO — pronominal clitic
  • CONJ — conjunction
  • FORM — formative suffix
  • INTJ — interjection
  • NOUN — noun
  • NUM — numeral
  • PART — particle
  • PR — preposition
  • PRO — pronoun
  • V — verb
  • VB_PART — verbal particle
  • 1 — 1st person
  • 2 — 2nd person
  • 3 — 3rd person
  • abl — ablative
  • abl2 — ablative 2
  • acc — accusative
  • acc_1pl — accusative clitic 1st person plural
  • acc_1sg — accusative clitic 1st person singular
  • acc_2pl — accusative clitic 2nd person plural
  • acc_2sg — accusative clitic 2nd person singular
  • acc_3pl — accusative clitic 3rd person plural
  • acc_3sg — accusative clitic 3rd person singular
  • act — active voice
  • adm — admirative mood
  • anim — animate noun
  • aor — aorist
  • cont — contact position of article
  • dat_1pl — dative clitic 1st person plural
  • dat_1sg — dative clitic 1st person singular
  • dat_2pl — dative clitic 2nd person plural
  • dat_2sg — dative clitic 2nd person singular
  • dat_3pl — dative clitic 3rd person plural
  • dat_3sg — dative clitic 3rd person singular
  • def — definite form
  • dist — distant position of article
  • f — feminine gender
  • gen_dat — genitive-dative
  • Gheg — Gheg lemma or word
  • imp — imperative mood
  • inanim — inanimate noun
  • ind — indicative mood
  • indef — indefinite form
  • ipf — imperfect
  • loc — locative
  • m — masculine gender
  • med — medial verb
  • mf — ambigenous noun
  • n — neuter gender
  • nom — nominative
  • nonst — non-standard lemma or word
  • opt — optative mood
  • pass — non-active voice
  • pl — plural
  • pres — present tense
  • ptcp — participle
  • sbjv — subjunctive mood
  • sg — singular
  • unmkd — unmarked case
  • vi — intransitive verb
  • vt — transitive verb
  • with_abl — preposition governs ablative
  • with_acc — preposition governs accusative
  • with_loc — preposition governs locative
  • with_nom — preposition governs nominative

Authors

The Corpus was developed as a result of efforts of the creative community of linguists from Saint Petersburg (Institute for Linguistic Studies of the Russian Academy of Sciences) and Moscow (the School of Linguistics at HSE).

The previous version used the search engine of the Eastern Armenian National Corpus (EANC). The current version of ANC uses the morphological analyzer and the tsakorpus platform developed by Timofey Arkhangelsky.

The main ideas for the development of the corpus representation of Albanian grammar, the metatext and grammatical annotation belong to Maria Morozova and Alexander Rusakov. Timofey Arkhangelsky and Mikhail Daniel took part in the discussion of these issues.

In the development of the Corpus were involved:

  • Maria Morozova, Saint Petersburg (corpus representation of Albanian grammar, grammatical annotation, metatext markup, corpus wordlist, text collection and processing)
  • Alexander Rusakov, Saint Petersburg (corpus representation of Albanian grammar, collection and processing of texts)
  • Timofey Arkhangelsky, Moscow — Hamburg (development of the morphological analyzer, the corpus platform, and the user interface, technical support)
  • Marina Domosiletskaya, Saint Petersburg (compilation of the corpus wordlist — nouns)
  • Anna Konovalenko, Saint Petersburg (compilation of the corpus wordlist — adverbs)
  • Anastasia Sidko, Saint Petersburg (compilation of the corpus wordlist — nouns, adjectives, verbs)
  • Daria Alekseeva, Saint Petersburg (collection and processing of texts)
  • Elizabeth Atakova, Saint Petersburg (collection and processing of texts)
  • Varvara Diveeva, Saint Petersburg (collection and processing of texts)
  • Maxim Makartsev, Moscow — Oldenburg (providing texts for the corpus)
  • Besim Kabashi, Munich (providing texts for the corpus)
  • Qerim Ondozi, Prishtina (collection and processing of texts)

The Corpus team is grateful to the publishing house «Onufri» (Tirana) for the help in the selection of texts. The first version of the Albanian National Corpus was created with the financial support of the «Corpus linguistics» Program by the Presidium of the Russian Academy of Sciences. The current version of the Corpus is hosted by the School of Linguistics at HSE.

Now the Corpus is maintained and developed by:

  • Maria Morozova (ILS RAS, SPbSU), morozovamaria86@gmail.com
  • Alexander Rusakov (ILS RAS, SPbSU), ayurusakov@gmail.com
  • Timofey Arkhangelsky (Universität Hamburg), timarkh@gmail.com
  • Contacts


    If you have questions, would like to propose collaboration, or noticed an error in the corpus, please contact us. You can also use the Albanian morphological analyzer and the tsakorpus corpus platform, which are open source and freely available.

    morozovamaria86@gmail.com

    ayurusakov@gmail.com

    timarkh@gmail.com