Albanian National Corpus

This is the main page of the website where the Albanian National Corpus (ANC) with 31.12 million words is located.

The Corpus is designed for people interested in various issues related to the Albanian language, and may provide the reference data for both professional linguists and those who show an interest in Albanian and its history due to their occupation or out of mere curiosity. The material collected in the Corpus can be used for studies of the Albanian lexicon and grammar, as well as for the investigation of language changes which happened in Albanian in the previous centuries.

From 2012 to the present, the Albanian National Corpus included primarily original prose texts (fiction, non-fiction, press) representing the Albanian literary language (since the 1970s), as well as translated texts and Albanian poetry. The project aims at the widest possible presentation of not only modern written texts of various genres, but also texts created in different periods of the Albanian language history, as well as colloquial and dialectal texts representing non-literary forms of the modern Albanian language. To date, the Corpus of early Albanian texts has been opened (in September 2019) and continues to develop.

The texts of ANC have user-friendly metadata descriptions and morphological annotation consisting of tags assigned to words.

Annotation

automatic morphological annotation (lemmatization, part of speech, all inflectional features), 93.9% words analyzedonly tokens that do not contain digits or characters from other alphabets are taken into account
no disambiguation at the current stage
glossing
English translation of lemmata

Metadata

title of the text
author or title of the newspaper
creation year (exact date in the case of newspapers)
genre / type of text

About corpora

Currently, two corpora are available: Corpus of the modern literary Albanian («Main corpus») and Corpus of early Albanian texts. They differ in what kind of texts they contain and how these texts are presented, but have mostly identical annotation and search capabilities. Here are the main characteristics of the corpora:

Main corpus

The Main corpus contains 31.12 million words and consists of the following subcorpora:

	Press	Fiction	Nonfiction	Poetry
Size	23.4 million words (75.2% of the Main Corpus)	3.2 million words (10.3% of the Main Corpus)	4.3 million words (13.8% of the Main Corpus)	0.2 million words (0.7% of the Main Corpus)
Texts	modern newspapers in Albanian: Agjencia Telegrafike Shqiptare (Albanian Telegraphic Agency online, Albania, 2016–2017) — 2.5 mln words Gazeta Shqiptare (Albania, 2006–2007) — 1.9 mln words Panorama (Albania, 2014–2017) — 8.7 mln words Zëri (Kosovo, 2013) — 6.2 mln words Koha.mk (North Macedonia, 2011–2013) — 4.1 mln words	authentic and translated fiction texts: novels, novellas, short stories, plays, books for children (Albania, Kosovo, North Macedonia, 1960–2010s)	authentic and translated nonfiction texts: essays, interviews, epistolary texts, memoirs, diaries, notes: essays and correspondence of L. Poradeci (1910–1940s); modern essays, diaries and memoirs (Albania, Kosovo, North Macedonia, 1950–2010s) — 1.05 mln words religious texts: the Bible (Old Testament, translated by the Albanian Bible Society; New testament, translated by V. Dervishi), the Quran — 0.75 mln words scientific texts and textbooks (humanities and social sciences; Albania, Kosovo, 2000–2010s) — 1.9 mln words official texts (legislative and diplomatic; Albania, Kosovo, 1990–2000s) — 0.6 mln words	Albanian poetry of the first half of the 20th century (works of Asdreni, A.Z. Çajupi, Migjeni, F. Noli, L. Poradeci); poetry of Albania, Kosovo and North Macedonia, 1950–2010s
Language variety	In most cases, standard written Albanian or close to it. In the interviews and quotations one can meet examples of dialectal speech.	In most cases, standard written Albanian or close to it. In the speech of the characters one can meet examples of dialectal speech or dialect stylization.	In most cases, standard written Albanian or close to it. In the texts created before the 1950s earlier variants of literary language can be used.	In most cases, standard written Albanian or close to it. In the poetic texts created before the 1950s earlier variants of literary language can be used.

Press

Fiction

Nonfiction

Poetry

Size

23.4 million words (75.2% of the Main Corpus)

3.2 million words (10.3% of the Main Corpus)

4.3 million words (13.8% of the Main Corpus)

0.2 million words (0.7% of the Main Corpus)

Texts

modern newspapers in Albanian:

Agjencia Telegrafike Shqiptare (Albanian Telegraphic Agency online, Albania, 2016–2017) — 2.5 mln words
Gazeta Shqiptare (Albania, 2006–2007) — 1.9 mln words
Panorama (Albania, 2014–2017) — 8.7 mln words
Zëri (Kosovo, 2013) — 6.2 mln words
Koha.mk (North Macedonia, 2011–2013) — 4.1 mln words

authentic and translated fiction texts: novels, novellas, short stories, plays, books for children (Albania, Kosovo, North Macedonia, 1960–2010s)

authentic and translated nonfiction texts:

essays, interviews, epistolary texts, memoirs, diaries, notes: essays and correspondence of L. Poradeci (1910–1940s); modern essays, diaries and memoirs (Albania, Kosovo, North Macedonia, 1950–2010s) — 1.05 mln words
religious texts: the Bible (Old Testament, translated by the Albanian Bible Society; New testament, translated by V. Dervishi), the Quran — 0.75 mln words
scientific texts and textbooks (humanities and social sciences; Albania, Kosovo, 2000–2010s) — 1.9 mln words
official texts (legislative and diplomatic; Albania, Kosovo, 1990–2000s) — 0.6 mln words

Albanian poetry of the first half of the 20th century (works of Asdreni, A.Z. Çajupi, Migjeni, F. Noli, L. Poradeci); poetry of Albania, Kosovo and North Macedonia, 1950–2010s

Language variety

In most cases, standard written Albanian or close to it.

In the interviews and quotations one can meet examples of dialectal speech.

In most cases, standard written Albanian or close to it.

In the speech of the characters one can meet examples of dialectal speech or dialect stylization.

In most cases, standard written Albanian or close to it.

In the texts created before the 1950s earlier variants of literary language can be used.

In most cases, standard written Albanian or close to it.

In the poetic texts created before the 1950s earlier variants of literary language can be used.

Corpus of early Albanian texts

This Corpus is supposed to include the main monuments of the Albanian writing: the works by Gj. Buzuk, L. Matranga, P. Budi, F. Bardhi, P. Bogdani, P. Mazreku, the Kanun of Arbëria, etc.

To date, the text of the «Christian Doctrine» by Luka Matranga (1592) has been included into the Corpus of early Albanian texts. Matranga’s book is the second largest text written in the Albanian language, and the first in the Tosk dialect.

The text which is used in the Corpus is based on the so-called manuscript А. We used the transcription of M. de Vaan, based on M. Sciambra. La «Dottrina cristiana» albanese di Luca Matranga (1964) and available at TITUS. We are grateful to M. de Vaan for providing us with the textual version of the transcription. During the text processing, it was verified using the other editions of the book, including the phototypic ones.

Publications

In our publications you may find more information about the Albanian National Corpus, its contents, and development:

Morozova, Maria. Shënime për standardin morfologjik të Korpusit nacional të shqipes [Notes on the morphological standard of the Albanian National Corpus] // Seminari Ndërkombëtar për Gjuhën, Letërsinë dhe Kulturën Shqiptare. Materialet e punimeve të Seminarit XXXI Ndërkombëtar për Gjuhën, Letërsinë dhe Kulturën Shqiptare: Prishtinë, 13–27 gusht 2012 / kryered. B. Rugova. Prishtinë: Universiteti i Prishtinës: Fakulteti i filologjisë, 2012. 31/1. F. 153–156.

Arkhangelskij, Timofej, Mikhail Daniel, Maria Morozova, Aleksandër Rusakov. Korpusi i gjuhës shqipe: drejtimet kryesore të punës [Albanian Corpus: Main directions of work] // Shqipja dhe gjuhët e Ballkanit. Albanian and Balkan Languages. Konferencë e mbajtur më 10–11 dhjetor 2011 në Prishtinë / red. R. Ismajli. Prishtinë: ASHAK, 2012. F. 635-642.

Rusakov, Aleksandër, Maria Morozova. Korpusi i gjuhës shqipe: problemet dhe rezultatet [Albanian Corpus: Problems and results] // Studime për nder të Rexhep Ismajlit me rastin e 65-vjetorit të lindjes / red. B. Rugova. Prishtinë: Koha, 2012. F. 639–649.

Morozova, Maria, Alexander Rusakov, Marina Domosiletskaya. Albanskaya imennaya morfologiya v korpusnom predstavlenii: Natsionalnyj korpus albanskogo yazyka [The Albanian noun morphology in the morphological model developed for the Albanian National Corpus] // Albanskaya filologiya, balkanistika, problemy yazykoznaniya. К 100-letiyu so dnya rozhdeniya chlena-korrespondenta RAN Agnii Vasiliyevny Desnitskoj / red. А. Kh. Girfanova, М. V. Domosiletskaya, А. V. Zhugra, N. N. Kazansky, А. Yu. Rusakov, N. L. Sukhachev. SPb: Nauka, 2013. S. 120–130.

Morozova, Maria, Alexander Rusakov. Korpusi elektronik i shqipes: përpunimi, përmbajtja dhe përdorimi [Electronic corpus of Albanian: Development, contents and use] // Seminari Ndërkombëtar për Gjuhën, Letërsinë dhe Kulturën Shqiptare. Materialet e punimeve të Seminarit XXXII Ndërkombëtar për Gjuhën, Letërsinë dhe Kulturën Shqiptare: Prishtinë, 19–30.08.2013 / kryered. B. Rugova. Prishtinë: Universiteti i Prishtinës: Fakulteti i filologjisë, 2014. 33/1. F. 85-96.

Morozova, Maria, Alexander Rusakov. Albanian National Corpus: Composition, Text Processing and Corpus-Oriented Grammar Development // Sprache und Kultur der Albaner. Zeitliche und räumliche Dimensionen. Akten der 5. Deutsch-albanischen kulturwissenschaftlichen Tagung (5.–8. Juni 2014, Buçimas bei Pogradec, Albanien) / Hrsg. von B. Demiraj. Wiesbaden: Harrassowitz Verlag, 2015. (Albanische Forschungen, 37). S. 270-308.

Morozova, Maria, Timofey Arkhangelsky, Mikhail Daniel, Alexander Rusakov. Albanskij natsionalnyj korpus: osnovnye napravleniya raboty [Albanian National Corpus: Main directions of work] // Acta Linguistica Petropolitana. Trudy Instituta lingvisticheskikh issledovanij RAN / otv. red. N. N. Kazansky. SPb: Nauka, 2016. Т. XII, Ch. 3. S. 169–189.

How to cite the Corpus

If you use the information and examples from the Albanian National Corpus in your research, please use the following link:

Maria Morozova, Alexander Rusakov, Timofey Arkhangelskiy. Albanian National Corpus. (Available online at: albanian.web-corpora.net, accessed on .)

Tagset

The grammatical features of the words in the corpora are marked with short tags. Here is the full list of tags used in the Albanian corpora. Both corpora have identical set of tags.

A — adjective
ADV — adverb
ART — article
CLIT_PRO — pronominal clitic
CONJ — conjunction
FORM — formative suffix
INTJ — interjection
NOUN — noun
NUM — numeral
PART — particle
PR — preposition
PRO — pronoun
V — verb
VB_PART — verbal particle
1 — 1st person
2 — 2nd person
3 — 3rd person
abl — ablative
abl2 — ablative 2
acc — accusative
acc_1pl — accusative clitic 1st person plural
acc_1sg — accusative clitic 1st person singular
acc_2pl — accusative clitic 2nd person plural
acc_2sg — accusative clitic 2nd person singular
acc_3pl — accusative clitic 3rd person plural
acc_3sg — accusative clitic 3rd person singular
act — active voice
adm — admirative mood
anim — animate noun
aor — aorist
cont — contact position of article
dat_1pl — dative clitic 1st person plural
dat_1sg — dative clitic 1st person singular
dat_2pl — dative clitic 2nd person plural
dat_2sg — dative clitic 2nd person singular
dat_3pl — dative clitic 3rd person plural
dat_3sg — dative clitic 3rd person singular
def — definite form
dist — distant position of article
f — feminine gender
gen_dat — genitive-dative
Gheg — Gheg lemma or word
imp — imperative mood
inanim — inanimate noun
ind — indicative mood
indef — indefinite form
ipf — imperfect
loc — locative
m — masculine gender
med — medial verb
mf — ambigenous noun
n — neuter gender
nom — nominative
nonst — non-standard lemma or word
opt — optative mood
pass — non-active voice
pl — plural
pres — present tense
ptcp — participle
sbjv — subjunctive mood
sg — singular
unmkd — unmarked case
vi — intransitive verb
vt — transitive verb
with_abl — preposition governs ablative
with_acc — preposition governs accusative
with_loc — preposition governs locative
with_nom — preposition governs nominative

Authors

The Corpus was developed as a result of efforts of the creative community of linguists from Saint Petersburg (Institute for Linguistic Studies of the Russian Academy of Sciences) and Moscow (the School of Linguistics at HSE).

The previous version used the search engine of the Eastern Armenian National Corpus (EANC). The current version of ANC uses the morphological analyzer and the tsakorpus platform developed by Timofey Arkhangelsky.

The main ideas for the development of the corpus representation of Albanian grammar, the metatext and grammatical annotation belong to Maria Morozova and Alexander Rusakov. Timofey Arkhangelsky and Mikhail Daniel took part in the discussion of these issues.

In the development of the Corpus were involved:

Maria Morozova, Saint Petersburg (corpus representation of Albanian grammar, grammatical annotation, metatext markup, corpus wordlist, text collection and processing)
Alexander Rusakov, Saint Petersburg (corpus representation of Albanian grammar, collection and processing of texts)
Timofey Arkhangelsky, Moscow — Hamburg (development of the morphological analyzer, the corpus platform, and the user interface, technical support)
Marina Domosiletskaya, Saint Petersburg (compilation of the corpus wordlist — nouns)
Anna Konovalenko, Saint Petersburg (compilation of the corpus wordlist — adverbs)
Anastasia Sidko, Saint Petersburg (compilation of the corpus wordlist — nouns, adjectives, verbs)
Daria Alekseeva, Saint Petersburg (collection and processing of texts)
Elizabeth Atakova, Saint Petersburg (collection and processing of texts)
Varvara Diveeva, Saint Petersburg (collection and processing of texts)
Maxim Makartsev, Moscow — Oldenburg (providing texts for the corpus)
Besim Kabashi, Munich (providing texts for the corpus)
Qerim Ondozi, Prishtina (collection and processing of texts)

The Corpus team is grateful to the publishing house «Onufri» (Tirana) for the help in the selection of texts. The first version of the Albanian National Corpus was created with the financial support of the «Corpus linguistics» Program by the Presidium of the Russian Academy of Sciences. The current version of the Corpus is hosted by the School of Linguistics at HSE.

Now the Corpus is maintained and developed by:

Maria Morozova (ILS RAS, SPbSU), morozovamaria86@gmail.com

Alexander Rusakov (ILS RAS, SPbSU), ayurusakov@gmail.com

Timofey Arkhangelsky (Universität Hamburg), timarkh@gmail.com

Albanian National Corpus

About corpora

Main corpus

Corpus of early Albanian texts

Publications

How to cite the Corpus

Tagset

Authors

Contacts