Alexandre Arkhipov: INEL Corpora General Transcription and Annotation PrinciplesVol 5 (2020)
INEL (“Grammatical Descriptions, Corpora and Language Technology for Indigenous Northern Eurasian Languages”) is a long-term research project (2016–2033), whose primary goal is to create digital annotated corpora of several languages of Northern Eurasia, making possible typologically aware corpus-based grammatical research. As of December 2020, full versions have been released for two corpora, Kamas and Dolgan; two intermediate versions have also been published for Selkup, while the complete Selkup corpus is scheduled to appear by the end of 2021. Evenki is another currently running subproject, for which the corpus is also to appear at the same time. In this paper, we outline the basic principles of transcription and some aspects of other annotations (such as translations and annotations of code switching), common for all the INEL corpora.
Chris Lasse Däbritz: User’s Guide to INEL Dolgan CorpusVol 4 (2020)
The present corpus of Dolgan has been created as part of the long-term research project INEL (“Grammatical Descriptions, Corpora and Language Technology for Indigenous Northern Eurasian Languages”) in the context of the Academies’ Programme1, coordinated by the Union of the German Academies of Sciences and Humanities. Its primary goal is to create digital and machine-searchable corpora of several indigenous Northern Eurasian Languages. The INEL Dolgan corpus at hand fills a gap in the documentation of the indigenous languages of Northern Eurasia and makes possible further descriptions of the language. Dolgan is not completely unknown and undescribed, however, well-based grammatical descriptions are missing, whence the corpus can be a valuable tool for both language-specific and typologically oriented research.
Alexandre Arkhipov – Chris Lasse Däbritz – Valentin Gusev: User’s Guide to INEL Kamas CorpusVol 3 (2020)
Kamas is an extinct Samoyedic (Uralic) language which used to be spoken in the north of the Sayan Mountains. The INEL Kamas corpus aims to bring together the whole body of available recorded spoken Kamas data. It is the first publicly available annotated and searchable digital resource for Kamas texts ever.
The present paper documents the structure and scope of the corpus, the details of the metadata provided, the layout of the annotation tiers as well as the annotation schemes used in the corpus and can thus serve as a user guide.
Rahaf Farag: Conversation-analytic transcription of Arabic-German talk-in-interactionVol 2 (2019)
The paper deals with the process of computer-aided transcription regarding Arabic-German data material for interaction-based studies. First of all, it sheds light upon some major methodological challenges posed by the conversation-analytic approaches: due to current corpus technology, the reciprocity, linearity, and simultaneity of linguistic activities cannot be reconstructed in an analytically proper way when using the Arabic characters in multilingual and bidirectional transcripts. The difficulty of transcribing Arabic encounters is also compounded by the fact that Spoken Arabic as well as its varieties and phenomena have not been standardised enough (for conversation-analytic purposes). Therefore, the second part of this paper is dedicated to preliminary, self-developed solutions, namely a systematic method for transcribing Spoken Arabic. Keywords: Conversation-analytic transcription, corpora of talk-in-interaction, multilingual data, multilingual transcripts, Spoken Arabic, varieties, systemisation, temporality, directionality.
Beáta Wagner-Nagy – Sándor Szeverényi – Valentin Gusev: User’s Guide to Nganasan Spoken Language CorpusVol 1 (2018)
The Nganasan Spoken Language Corpus (NSLC) was created as part of the project Corpus based grammatical studies on Nganasan at the Institute of Finno-Ugric/Uralic Studies of Universität Hamburg. The primary goal of the project was to generate a digital, searchable corpus of spoken Nganasan. The language material to be integrated, glossed and annotated was collected by several researchers and is available in audio format, most of it in video format as well. The paper is the guide of the corpus.