Archives

  • Anna Wamprechtshammer – Elena Arestau – Jocelyn Aznar – Hanna Hedeland – Amy Isard – Ilya Khait– Herbert Lange – Nicole Majka – Felix Rau: QUEST: Guidelines and Specifications for the Assessment of Audiovisual, Annotated Language Data
    Vol. 8 (2022)

    This guide documents the main results of the joint project “QUEST: Quality – Established: Qualitätsstandards und Kurationskriterien für audiovisuelle annotierte Sprachdaten”, which was carried out between 2019 and 2022 and funded by the German Federal Ministry of Education and Research (BMBF). The project consortium consisted of the University of Hamburg, the Leibniz-Centre General Linguistics (ZAS) in Berlin, the Archive for Spoken German (AGD)/Institute for the German Language (IDS) in Mannheim and the University of Cologne. The BBAW in Berlin was also involved through the ‘Endangered Languages Documentation Programme’.

    Main aim of the project was to maximise the potential for reuse and secondary use of audiovisual, annotated language data. For this purpose, QUEST developed quality standards and curation criteria for several reuse scenarios such as ‘Language Documentation’, ‘Learner Corpora’, ‘Interpreted Corpora’, ‘Sign Language’, ‘Language Community’, ‘Ethnography’ and ‘Oral History’. Based on this, quality assurance procedures (an online questionnaire and automated quality checks) were implemented and tested on authentic data.

    In summary, the guidelines document provides definitions and examples for the quality criteria elaborated in QUEST, which are intended to provide information on the reuse potential of audiovisual, annotated data and aims to give overview of the objects and workflows of the evaluation system. Quality standards and curation criteria are linked to data maturity levels and suggestions are made on how to evaluate each criterion.

  • Nicole Palliwoda (ed.): Data Processing and Visualization in Variational Linguistics/Dialectology
    Vol. 7 (2022)

    From 3 to 4 September 2020, the workshop Data Processing and Visualization in Variational Linguistics/Dialectology took place as a digital event organized by Ludwig Maximilian Breuer (University of Vienna) and Nicole Palliwoda (University of Kiel) via the University of Siegen. This workshop is a continuation and extension of the very successful and constructive workshops that took place in 2016 and 2017. The focus of this workshop was on the post-processing or preparation of linguistic data and on digital tools. Accordingly, in the workshop, various technical means for implementing such projects were presented and discussed directly from practice, i.e. on the basis of experience from international (variation) linguistic research projects. A special focus of the event was and still is the exchange between the individual projects and participants. The publication now brings together a total of five contributions that provide insights, results, problems and reflections on different methods and projects from the workshop.

    Editor:

    Nicole Palliwoda
    Christian-Albrechts-University of Kiel
    https://orcid.org/0000-0002-1501-8612

  • Anja Behnke – Josefina Budzisch: Selkup Language Corpus
    Vol. 6 (2021)

    This paper documents the project: “Syntactic Description of the Southern and Central Selkup Dialects: A Corpus-Based Investigation”, which was carried out between 2015 and 2018 at the University of Hamburg. The project was funded by the German Research Foundation (DFG). The main goal of the project was the creation of a digital language corpus of Selkup. In addition to the originally planned texts from Central and Southern Selkup dialects, a number of Northern Selkup texts were added in the course of the project. The corpus, therefore, reflects the great dialectal diversity of Selkup.

    The paper is structured as follows: Section 2 describes the project objectives and the tasks that were carried out during the course of the project. In section 3, a short overview of Selkup is presented, giving some remarks about the areal distribution as well as the linguistic status of Selkup. In section 4, metadata about the corpus are introduced; here information about archiving and conventions throughout the corpus are described. Section 5 deals with the structure of the corpus and gives a detailed analysis of the transcription and annotation of the data. In section 6, a list of research based on the corpus is presented, section 7 lists the text sources for the corpus, and in section 8 references are given. In the appendix, the used characters, as well as labels for glosses and categories, can be found.

  • Alexandre Arkhipov: INEL Corpora General Transcription and Annotation Principles
    Vol. 5 (2020)

    INEL (“Grammatical Descriptions, Corpora and Language Technology for Indigenous Northern Eurasian Languages”) is a long-term research project (2016–2033), whose primary goal is to create digital annotated corpora of several languages of Northern Eurasia, making possible typologically aware corpus-based grammatical research. As of December 2020, full versions have been released for two corpora, Kamas and Dolgan; two intermediate versions have also been published for Selkup, while the complete Selkup corpus is scheduled to appear by the end of 2021. Evenki is another currently running subproject, for which the corpus is also to appear at the same time. In this paper, we outline the basic principles of transcription and some aspects of other annotations (such as translations and annotations of code switching), common for all the INEL corpora.

  • Chris Lasse Däbritz: User’s Guide to INEL Dolgan Corpus
    Vol. 4 (2020)

    The present corpus of Dolgan has been created as part of the long-term research project INEL (“Grammatical Descriptions, Corpora and Language Technology for Indigenous Northern Eurasian Languages”) in the context of the Academies’ Programme1, coordinated by the Union of the German Academies of Sciences and Humanities. Its primary goal is to create digital and machine-searchable corpora of several indigenous Northern Eurasian Languages. The INEL Dolgan corpus at hand fills a gap in the documentation of the indigenous languages of Northern Eurasia and makes possible further descriptions of the language. Dolgan is not completely unknown and undescribed, however, well-based grammatical descriptions are missing, whence the corpus can be a valuable tool for both language-specific and typologically oriented research.

  • Alexandre Arkhipov – Chris Lasse Däbritz – Valentin Gusev: User’s Guide to INEL Kamas Corpus
    Vol. 3 (2020)

    Kamas is an extinct Samoyedic (Uralic) language which used to be spoken in the north of the Sayan Mountains. The INEL Kamas corpus aims to bring together the whole body of available recorded spoken Kamas data. It is the first publicly available annotated and searchable digital resource for Kamas texts ever.
    The present paper documents the structure and scope of the corpus, the details of the metadata provided, the layout of the annotation tiers as well as the annotation schemes used in the corpus and can thus serve as a user guide.

  • Rahaf Farag: Conversation-analytic transcription of Arabic-German talk-in-interaction
    Vol. 2 (2019)

    The paper deals with the process of computer-aided transcription regarding Arabic-German data material for interaction-based studies. First of all, it sheds light upon some major methodological challenges posed by the conversation-analytic approaches: due to current corpus technology, the reciprocity, linearity, and simultaneity of linguistic activities cannot be reconstructed in an analytically proper way when using the Arabic characters in multilingual and bidirectional transcripts. The difficulty of transcribing Arabic encounters is also compounded by the fact that Spoken Arabic as well as its varieties and phenomena have not been standardised enough (for conversation-analytic purposes). Therefore, the second part of this paper is dedicated to preliminary, self-developed solutions, namely a systematic method for transcribing Spoken Arabic. Keywords: Conversation-analytic transcription, corpora of talk-in-interaction, multilingual data, multilingual transcripts, Spoken Arabic, varieties, systemisation, temporality, directionality.

  • Beáta Wagner-Nagy – Sándor Szeverényi – Valentin Gusev: User’s Guide to Nganasan Spoken Language Corpus
    Vol. 1 (2018)

    The Nganasan Spoken Language Corpus (NSLC) was created as part of the project Corpus based grammatical studies on Nganasan at the Institute of Finno-Ugric/Uralic Studies of Universität Hamburg. The primary goal of the project was to generate a digital, searchable corpus of spoken Nganasan. The language material to be integrated, glossed and annotated was collected by several researchers and is available in audio format, most of it in video format as well. The paper is the guide of the corpus.