From ht@cogsci.ed.ac.uk Fri Mar 19 10:47:50 1993 Path: pavo.csi.cam.ac.uk!doc.ic.ac.uk!uknet!edcastle!aisb!cogsci!ht From: ht@cogsci.ed.ac.uk (Henry S. Thompson) Newsgroups: comp.speech Subject: HCRC Map Task Corpus on CD: Audio and transcripts of natural speech Message-ID: Date: 18 Mar 93 23:12:40 GMT Sender: news@aisb.ed.ac.uk (Network News Administrator) Distribution: comp Organization: HCRC, University of Edinburgh Lines: 134 The HCRC Map Task Corpus The Human Communication Research Centre (HCRC) is happy to announce the release of the Map Task Corpus. The Map Task Corpus is a set of 8 CD-ROMs containing linked audio and transcriptions of a total of about 18 hours of spontaneous speech that was recorded from 128 two-person conversations according to a detailed experimental design. Altogether, the corpus as distributed provides a thorough and invaluable set of resources and tools for use in analyzing all levels of linguistic structure, via both text-based and speech-based investigation. The range of research questions that are addressable using this corpus span a wide spectrum of linguistic and cognitive issues. We have kept the price as low as possible to encourage researchers from many disciplines to use this corpus as a common reference point for many different kinds of research. The HCRC is an interdisciplinary research centre at the Universities of Edinburgh and Glasgow, supported by the UK Economic and Social Research Council and the Universities Funding Council. The publication of the Map Task Corpus was made possible by assistance from the Linguistic Data Consortium. Corpus Details 64 different speakers, 32 female, 32 male, all adults, each took part in four conversations in a quiet recording studio. They were all students at the University of Glasgow, 61 of them being native Scots. The conversations were carried out in an experimental setting in which each participant has a schematic map in front of them, not visible to the other. Each map is comprised of an outline and roughly a dozen labelled features (e.g. "a white cottage", "an oak forest", "Green Bay", etc). Most features are common to the two maps, but not all. One map has a route drawn in, the other does not. The task is for the participant without the route to draw one on the basis of discussion with the participant with the route. In addition to the conversations, each speaker provides a wordlist reading, consisting of the major vocabulary items contained in the conversations. All recordings were direct to Digital Audio Tape (DAT) at 48KHz, providing very good acoustic quality. The experimental design allows a number of different phonemic, syntactico-semantic and pragmatic contrasts to be explored in a controlled way. In particular, maps and feature names were designed to allow for controlled exploration of phonological reductions of various kinds in a number of different referential contexts, and to provide, via varying patterns of matches and mis-matches between the two maps, a range of different stimuli for referent negotiation. Also the conditions of the conversations were carefully balanced: In half of them the speakers were strangers, in half friends; in half of them the speakers could see each other's faces, in half they could not. Subjects accommodated easily to the task and experimental setting, and produced evidently unselfconscious and fluent speech. The syntax is largely clausal rather than sentential; showing good turn-taking, with modest amounts of overlap and interruption. The total corpus runs to about 18 hours of speech, with the transcripts consisting of around 150,000 word tokens drawn from just over 2,000 word form types. Transcription is at the orthographic level, quite detailed, including filled pauses, false starts and repetitions, broken words, etc. Considerable care has been taken to ensure consistency of notation, which is thoroughly documented. Although the full complexity of overlapped regions has not been reflected in the transcriptions, such regions are clearly set off from the rest of the transcripts. Transcripts are connected to the acoustic sampled data by sample numbers marked every few turns. CD-ROM Contents The waveform data are provided in "raw" (headerless) files (16-bit samples, 20 kHz sample rate, 2 channels per conversation), and alternative header files are provided for use with software based on either the NIST "SPHERE" header structure or the European "SAM" header structure. Transcriptions are provided for each conversation, marked up with TEI-compliant SGML, in a minimally intrusive and easily separated way. PostScript files of the map images used in the experiments are provided, along with full documentation of the experimental design and data collection protocol, resources for using SGML tools on the transcriptions and other text materials, and an extensive set of source code for performing basic signal processing functions on the waveform data, such as down-sampling, de-multiplexing, channel summation, and D/A conversion for Sun workstations (including playback of segments selected via inspection of transcripts in Emacs). The CD-ROMs are in High Sierra (ISO 9660) format with the RockRidge extensions, and are compatible with (inter alia) Unix, MS-DOS and Macintosh operating systems. Copies of the Map Task Corpus are available from the LDC for $200 or from HCRC for 164.50 UK pounds (including VAT) at the addresses given below, plus postage and packing as necessary. Please contact us (by e-mail if possible) for details of payment methods and shipping costs. In Europe please contact Henry Thompson University of Edinburgh Human Communication Research Centre 2 Buccleuch Place Edinburgh EH8 9LW Scotland Tel: +44 31 650-4440 Fax: +44 31 650-4587 email: maptask@cogsci.ed.ac.uk or Dawn Griesbach ELSNET 2 Buccleuch Place Edinburgh EH8 9LW Scotland Tel: +44 31 650-4594 Fax: +44 31 650-4587 email: elsnet@cogsci.ed.ac.uk Outside Europe please contact Elizabeth Hodas Linguistic Data Consortium 441 Williams Hall University of Pennsylvania Philadelphia, PA 19104-6305 Tel: (215) 898-0464 Fax: (215) 573-2175 email: ehodas@unagi.cis.upenn.edu -- Henry Thompson, Human Communication Research Centre, University of Edinburgh 2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 31 650-4440 Fax: (44) 31 650-4587 ARPA: ht@cogsci.ed.ac.uk JANET: ht@uk.ac.ed.cogsci UUCP: ...!uunet!mcsun!uknet!cogsci!ht