Close
This site uses cookies

By using this site, you consent to our use of cookies. You can view our terms and conditions for more information.

Cognitively and Linguistically Motivated Part of Speech Tagging: Quantitative Assessment of a Near Human-Scale Computational Cognitive Model

Authors
Dr. Jerry Ball
Mr. Stu Rodgers
Institute for Defense Analyses ~ Operational Evaluation Division
Abstract

We provide a quantitative assessment of the part of speech tagging accuracy rate of the written word recognition subcomponent of the computational implementation of Double R Grammar. Double R Grammar is a cognitively and linguistically motivated near human-scale computational cognitive model of the grammatical analysis of written English which is focused on the grammatical encoding of two key dimensions of meaning: referential and relational meaning. Cognitively, the computational implementation is implemented in the ACT-R cognitive architecture. It contains a mental lexicon which encodes explicit declarative knowledge about lexical items and grammatical constructions, and a procedural memory which encodes implicit knowledge about how to grammatically analyze input expressions. With ~100,000 words and multi-word units, the size of the mental lexicon aligns with numerous estimates of the size of the human mental lexicon. The words were primarily borrowed from the COCA corpus and are assigned a part of speech specific base-level activation based on their frequency of use in that corpus. The retrieval of lexical items corresponding to input tokens depends on the spread of activation from the lexical, morphological, and grammatical context, and the base-level activation. Grammatical productions determine how to integrate retrieved lexical items and projected grammatical constructions into grammatical representations. There are ~2500 manually created productions that cover the common grammatical patterns of English. The basic processing mechanism is pseudo-deterministic in that it pursues the single best analysis, but is capable of non-monotonically adjusting to the evolving context. The processing mechanism adheres to two well established cognitive constraints on human language processing: incremental and interactive processing. Linguistically, Double R Grammar aligns with cognitive and construction grammar, and is strongly usage based. On a previously unseen sample corpus of book abstracts of spy novels and a few paragraphs of a Clive Cussler book, the computational implementation achieved a 98.48% part of speech tagging accuracy rate over 1838 tokens. On a second sample corpus of 8 abstracts of books on the topic of self-help and two political biographies, the computational implementation achieved a part of speech tagging accuracy rate of 98.56% over 766 tokens. Although this accuracy rate is not directly comparable to competing machine learning approaches trained over an annotated corpus, or deep learning approaches trained over big data, the current state of the art for part of speech tagging accuracy is in the neighborhood of 98% for systems trained on the annotated Penn Treebank corpus, using the Penn Treebank tagset which contains 36 atomic parts of speech organized into a flat listing with no internal structure. This compares to 56 non-atomic parts of speech with internal structure, organized into a multiple inheritance hierarchy in Double R Grammar.

Discussion
New
use in data coding? Last updated 10 months ago

Jerry -- thanks for the excellent discussion on this today. One of the challenges I've been running into on some projects -- and of course it's a long running challenge in behavioral research -- is the time it takes to code speech data. Do you think that Double R models would be able to play some role in accelerating that process? The big target wo...

Dr. Leslie Blaha 1 comment
Cite this as:

Ball, J. T., & Rodgers, S. (2023, June). Cognitively and Linguistically Motivated Part of Speech Tagging: Quantitative Assessment of a Near Human-Scale Computational Cognitive Model. Paper presented at Virtual MathPsych/ICCM 2023. Via mathpsych.org/presentation/1267.