Linguary

by Maxwell Joslyn. August 21st, 2020 (updated November 6th, 2020.)

chinese computing linguistics translation zh

Introduction

This page describes Linguary, an application to support two major activities in language analysis: (1) storing source material which has caught the user's interest, and (2) grouping and commenting on source material to attack or defend a hypothesis. This page, like Linguary, is a work in progress.

Linguary was born to serve my own needs in translation, linguistics, and recreational language learning, but I intend to make it useful for others (though there are no plans to make it applicable to non-Chinese languages.) The application was first mentioned on this site under the earlier name "Linguist's Working Environment," in my June letter to h0p3.

Goals

Linguary is being built to handle all the vagaries of storing chunks of Chinese source material, and grouping and annotating those chunk (collectively, interacting.)

Storage means providing one place to put all the written material that is found or created while studying a language. For this Chinese scholar, such material includes:

words/phrases of interest: leaving the definition of "word" or "phrase" underspecified for now, let's just say that linguists spend a lot of time considering the internal structure of words, and the ways in which words combine into phrases.
grammatical examples
errors and corrections
personal dictionary entries (for when a 3rd-party entry lacks an apparent meaning)
source documents: stories, poems, and other target-language texts which are associated with specific creators. Note that an individual quotation is also a "document": see below.
freeform notes: in any computational system, it is endlessly helpful to have an "escape hatch" for entering data that doesn't fit any of the predefined template.

Interaction means viewing lists of stored materials, and statistics about those materials; reading and marking up source documents; entering metadata for incomplete records; and so on. Eventually, I'd like to add a "mental stimulation" interface, which, given a word of input, would show (on one screen) the results of searching for that word in external dictionaries, saved documents, other words' definitions and example sentences, and so on.

Of the two activities Linguary intends to provide, storage is more fundamental. It's easier to iterate on an application's user interface than its data model, so it'd be best to spend time getting the latter right before the app is fully developed. Interaction is, for now, a smaller topic, so let's get it out of the way before returning to the subject of storage.

Interaction

I'm currently focused on building Linguary's data storage and API. Since there exist libraries for manipulating and viewing databases, I'll use one to the greatest extent possible, and save custom frontend and UI design work for when data modeling and transmission are stable for daily work.

As of 2020-08-21, the best tools I've found in the "view a database and build basic apps on it" niche are Datasette -- an open source multi-tool for exploring and publishing data -- and Retool, a free/paid SaaS for building apps around internal business processes.

The actions available to a user will include:

create, update, and delete records of Chinese character variants
create, update, and delete individual language items
while viewing an item, pull definitions from multiple 3rd-party dictionaries into one screen for comparison
show occurrences of a target item in stored documents, and in other items' text fields
create, update, and edit collections and their accompanying writeups

Linguary itself will not support flashcards, but Linguary users will be able to export data as Anki flashcards, and possibly in other formats too. Linguary will be able to send certain commands to a running instance of Anki through the 3rd-party library AnkiConnect, and it may even be feasible to trigger Anki's built-in cloud synchronization. I've already written the Python code to programmatically add language items to an Anki flashcard deck; this code exists outside the Linguary codebase, but can be readily expanded into a small library to manage communication between Linguary and the AnkiConnect API.

Storage

The types of information which linguists encounter while working with Mandarin Chinese can be classified to form the beginnings of Linguary's data model. Data modeling has two goals: (1) distinguishing the data types which the application will store and manipulate, and (2) designing a database schema which properly encodes the relations between those data types.

I foresee Linguary's data falling into these major categories:

Syllables
Characters
Language Items
- Morphemes
- Words
- Phrases
Collections
- Errors
Documents
- Sources
- Quotations, Highlights

Syllables

Syllables ("readings", "pronunciations") are individual syllables of Mandarin Chinese, using the pinyin romanization system. There's nothing to invent here, so I'll punt on an explanation for now; see Wikipedia to learn more about the syllables of Modern Standard Mandarin, and how they're transcribed.

Regional variation causes regular (predictable) changes in how some readings are spoken. Note that this type of accent variation can produce syllables that "don't exist" as citation syllables. For instance, "zhuang" will be pronounced "zuang" by many Southern or Taiwanese Mandarin speakers. I'll provisionally refer to these as synthetic syllables.

Characters may have one or many readings, and some even have zero readings (such as when the user is entering a character they haven't fully researched, or when a character is so archaic that it has no modern reading.) Readings also vary unpredictably by region, usually under the influence of a local language. For these reasons, Linguary will store one-to-many relations between characters and syllables.

The syllables table would only store individual syllables, note that certain combinations of syllables are pronounced differently than you might expect. Such changes are a form of sandhi, which means the changes are regular. Thus, the character-pronunciation or word-pronunciation relations need not capture them.

Elsewhere on this site: The Rarest Mandarin Syllables.

(TBD: specify tone sandhi; possibly discuss topolect examples of syllable sandhi; consider erhua)

Characters

Individual Chinese characters (stored as traditional characters¹) are here considered to be elements of writing, not elements of the language itself. In technical terms, they are not morphemes, but graphemes. The exact shape of a given character varies across regions, and across its use in different languages, but these variants are all considered versions of one Platonic character. I believe this is the Unicode approach.

Elsewhere on this site: uncommon characters.

(TBD: bite bullet and dig into "simplified" vs. "traditional".)

Language Items

"Language Items" is a supercategory for any chunk of Chinese language. So far, I've specified these major types:

morphemes: meaningful subword units, such as prefixes and suffixes, as well as items more exotic to the non-linguist, like clitics and transitivizers. (For an English example of a "subword unit," consider the suffix -er, as in "worker" or "farmer." The suffix -er turns a verb, V, into a noun meaning "someone who [performs that verb].")
words, including one-character words
phrases: complete and incomplete sentences, as well as unique Chinese phrase types like chengyu and xiehouyu.

For now, items are just strings, with no structure. I'd like to change that, and add the ability to store (partial) syntax trees as well, but I'm nowhere near ready.

(TBD: transpositional variants and detection thereof; compound nouns; numeric epithets; onomatopoeia; subtypes of words/phrases. A "verified" field for items which have been checked with, or were produced by, a native speaker. A data column to explicitly mark whether a source item is correct, incorrect, questionably correct, marked, etc. as in normal linguistic studies.)

Collections

A collection enriches a set of Items with annotations, grouping, and explanatory text. These are the building blocks of linguistic arguments: a linguist will contrast multiple phrases which differ in precisely some way that that proves or disproves a conjecture about the characteristics of the target language.

TBD: example, including clear portrayal of "grouping" (e.g. two correct and two incorrect items.)

Collections will frequently be used for error explanation, in which a fragment of correct Chinese is set against its correct counterpart(s), and augmented with precise descriptions of both the error and the tactic taken to repair it.

Taking the above into account, a collection will relate to many items, and either one chunk of explanatory text, or many. As for grouping items together, as of 2020-08-30 I intend to implement such "groups" into collections themselves (which Linguary could create on the fly), so a collection can relate to many other collections.

Aside from database tables for collections and explanatory freeform-text notes, Linguary will also need tables to maintain each of the one-to-many relations collection-item, collection-note, and collection-collection.

Finally, I already have about 70 example collections which could be entered into Linguary. Countless more could be constructed in service of a pedagogical or linguistic goal.

Documents

(TBD: should documents just be really long Items? Quotations, especially if just a single sentence or less, also seem like Items. It's no good to try distinguishing Items as being less than one sentence, and Quotations/Docs as being one or more sentences, because many extremely short Items would be able to stand as full sentences in the right context.)

The documents category would primarily include what I call sources: short stories, essays, transcripts, and other textual works. A user will also be able to enter quotations without uploading an entire document; these quotations are themselves a document subtype.

If a user highlights a section of a document in the viewer interface, the highlight could be stored as a quotation. Therefore, highlights are also a type of document.

A document in the database could, in addition to its text, contain metadata like the author, publisher, publication date, access URL, and so on. This isn't a priority at first.

Data Sources

I certainly don't intend to collect all my information on the Chinese language from scratch! I've relied on dictionaries and corpora aplenty in the course of my journey. Linguary will eventually sport an bulk import system for ingesting pre-existing data sources.

Such a feature can be valuable long before it's 100% automated. An HTML scraper for a frequently-used dictionary could dump hundreds and thousands of entries into a "for review" table, letting the user categorize & annotate those items at their leisure.

There are also academic data sets made by scholars in NLP or the digital Chinese humanities which offer language material richly annotated with grammatical, historical, philological, and cultural metadata. These, too, would be worth absorbing, and I have several lists' worth of such material to post here (TBD / under construction):

CTEXT is fantastic for Literary Chinese, with its vast document collection, its API, and its text tools.
Ichacha has priceless example-sentence functionality which shows when multiple English phrases correspond to a single Chinese equivalent.
Dict.cnki.net has loads of technical vocabulary.
MDBG is an old reliable dictionary based on CC-CEDICT, though its example sentences, sourced from Juku, are terrible: nearly all are drawn from what seems to be a single software-design manual, and I've seen countless examples where the English was incorrectly interpreted when translating into Chinese.
Pleco, the king of mobile Chinese dictionaries, has digital dictionaries not available anywhere else. Though licensing agreements limit how I can use those dictionaries' data, correspondence with Mike Love, Pleco's creator, confirmed that I could make derivative works for non-commercial personal projects (e.g. visualizations of links between chengyu.)

Difficulties

WIP 2020-09-15T18:40:33-0700. TBD: lots, including simplified versus traditional.

See The Pitfalls and Complexities of Chinese to Chinese Conversion for initial discussion.

Evidence for needing to distinguish between characters (as graphemes) and words (as morphemes):

For example, an SC string such as 头发 ‘hair’ is not treated as a single unit, but is converted character by character. Since SC 头 maps only to TC 頭, the conversion succeeds. On the other hand, since SC 发 ‘hair’ maps to both TC 髮 ‘hair’ and TC 發 ‘emit’, the conversion may fail. That is, if the table maps 发 to 發, which is often the case, the result will be the nonsensical 頭發. ‘head’ + ‘emit’. On the other hand, if the table maps 发 to 髮, 头发 will correctly be converted to 頭髮, but other common words, such as SC 出发 ‘depart’, will be converted to the nonsensical 出髮 ‘go out’ + ‘hair’.

See Wikipedia: Ambiguities in Chinese character simplification.

As we have seen, orthographic converters have a major advantage over code converters in that they process word-units, rather than single codepoints. Thus SC 特征 (te4zheng1) ‘characteristic’, for example, is correctly converted to TC 特徵 (not to the incorrect 特征). Similarly, lexemic converters process lexemes. For example, SC 光盘 (guang1pan2) ‘CD-ROM’ is converted to the lexemically equivalent TC 光碟 (guang1die2), not to its orthographically equivalent but incorrect 光盤.
This works well most of the time, but there are special cases in which a polysemous SC lexeme maps to multiple TC lexemes, any of which may be correct, depending on the semantic context. We will refer to these as ambiguous polygraphic compounds.

One-to-many mappings of polysemous SC compounds occur both on the orthographic level and the lexemic level. SC 文件 (wen2-jian4) is a case in point. In the sense of ‘document’, it maps to itself, that is, to TC 文件; but in the sense of ‘data file’, it maps to TC 檔案 (dang4'an4). This could occur in the TC-to-SC direction too. For example, TC 資料 (zi1liao4) maps to SC 资料 in the sense of ‘material(s); means’, but to SC 数据 (shu4ju4) in the sense of ‘data’.

Later, deal with polysyllabic characters, words without standardized characters in Dongbei and Cantonese, "second-round" simplifications like 歺 can1 (for 餐), the invention of completely new characters, the six types of character composition, ...

User Stories

TBD. I'm not rushing to hype up Linguary before release, or sell it as a commercial product, but detailed user stories will help me clarify which features I do and don't want to deliver. In addition, if I find I can't understand how someone might use Linguary for a given task, that would drive me to either discuss it with people doing that task, that I might adapt it to their needs -- or else decide I should not attempt to serve that group of potential users.

TBD: linguist user story. Translator user story.

This is the standard for Chinese database and dictionary software, because traditional characters can be converted to their simplified ones by a mechanical, context-free procedure. Going the opposite direction is much harder to do correctly. Because multiple traditional characters were commonly merged into one simplified character, the conversion process is probabilistic, and the odds of getting the right traditional character can only be improved by analyzing the character's occurrence in context (such as by examining the other characters in a given word.)↩