Analysis
-
General task: convert linguistic input into useful output or an internal representation that facilitates further processing
-
What sort of input
- Speech or text
- Possibly constrained by
- language
- dialect
- speaker
- genre
- content
-
What sort of output
- Segmented input (by morphemes, words, phrases)
-
Collection of lexical representations
- Unstructured (or segmented) data labeled (with words, parts of speech)
-
Morphosyntactic+lexical representation (tree, DAG/feature structure, dependency graph)
-
Semantic/pragmatic/conceptual representation (predicate calculus, Discourse Representation Theory)
- Classification (language, dialect, author, genre, value)
-
How is it done?
-
Chunkers
-
Parsers, with or without explicit grammars
-
Grammars and automata: regular grammars and finite-state automata, context-free grammars, dependency grammars
-
Co-occurrence statistics
-
Machine learning: hidden Markov models, classifiers,
case-based reasoning, neural networks (especially in cognitive models)
-
What's hard
- Ambiguity
- Invariance: what features are relevant for categories?
-
Multiple knowledge sources, including statistical, linguistic, and conceptual
- Possibility of errorful input
Generation
-
General task: convert a linguistic or non-linguistic representation into a linguistic form
-
What sort of input
- Raw linguistic form
- Parsed linguistic form
- Semantic/pragmatic/conceptual representation
-
What sort of output
- Text (sentence, discourse)
- Speech (in speech synthesis, text-to-speech)
-
How is it done?
- Grammars, automata, Markov models: finite-state, context-free, dependency
- Planners
-
Incremental or pre-specified
-
Grammar-driven (top-down) or lexically-driven (bottom-up)
-
What's hard
-
Planning on the basis of a pragmatic goal and what the hearer/reader knows;
the mutual knowledge problem
-
Sometimes no obvious correspondence (or one-to-many correspondence)
between input concepts and linguistic constructs
- Words and phrases must be ordered; ordering constraints may be very complex
Translation
-
General task: convert speech or text in a source language into corresponding speech or text in a
target language
- What sort of input
- Arbitrary source language
- Source language constrained by
- content domain
- linguistic complexity
- What sort of output
- Something, no matter how bad, for any input (browser quality translation)
- Grammatical target language text/speech, for only some input (publication quality translation)
-
How is it done?
-
Knowledge-based (rule-based) MT with varying depths of analysis
-
Statistical MT: from a sequence of words in a source sentence, generate the sequence of words in the target language that is most likely, based on estimates of the probability of the co-occurrence of source and target words or phrases, the probability of the ordering of corresponding source and target words or phrases, and the probability of the target sentence
-
Word-based and phrase-based approaches
-
Machine-assisted translation
-
What's hard
-
Cross-linguistic ambiguity: lexical, grammatical
- Structural/word-order differences
-
Lexical/grammatical gaps: there may be no target language lexical item or structure corresponding to
a particular source language lexical item or structure