Term paper specifications

You are responsible for a term paper that builds and evaluates some language technology discussed in class. The focus may be on the engineering side, or alternatively, you may be interested in applying quantitative data and methods to some questions of theoretical linguistic or psycholinguistic interest. This is also acceptable and encouraged so long as it involves some legitimate computational or language technology component.

If you are not sure whether your project satisfies the above specifications, email or send a slack message with a brief description to Spencer before proceeding.

A brief list of ideas:

  1. Write a finite-state grapheme-to-phoneme conversion grammar using Pynini, then evaluate it against pronunciation dictionaries from WikiPron.
  2. Using a language model and a finite-state covering grammar, decode ambiguous text (e.g., written in "chatspeak", containing ambiguous abbreviations, etc.) using NGram and Pynini.
  3. Train and evaluate a tagger (for part-of-speech, NP chunks, or named entities) using a tagger (1 2 3 4).
  4. Train and evaluate a text classifier using scikit-learn.
  5. Measure and evaluate the distributional properties of some linguistic alternation (e.g. vowel harmony)
  6. Test predictions of some syntactic or semantic theory via usage statistics (e.g. what drives Scandinavian embedded V2 phenomenon))
  7. Quantify properties of child-directed or child-produced speech (e.g. the distribution of determiner usage)
  8. Apply the Tolerance Principle to some novel domain of interest (e.g. my judgment is that the plural form of "diagnosis" might be a gap. Is this predicted? Ask me more about this if you're intersted.)

What to submit

Your submission should include:

  1. Any interesting samples of code (though I won't reviewing code quality in my grading)

  2. Data used (or instructions or code to obtain it, if it's more than 15 MB or so)

  3. A write-up of several pages describing:

    1. what you did
    2. why it might be a useful thing to automate or the linguistic interest
    3. the data you used
    4. the software you used and/or developed
    5. the results of your evaluation

Rubric

The term paper will be graded on the degree to submission satisfies the above specification.

I will grade the submission up to the point where I am required to submit grades to the registrar's office; this usually a week or so after the end of the semester. If I have not received a term paper by then, you will receive an "I" (incomplete) grade until you submit the term paper.

Hints

  1. While it's possible to work with audio data for this project, it's a lot harder than working with discrete (e.g., text, etc.) data unless you've also studied acoustic phonetics and/or signal processing.
  2. It's okay (good, even) if this harmonizes with some other projects you're doing for credit (e.g., qualifying papers), so long as you make it clear in your write-up what part of the project is unique to the term paper.

Proposal

To propose a topic, send a brief description of the project to Spencer.

Submission

Submit the term paper via email, sending it to Spencer.