LING83800 – S25

Methods in Computational Linguistics II
CUNY Graduate Center – Spring 2025

Instructor: Prof. Spencer Caplan
Practicum leader: Natalia Tyulina
Lecture: Monday 4:15-6:15, GC 3306
Practicum: Wednesday 4:15-6:15, Ling Thesis Room
Office hours: Thursdays 3:15-4:15, GC 7400.02 and by appt
Questions and Discussion: Reach out in the Slack! (#methods-ii-s25)

Synopsis

This course is the second of a two-semester series introducing computational linguistics and modern software development. The intended audience are students interested in speech and language processing technologies, though the materials will be beneficial to all language researchers.

Objectives

Using the Python programming language, students will learn formalisms and technologies used to build speech and language technologies.

Materials

Readings will be assigned throughout the term and posted to the course schedule, although there will be no official "textbook." Students are strongly encouraged to bring a laptop computer to the lecture and practicum. Students are also welcome to use the Computational Linguistics Laboratory (7400.13) for practice and assignments.

Assignments

Assignments will take the form of small software development projects accompanied by a write-up describing the general approach taken and any challenges encountered. Students will usually be able to verify the technical correctness of their code by running a provided unit test. Students will also be graded on the readability of their code, and the quality of the write-up. We will use GitHub for assignment turn-in. You will receive a link either in-class or via email which will generate your GitHub repo for each assignment.

Since there are often common questions that arise about assignments throughout the term, I have set up a Slack channel for discussion. Please contact Spencer if you don't have access to it.

The final assignment will be an open-ended project which will either extend earlier projects, or build and evaluate a speech and language technology system. Students are encouraged to conceive of projects relevant to their research interests. Students should discuss project plans with the instructor during office hours to confirm that it is both feasible and of appropriate scope. Because of the open-ended nature of the final assignment, unit tests will not be provided.

Grading

80% of students' grades will be derived from the assignments; the remaining 20% will be reserved for participation and attendance. Assignments must be submitted on time or will receive a 0 grade (barring a documented emergency). No separate grade will be assigned for the practicum.

Accommodations

The instructor will attempt to provide all reasonable accommodations to students upon request. If you believe you are covered under the Americans With Disabilities Act, please direct accommodations requests to Vice President for Student Affairs Matthew G. Schoengood.

Attendance

Students are extended to attend all lectures and practica (in person). There will in general be no accomodation to attend class online. However, students who have reason to believe they may be contagious with an infectious diseases should stay at home and contact the instructor. Other absences will not be excused, and the instructor reserves the right to tie grades to attendance records. The instructor and practicum leader are not responsible for reviewing materials missed to absence.

Integrity

In line with the Student Handbook policies on plagiarism, students are expected to complete their own work. However, a student is permitted to collaborate with another student during the coding phase of an assignment so long as they: do not share lines of code with each other, mutually disclose their collaboration in their write-ups, and do not collaborate at all on their write-ups.

The general ethos of the integrity policy is that actions which shortcut the learning process are forbidden while actions which promote learning are encouraged. Studying lecture notes together, for example, provides an additional avenue for learning and is encouraged. Using a classmate’s solution to a homework, however, is prohibited because it avoids the learning process entirely. If you have any questions about what is or is not permissible, please contact your instructor.

The instructor reserves the right to refer violations to the Academic Integrity Officer.

Respect

For the sake of the privacy, students are asked not to record lectures. Students are expected to be considerate of their peers and to treat them with respect during class discussions.

Schedule

(Please note that this is subject will be updated dynamically throughout the semester and is subject to change.)

W0	Date	Due	Class	Topics	Slides	Reading
M	1/27		Lect.	Syllabus; tooling; Review BSTs	Slides-L0	None
W	1/29		No class (Lunar New Year)
W1
M	2/3		Lect.	Git; GitHub; Command-line things; AVL-trees	Handout-L1 Slides-L1	Core: Chacon & Straub ch. 1.1-3.2, 6.1-6.3; Canonical Tutorial Command-Line; AVL Wikipedia
W	2/5		Prac.	First practice	Handout-P1
W2
M	2/10	HW1 due	Lect.	Formal languages I	Handout-L2	Core: Partee et al. ch. 1; Additional: (Hopcroft et al. ch. 1.5)
W	2/12		No class (GC Closed for Lincoln's Birthday)
W3
T	2/18	HW2 due	Lect.	(GC on Mon. schedule) Formal languages II	Handout-L3 Slides-L3	Core: Jäger & Rogers; Additional: (Graf)
W	2/19		Prac.	Working with subregular languages	Notebook
W4
M	2/24		Lect.	FSAs	Slides-L4	Core: Gorman & Sproat ch. 1-1.4 Jurafsky & Martin ch. 2-2.1 Additional: (Freeman et al. ch 10) (Hopcroft et al. ch. 3-3.1, 3.3)
W	2/26		Prac.	Pynini	Notebook
W5
M	3/3		Lect.	FSTs	Slides-L5	Core: Gorman & Sproat ch. 5 Additional: (Hopcroft et al. ch. 2) (Hopcroft et al. ch. 3.2)
W	3/5	HW3 due	Prac.	Rewrite Rules	Notebook
R	3/6		Lect.	Probability (GC on Wed. schedule)	Handout-L6	Core: Manning & Schütze ch. 2
W6
M	3/10		Lect.	Language models I	Slides-L7	Core: Jurafsky & Martin ch. 3
W	3/12		Prac.	Practice with: Probability and Language Models	Handout-P5 More practice
W7
M	3/17		Lect.	Language models II	Slides-L8 Handout-L8	Core: Charniak & Johnson ch. 1 Gorman & Sproat ch. 1.5-1.6
W	3/19		Prac.	OpenFST Language Models	Notebook
R	3/20	HW4 due
W8
M	3/24		Lect.	Dynamic Programming; Edit Distance	Slides-L9	None
W	3/26		Prac.
W9
M	3/31		No class (GC-wide)
W	4/2		No practicum today
R	4/3	HW5 due
W10
M	4/7		Lect.	POS Tagging; HMMs	Slides-L10	Core: Bird et al. ch. 5 Jurafsky & Manning Ch A Additional: (Manning & Schütze ch. 9)
W	4/9		Prac.		Notebook Slides
W11
M	4/14		No class (Spring Break☀️😎)
W	4/16		No class (Spring Break☀️😎)
W12
M	4/21		Lect.	Generative classifiers	Slides-L11	Core: Bird et al. ch. 6.1-3, 6.5-6.9 Jurafsky & Manning Ch 4
W	4/23		Prac.	Classification in SKlearn	Notebook
W13
M	4/28		Lect.	Discriminative classifiers	Slides-L12	Core: Pedregosa et al. Breiman (Two Cultures) Additional: (Ng & Jordan)
W	4/30		Prac.	Regression in SKlearn	Notebook
W14
M	5/5		Lect.	Perceptrons; Regularization & Tuning	Slides-L13	Scikit-learn tutorials: 1, 2, 3
W	5/7	HW6 due	Prac.	Advanced text classification; Evaluation	Notebook Slides
W15
M	5/12		Lect.	DL🐎 Perils of evaluation(🐒💻)	Slides-L14	Core: Caplan et al. (2020)
W	5/14		Prac.	Rules and Exceptions Learning by people and populations	Slides-L15	Core: TP User's Guide GCY Good Enough
W16
R	5/22		Term paper due / End of semester			details here

Links and references

Bird, S., Klein, E. and Loper, E. n.d. Natural Language Processing with Python. URL: https://www.nltk.org/book/.
Breiman, L. 2001. Statistical modeling: the two cultures. Statistical Science 16(3): 199-231.
Church, K. W. n.d. Unix™ for poets. Ms., AT&T Research.
Chacon, S., and Straub, B. 2014. Pro Git. 2nd edition. Apress. URL: https://git-scm.com/book/en/v2.
Collins, M. 2002. Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, pages 1-8.
Eisenstein, J. 2019. Introduction to Natural Language Processing. MIT Press.
Freeman, Eric, Freeman, Elisabeth, Sierra, K. and Bates, B. 2004. Head First Design Patterns: A Brain-friendly Guide. O'Reilly & Associates.
Freund, Y., and Schapire, R. E. 1999. Large margin classification using the perceptron algorithm. Machine Learning 37(3): 277-296.
Gorman, K. and Bedrick, S. 2019. We need to talk about standard splits. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2786-2791.
Gorman, K. and Sproat, R. 2021. Finite-State Text Processing. Morgan & Claypool.
Graf, T. 2022. Subregular linguistics: bridging theoretical linguistics and formal grammar. Theoretical Linguistics 48(3-4): 245-278.
Hopcroft, J. E., Motwani, R. and Ullman, J. D. 2008. Introduction to Automata Theory, Languages, and Computation. Pearson.
Hovy, D. and Spruit, S. L. 2016. The social impact of natural language processing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 591-598.
Jäger, G. and Rogers, J. 2012. Formal language theory: refining the Chomsky hierarchy. Philosophical Transactions of the Royal Society B 367: 1956-1970.
Jelinek, F. 1997. Statistical Methods for Speech Recognition. MIT Press.
Jurafsky, D., and Martin, J. H. To appear. Speech and Language Processing. 3rd edition. Pearson. URL: https://web.stanford.edu/~jurafsky/slp3/.
Manning, C. D., and Schütze, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press.
Ng, A. Y. and Jordan, M. I., 2002. On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. In NeurIPS, pages 841-848.
Partee, B. H., ter Meulen, A., and Wall, R. E. 1993. Mathematical Methods in Linguistics. 2nd edition. Kluwer Academic Publishers.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, É. 2011. Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12: 2825-2830.
Resnik, P. and Lin, J. 2010. Evaluation of NLP systems. In Clark, A., Fox, C., and Lappin, S. (ed)., The Handbook of Computational Linguistics and Natural Language Processing, pages 271-295. Wiley-Blackwell.
Roark, B. and Sproat, R. 2007. Computational Approaches to Morphology and Syntax. Oxford University Press.
Strubell, E., Ganesh, A., and McCallum, A. 2019. Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645-3650.

LING83800 Spring 2025 (GC)

Methods in Computational Linguistics IICUNY Graduate Center – Spring 2025