Must-Link and Cannot-Link Constraints

Must-Link Transforming Data Space

Clustering Concentric Circles

Spectral Learning as more Data is Labeled

Complete Link vs. Single Link Clustering

Line-Link Clustering Finds the Pattern

Machine Learning

2001 - 2003

When I started graduate school, I was interested in clustering search results, particularly for quick summarization for medical search. This led me to explore probabilistic models for clustering, constrained clustering, and classification with very little labeled data, with applications to text mining.

This is joint work with Dan Klein, Chris Manning and others.

Selected Publications

  • From Instance-Level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering. Proceedings of the Nineteenth International Conference on Machine Learning, July 2002. (with Dan Klein and Chris Manning).
  • Spectral Learning. Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, August 2003. (with Dan Klein and Chris Manning).
  • Interpreting and Extending Classical Agglomerative Clustering Algorithms using a Model-Based Approach. Proceedings of the Nineteenth International Conference on Machine Learning, July 2002. (with Dan Klein and Chris Manning).
  • Combining Heterogeneous Classifiers for Word-Sense Disambiguation. ACL-2002 Workshop on Word Sense Disambiguation (with Dan Klein, Kristina Toutanova, Tolga Ilhan, and Chris Manning).
  • Inducing Novel Gene-Drug Interactions from the Biomedical Literature. Stanford University Technical Report, 2002. (with Diane Oliver, Chris Manning, and Russ Altman).
  • An Oncology Patient Interface to Medline. Proceedings of the 37th Annual Meeting of the American Society of Clinical Oncology, 2001. (with Elmer Bernstam, Funda Meric, John Dugan, Steven Chizek, Chris Stave, Olga Troyanskaya, Jeffery Chang, and Lawrence Fagan).
  • Medline IRaCS: An Information Retrieval and Clustering System for Genomic Knowledge Acquisition. Symposium on Biomedical Computation at Stanford, 2000. (with Eldar Giladi, Jeanne Loring, and Mike Walker).


  • Constrained Clustering for Improved Pattern Discovery. ICML, 2002.
  • A Probabilistic Interpretation of Agglomerative Clustering. ICML, 2002.
  • Matrix Factorizations for Document Clustering and Topic Extraction Stanford University, 2001.
  • Parametric Mixture Models for Document Clustering and Topic Extraction IBM Almaden Research Center, 2001.
  • Text Mining for Medical Knowledge Acquisition Incyte Genomics, 2000.