Please enable javascript on your browser.
 






AI/ML Weekly Seminar
Sponsored by Yahoo! Research and Experian

AIML Meetings 2009

The AIML seminars are open to the public and will be held on most Mondays from 1-2pm on the 4th floor of Bren Hall room 4011. Light snacks will be served at 1 pm.

Sponsorship of this series by Yahoo! and Experian is gratefully acknowledged.

Directions can be found here

Fall Quarter 2009
October 5
Bren Hall 4011
1 pm
Babak Shahbaba
Assistant Professor
Department of Statistics
University of California, Irvine

Multiple hypothesis testing when the hypotheses are grouped
October 12
Bren Hall 4011
1 pm
Justin Ma
PhD candidate
Department of Computer Science and Engineering
University of California, San Diego

Identifying Suspicious URLs: An Application of Large-Scale Online Learning
October 19
Bren Hall 4011
1 pm
Doug Oard
Associate Professor
College of Information Studies
University of Maryland, College Park

Who 'Dat? Identity resolution in large email collections
October 26
Bren Hall 4011
1 pm
Padhraic Smyth
Professor
Department of Computer Science
University of California, Irvine

The Netflix Prize and Competition
November 2
Bren Hall 4011
1 pm
Drew Frank
Graduate student
Department of Computer Science
University of California, Irvine

Belief Propagation in a Continuous World
November 9
Bren Hall 4011
1 pm
Eric Mjolsness
Associate Professor
Department of Computer Science
University of California, Irvine

Morphodynamic Modeling Languages
November 16
Bren Hall 4011
1 pm
Mark Steyvers
Associate Professor
Department of Cognitive Sciences
University of California, Irvine

The Wisdom of Crowds and Rank Aggregation
November 23
Bren Hall 4011
1 pm
Hamed Pirsiavash
PhD candidate
Department of Computer Science
University of California, Irvine

Bilinear classifiers for visual recognition
Babak Shahbaba
Assistant Professor
Department of Statistics
University of California, Irvine

Multiple hypothesis testing when the hypotheses are grouped

Simultaneously evaluating a large number of hypotheses has become a common theme in many areas of applied statistics and machine learning. Such problems are abundant in signal processing, genomics, proteomics, and brain imaging. Traditional methods within the frequentist framework may involve computing a simple statistic (e.g., two-sample t-statistic) for each hypothesis and declaring the ones above a cutoff (in absolute value) as significant while adjusting the cutoff to control the family-wise error rate (i.e., the chance of making at least one type I error) or false discovery rate (i.e., the expected proportion of false positives among the rejected null hypotheses). In many situations, prior information is available on how the set of hypotheses could be grouped into predefined subsets. For example, while evaluating the significance of a large number of genes, we could use prior biological information, such as biochemical pathways or similarity of gene sequences, to create subsets of genes. Recent work has demonstrated that focusing on predefined hypothesis sets (e.g., hypotheses regarding the overall significance of gene sets) could increase statistical power and provide more interpretable results. We introduce a new approach for multiple simultaneous hypothesis testing when the hypotheses are grouped according to some prior information. Our approach uses a hierarchical Bayesian framework where a high level hyperparameter measures the overall significance of each hypothesis set. Our main focus is on the application of this method for analyzing genomic data. Using computer simulations, we compare our proposed method to alternative approaches, such as Gene Set Enrichment Analysis (GSEA) and Gene Set Analysis (GSA). Our approach provides the best overall performance. We also discuss the application of our method to experimental data based on p53 mutation status and MYC expression level in cancer cell lines.

Justin Ma
PhD candidate
Department of Computer Science and Engineering
University of California, San Diego

Identifying Suspicious URLs: An Application of Large-Scale Online Learning

We explore online learning approaches for detecting malicious Web sites (those involved in criminal scams) using lexical and host-based features of the associated URLs. We show that this application is particularly appropriate for online algorithms as the size of the training data is larger than can be efficiently processed in batch and because the distribution of features that typify malicious URLs is changing continuously. Using a real-time system we developed for gathering URL features, combined with a real-time source of labeled URLs from a large Web mail provider, we demonstrate that recently-developed online algorithms can be as accurate as batch techniques, achieving daily classification accuracies up to 99% over a balanced data set.

Doug Oard
Associate Professor
College of Information Studies
University of Maryland, College Park

Who 'Dat? Identity resolution in large email collections

Automated techniques that can support the human activities of search and sense-making in large email collections are of increasing importance for a broad range of uses, including historical scholarship, law enforcement and intelligence applications, and lawyers involved in "e-discovery" incident to civil litigation. In this talk, I'll briefly describe some of the work to date on searching large email collections, and then for most of the talk I will focus on the more challenging task of support for sense-making. Specifically, I'll describe joint work with Tamer Elsayed to automatically resolve the identity of people who are mentioned ambiguously (e.g., just by first name) in a collection of email from a failed corporation (Enron). Our results indicate that for people who are well represented in the collection we can use a generative model to guess the right identity about 80% of the time, and for others we are right about half the time. I'll conclude the talk with a few remarks on our next directions for techniques, evaluation, and additional types of collections to which similar ideas might be applied.

Drew Frank
Graduate student
Department of Computer Science
University of California, Irvine

Belief Propagation in a Continuous World

Belief propagation is a popular algorithm for performing inference in probabilistic graphical models. It has been successfully applied in a diverse range of areas including computer vision, computational biology, and error correcting codes. Despite its many successes, however, the algorithm has weaknesses: it does not directly handle continuous random variables, and it may produce poor results on loopy graphical models. In this talk I will discuss several extensions to belief propagation that address these shortcomings. I will then show how these extensions can be combined to enable reasonable performance on loopy graphical models with continuous variables, with applications in localization and protein structure estimation.

Mark Steyvers
Associate Professor
Department of Cognitive Sciences
University of California, Irvine

The Wisdom of Crowds and Rank Aggregation

When individuals independently recollect events or retrieve facts from memory, how can we average these retrieved memories to best reconstruct the actual set of events or facts? We report the performance of individuals in a series of general knowledge tasks, where the goal is to reconstruct from memory the order of historic events (e.g. the order of US presidents), or magnitudes along some physical dimension (e.g., the order of largest US cities). We introduce two Bayesian models for aggregating order information based on a Thurstonian approach and a modified version of the perturbation model. Both models assume that each individual's reconstruction is based on a random permutation of the unobserved ground truth and that there is variability across individuals in knowledge of the domain. The models demonstrate a wisdom of crowds effect, where the aggregated orderings are closer to the true ordering than the orderings of the best individual. The models also demonstrate that we can recover the degree of expertise of each individual, in the absence of any explicit feedback or access to ground truth.

Hamed Pirsiavash
PhD candidate
Department of Computer Science
University of California, Irvine

Bilinear classifiers for visual recognition

We describe an algorithm for learning bilinear SVMs. Bilinear classifiers are a discriminative variant of bilinear models, which capture the dependence of data on multiple factors. Such models are particularly appropriate for visual data that is better represented as a matrix or tensor, rather than a vector. Matrix encodings allow for more natural regularization through rank restriction. For example, a rank-one scanning-window classifier yields a separable filter. Low-rank models have fewer parameters and so are easier to regularize and faster to score at run-time. We learn low-rank models with bilinear classifiers. We also use bilinear classifiers for transfer learning by sharing linear factors between different classification tasks. Bilinear classifiers are trained with biconvex programs. Such programs are optimized with coordinate descent, where each coordinate step requires solving a convex program - in our case, we use a standard off-the-shelf SVM solver. We demonstrate bilinear SVMs on difficult problems of people detection in video sequences and action classification of video sequences, achieving state-of-the-art results in both.

© Copyright 2006. Center for Machine Learning and Intelligent Systems
949.824.9296 tel | 949.824.9813 fax | cmlis@ics.uci.edu