|
|
AI/ML Weekly Seminar Sponsored by Yahoo! Research and Experian
AIML Meetings 2009
The AIML seminars are open to the public and will be held on
most Mondays from 1-2pm on the 4th floor of Bren Hall room
4011. Light snacks will be served at 1 pm.
Sponsorship of this series by Yahoo! and Experian is gratefully acknowledged.
Directions can be found here
|
|
Babak Shahbaba
Assistant Professor
Department of Statistics
University of California, Irvine
Multiple hypothesis testing when the hypotheses are grouped
Simultaneously evaluating a large number of hypotheses has become a
common theme in many
areas of applied statistics and machine learning. Such problems are
abundant in signal processing,
genomics, proteomics, and brain imaging. Traditional methods within
the frequentist framework
may involve computing a simple statistic (e.g., two-sample
t-statistic) for each hypothesis and
declaring the ones above a cutoff (in absolute value) as significant
while adjusting the cutoff to
control the family-wise error rate (i.e., the chance of making at
least one type I error) or false
discovery rate (i.e., the expected proportion of false positives among
the rejected null hypotheses).
In many situations, prior information is available on how the set of
hypotheses could be grouped into
predefined subsets. For example, while evaluating the significance of
a large number of genes, we
could use prior biological information, such as biochemical pathways
or similarity of gene sequences,
to create subsets of genes. Recent work has demonstrated that focusing
on predefined hypothesis
sets (e.g., hypotheses regarding the overall significance of gene
sets) could increase statistical power
and provide more interpretable results. We introduce a new approach
for multiple simultaneous
hypothesis testing when the hypotheses are grouped according to some
prior information. Our
approach uses a hierarchical Bayesian framework where a high level
hyperparameter measures the
overall significance of each hypothesis set. Our main focus is on the
application of this method
for analyzing genomic data. Using computer simulations, we compare our
proposed method to
alternative approaches, such as Gene Set Enrichment Analysis (GSEA)
and Gene Set Analysis
(GSA). Our approach provides the best overall performance. We also
discuss the application of our
method to experimental data based on p53 mutation status and MYC
expression level in cancer
cell lines.
|
Justin Ma
PhD candidate
Department of Computer Science and Engineering
University of California, San Diego
Identifying Suspicious URLs: An Application of Large-Scale
Online Learning
We explore online learning approaches for detecting malicious Web
sites (those involved in criminal scams) using lexical and host-based
features of the associated URLs. We show that this application is
particularly appropriate for online algorithms as the size of the
training data is larger than can be efficiently processed in batch and
because the distribution of features that typify malicious URLs is
changing continuously. Using a real-time system we developed for
gathering URL features, combined with a real-time source of labeled
URLs from a large Web mail provider, we demonstrate that
recently-developed online algorithms can be as accurate as batch
techniques, achieving daily classification accuracies up to 99% over a
balanced data set.
|
Doug Oard
Associate Professor
College of Information Studies
University of Maryland, College Park
Who 'Dat? Identity resolution in large email collections
Automated techniques that can support the human activities of
search and sense-making in large email collections are of increasing
importance for a broad range of uses, including historical
scholarship, law enforcement and intelligence applications, and
lawyers involved in "e-discovery" incident to civil litigation.
In this talk, I'll briefly describe some of the work to date on
searching large email collections, and then for most of the talk I
will focus on the more challenging task of support for sense-making.
Specifically, I'll describe joint work with Tamer Elsayed to
automatically resolve the identity of people who are mentioned
ambiguously (e.g., just by first name) in a collection of email from
a failed corporation (Enron). Our results indicate that for people
who are well represented in the collection we can use a generative
model to guess the right identity about 80% of the time, and for
others we are right about half the time. I'll conclude the talk
with a few remarks on our next directions for techniques,
evaluation, and additional types of collections to which similar
ideas might be applied.
|
Drew Frank
Graduate student
Department of Computer Science
University of California, Irvine
Belief Propagation in a Continuous World
Belief propagation is a popular algorithm for performing
inference in
probabilistic graphical models. It has been successfully applied in a
diverse range of areas including computer vision, computational
biology, and error correcting codes. Despite its many successes,
however, the algorithm has weaknesses: it does not directly handle
continuous random variables, and it may produce poor results on loopy
graphical models. In this talk I will discuss several extensions to
belief propagation that address these shortcomings. I will then show
how these extensions can be combined to enable reasonable performance
on loopy graphical models with continuous variables, with applications
in localization and protein structure estimation.
|
Mark Steyvers
Associate Professor
Department of Cognitive Sciences
University of California, Irvine
The Wisdom of Crowds and Rank Aggregation
When individuals independently recollect events or retrieve facts
from memory, how can we average these retrieved memories to best
reconstruct the actual set of events or facts? We report the
performance of individuals in a series of general knowledge tasks,
where the goal is to reconstruct from memory the order of historic
events (e.g. the order of US presidents), or magnitudes along some
physical dimension (e.g., the order of largest US cities). We
introduce two Bayesian models for aggregating order information
based on a Thurstonian approach and a modified version of the
perturbation model. Both models assume that each individual's
reconstruction is based on a random permutation of the unobserved
ground truth and that there is variability across individuals in
knowledge of the domain. The models demonstrate a wisdom of crowds
effect, where the aggregated orderings are closer to the true
ordering than the orderings of the best individual. The models also
demonstrate that we can recover the degree of expertise of each
individual, in the absence of any explicit feedback or access to
ground truth.
|
Hamed Pirsiavash
PhD candidate
Department of Computer Science
University of California, Irvine
Bilinear classifiers for visual recognition
We describe an algorithm for learning bilinear SVMs. Bilinear
classifiers are a discriminative variant of bilinear models, which
capture the dependence of data on multiple factors. Such models are
particularly appropriate for visual data that is better represented
as a matrix or tensor, rather than a vector. Matrix encodings allow
for more natural regularization through rank restriction. For
example, a rank-one scanning-window classifier yields a separable
filter. Low-rank models have fewer parameters and so are easier to
regularize and faster to score at run-time. We learn low-rank models
with bilinear classifiers. We also use bilinear classifiers for
transfer learning by sharing linear factors between different
classification tasks. Bilinear classifiers are trained with biconvex
programs. Such programs are optimized with coordinate descent, where
each coordinate step requires solving a convex program - in our
case, we use a standard off-the-shelf SVM solver. We demonstrate
bilinear SVMs on difficult problems of people detection in video
sequences and action classification of video sequences, achieving
state-of-the-art results in both.
|
|