Research

My primary research interests are in computer vision and machine learning. They include:

  1. Modelling dynamic objects in video sequences to address the problems of visual object tracking, action/activity recognition, video registration, and motion segmentation

  2. Developing efficient optimization and randomization techniques for large-scale computer vision and machine learning problems (e.g. solving normalized graph cut problems with priors and optimal camera placement in 3D)

  3. Exploring novel means of involving human judgment in computer vision and image processing methodology. This includes studying perceptual elements of image quality and saliency and linking them to low-level image features, which can be used to develop more effective and perceptually-relevant recognition and compression techniques

  4. Developing frameworks for joint representation and classification by exploiting data sparsity and low-rankness

Image/Video Registration

Registration is a classical problem in computer vision. It refers to the problem of spatially aligning images (possibly consecutive frames in the same video or individual images captured by the same or different cameras) into the same absolute coordinate system determined by a reference image. The spatial transformation between each pair of images governs the relative camera motion between these two images. Effectively estimating the mapping between coordinate systems is imperative for any computer vision application that makes use of object positions in a dynamic scene including object tracking and action recognition. These applications assume that pixel motion in video frames is only due to moving objects and not to apparent motion resulting from changes in camera parameters (e.g. translation, pan, tilt, and/or zoom). This is why registration in the presence of camera motion is a common problem in many computer vision related domains (e.g. augmented reality and sports video analysis) and a fundamental pre-processing step that is necessary before meaningful higher level inference can be performed on dynamic scene content. Below is a sample publication on this topic.

Robust Video Registration Applied to Field-Sports Video Analysis [ICASSP2012]

Video (image-to-image) registration is a fundamental problem in computer vision. Registering video frames to the same coordinate system is necessary before meaningful inference can be made from a dynamic scene in the presence of camera motion. Standard registration techniques detect specific structures (e.g. points and lines), find potential correspondences, and use a random sampling method to choose inlier correspondences. Unlike these standards, we propose a parameter-free, robust registration method that avoids explicit structure matching by matching entire images or image patches. We frame the registration problem in a sparse representation setting, where outlier pixels are assumed to be sparse in an image. Here, robust video registration (RVR) becomes equivalent to solving a sequence of L1 minimization problems, each of which can be solved using the Inexact Augmented Lagrangian Method (IALM). Our RVR method is made efficient (sublinear complexity in the number of pixels) by exploiting a hybrid coarse-to-fine and random sampling strategy along with the temporal smoothness of camera motion. We showcase RVR in the domain of sports videos, specifically American football. Our experiments on real-world data show that RVR outperforms standard methods and is useful in several applications (e.g. automatic panoramic stitching and non-static background subtraction).

Visual Object Tracking

Object tracking in video is a classical problem in computer vision. It refers to finding the "state" (e.g. position, speed, etc.) of an object moving in a video. Usually, the state of the object is given at a particular video frame and the tracking method is required to follow the object in the rest of the frames. Many problems arise while tracking objects in video, including occlusion, appearance variations due to lighting, scale, and deformations, rapid motion, etc. Good trackers need to address these issues. Below are some sample publications on this topic.

Robust Visual Tracking via Multi-Task Sparse Learning [CVPR2012][IJCV2012]

In this paper, we formulate object tracking in a particle filter framework as a multi-task sparse learning problem, which we denote as Multi-Task Tracking (MTT). Since we model particles as linear combinations of dictionary templates that are updated dynamically, learning the representation of each particle is considered a single task in MTT. By employing popular sparsity-inducing pq mixed norms, we regularize the representation problem to enforce joint sparsity and learn the particle representations together. As compared to previous methods that handle particles independently, our results demonstrate that mining the interdependencies between particles improves tracking performance and overall computational complexity. Interestingly, we show that the popular L1 tracker is a special case of our MTT formulation (denoted as the L11 tracker) when p=q=1. The learning problem can be efficiently solved using an Accelerated Proximal Gradient (APG) method that yields a sequence of closed form updates. As such, MTT is computationally attractive. We test our proposed approach on challenging sequences involving heavy occlusion, drastic illumination changes, and large pose variations. Experimental results show that MTT methods consistently outperform state-of-the-art trackers.

Low-Rank Sparse Learning for Robust Visual Tracking [ECCV2012][IJCV2014]

In this work, we propose a new particle-filter based tracking algorithm that exploits the relationship between particles (candidate targets). By representing particles as sparse linear combinations of dictionary templates, this algorithm capitalizes on the inherent low-rank structure of particle representations that are learned jointly. As such, it casts the tracking problem as a low-rank matrix learning problem. This low-rank sparse tracker (LRST) has a number of attractive prop erties. (1) Since LRST adaptively updates dictionary templates, it can handle significant changes in appearance due to variations in illumination, pose, scale, etc. (2) The linear representation in LRST explicitly incorporates background templates in the dictionary and a sparse error term, which enables LRST to address the tracking drift problem and to be robust against occlusion respectively. (3) LRST is computationally attractive, since the low-rank learning problem can be efficiently solved as a sequence of closed form update operations, which yield a time com plexity that is linear in the number of particles and the template size. We evaluate the performance of LRST by applying it to a set of chal lenging video sequences and comparing it to 6 popular tracking methods. Our experiments show that by representing particles jointly, LRST not only outperforms the state-of-the-art in tracking accuracy but also sig nificantly improves the time complexity of methods that use a similar sparse linear representation model for particles.

Action and Activity Analysis

Below are some sample publications on this topic.

ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding [CVPR2015][URL]

In spite of many dataset efforts for human action recognition, current computer vision algorithms are still severely limited in terms of the variability and complexity of the actions that they can recognize. This is in part due to the simplicity of current benchmarks, which mostly focus on simple actions and movements occurring on manually trimmed videos. In this paper we introduce ActivityNet,a new large-scale video benchmark for human activity understanding. Our benchmark aims at covering a wide range of complex human activities that are of interest to people in their daily living.

In its current version, ActivityNet provides samples from 203 activity classes with an average of 137 untrimmed videos per class and 1.41 activity instances per video, for a total of 849 video hours. 68.8 hours of actually annotated video. We illustrate three scenarios in which ActivityNet can be used to compare algorithms for human activity understanding: untrimmed video classification, trimmed activity classification

and activity detection.

Trajectory-Based Fisher Kernel Representation of Human Actions [ICPR2012]

Action recognition is an important computer vision problem that has many applications including video indexing and retrieval, event detection, and video summarization. In this paper, we propose to apply the Fisher kernel paradigm to action recognition. The Fisher kernel framework combines the strengths of generative and discriminative models. In this approach, given the trajectories extracted from a video and a generative Gaussian Mixture Model (GMM), we use the Fisher Kernel method to describe how much the GMM parameters are modified to best fit the video trajectories. We experiment in using the Fisher Kernel vector to create the video representation and to train an SVM classifier. We further extend our framework to select the most discriminative trajectories using a novel MIL-KNN framework. We compare the performance of our approach to the current state-of-the-art bag-of-features (BOF) approach on two benchmark datasets. Experimental results show that our proposed approach outperforms the state-of-the-art method and that the selected discriminative trajectories are descriptive of the action class.

Automatic Recognition of Offensive Team Formation in American Football Plays [CVPRW2013][Best Paper Award]

Compared to security surveillance and military applications, where automated action analysis is prevalent, the sports domain is extremely under-served. Most existing software packages for sports video analysis require manual annotation of important events in the video. American football is the most popular sport in the United States, however most game analysis is still done manually. Line of scrimmage and offensive team formation recognition are two statistics that must be tagged by American Football coaches when watching and evaluating past play video clips, a process which takes many man hours per week. These two statistics are also the building blocks for more high-level analysis such as play strategy inference and automatic statistic generation. In this paper, we propose a novel framework where given an American football play clip, we automatically identify the video frame in which the offensive team lines in formation (formation frame), the line of scrimmage for that play, and the type of player formation the offensive team takes on. The proposed framework achieves 95% accuracy in detecting the formation frame, 98% accuracy in detecting the line of scrimmage, and up to 67% accuracy in classifying the offensive team's formation. To validate our framework, we compiled a large dataset comprising more than 800 play-clips of standard and high definition resolution from real-world football games. This dataset will be made publicly available for future comparison.

Large-Scale Optimization in Vision and Machine Learning

Below are some sample publications on this topic.

L0TV: A New Method for Image Restoration in the Presence of Impulse Noise [CVPR2015]

Total Variation (TV) is an effective and popular prior model in the field of regularization-based image processing. This paper focuses on TV for image restoration in the presence of impulse noise. This type of noise frequently arises in data acquisition and transmission due to many reasons, e.g. a faulty sensor or analog-to-digital converter errors. Removing this noise is an important task in image restoration. State-of-the-art methods such as Adaptive Outlier Pursuit (AOP), which is based on TV with L02-norm data fidelity, only give sub-optimal performance. In this paper, we propose a new method, called L0TV-PADMM, which solves the TV-based restoration problem with L0-norm data fidelity. To effectively deal with the resulting non-convex non-smooth optimization problem, we first reformulate it as an equivalent MPEC (Mathematical Program with Equilibrium Constraints), and then solve it using a proximal Alternating Direction Method of Multipliers (PADMM). Our L0TV-PADMM method finds a desirable solution to the original L0-norm optimization problem and is proven to be convergent under mild conditions. We apply L0TV-PADMM to the problems of image denoising and deblurring in the presence of impulse noise. Our extensive experiments demonstrate that L0TV-PADMM outperforms state-of-the-art image restoration methods.

Dinkelbach Normalized Graph Cuts [IJCV2010]

We propose a novel framework, called Dinkelbach NCUT (DNCUT), to extend normalized graph cuts to include general, convex constraints under given priors on the graph. The formulation is presented for the problem of spectral graph based, low-level image segmentation. We present a solution to this problem in the form of a sequence of quadratic programs (QP's) subject to convex constraints. We use the iterative Dinkelbach method for fractional programming, where the complexity of finding a global solution in each iteration depends on the complexity of the constraints, the convexity of the cost function, and the chosen initialization. In fact, we derive an initialization which guarantees that each Dinkelbach iteration involves a convex QP problem. Download the code here.

Low-Rank Quadratic Semidefinite Programming [Neurocomputing2012]

Low rank matrix approximation is an attractive model in large scale machine learning problems, because it can not only reduce the memory and runtime complexity, but also provide a natural way to regularize parameters while preserving learning accuracy. In this paper, we address a special class of nonconvex quadratic matrix optimization problems, which require a low rank positive semidefinite solution. Despite their non-convexity, we exploit the structure of these problems to derive an efficient solver that converges to their local optima. Furthermore, we show that the proposed solution is capable of dramatically enhancing the efficiency and scalability of a variety of concrete problems, which are of significant interest to the machine learning community. These problems include the Topk Eigenvalue Problem, Distance Learning and Kernel Learning. Extensive experiments on UCI benchmarks have shown the effectiveness and efficiency of our proposed method.

Human Perception in Computer Vision and Image Processing

Image Quality Assessment using Image Segmentation [ICIP2008]

Computational representation of perceived image quality is a fundamental problem in computer vision and image processing, which has assumed increased importance with the growing role of images and video in human-computer interaction. It is well-known that the commonly used Peak Signal-to-Noise Ratio (PSNR), although analysis-friendly, falls far short of this need. We propose a perceptual image quality measure (IQM) in terms of an image's region structure. Given a reference image and its "distorted" version, we propose a "full-reference" IQM, called Segmentation-based Perceptual Image Quality Assessment (SPIQA), which quantifies this quality reduction, while minimizing the disparity between human judgment and automated prediction of image quality. One novel feature of SPIQA is that it enables the use of inter- and intra- region attributes in a way that closely resembles how the human visual system perceives distortion. Download the code here.

Computational representation of perceived image quality is a fundamental problem in computer vision and image processing, which has assumed increased importance with the growing role of images and video in human-computer interaction. It is well-known that the commonly used Peak Signal-to-Noise Ratio (PSNR), although analysis-friendly, falls far short of this need. We propose a perceptual image quality measure (IQM) in terms of an image's region structure. Given a reference image and its "distorted" version, we propose a "full-reference" IQM, called Segmentation-based Perceptual Image Quality Assessment (SPIQA), which quantifies this quality reduction, while minimizing the disparity between human judgment and automated prediction of image quality. One novel feature of SPIQA is that it enables the use of inter- and intra- region attributes in a way that closely resembles how the human visual system perceives distortion. Download the code here.

Do Humans Fixate on Interest Points? [ICPR2012]

Interest point detectors (e.g. SIFT, SURF, and MSER) have been successfully applied to numerous applications in high level computer vision tasks such as object detection, and image classification. Despite their popularity, the perceptual relevance of these detectors has not been thoroughly studied. Here, perceptual relevance is meant to define the correlation between these point detectors and free-viewing human fixations on images. In this work, we provide empirical evidence to shed light on the fundamental question: "Do humans fixate on interest points in images?". We believe that insights into this question may play a role in improving the performance of vision systems that utilize these interest point detectors. We conduct an extensive quantitative comparison between the spatial distributions of human fixations and automatically detected interest points on a recently released dataset of 1003 images. This comparison is done at both the global (image) level as well as the local (region) level. Our experimental results show that there exists a weak correlation between the spatial distributions of human fixation and interest points.

Dynamic Textures

A dynamic texture (DT) is the temporal extension of 2D texture. Even though the overall global motion of a DT may be perceived by humans as simple and coherent, the underlying local motion is complex and stochastic. DT models are developed and applied to DT synthesis, recognition, and compression. Some experimental results are provided here. Below are some sample publications.

Phase-Based Modeling of DTs [ICCV2007]

The Principal Difference Phase PCA (PDPP) model represents frames in a DT sequence using their Fourier phase spectra. Extracting Fourier features and embedding them in a PCA space, we efficiently model the spatiotemporal properties of a DT. This model is successful in recognizing DT's, synthesizing new DT sequences, and compressing DT videos.

Phase-Based Compression of DTs [ICIP2007]

We apply the PDPP model to DT compression, which can be used to improve the performance of MPEG-II encoding. We present experimental evidence that validates this method for a variety of complex sequences, while also comparing it to the LDS model.

Extracting Fluid DT and the Background from Video [CVPR2008]

Given the video of a still background occluded by a fluid dynamic texture (FDT), we address the problem of separating the video sequence into its two constituent layers. One layer corresponds to the video of the unoccluded background, and the other to that of the FDT. We learn the frame-to-frame FDT densities, the FDT appearance, and the background simultaneously.

Dynamic Swarms [CVIU2012]

A dynamic swarm (DS) is a large layout of stochastically repetitive spatial configurations of dynamic objects (swarm elements) whose motions exhibit local spatiotemporal interdependency and stationarity. Examples of DS abound in nature, e.g., herds of animals and flocks of birds. To capture the local spatiotemporal properties of the DS, we present a probabilistic model that learns both the spatial layout of swarm elements and their joint dynamics that are modeled as linear transformations.

DT Recognition Using Efficient Maximum Margin Distance Learning [ECCV2010]

The space of DTs varies along three dimensions: spatial texture, spatial texture layout, and dynamics. By describing each dimension with appropriate spatial or temporal features and by equipping it with a suitable distance, elementary distances between DT sequences can be computed. We address the problem of DT recognition by learning linear combinations of these elementary distances. An efficient maximum margin distance learning (MMDL) method based on the Pegasos algorithm is proposed. In contrast to popular MMDL methods, we show that our method, called DL-PEGASOS, can handle more general distance constraints with a linear computational complexity. A new dataset called DynTex++ is compiled.

DT Recognition using Sparse Representation [ICPR2010]

Given a sequence of features of a linear dynamical system (LDS), we propose the problem of finding a representation of the LDS which is sparse in terms of a given dictionary of LDSs. Since LDSs do not belong to Euclidean space, traditional sparse coding techniques do not apply. We propose a probabilistic framework and an efficient MAP algorithm to learn this sparse code. Since DTs can be modeled as LDSs, we validate our algorithm by applying them to DT recognition, especially with occlusion.

RFID-Based Applications

Our ongoing project aims at non-invasively integrating Radio Frequency Identification (RFID) technology into building infrastructure. We present RFID technology as a cornerstone for ubiquitous sensing within buildings via three applications we have developed. These applications are user-friendly implementations that facilitate the manipulation of the data stored in the RFID tags especially in regards to location information and temperature logs. They provide a low cost means for ubiquitous sensing of environmental parameters, which is not restricted to measurement alone, but it also renders intelligent monitoring of these parameters and their temporal fluctuations. For more details, we refer the reader to this report.

  • Indoor tag deployment for indoor RFID tag maintenance

  • Indoor temperature monitoring

  • Indoor RFID reader tracking, which is a low cost alternative for GPS tracking within buildings