NEC Laboratories America, 4 Independence Way, Princeton, NJ 08540
[Homepage] [Publication] [Teaching] [Miscellaneous]
I have a broad interest in the areas of data mining, database and ubiquitous systems. My research focuses on developing efficient and effective data mining techniques for novel informational systems/applications. Currently I am working on the projects of analyzing cyber security data. I have worked on the projects about trajectory and mobility data mining, sensor data analysis in cyber-physical systems, kNN query processing, data stream mining and web information system construction. My previous research was mainly supported by the NSF project of "Foundations of Cyber-Physical Networks" and the NS-CTA project of "Military Mobility Research".
Following are the brief description about my recent studies.
AUTOMATED SECURITY INTELLIGENCE
Many enterprise systems have huge complexity and keep evolving. It is very difficult if not impossible to keep track of security vulnerabilities in such systems, and even many unknown zero-day vulnerabilities exist. Sophisticated attacks are launched from economical-driven well-organized attackers. On the other side, every employee uses computer and network, but it is impossible to train every employee with enough security knowledge and skills. To this end, we propose the automated security intelligence project, inspired by Sun Tzu's Military Principles: "If you know your enemies and yourself, you can win a hundred battles without a single loss."
TRAJECTORY AND MOBILITY DATA MINING
1. Traveling Companion Discovery
The advance of object
tracking technologies leads to huge volumes of spatio-temporal data collected in
the form of trajectory stream. In this study, we investigate the problem of
discovering object groups that travel together (i.e., traveling companions) from
streaming trajectories. Such technique has broad applications in the areas of
scientific study, transportation management and military surveillance. The key
issue of companion discovery on trajectory stream is about the algorithm's
efficiency. We propose a data structure termed traveling buddy, which is
micro-group of objects that are tightly bound together. By only storing the
object relationships rather than their spatial coordinates, the buddies can be
dynamically maintained along trajectory stream with low cost. The proposed
methods are evaluated with extensive experiments on both real and synthetic
datasets. The buddy-based method is an order of magnitude faster than baselines.
It also achieves higher precision and recall in companion discovery.
Figure 1. To discover the traveling companions, the system has to cluster the objects of each snapshot (the clusters are tagged with black-solid lines, e.g., C1,1, C1,2 in snapshot S1), and then intersect the clusters of different snapshots to retrieve the companion results. Both clustering and intersection steps involve high computational overhead. The traveling buddies (tagged with red-dashed lines, e.g., b1, b2 and b3) are micro-groups of objects that are tightly bound together. Their maintaining cost is much lower and they can help achieve a magnitude-faster improvement in clustering efficiency.
More details of this study can be found in our ICDE'12 paper (presentation slides) and ACM TIST paper.
2. Retrieving k-Nearest Neighboring Trajectories
The big trajectory data bring us new opportunities and challenges in efficient trajectory retrieval. In this study, we propose a new type of query that finds the k Nearest Neighboring Trajectories (k-NNT) with the minimum aggregated distance to a set of query points. Such queries, though have a broad range of applications like trip planning and moving object study, cannot be handled by traditional k-NN processing techniques that only find the neighboring points of an object. To facilitate scalable, flexible and effective query execution, we propose a k-NN trajectory retrieval algorithm using a candidate-generation-and-verification strategy. The algorithm starts by retrieving candidate trajectories near each individual query point, utilizing a data structure called global heap. Then, at the verification step, it refines these trajectory candidates by a lower-bound computed based on the global heap. The global heap guarantees the candidate¡¯s completeness, (i.e., all the k-NNTs are included), and reduces the computational overhead of candidate verification. In addition, we propose a qualifier expectation measure that ranks partial-matching candidate trajectories to accelerate query processing in the cases of non-uniform trajectory distributions or outlier query locations. This technique has been applied in the T-Drive project of MSRA.
Figure 2. A case study of k-NNT query and other state-of-the-art methods: we retrieve 20 vehicle trajectories of Beijing city from T-Drive dataset. Three locations in downtown Beijing are selected as the query input. The k-NNT query, k-BCT query [Chen et.al., SIGMOD 2010] and Moving Object query [Frentzos, SSTD 2005] are carried out separately. Their results are plotted in T-drive system's interface, as shown in the figure. Since the moving object query could only deal single query point, it cannot report the correct trajectory close to all three locations. In Figure 2 (a), k-NNT and k-BCT retrieve the same trajectory. In Figure 2 (b), when the query point q2 slightly moves to east, the result of k-NNT query stays the same since the aggregated distance only has minor changes, however, k-BCT query's result changes substantially and it retrieves a trajectory that exactly passes q2 but is far from q1. In this case, k-NNT query is robust and accurate.
More details of this study can be found in our SSTD'11 paper (presentation slides).
In the cyber-physical systems, some sensors occasionally report unusual or abnormal readings (i.e., atypical data), such data may imply fundamental changes of the monitored areas and possess high domain significance. To benefit the system's performance and help user's decision making, it is important to analyze the atypical data with spatial, temporal and other multi-dimensional information in an integrated manner. In this study, we introduce the techniques to discover the atypical data and summarize them as atypical events. The atypical cluster is proposed as a model to describe the multi-dimensional features of an atypical event. The atypical clusters can be efficiently integrated in a hierarchical framework to form macro-clusters for analytical queries. To the best of our knowledge, this study is the first one to generate, integrate and retrieve clusters from cyber-physical system for multidimensional analysis.
Figure 3.(Traffic Monitoring Application) The atypical clusters are retrieved from the traffic congestion event. The daily congestion clusters are used to construct a hierarchical tree. Such a tree can be used to answer the queries such as "Where and when do the traffic congestions usually happen in the past month?" and "On which road segment (or time period) is the congestion most serious?"
More details of this study can be found in our ICDE'12 paper (presentation slides) and IJDSN paper.
The Cyber-Physical System (CPS) integrates physical (i.e., sensor) devices with cyber (i.e., informational) components to form a context sensitive integrated system that responds intelligently to dynamic changes in real-world situations. Such system has wide applications in the scenarios of communication, surveillance and remote control on battlefields. This study investigates the problem of intruder mining on battlefield: With a large number of wireless sensors deployed in a designated area, the task is real time detection of intruders who enter the area, based on noisy data. The IntruMine is hence proposed to iteratively model the relationship between sensors and intruders via monitoring graphs, and estimates the attribute values of the intruders based on the link information of such graphs. The confidence of intruder detection is computed based on the difference between the real sensor readings and the estimated ones.
Figure 4. (a) The framework
of battlefield network: The sentry nodes are the deployed sensors. They
constantly collect sound,
magnetic or temperature data from the environment. Land-based gateways and aircraft bridges transmit sensor data to a command center. The system in the command center analyzes the collected data to produce information about intruders. Such technology can enable military forces to see through the "fog of war" and improve decision making. (b) The monitoring graph helps select out a small set of relevant sensors that receive significant signals from the intruders and constructs the relationships between sensors and intruders.
More details of this study can be found in our SDM'12 paper (presentation slides), ICDM'10 paper (presentation slides) and JCSS paper.
In many data stream systems, huge amounts of data arrive rapidly and their values change over time. The variations on streams typically imply some fundamental changes of the underlying objects and possess significant domain meanings. In some data streams, successive events seem to recur in a certain time interval, but the data indeed evolves with tiny differences as time elapses. This feature is called pseudo periodicity, which poses a nontrivial challenge to variation management in data streams. This study presents our research effort in online variation management over such streams, and the idea can be applied to the problem domain of medical applications, such as patient vital signal monitoring. We propose a new method named Pattern Growth Graph (PGG) to detect and manage variations over pseudo periodical streams. PGG adopts the wave-pattern to capture the major information of data evolution and represent them compactly. With the help of wave-pattern matching algorithm, PGG detects the stream variations in a single pass over the stream data. PGG only stores the different segments of the pattern for incoming stream, and hence it can substantially compress the data without losing important information. The statistical information of PGG helps to distinguish meaningful data changes from noise and to reconstruct the stream with acceptable accuracy. The techniques are applied as a core module in an ICU healthcare system.
Figure 5.(a) records the
data evolution over a respiration stream. At the beginning, the data in time A
and B are almost the same,
including two inspiration sections and one expiration section. After 20 minutes, the expiration data in time C transforms to two sections. In time D (after 3 hours) and E (after 5 hours), the inspiration data merges into one section, while the expiration data evolves into three sections. The variations reflect the evolution of the patient¡¯s illness during the five hours. Figure 5 (b) demonstrates the PGG model, pattern 1 is a base pattern with 8 segments. It partially matches the new stream wave on segment 1, 2, 3 and 7. Therefore, a growth pattern (pattern 2) with segments 1¡¯--4¡¯ is generated based on the un-matched data.
More details of this study can be found in our SIGMOD'07 paper (presentation slides) and JCST paper.
The class of k Nearest Neighbor (kNN) queries is frequently used in geospatial applications. It is an interesting problem to process kNN query on land surface, i.e., the surface kNN query (skNN). In this study, for the first time, we propose an index structure on land surface that enables exact and fast responses to skNN queries. Two complementary indexing schemes, namely Tight Surface Index (TSI) and Loose Surface Index (LSI), are constructed and stored collectively on a single novel data structure called Surface Index R-tree (SIR-tree). With those indexes, we can process skNN query efficiently by localizing the search and minimizing the invocation of the costly surface distance computation and hence incurring low I/O and computation costs. Our algorithm does not need to know the value of k a priori and can incrementally expand the search region using SIR-tree and report the query result progressively. It also reports the exact shortest surface paths to the query results.
Figure 6. (a) shows a sample application for skNN query. In a virtual environment based on the elevation data of Yosemite National Park, a trekker wants to know how far the nearest camp site is. Note that, p1 is the nearest site in 3-D Euclidean distance; however, since its surface path need to go across the valley, it is actually farther than site p2 on the other side. Figure 6 (b) shows the tight and loose cell index constructed in the land surface. With such index, the efficiency of skNN query processing is improved with a magnitude faster speed than traditional methods.
More details of this study can be found in our VLDB'08 paper.
[Homepage] [Publication] [Teaching] [Miscellaneous]