Aug 2018     Issue 8
Prof. Yufei TAO Awarded PODS 2018 Best Paper Award

Prof. Yufei TAO received the PODS 2018 Best Paper Award, for his paper titled "Entity Matching with Active Monotone Classification."  This is a single-authored paper.

The annual ACM SIGMOD/PODS Conference is a leading international forum for database researchers, practitioners, developers, and users to explore cutting-edge ideas and results, and to exchange techniques, tools, and experiences. The conference includes a fascinating technical program with research and industrial talks, tutorials, demos, and focused workshops. It also hosts a poster session to learn about innovative technology, an industrial exhibition to meet companies and publishers, and a careers-in-industry panel with representatives from leading companies.

Given two sets of entities X and Y, entity matching aims to decide whether x and y represent the same entity for each pair (x, y) ∈ X × Y. As the last resort, human experts can be called upon to inspect every (x, y) pair, but this is expensive because the correct verdict could not be determined without investigation effort dedicated specifically to the entities involved. It is therefore important to design an algorithm that asks humans to look at only some pairs then automatically renders verdicts on the other pairs with good accuracy. At the core of most (if not all) existing approaches is the following classification problem. The input is a set P of points in Rd, each of which carries a binary label: 0 or 1. A classifier F is a function from Rd to {0,1}. The objective is to find a classifier that captures the labels of a large number of points in P.

In this paper, we cast the problem as an instance of active learning where the goal is to learn a monotone classifier F, namely, F(p) ≥ F(q) holds whenever the coordinate of p is at least that of q on all dimensions. In our formulation, the labels of all points in P are hidden at the beginning. An algorithm A can invoke an oracle, which discloses the label of a point p ∈ P chosen by A. The algorithm may do so repetitively, until it has garnered enough information to produce F. The cost of A is the number of times that the oracle is called. The challenge is to strike a good balance between the cost and the accuracy of the classifier produced. We describe algorithms with non-trivial guarantees on the cost and accuracy simultaneously. We also prove lower bounds that establish the asymptotic optimality of our solutions for a wide range of parameters.



Past Issue      
Contact Us
Subscribe    Email to friend    Unsubscribe
Copyright © 2020.
All Rights Reserved. The Chinese University of Hong Kong.