CUEngineering

Aug 2018 Issue 8

Research

New Studies Reveal How Much a Human Must Do to Train AI

Prof. Yufei TAO, Department of Computer Science and Engineering

We are no strangers to artificial intelligence (AI). Today, computers are no longer just machines for number crunching. Instead, they seem to become smarter and smarter every day. Indeed, computers are already capable of accomplishing tasks that many people thought were too hard for them just several years ago. When you had your first iPhone, for example, did you imagine you could speak to your phone, and then be amazed by how accurately it had captured the words? Likewise, did you think you could unlock your phone by just showing up in front of the camera? If these are not enough, how about beating the world champion in chess and Go?

Has it occurred to you that we, humans, must get our hands dirty to train a computer into a real master? If machines could really teach itself to do all the stuff we see, we would already be in the dreadful era when AI has transcended and dominated the human kind. We are, for sure, not there yet; and yes, humans do have to sweat when a machine grows its intelligence.

If all this sounds too esoteric, let us look at a concrete scenario. Heard of Amazon and eBays? Well, here is a task for you. Identify every pair of product listings --- one from Amazon and one from eBay --- that are selling the same product. This may seem deceivingly trivial at first glance. All you need to do is to read the details of a listing at Amazon, do the same on a listing at eBay, and then judge whether they are the same. The problem, however, is that there are millions of listings at either site. Comparing all of them manually is simply a mission impossible.
In despair, you turn to a computer, and hope that this piece of metal can pull off the task for you. However, computers rely on algorithms (a.k.a., programs) to do this kind of jobs. Before actually doing the real work, they must first learn the relevant knowledge from some ground-truth "training data" (otherwise, how can they tell that "MS word" means the same as "Microsoft Word processor"?). In the Amazon-eBay scenario, to provide the training data, a human must manually identify some pairs of matching listings, as well as some pairs of non-matching listings. By feeding these resolved pairs to an algorithm (hence, the ground truth), we hope that the algorithm can learn some patterns, and then, use these patterns to find us the rest of the matching listings automatically.

Of course, we expect that the algorithm (with its newly acquired AI) would make some mistakes, but hopefully not too many. The quality of the algorithm depends heavily on how much training data we provide in the first place. Just think about the two extremes. On the one hand, we can choose to provide nothing at all, and ask the algorithm to go ahead with zero knowledge anyway; in this case, as can be imagined, the algorithm's accuracy would be arbitrarily bad. On the other hand, we can choose to do all the work (i.e., resolving all pairs of listings ourselves); in this case, the algorithm has nothing to do, and vacuously "achieves" full accuracy, but the degree of human involvement is horrible. So, how to strike a good balance in between? More specifically, what is the minimum amount of human efforts needed to reach a desirable accuracy?

This question is recently given a surprising answer in a scientific paper (the reference [1]). As it turns out, there exists an "inherent" accuracy rate x such that, if we want to ensure better accuracy than x, we must manually do all the work (that is, AI does not help at all). On the other hand, if we are satisfied with the accuracy rate x, then there exists a way to ask a human do a provably small amount of work. For example, back to the Amazon-eBay task, let us say that its inherent accuracy rate x equals 87%. Then, if you want 88% accuracy, AI will not be useful, and you just have to read all the listings yourself. Interestingly, if you are okay with 1 less percent, the amount of listings you have to read suddenly drops from everything to just a tiny fraction.

The inherent accuracy rate x depends on the concrete task. While it can be 87% for Amazon-eBay, it could be another percentage for the same task, say, on Alibaba and Amazon. Without knowing this accuracy, how do we determine how much manual effort to pay? Amazingly, as the paper [1] shows, it is always possible to do the minimum amount of work to achieve accuracy x, without having to know x in advance. Next time you attempt a task such as Amazon-eBay matching, just run the method of [1], which will tell you precisely which listings to read, and then will accomplish the task for you once it has garnered enough information, ensuring the inherent accuracy x. When it does so, you will be happily assured that the number of listings you have read cannot be any smaller to achieve the same accuracy x.

The Amazon-eBay task is, in fact, a real task with profound importance to e-commerce (it is not surprising that both Amazon and eBay would love to know whether the same products are being sold with a cheaper price at the other site). In reality, companies hire people to do the manual matching of listings (this is called crowdsourcing). The new techniques developed in [1] have the potential to save these companies millions of dollars, in a wide range of applications (known as entity matching in the scientific community) similar to the Amazon-eBay one.

The paper [1], published by Prof. Yufei Tao (CSE Dept, CUHK) as the sole author, won the best-paper award at PODS 2018, which is a premium conference on database theory.

References
[1] Yufei Tao. Entity Matching with Active Monotone Classification. Proceedings of the 37th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), pages 49-62, 2018.

Past Issue


	Contact Us

	Subscribe Email to friend Unsubscribe

	Copyright © 2025.
	All Rights Reserved. The Chinese University of Hong Kong.