Welcome to CrowdGame

For full papar visit Cost-Effective Data Annotation using Game-Based Crowdsourcing.

Introduction

Usually, we know that we can annotate multiple tuples with labeling rules. For example, in entity matching, i.e., identifying whether any pair of product records is the same entity, as follows, we can find some textual rules to help us match records.


Record a: Sony VAIO Silver Desktop Laptop
Record b: Apple Pro Black Desktop Notobook
Record c: Apple MacBook Pro Silver Laptop

As shown in the example above, we can say that a product containing "Sony" is different from the one with "Apple". But the rules (Black, Silver) and (Laptop,Notebook) will lead to errors. This shows that labeling rules can help us labeling multiple tuples, but they also bring loss of labeling quality.

This paper introduces a cost-effective annotation approach, and focuses on the labeling rule generation problem that aims to generate high-quality rules to largely reduce the labeling cost while preserving quality.

To address the problem, we first generate candidate rules, and then devise a game-based crowdsourcing approach CrowdGame to select high-quality rules by considering coverage and precision. CrowdGame employs two groups of crowd workers: one group answers rule validation tasks (whether a rule is valid) to play a role of rule generator, while the other group answers tuple checking tasks (whether the annotated label of a data tuple is correct) to play a role of rule refuter. The two-pronged crowdsourcing task scheme is shown as follow.

avatar

We let the two groups play a two-player game: rule generator identifies high-quality rules with large coverage and precision, while rule refuter tries to refute its opponent rule generator by checking some tuples that provide enough evidence to reject rules covering the tuples. We iteratively call rule generator and rule refuter until crowdsourcing budget is used up.

Framework

We introduce a cost-effective data annotation framework as shown in figure below. The framework makes use of the labeling rules (or rules for simplicity), each of which can be used to annotate multiple tuples, to reduce the cost.

avatar

Our framework takes a set of unlabeled tuples as input and annotates them utilizing labeling rules in two phases.

Phase I: Crowdsourced Rule Generation. This phase aims at generating “high-quality” rules, where rule quality is measured by not only coverage but also precision. To this end, we first construct candidate rules, which may have various coverage and precision. Then we leverage the crowd on identifying good rules from noisy candidates.
Phase II: Data Annotation using Rules. This phase is to annotate the tuples using the labeling rules generated in the previous phase.

avatar

Let us use an example above to show how CrowdGame works, which is like a round-based board game between two players.

In the first round, RuleGen selects r₃ for rule validation, as it covers 4 tuples. However, its opponent RuleRef finds a “counter-example” e₅ using a tuple checking task. Based on this, RuleRef refutes both r₃ and r₄ and rejects their covered tuples.
In the second round, RuleGen selects another crowd-validated rule r₁, while RuleRef crowdsources e₁, finds e₁ is correctly labeled, and finds no “evidence” to refute r₁. As the budget is used up, we find a rule set R = {r₁}.

We studies the challenges in CrowdGame. The first is to balance the trade-off between coverage and precision. We define the loss of a rule by considering the two factors. The second is to estimate accurate rule precision. We utilize Bayesian estimation to combine both rule validation and tuple checking tasks. The third is to select crowdsourcing tasks to fulfill the game-based framework for minimizing the loss.We introduce a minimax strategy and develop efficient task seselection algorithms.

For more details, please view our full paper.

Publication

Jingru Yang, Ju Fan, Zhewei Wei, Guoliang Li, Tongyu Liu, Xiaoyong Du. Cost-Effective Data Annotation using Game-Based Crowdsourcing. PVLDB, 12 (1): 57-70, 2018.
Ju Fan, Guoliang Li. Human-in-the-loop Rule Learning for Data Integration. IEEE Data Eng. Bull. 41(2): 104-115 (2018)

Datasets

We provide the 4 datasets used in our paper, including the tuple labels and the rule labels we got from AMT.

Datasets	Descriptions	# Tuples to be annotated	# Candidate labeling rules	Download
Abt-Buy	Electronics product records from two websites, Abt and BestBuy.	1.2 million	1.6 w	Abt-Buy
Ama-Goo	Software products from two websites, Amazon and Google.	4.4 million	1.5 w	Ama-Goo
Ebay	Beauty products collected from website Ebay	0.5 million	1.3 w	Ebay
Spouse	Identify if two person in a sentence have spouse relation.	5.5 k	360	Spouse

Source Code

You can download the source code from Here. With Python3.5 installed, you can run this code with an example data set.


cd src
python run.py

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search