Wikidata is a Knowledge Base where anybody can add new information. Unfortunately, it is targeted by vandals, who put inaccurate or offensive information there. To fight them, Wikidata employs moderators, who manually inspect each suggested edit. In this talk we will look into how we can use Machine Learning to automatically detect vandalic revisions and help the moderators.
Knowledge bases are an important source of information for many AI system: they rely on the bases for enriching the information they process to make better user experience. Obtaining such Knowledge Bases is difficult, and which is why this process is crowd-sourced. One of such bases is Wikidata: they allow everybody on the Internet to edit the content and add new information.
Unfortunately, Wikidata is often targeted by vandals, who misuse the system and put false or offensive information there. This may lead to incorrect behaviour of the AI systems. To keep the base clean, Wikidata employs moderators who manually inspect each revision and revert vandalic ones.
To help moderators fight vandals, the organizers of WSDM Cup 2017 challenged the participants to build a Machine Learning model which automatically detects if an edit should be rolled back. In this talk we will discuss the second place solution to the Cup: how to process half of terabyte of revisions, extract meaningful features and create a production ready model that scales to a large number of testing examples.