Multi-Source Data Integration with Humans In The Loop
Teacher
Donatella Firmani
Abstract
Many domains, such as government Open Data and the World Wide Web, can provide thousands of data sources about real-world entities, including profiles of people and institutions, or specifications of products and services. In these scenarios, multi-source data integration technologies are key for solving complex tasks, such as building a knowledge graph and extracting semantic information. Multi-source data integration is an intricate problem because of our ability to represent and misrepresent information in a variety of ways that humans can easily match and distinguish based on domain knowledge, but would be challenging for automated strategies.
To this end, a wide range of human-machine algorithms have been recently proposed. The course aims at illustrating these methods and how we can leverage domain knowledge of humans for data integration in a effective way. The lectures focus on (i) the problem of integrating large-scale datasets extracted from multiple sources, (ii) the algorithms and software that have been proposed in recent years to solve these problems, and (iii) possible directions of research in this context. All the subjects addressed during the course are investigated under both practical and methodological perspectives. The course includes practical exercises with real-world data and the assignments of individual projects.
Program
1. Introduction to Multi-Source Data Integration
2. Entity Resolution with a Crowd Oracle
3. Scaling Up with Blocking
4. Semi-Structured Data (Alignment and Fusion)
5. Beyond Integration: Hierarchies and Knowledge Graphs