Lectures

Lectures

Crowd Data Sourcing : Sihem Amer-Yahia, Univ. Grenoble Alpes, CNRS, France; Tova Milo, Tel Aviv University, Israel
Data Quality : Floris Geerts, Antwerp University, Belgium; Melanie Herschel, Universitaet Stuttgart, Germany
Data Integration : Alvaro A A Fernandes, University of Manchester, UK; Domenico Lembo, Università di Roma La Sapienza, Italy
Query answering : Pablo Barcelo,Universidad de Chile; Paolo Guagliardo, University of Edinburgh, UK
Reasoning about Data* : Andreas Pieris, University of Oxford, UK; Emanuel Sallinger, University of Oxford, UK
Data, Responsibly - Fairness, Neutrality and Transparency in Data Analysis* : Serge Abiteboul, Inria, ENS Paris, France; Julia Stoyanovich, Drexel University, Philadelphia, USA
Intelligent Web Data Extraction* : Tim Furche, University of Oxford, UK; Paolo Merialdo, Università di Roma 3, Italy

* These courses are part of an innovative special AI&DB track sponsored by the AI Journal

Crowd Data Sourcing

Sihem Amer-Yahia, Univ. Grenoble Alpes, CNRS, France

Tova Milo, Tel Aviv University, Israel

Crowdsourcing is a powerful project management and procurement strategy that enables value realization via an "open call" to an unlimited pool of people through Web-based technology. Our focus here is crowd-based data sourcing where the crowd is invoked to obtain data, to aggregate and/or fuse data, to process data, or, more directly, to develop dedicated applications or solutions over the sourced data. This course will focus on the role of human workers in the assignment of tasks to the crowd, and on the process of crowd mining, i.e., how to best leverage the crowd to complete tasks.

Part 1 introduces some crowdsourcing applications and platforms classifies them into
domain-specific/generic, paying/non-paying, etc. It describes the main steps of task deployment, namely, worker recruitment, task assignment, task completion,
learning workers’ and requesters’ characteristics, and worker compensation. It also summarizes research challenges that are being tackled by several communities.

Part 2 is dedicated to database query processing with the crowd.

Part 3 focuses on crowd mining and the different software/architecture components needed to achieve it. We will describe declarative crowd member selection and present algorithms to achieve operations such as filtering with the crowd and more generally, evaluating queries with the crowd.

Part 4 is dedicated to human factors in crowdsourcing and a summary of how they affect different crowdsourcing processes. It describes in detail algorithmic task assignment whereby workers and tasks are matched.

We conclude with a summary of open questions.

Data Quality

Floris Geerts, Antwerp University, Belgium

Melanie Herschel, Universitaet Stuttgart, Germany

Data quality is one of the most important problems in data management. Dirty data may lead to misleading or biased analytical decisions, loss of revenue, credibility and customers. Since data is often too big to be manually cleaned or curated, computational methods are required for the detection of inconsistent, inaccurate, incomplete, duplicate, or stale data, and for repairing the data in either an automated way or by levering user’s input.

In this course, after we describe different causes for dirty data and survey different types of dirty data, we provide an overview of approaches both from a theoretical and practical point of view.

In particular, we cover declarative methods to data quality in which various logical formalisms are used to detect different kinds of dirty data. These formalisms, in combination with different repair models, also lead to practical algorithms for the repairing of dirty data.

Not all data quality aspects, however, have been covered or explored from a theoretical perspective. In those settings, specialized heuristic methods are developed. In this course, we discuss such heuristic methods that resolve specific kinds of dirty data. Attendees will have the opportunity to gain hands-on experience in data quality during a practical exercise that focuses on the detection of one specific type of data error, namely duplicates, in a relational dataset.

Data Integration

Alvaro A A Fernandes, University of Manchester, UK

Domenico Lembo, Università di Roma La Sapienza, Italy

Data Integration is the problem of providing users with a unified, reconciled and value-added view of data stored in diverse and possibly heterogeneous data sources. The theoretical view of this lecture will present an overview of the research work carried out on data integration during the years. Particular attention will be posed on conceptual architectures for information integration and the main query processing and data exchange techniques proposed in the literature. We will discuss the impact on query processing caused by integrity constraints in the global unified view, and survey ontology-based data access and integration, a

recent form of mediator-based data integration in which the global view is given in terms of an ontology.

The systems view of data integration will be covered in three parts:

(i) a systems-oriented survey of the literature on data integration focusing on matching and mapping generation (with glances at dependencies and impacts on extraction and preparation as well as evaluation, de-duplication and fusion) highlighting approaches, techniques and algorithms; (ii) a more detailed exploration of a value-adding strategy in which user- and data-context inform the decisions made within a data integration process when this is seen as one stage in flexible and dynamic end-to-end data wrangling orchestrations; and (iii) a practical, problem-solving session with the data integration components from the Oxford-Edinburgh-Manchester VADA prototype system.

Query answering

Pablo Barcelo, Universidad de Chile

Paolo Guagliardo, University of Edinburgh, UK

The course will be divided in two parts. The first one, which is more theoretical in spirit, will introduce some of the most common classes of database queries, including first-order logic - or, equivalently, relational algebra - and unions of conjunctive queries, which correspond to the positive fragment of the latter. For each one of these languages, we will analyze the computational cost of its associated evaluation problem.

The main goal of the course is to develop a principled understanding of when query evaluation is computationally hard, and which kind of real-world restrictions alleviate the cost of this task.

We will study different notions of complexity - in particular, combined, data, and parameterized complexity - which measure in different ways the influence of the size of the data and the query in the evaluation problem. We will also present a theory of structural decompositions of joins that serves as the basis for the efficient evaluation of the basic class of unions of conjunctive queries.

Answering queries on databases with nulls is an important task in many applications in data management. In the second part of the course we will first introduce some basic concepts in the theory of incomplete information: models of incompleteness (open-world and closed-world semantics) and representation systems (Codd tables and conditional tables). We will then focus on the fundamental notion of "certain answers", that is, answers that are true in every possible world represented by an incomplete database. For positive queries these can be found efficiently by naive evaluation, but for queries involving negation the problem becomes intractable (in data complexity). For such queries, we will consider novel approximation schemes with good complexity bounds, and we will discuss how and to which extent these can be applied in practice for SQL queries in real DBMSs.

Reasoning about Data

Andreas Pieris, University of Oxford, UK

Emanuel Sallinger, University of Oxford, UK

The need for ontological reasoning has been widely acknowledged, both in the knowledge representation and reasoning community and the database community. Indeed, ontologies may be used to infer knowledge that is not explicitly stored, which allows overcoming incompleteness of the source data. Moreover, it allows to abstract from the specific way the data is stored, allowing the user to perform queries on a more conceptual level. It also serves as the common foundation for many tasks required in data wrangling systems.

In this tutorial, we give an overview on ontological reasoning, where ontologies are modeled as logical rules. Such formalisms are known under many different names, among them tuple-generating dependencies, existential rules and Datalog+/- rules. We first introduce the formal setting and the challenges that processing big data poses to reasoning. We then focus on the logical foundations, giving an overview of the different ways that such rules are represented in various fields, and introduce the main reasoning tasks that need to be solved for them. After introducing the primary tools needed to understand and handle such rules, we focus on concrete rule-based ontology languages that are striking a balance between expressivity on the one hand, and computational complexity on the other hand. We conclude by demonstrating how theoretical results on computational complexity transfer to practical reasoning systems. We look at simple, but concrete examples from data wrangling to give a feeling how theory and practice meet in this area.

Data, Responsibly - Fairness, Neutrality and Transparency in Data Analysis

Serge Abiteboul, Inria, ENS Paris, France

Julia Stoyanovich, Drexel University, Philadelphia, USA

Big Data technology holds incredible promise of improving people’s lives, accelerating scientific discovery and innovation, and bringing about positive societal change. Yet, if not used responsibly, this technology can propel economic inequality, destabilize global markets and affirm systemic bias. In this course we will focus on the importance of using Big Data technology responsibly - in a manner that adheres to the legal requirements and ethical norms of our society.

The primary goal of this course is to draw attention of the students in data management to the important emerging subject of responsible data management and analysis. We will define key notions, such as fairness, diversity, stability, accountability, transparency, and neutrality. We will give examples of concrete situations, many of which were covered in recent popular press, where reasoning about and enforcing these properties is important. We will then discuss potential algorithmic approaches for quantifying and enforcing responsible practices, using real datasets and application scenarios from criminal sentencing, credit scoring, and homelessness services.

The course will be structured in three parts. The first part discusses the importance of stating assumptions and interpreting data analysis results in context. The second part gives a high-level overview of responsible data sharing, acquisition, management, and analysis practices.

The third part focuses on fairness.

Intelligent Web Data Extraction

Tim Furche, University of Oxford, UK

Paolo Merialdo, Università di Roma 3, Italy

We are told that this is the age of “big data” and that accurate and comprehensive data sets are the new gold of this age. In this course, we show you that not all that glitters is gold but also provide you with skills and pointers towards what is needed to refine gold out of the vast lakes of documents and data out there.

We focus specifically on web data extraction though that’s hardly a limitation any more these days. This course will provide you with a tour-de-force through the major aspects of data and information extraction in the last decade. We start at the micro level of individual sites that you may want to scrape with exceeding accuracy and how to approach such a task. We then turn our gaze at more complex structure where data is spread over multiple sources but still very

much of the same kind and shape. Finally, we end up at the macro level of the entire “web” where the focus shifts to coarser, less specific extraction systems such as Google’s knowledge vault. Where we have to choose, we focus on knowledge-driven approaches from the larger AI and databases community but provide pointers.

A particular theme of this course is on the interchange between academia and industry in this field, along our personal journeys between those two.

At the end of the course and tutorial, attendants will not only have a good grasp of the underlying concepts but will be ready to apply them in practice for data collection activities in whatever domain they are excited about.