Lectures

* These courses are part of an innovative special AI&DB track sponsored by the AI Journal 

 

Crowd Data Sourcing

 

Sihem Amer-Yahia, Univ. Grenoble Alpes, CNRS, France 

Tova Milo, Tel Aviv University, Israel

Crowdsourcing is a powerful project management and procurement strategy that enables value realization via an "open call" to an unlimited pool of people through Web-based technology. Our focus here is crowd-based data sourcing where the crowd is invoked to obtain data, to aggregate and/or fuse data, to process data, or, more directly, to develop dedicated applications or solutions over the sourced data. This course will focus on the role of human workers in the assignment of tasks to the crowd, and on the process of crowd mining, i.e., how to best leverage the crowd to complete tasks.

Part 1 introduces some crowdsourcing applications and platforms classifies them into
domain-specific/generic, paying/non-paying, etc. It describes the main steps of task deployment, namely, worker recruitment, task assignment, task completion,
learning workers’ and requesters’ characteristics, and worker compensation. It also summarizes research challenges that are being tackled by several communities.

Part 2 is dedicated to database query processing with the crowd.

Part 3 focuses on crowd mining and the different software/architecture components needed to achieve it.  We will describe declarative crowd member selection and present algorithms to achieve operations such as filtering with the crowd and more generally, evaluating queries with the crowd.

Part 4 is dedicated to human factors in crowdsourcing and a summary of how they affect different crowdsourcing processes.  It describes in detail algorithmic task assignment whereby workers and tasks are matched.

We conclude with a summary of open questions.

 

Data Quality

Floris Geerts, Antwerp University, Belgium

Melanie Herschel, Universitaet Stuttgart, Germany 

Data quality is one of the most important problems in data management. Dirty data may lead to misleading or biased analytical decisions, loss of revenue, credibility and customers. Since data is often too big to be manually cleaned or curated, computational methods are required for the detection  of  inconsistent,  inaccurate, incomplete, duplicate,  or  stale data, and for  repairing  the data in either an automated way or by levering user’s input.

In this course, after we describe different causes for dirty data and survey different types of dirty data, we provide an overview of approaches both from a theoretical and practical point of view.

In  particular,  we  cover  declarative  methods  to  data  quality  in  which  various  logical  formalisms are used to detect different kinds of dirty data. These formalisms, in combination with different repair models, also lead to practical algorithms for the repairing of dirty data.

Not  all  data  quality  aspects,  however,  have  been  covered  or  explored  from  a  theoretical perspective.  In those  settings,  specialized heuristic  methods  are developed.  In  this  course,  we discuss such heuristic methods that resolve specific kinds of dirty data. Attendees will have the opportunity to gain hands-on experience in data quality during a practical exercise that focuses on the detection of one specific type of data error, namely duplicates, in a relational dataset.

Data Integration

Alvaro A A Fernandes, University of Manchester, UK 

Domenico Lembo, Università di Roma La Sapienza, Italy 

Data  Integration  is  the  problem  of  providing  users  with  a  unified,  reconciled  and  value-added view of data stored in diverse and possibly heterogeneous data sources. The theoretical view of this lecture will present an overview of the  research work carried out on data integration during the years. Particular attention  will be posed on conceptual architectures for  information  integration  and  the  main  query  processing  and  data  exchange  techniques proposed in the literature.  We will discuss the impact on query processing caused by integrity constraints in the global unified view, and survey ontology-based data access and integration, a

recent form of mediator-based data integration in which the global view is given in terms of an ontology.

The systems view of data integration will be covered in three parts:

(i)  a  systems-oriented  survey  of  the  literature  on  data integration  focusing  on  matching  and mapping generation  (with  glances at  dependencies  and  impacts on extraction  and preparation as  well  as  evaluation,  de-duplication  and  fusion)  highlighting  approaches,  techniques  and algorithms;  (ii)  a  more detailed exploration  of  a  value-adding  strategy  in  which user- and data-context  inform  the decisions  made  within a  data integration  process  when  this  is seen  as  one stage  in  flexible  and  dynamic  end-to-end  data  wrangling  orchestrations;  and  (iii)  a  practical, problem-solving  session  with  the  data integration  components  from  the  Oxford-Edinburgh-Manchester VADA prototype system.

Query answering 

Pablo Barcelo, Universidad de Chile

Paolo Guagliardo, University of Edinburgh, UK 

The  course  will  be  divided  in  two  parts.  The  first  one,  which  is  more  theoretical  in  spirit,  will introduce some of the most common classes of database queries, including first-order logic - or, equivalently,  relational  algebra  -  and  unions  of  conjunctive  queries,  which  correspond  to  the positive  fragment  of  the latter.  For  each  one  of  these  languages,  we  will  analyze  the computational cost of its associated evaluation problem.

The main goal of the course is to develop a principled understanding of when query evaluation is  computationally hard, and  which  kind  of  real-world  restrictions alleviate  the  cost  of  this  task.

We will study different notions of complexity  - in particular, combined, data, and parameterized complexity - which measure in different ways the influence of the size of the data and the query in  the  evaluation  problem.  We  will  also  present  a  theory  of  structural  decompositions  of   joins that  serves  as  the  basis  for  the  efficient  evaluation  of  the  basic  class  of  unions  of  conjunctive queries.

Answering  queries  on  databases  with  nulls  is  an  important  task  in  many  applications  in  data management. In the second part of the course we will first introduce some basic concepts in the theory  of  incomplete  information:  models  of  incompleteness  (open-world  and  closed-world semantics) and representation systems (Codd tables and conditional tables). We will then focus on the fundamental notion of "certain answers", that is, answers that are true in every possible world  represented  by  an  incomplete  database.  For  positive  queries  these  can  be  found efficiently  by  naive  evaluation,  but  for  queries  involving  negation the  problem  becomes intractable  (in  data  complexity).  For  such  queries,  we  will  consider  novel  approximation schemes with good complexity bounds, and we will discuss how and to which extent these can be applied in practice for SQL queries in real DBMSs.

Reasoning about Data

Andreas Pieris, University of Oxford, UK 

 

Emanuel Sallinger, University of Oxford, UK 

The  need  for  ontological  reasoning  has  been  widely  acknowledged,  both  in  the  knowledge representation and reasoning community and the database community. Indeed, ontologies may be used to infer knowledge that is not explicitly stored, which allows overcoming incompleteness of  the  source  data.    Moreover,  it  allows  to  abstract  from  the  specific  way  the  data  is  stored, allowing the user to perform queries on a more conceptual level. It also serves as the common foundation for many tasks required in data wrangling systems.

In this tutorial, we give an overview on ontological reasoning, where ontologies are modeled as logical  rules.  Such  formalisms  are  known  under  many  different  names,  among  them  tuple-generating  dependencies,  existential  rules  and  Datalog+/-  rules.  We  first  introduce  the  formal setting and  the  challenges  that  processing  big  data poses to  reasoning. We  then focus  on the logical foundations, giving an overview of the different ways that such rules are represented in various  fields,  and  introduce  the  main  reasoning  tasks  that  need  to  be  solved  for  them.  After introducing  the  primary  tools  needed  to  understand  and handle  such  rules,  we  focus  on concrete rule-based ontology languages that are striking a balance between expressivity on the one hand, and computational complexity on the other hand. We conclude by demonstrating how theoretical results on computational complexity transfer to practical reasoning systems. We look at simple, but concrete examples from data wrangling to give a feeling how theory and practice meet in this area.

Data, Responsibly - Fairness, Neutrality and Transparency in Data Analysis 

Serge Abiteboul, Inria, ENS Paris, France

Julia Stoyanovich, Drexel University, Philadelphia, USA 

Big Data technology holds incredible promise of improving people’s lives, accelerating scientific discovery  and  innovation,  and  bringing  about  positive  societal  change.  Yet,  if  not  used responsibly,  this  technology  can  propel  economic  inequality,  destabilize  global  markets  and affirm systemic bias. In this course we will focus on the importance of using Big Data technology responsibly - in  a  manner  that  adheres  to  the  legal  requirements and  ethical  norms  of  our society.

The primary goal of this course is to draw attention of the students in data management to the important emerging subject of responsible data management and analysis.  We will define key notions,  such  as  fairness,  diversity,  stability,  accountability,  transparency,  and  neutrality. We will give examples of concrete situations, many of which were covered in recent popular press, where  reasoning  about  and  enforcing  these  properties  is important.    We  will  then  discuss potential algorithmic approaches for quantifying and enforcing responsible practices, using real datasets and application scenarios from criminal sentencing, credit scoring, and homelessness services.

The course will be structured in three parts.  The first part discusses the importance of stating assumptions and interpreting data analysis results in context.  The second part gives a high-level overview of responsible data sharing, acquisition, management, and analysis practices. 

The third part focuses on fairness.

Intelligent Web Data Extraction

Tim Furche, University of Oxford, UK

Paolo Merialdo, Università di Roma 3, Italy 

We are told that this is the age of “big data” and that accurate and comprehensive data sets are the new gold of this age. In this course, we show you that not all that glitters is gold but also provide you with skills and pointers towards what is needed to refine gold out of the vast lakes of documents and data out there.

We focus specifically on web data extraction though that’s hardly a limitation any more these days.  This  course  will  provide  you  with  a  tour-de-force  through the  major  aspects  of  data  and information extraction in the last decade. We start at the micro level of individual sites that you may  want  to  scrape  with  exceeding accuracy and  how  to  approach  such a  task. We  then  turn our  gaze  at  more  complex  structure  where  data  is  spread over  multiple  sources  but  still  very

much of the same kind and shape. Finally, we end up at the macro level of the entire “web” where the focus shifts to coarser, less specific extraction systems such as Google’s knowledge vault. Where we have to choose, we focus on knowledge-driven approaches from the larger AI and databases community but provide pointers.

A  particular  theme  of  this  course  is on the  interchange between  academia and  industry  in  this field,  along  our  personal  journeys  between  those  two.

At  the  end  of  the  course  and  tutorial, attendants will not only have a good grasp of the underlying concepts but will be ready to apply them in practice for data collection activities in whatever domain they are excited about.