IEEE ICDE 2020 Tutorials


Ibrahim Sabek (University of Minnesota, Twin Cities); Mohamed Mokbel (University of Minnesota - Twin Cities)*

The proliferation in amounts of generated data has propelled the rise of scalable machine learning solutions to efficiently analyze and extract useful insights from such data. Meanwhile, spatial data has become ubiquitous, e.g., GPS data, with increasingly sheer sizes in recent years. The applications of big spatial data span a wide spectrum of interests including tracking infectious disease, climate change simulation, drug addiction, among others. Consequently, major research efforts are exerted to support efficient analysis and intelligence inside these applications by either providing spatial extensions to existing machine learning solutions or building new solutions from scratch. In this 90-minute tutorial, we comprehensively review the state-of-the-art work in the intersection of machine learning and big spatial data. We cover existing research efforts and challenges in three major areas of machine learning, namely, data analysis, deep learning and statistical inference. We also discuss the existing end-to-end systems, and highlight open problems and challenges for future research in this area.

Maria K Krommyda (National Technical University of Athens)*; Verena Kantere (National Technical University of Athens)

The wide adoption of the RDF data model, as well as the Linked Open Data initiative, have made available large linked datasets that have the potential to offer invaluable knowledge.  Accessing, evaluating and understanding these datasets as published, though, requires extensive training and experience in the field of the Semantic Web, making these valuable sources of information inaccessible to a wider audience.  In the recent years, there have been many efforts to create systems that allow the visualization and exploration of this information.  Some of these systems rely on techniques that allow them to limit the volume of the displayed information, by providing aggregated, filtered or summarized access to the datasets while others initialize the exploration of the dataset based on actions performed by the users, such as keyword searches and queries. The underlying technique is key for the sustainability of the system, the definition of the requirements that the input must comply with, the datasets that can be visualized as well as the visualization types provided.  We present here a survey on these techniques, their strengths and weaknesses as well as the datasets that they can support. The survey will provide the reader with a deep understanding of the challenges regarding the visualization of large linked datasets, a categorization of the developed techniques to resolve them as well as an overview of the available systems and their functionalities.

Sebastian Villarroya (Jacobs University Bremen)* & Peter Baumann (Jacobs University Bremen)

Machine Learning is increasingly being applied to many different application domains. From cancer detection to weather forecast, a large number of different applications leverage machine learning algorithms to get faster and more accurate results over huge datasets. Although many of these datasets are mainly composed of array data, a vast majority of machine learning applications do not use array databases. This tutorial focuses on the integration of machine learning algorithms and array databases. By implementing machine learning algorithms in array databases, users can boost the native efficient array data processing with machine learning methods to perform accurate and fast array data analytics.

Cuneyt G Akcora (University of Manitoba)* & Yulia Gel (The University of Texas at Dallas) & Murat Kantarcioglu ((The University of Texas at Dallas)

Over the last couple of years, Bitcoin cryptocurrency and the Blockchain technology that forms the basis of Bitcoin have witnessed an unprecedented attention.    Designed to facilitate a secure distributed platform without central regulation, Blockchain is heralded as a novel paradigm that will be as powerful as Big Data, Cloud Computing, and Machine Learning.    The Blockchain technology garners an ever-increasing interest of researchers in various domains that benefit from scalable cooperation among trustless parties. As Blockchain applications proliferate, so does the complexity and volume of data stored by Blockchains. Analyzing this data has emerged as an important research topic, already leading to methodological advancements in the information sciences.   In this tutorial, we offer a holistic view on applied Data Science on Blockchains. Starting with the core components of Blockchain, we will detail the state of the art in Blockchain data analytics for graph, security and finance domains. Our examples will answer questions, such as, how to parse, extract and clean the data stored in Blockchains?, how to store and query Blockchain data? and what features could be computed from Blockchains?

Mohammad Javad Amiri (University of California, Santa Barbara)* & Divy Agrawal (University of California, Santa Barbara) & Amr El Abbadi (University of California Santa Barbara)

Large scale data management systems utilize consensus protocols to provide fault tolerance. Consensus protocols are extensively used in the distributed database infrastructure of large enterprises such as Google, Amazon, and Facebook as well as permissioned blockchain systems like IBM's Hyperledger Fabric. In the last four decades, numerous consensus protocols have been proposed to cover a broad spectrum of distributed systems. On one hand, distributed networks might be synchronous, partially synchronous, or asynchronous, and on the other hand, a distributed system might include crash-only nodes, Byzantine nodes or both. In addition, a consensus protocol might follow a pessimistic or optimistic strategy to process transactions. Furthermore, while traditional consensus protocols assume a priori known set of nodes, in permissionless blockchains, nodes are assumed to be unknown. Finally, consensus protocols have explored a variety of performance trade-offs between the number of phases/messages (latency), number of required processors, message complexity, and the activity level of participants (replicas and clients). In this tutorial we discuss existing consensus protocols, classify them into different categories based on their assumptions on network synchrony, failure model of nodes, etc., and elaborate on their main advantages and limitations.

Shantanu Sharma (University of California Irvine)* & Anton Burtsev (University of California Irvine) & Sharad Mehrotra (University of California Irvine)

Despite extensive research, secure outsourcing remains an open challenge. This tutorial focuses on recent advances in secure cloud-based data outsourcing based on cryptographic  (encryption, secret-sharing, and multi-party computation (MPC)) and  hardware-based approaches. We highlight the strengths and weaknesses of state-of-the-art techniques, and conclude that no single approach is likely to emerge as a silver bullet. Thus the key is to merge different hardware and software techniques to work in conjunction using partitioned computing wherein a computation is split across different cryptographic techniques carefully, so as not to compromise security. We highlight some recent work in that direction.