Industry Papers
Program At A Glance
Research Sessions
PhD Symposium
Poster Papers
TKDE Posters

Industry Track Schedule

Industry 1: High Performance and Scalable Data Platforms (Tue 21st April 10:00-11:30)
Title Authors
I1.1 507 "PSGraph: How Tencent trains large-scale graphs with Spark? " Jiawei Jiang (ETH Zurich); Pin Xiao (Tencent); Lele Yu (Tencent Inc.); Xiaosen Li (Tencent Inc.); Jiefeng Cheng (Tencent Inc.); Xupeng Miao (Peking University); zhipeng zhang (Peking University); Bin Cui (Peking University)
Spark has extensively used in many applications of Tencent, due to its easy deployment, pipeline capability, and close integration with the Hadoop ecosystem.As the graph computing engine of Spark, GraphX is also widely deployed to process large-scale graph data in Tencent.However, when the size of the graph data is up to billion-scale, GraphX encounters serious performance degradation.Worse, Graphx cannot support the rising advancement of graph embedding (GE) and graph neural network (GNN) algorithms.To address these challenges, we develop a new graph processing system, called PSGraph, which uses Spark executor and PyTorch to perform calculation, and develops a distributed parameter server to store frequently accessed models.PSGraph can train extremely large-scale graph data in Tencent with the parameter server architecture, and enable the training of GE and GNN algorithms.Moreover, PSGraph still benefits from the advantages of Spark via staying inside the Spark ecosystem, and can directly replace GraphX without modification to the existing application framework.Our experiments show that PSGraph outperforms GraphX significantly.
I1.2 634 "JUST: JD Urban Spatio-Temporal Data Engine " Ruiyuan Li (Xidian University; JD Finance); Huajun He (Southwest Jiaotong University); Rubin Wang (Southwest Jiaotong University); Yuchuan Huang (JD Intelligent Cities Research); Junwen Liu (JD Intelligent Cities Research); Sijie Ruan (Xidian University; JD Finance); Tianfu He (JD Intelligent Cities Research); Jie Bao (JD Finance); Yu Zheng (JD Finace)
With the prevalence of positioning techniques, a prodigious number of spatio-temporal data is generated. To effectively support various urban applications based on spatio-temporal data, e.g., location-based services and traffic prediction, it is desirable for an efficient, scalable, and easy-to-use spatio-temporal data management system.This paper presents JUST, i.e., JD Urban Spatio-Temporal data engine, which can efficiently manage big spatio-temporal data in a convenient way. JUST incorporates the distributed NoSQL data store Apache HBase as the underlying storage, GeoMesa as the spatio-temporal data indexing tool, and Apache Spark as the execution engine. We creatively design two indexing techniques, i.e., Z2T indexing and XZ2T indexing, which accelerate spatio-temporal queries tremendously. Furthermore, we introduce a compression mechanism, which not only greatly reduces the storage cost, but also improves the query efficiency immensely. To make JUST easy-to-use, we design and implement a complete SQL engine, with which all operations can be performed through a SQL-like language, i.e., JustQL. JUST is deployed as a PaaS in JD with multi-user support. Many applications has been developed based on the SDKs provided by JUST. Extensive experiments are carried out using two real datasets and one synthetic dataset, demonstrating the powerful efficiency and scalability of JUST. The results show that JUST achieves a better performance comparing eight state-of-the-art spatio-temporal management systems.
I1.3 911 "Oracle Database In-Memory on Active Data Guard: Real-time Analytics on a Standby Database Author: Sukhada S Pendse Show Abstract" Sukhada S Pendse (Oracle America, Inc.); Vasudha Krishnaswamy (Oracle, USA); Kartik Kulkarni (Oracle America, Inc.); Yunrui Li (Oracle America, Inc.); Tirthankar Lahiri (Oracle America); Vivekanandhan Raja (Oracle America, Inc); Jing Zheng (Oracle); Mahesh Girkar (Oracle America, Inc.); Akshay Kulkarni (Oracle America, Inc.)
Oracle Database In-Memory (DBIM) provides orders of magnitude speedup for analytic queries with its highly compressed, transactionally consistent, memory-optimized Column Store. Customers can use Oracle DBIM for making real-time decisions by analyzing vast amounts of data at blazingly fast speeds. Active Data Guard (ADG) is Oracle’s comprehensive solution for high-availability and disaster recovery for the Oracle Database. Oracle ADG eliminates the high cost of idle redundancy by allowing reporting applications, ad-hoc queries and data extracts to be offloaded to the synchronized, physical Standby database replicated using Oracle ADG. In Oracle 12.2, we extended the DBIM advantage to Oracle ADG architecture. DBIM-on-ADG significantly boosts the performance of analytic, read-only workloads running on the physical Standby database, while the Primary database continues to process high-speed OLTP workloads. Customers can partition their data across the In-Memory Column Stores on the Primary and Standby databases based on access patterns, and reap the benefits of fault-tolerance as well as workload isolation without compromising on critical performance SLAs. In this paper, we explore and address the key challenges involved in building the DBIM-on-ADG infrastructure, including synchronized maintenance of the In-Memory Column Store on the Standby database, with high-speed OLTP activity continuously modifying data on the Primary database.
I1.4 914 "Data Sentinel: A Declarative Production-Scale Data Validation Platform " Arun Swami (LinkedIn, Inc.); Sriram Vasudevan (LinkedIn, Inc.); Joojay Huyn (LinkedIn, Inc.)
While many organizations process big data for important business operations and decisions, data quality problems continue to be widespread, costing US businesses an estimated $600 billion annually. To date, addressing data quality in production environments still poses many challenges: easily defining properties of high-quality data; debugging poor quality data; making data quality solutions easy to use, understand, and run; and validating production-scale data in a timely manner. Current data validation systems do not comprehensively address these challenges. To address data quality in production environments at LinkedIn, we developed Data Sentinel, a declarative production-scale data validation platform. To make Data Sentinel easy to use, understand, and run in production environments, we provide Data Sentinel Service (DSS), a complementary system to help specify data checks, schedule and deploy data validation jobs, and tune performance. The contributions of this paper include the following: 1) Data Sentinel, a declarative production-scale data validation platform that has been successfully deployed at LinkedIn 2) A generic blueprint to build and deploy similar systems for production environments 3) Experiences and lessons learned that can benefit practitioners with similar objectives.
I1.5 893 "Turbine: Facebook’s Service Management Platform for Stream Processing " Yuan Mei (Facebook); Luwei Cheng (Facebook); Vanish Talwar (facebook); Michael Levin (facebook); Gabriela Jacques-Silva (Facebook); Nikhil Simha (Facebook); Anirban Banerjee (Facebook); Brian Smith (facebook); Tim Williamson (facebook); Serhat Yilmaz (Facebook); Weitao Chen (facebook); Guoqiang Jerry Chen (Facebook)
The demand for stream processing at Facebook has grown as services increasingly rely on real-time signals to speed up decisions and actions. Emerging real-time applications require strict Service Level Objectives (SLOs) with low downtimeand processing lag—even in the presence of failures and load variability. Addressing this challenge at Facebook scale led to the development of Turbine, a management platform designed to bridge the gap between the capabilities of the existing general-purpose cluster management frameworks and Facebook’s stream processing requirements. Specifically, Turbine features a fast and scalable task scheduler; an efficient predictive auto scaler; and an application update mechanism that provides fault-tolerance, atomicity, consistency, isolation and durability.Turbine has been in production for over two years, and is currently deployed on clusters spanning tens of thousands of machines. It manages several thousands of streaming pipelines processing hundreds of gigabytes of data per second in real time. Our production experience has validated Turbine’s effectiveness: its task scheduler evenly balances workload fluctuation across clusters; its auto scaler effectively and predictively handles unplanned load spikes; and the application update mechanism consistently and efficiently completes high scale updates withinminutes. This paper describes the Turbine architecture, discusses the design choices behind it, and shares several case studies demonstrating Turbine capabilities in production.
Industry 2: Information Discovery and Management (Tue 21st April 13:30-15:00)
Title Authors
I2.1 408 "Speed Kit: A Polyglot & GDPR-Compliant Approach For Caching Personalized Content " Wolfram WW Wingerath (Baqend); Felix Gessert (Universität Hamburg); Norbert Ritter (Universität Hamburg); Benjamin Wollmer (Baqend)
Users leave when page loads take too long. This simple fact has complex implications for virtually all modern businesses, because accelerating content delivery through caching is not as simple as it used to be. As a fundamental technical challenge, the high degree of personalization in today's Web has seemingly outgrown the capabilities of traditional content delivery networks (CDNs) which have been designed for distributing static assets under fixed caching times. As an additional legal challenge for services with personalized content, an increasing number of regional data protection laws constrain the ways in which CDNs can be used in the first place. In this paper, we present Speed Kit as a radically different approach for content distribution that combines (1) a polyglot architecture for efficiently caching personalized content with (2) a natively GDPR-compliant client proxy that handles all sensitive information within the user device. We describe the system design and implementation, explain the custom cache coherence protocol to avoid data staleness and achieve delta-atomicity, and we share field experiences from over a year of productive use in the e-commerce industry.
I2.2 410 "De-Health: All Your Online Health Information Are Belong to Us " Shouling Ji (Zhejiang University); Qinchen Gu (Georgia Tech); Haiqin Weng (Ant Finance ); Qianjun Liu (Zhejiang University); Pan Zhou (HUST); Jing Chen (Wuhan University); Zhao Li (Alibaba Group); Raheem Beyah (Georgia Institute of Technology); Ting Wang (Lehigh University)
In this paper, we study the privacy of online health data.We present a novel online health data De-Anonymization (DA) framework,named De-Health. Leveraging two real world online health datasets WebMD %(89,393 users, 506K posts) and HealthBoards, we validate the DA efficacyof De-Health. We also present a linkage attack framework which can link online health/medical information to real world people. Through a proof-of-concept attack,we link 347 out of 2805 WebMD users to real world people, and findthe full names, medical/health information, birthdates, phone numbers, andother sensitive information for most of the re-identified users.This clearly illustrates the fragility of the privacy of those who use online health forums.
I2.3 896 "Maxson: Reduce duplicate Parsing Overhead on Raw Data " Hong Huang (Huazhong University of Science and Technology); Xuanhua Shi (Huazhong University of Science and Technology ); Yipeng Zhang (Huazhong University of Science and Technology ); Zhenyu Hu (Huazhong University of Science and Technology ); Hai Jin (Huazhong University of Science and Technology); Huan Shen (Huazhong University of Science and Technology ); Yongluan Zhou (University of Copenhagen); Bingsheng He (National University of Singapore); Ruibo Li (Alibaba); Keyong Zhou (Alibaba)
I2.4 908 "Automatic Calibration of Road Intersection Topology using Trajectories Author: Jiali Mao Show Abstract" Lisheng Zhao (East China Normal University); Jiali Mao (East China Normal University); Min Pu (East China Normal University); Cheqing Jin (East China Normal University); Weining Qian (East China Normal University); Aoying Zhou (East China Normal University ); Guoping Liu (Didi Chuxing); Xiang Wen (Didi Chuxing); Runbo Hu (DiDi Chuxing); Hua Chai (Didi Chuxing)
The inaccuracy of road intersection in digital road map easily brings serious effects on the mobile navigation and other applications. Massive traveling trajectories of thousands of vehicles enable frequent updating of road intersection topology. In this paper, we first expand the road intersection detection issue into a topology calibration problem for road intersection influence zone. Distinct from the existing road intersection update methods, we not only determine the location and coverage of road intersection, but figure out incorrect or missing turning paths within whole influence zone based on the unmatched trajectories as compared to the existing map. The important challenges of calibration issue include that trajectories are mixing with exceptional data, and road intersections are of different sizes and shapes, etc. To address above challenges, we propose a three-phase calibration framework, called CITT. It is composed of trajectory quality improving, core zone detection, and topology calibration within road intersection influence zone. From such components it can automatically obtain high quality topology of road intersection influence zone}. Extensive experiments compared with the state-of-the-art methods using trajectory data obtained from Didi Chuxing and Chicago campus shuttles demonstrate that CITT method has strong stability and robustness and significantly outperforms existing methods.
I2.5 894 "SAFE: Scalable Automatic Feature Engineering Framework for Industrial Tasks " Qitao Shi (Ant Financial Services Group); Ya-Lin Zhang (Ant Financial Services Group ); Longfei Li (Ant Financial); Xinxing Yang (Ant Financial Services Group); Meng Li (Ant Financial Services Group); Jun Zhou (Ant Financial)
Machine learning techniques have been widely applied in Internet companies for various tasks, acting as an essential driving force, and feature engineering has been generally recognized as a crucial tache when constructing machine learning systems. Recently, a growing effort has been made to the developing of automatic feature engineering methods, so that substantial and tedious manual effort can be liberated. However, for industrial tasks, the efficiency and scalability of these methods are still far from satisfactory. In this paper, we proposed a staged method named SAFE (\textbf{S}calable \textbf{A}utomatic \textbf{F}eature \textbf{E}ngineering), which can provide excellent efficiency and scalability, along with promising performance. Extensive experiments are conducted and the result show that the proposed method can provide prominent efficiency and competitive effectiveness when comparing with other methods. What's more, the adequate scalability of the proposed methods ensures it to be deployed in large scale industrial tasks.
Industry 3: Deep Learning and Novel Applications (Wed 22nd April 10:00-11:30)
Title Authors
I3.1 519 "Cross-Graph Convolution Learning for Large-Scale Text-Picture Shopping Guide in E-Commerce Search " Tong Zhang (Nanjing University of Science and Technology)*; Baoliang Cui (Alibaba Group); Zhen Cui (Nanjing University of Science and Technology); Haikuan Huang (Alibaba); Jian Yang (Nanjing University of Science and Technology); Hongbo Deng (Alibaba Group); Bo Zheng (Alibaba Group)
In this work, a new e-commerce search service named text-picture shopping guide (TPSG) is investigated, and applied to one popular large-scale online e-commerce platform named Taobao. Compared to traditional options that only contain textual terms, the TPSG is more user-friendly in interactive understanding through the recommended text-picture options (TPOs) consisting of text terms together with pictures. Aimed to automatically recommend personalized pictures in TPOs, rather than the previous manual selection, we build a large-scale graph model on a gigantic amount of users, pictures and terms data, and propose a cross-graph convolution learning (CGCL) method for more accurate and high-speed inference. Instead of an entire mixture relation graph, we model the attributes/relations of users and commodities with a within-user graph, a within-commodity graph, and a resultant cross graph describing preferences of different users for commodities. To more efficiently infer on/across these graphs, we specifically generalize graph convolution into this case and propose a new tensor graph convolution across different graphs. In the large-scale offline and online tests, we validate the superiority of the automatic recommendation over manual selection of pictures in TPOs, and meantime demonstrate the effectiveness of our proposed CGCL.
I3.2 900 "Billion-scale Recommendation with Heterogeneous Side Information at Taobao " Andreas Pfadler (Alibaba); Huan Zhao (HKUST); Jizhe Wang (Alibaba); Lifeng Wang (Alibaba); Pipei Huang (Alibaba); Dik-Lun LEE (Hong Kong University of Science and Technology, Hong Kong)
In recent years, embedding models based on skip-gram algorithm have been widelyapplied to real-world recommendation systems (RSs). When designingembedding-based methods for recommendation at Taobao, there are three mainchallenges: scalability, sparsity and cold start. The first problem isinherently caused by the extremely large numbers of users and items (in theorder of billions), while the remaining two problems are caused by the fact that most items have only very few (or none at all) user interactions. To address these challenges, in this work, we present a flexible and highly scalable Side Information enhanced Skip-Gram (SISG) framework, which is deployed at Taobao. SISG overcomes the drawbacks of existing embedding-based models by modeling user metadata and capturing asymmetries of user behavior. Furthermore, as training SISG can be performed using any SGNS implementation, we present our production deployment of SISG on a custom-built word2vec engine which allows us to compute item and side information embedding vectors for billion-scale sets of products in a join semantic space on a daily basis. Finally, in a number of offline and online experiments we demonstrate the significant superiority of SISG over our previously deployed framework, EGES, and a well-tuned CF, as well as present evidence supporting our scalability claims.
I3.3 915 "Hierarchical Bipartite Graph Neural Networks: Towards Large-Scale E-commence Applications " Zhao Li (Alibaba Group); SHEN Xin (Nanyang Technological University); Yuhang Jiao (Central University of Finance and Economics, China); Xuming Pan (Alibaba Group); Pengcheng Zou (Alibaba Group); Xianling Meng (Zhejiang University); Chengwei Yao (Zhejiang University); Jiajun Bu (Zhejiang University)
The e-commerce appeals to a multitude of online shoppers by providing personalized experiences and becomes indispensable in our daily life. Accurately predicting user preference and making a recommendation of favorable items plays a crucial role in improving several key tasks such as Click Through Rate (CTR) and Conversion Rate (CVR) in order to increase commercial value. Some state-of-the-art collaborative filtering methods exploiting non-linear interactions on a user-item bipartite graph are able to learn better user and item representations with Graph Neural Networks (GNNs), which do not learn hierarchical representations of graphs because they are inherently flat. Hierarchical representation is reportedly favorable in making more personalized item recommendations in terms of behaviorally similar users in the same community and a context of topic-driven taxonomy. However, some advanced approaches, in this regard, are either only considering linear interactions, or adopting single-level community, or computationally expensive. To address these problems, we propose a novel method with \textbf{H}ierarchical b\textbf{i}partite \textbf{G}raph \textbf{N}eural \textbf{N}etwork (HiGNN) to handle large-scale e-commence tasks. By stacking multiple GNN modules and using a deterministic clustering algorithm alternately, HiGNN is able to efficiently obtain hierarchical user and item embeddings simultaneously, and effectively predict user preferences on a larger scale. Extensive experiments on some real-world e-commence datasets demonstrate that HiGNN achieves a significant improvement compared to several popular methods. Moreover, we deploy HiGNN in Taobao, one of the largest e-commences with hundreds of million users and items, for a series of large-scale prediction tasks of item recommendations. The results also demonstrate that HiGNN is arguably promising and scalable in real-world applications.
I3.4 912 "LoCEC: Local Community-based Edge Classification in Large Online Social Networks " Chonggang Song (Tencent); Qian Lin (National University of Singapore); Guohui Ling (Tencent Technology); Zongyi Zhang (Tencent); Hongzhao Chen (Tencent); Jun Liao (Tencent); Chuan Chen (Tencent)
Relationships in online social networks often imply social connections in the real world.An accurate understanding of relationship types, such as family members or colleagues, benefits many applications, e.g. social advertising and recommendation. Some recent attempts have been proposed to classify user relationships into pre-defined types with the help of pre-labeled relationships or abundant interaction features on relationships. Unfortunately, both relationship feature data and label data are very sparse in real social platforms like WeChat rendering existing methods inapplicable.In this paper, we present an in-depth analysis of WeChat relationships to identify the major challenges for the relationship classification task as well as candidate approaches for tackling the challenges. We propose a Local Community-based Edge Classification (LoCEC) framework that classifies user relationships in a social network into real-world social connection types. LoCEC uses three stages, namely local community detection, community classification and relationship classification, to address the sparsity issue of relationship features and relationship labels. Moreover, LoCEC is designed to handle large scale networks by allowing parellel and distributed processing. We conduct extensive experiments on the real-world WeChat network with hundreds of billions of edges to validate the effectiveness and efficiency of our proposed approach.
I3.5 633 "APTrace: A Responsive System for Agile Enterprise Level Causality Analysis " Jiaping Gui (NEC Labs)*; Ding Li (NEC Laboratories America, Inc.); Zhengzhang Chen (NEC Laboratories America, Inc.); Junghwan Rhee (NEC Laboratories America, Inc.); Xusheng Xiao (Case Western Reserve University); Mu Zhang (University of Utah); Kangkook Jee (University of Texas, Dallas); Zhichun Li (Stellar Cyber); Haifeng Chen (NEC Labs)
While backtracking analysis has been successful in assisting the investigation of complex security attacks, it faces a critical dependency explosion problem. To address this problem, security analysts currently need to tune up backtracking analysis manually with different case-specific heuristics. However, existing systems fail to fulfill two important system requirements to achieve effective backtracking analysis. First, there need flexible abstractions to express various types of heuristics. Second, the system needs to be responsive in providing updates so that the progress of backtracking analysis can be frequently inspected, which typically involves multiple rounds of manual tuning. In this paper, we propose a novel system, APTrace, to meet both of the above requirements. As we demonstrate in the evaluation, security analysts can effectively express heuristics to reduce more than 99.5% of irrelevant events in the backtracking analysis of real-world attack cases. To improve the responsiveness of backtracking analysis, we present a novel execution-window partitioning algorithm that significantly reduces the waiting time between two consecutive updates (especially, 57 times reduction for the top 1% waiting time).