Haoqiong Bian

bianhaoqiong at gmail dot com
Research Area: Database Systems, Big Data Analytics
Also Loves Traveling and Cooking


2012.09 - Now Database Group, DBIIR Lab., Renmin University of China

  • Ph.D. Candidate, Thesis: High Performance Analytical Query Processing of Very Large Log Data.
  • Advisor: Prof. Xiaoyong Du
  • 2015.9 - 2016.10 Interactive Data Systems Group, The Ohio State University

  • Visiting Ph.D. Student, Topic: Interactive Data Analytics.
  • Host Advisor: Prof. Arnab Nandi
  • 2008.09 - 2012.06 Hohai University

  • Bachelor of Engineering, Computer Science and Technology.
  • Research

    2016.11 - Now Adaptive Data Layout Optimization of Very Large and Wide Tables

  • In this ongoing work, emerging hardwares and more extensive workloads are considered in data layout optimizations of very large and wide tables. With an adaptive framework, query workloads will benefit from a scalable and self-tuning storage layout in HDFS.
  • 2015.01 - 2016.10 Wide Table Layout Optimization Based on Column Ordering and Duplication

  • In many Internet companies, log data and user features are generally stored as wide tables in columnar formats in HDFS. By reordering the columns and/or duplication of frequent accessed columns in the physical layout of a wide table, I/O performance of query workload on the table can be boosted by more than 50%. Paper on this work is accepted by SIGMOD'17. [Slides], [Paper], [Source Code].
  • 2014.10 - 2014.12 Indexing and Ingestion Scheme for Real-time Analytics of Rapid Growing Log Data

  • Many types of log data such as Internet access logs grow rapidly. How to index and ingest large amount of data in a few seconds and support real-time query evaluation on the data becomes a big challenge. By taking lessons from LSM-Tree, which is widely used in key-value stores, we proposed a powerful log data indexing scheme which can ingest 500+ million tuples per node per second and support real-time lookups on multiple indexed dimensions. [Paper].
  • 2013.06 - 2014.07 Benchmark Evaluations of SQL-on-Hadoop Systems

  • This work was internally called B52 Plan. We firstly setup an OpenStack infrastructure (named XingCloud @ RUC) on top of 52 physical servers. Then we performed extensive benchmark (TPC-DS and TPC-H, 300GB-3TB) evaluations of popular SQL-on-Hadoop Systems including Hive, Hive on Tez, Shark, Spark SQL, Impala and Presto on XingCloud under 25-, 50-, 100- and 200-node cluster scales. We have a team on this work and I am the technical leader. [Slides (Chinese)], [Paper1], [Paper2].
  • Projects

    2013.12 - 2014.12 Sleeve: A Multi-thread Pipeline Computation Framework

  • Although there are lots of streaming engines running on a cluster of machines to provide good scalability, sometimes, we only need an efficient multi-thread program to release the frozen resources on a single machine to quickly finish our tasks. However, writing an efficient multi-thread program is not an easy job for everyone. Sleeve is such an easy-to-use framework for you to quickly assemble multiple threads into a pipeline and make it efficient. It provides a functional API like MapReduce. [Poster].
  • 2013.05 - 2013.06 MetKB: An RDF Knowledge-base Enrichment Toolkit

  • RDF dataset like Yago or DBpedia can be seen as a knowledge-base (KB) and support applications such as entity search and exploration. Unfortunately, these KBs are static and only updated periodically, so that many new knowledges are missed and can not be used by on-top applications. To solve this problem, we designed a tool which keeps crawling HTML tables and mapping them to existing entities in the KB. New contents in the mapped tables are then used to enrich the KB. [Paper].
  • 2013.02 - 2013.05 Distributed SparQL Query Engine

  • RDF datasets are now considerable large (tens / hundreds of GBs) and keep growing. While SparQL is the standard query language over RDF data, it becomes hard for a single computer to perform efficient query on such large datasets. We take RDF-3x (state-of-the-art single-node SparQL query engine at that time) as the storage engine on each node and designed a distributed SparQL query engine to improve query performance on large RDF datasets. [Paper].
  • Internship

    2014.08 - 2015.04 System and Algorithm Group, Microsoft Research Asia

  • Mentor: Ying Yan
  • Awarded: Star of Tomorrow Excellent Intern
  • Main Works:
    • Data layout optimization for Bing search log analysis pipeline (shipped),
    • Automatic deployment of big data systems in Azure.
  • Publications

    SIGMOD'17 Wide Table Layout Optimization based on Column Ordering and Duplication [Paper]
  • Haoqiong Bian, Ying Yan, Liang Jeff Chen, Yueguo Chen, Xiaoyong Du, Thomas Mascibroda
  • SOCC'15 Poster Taming Big Wide Tables: Layout Optimization based on Column Ordering [Poster]
  • Haoqiong Bian, Ying Yan, Liang Jeff Chen, Yueguo Chen, Thomas Moscibroda
  • APWeb'15 A Fast Data Ingestion and Indexing Scheme for Real-time Log Analytics [Paper]
  • Haoqiong Bian, Yueguo Chen, Xiongpai Qin, Xiaoyong Du
  • BPOE'14 (VLDB'14 Workshop) A Study of SQL-on-Hadoop Systems [Paper]
  • Yueguo Chen, Xiongpai Qin, Haoqiong Bian, Jun Chen, Zhaoan Dong, Xiaoyong Du, Yanjie Gao, Dehai Liu, Jiaheng Lu, Huijie Zhang
  • 华东师范大学学报(自然科学版),No.5,2014.09 Equi-join Optimization on Spark (Chinese) [Paper]
  • Haoqiong Bian, Yueguo Chen, Xiaoyong Du, Yanjie Gao
  • CIKM'13 Demo MetKB: Enriching RDF Knowledge Bases with Web Entity-Attribute Tables [Paper]
  • Haoqiong Bian, Yueguo Chen , Xiaoyong Du, Xiaolu Zhang
  • IEEE BigData Congress'13 Efficient SPARQL Query Evaluation In a Database Cluster [Paper]
  • Fang Du, Haoqiong Bian, Yueguo Chen, Xiaoyong Du
  • Contests

    The National Contest in Big Data Technical Innovations'14 First Place in Database Track
  • Solved Problem: Index Solution for Real-time Loading and Query of Large Internet Log Data.
  • Haoqiong Bian (team leader), Liping Zhao, Ao Cheng, Peishen Jia
  • The National Contest in Big Data Technical Innovations'13 First Place in Database Track
  • Solved Problem: Real-time Detection of Black Holes in Telecom Paging Networks.
  • Haoqiong Bian (team leader), Jun Chen, Huijie Zhang
  • The Interdisciplinary Contest in Modeling (ICM)'11 Honorable Mention
  • Solved Problem: How environmentally and economically sound are electric vehicles? Is their widespread use feasible and practical?
  • Guo Yang, Haoqiong Bian, Chenyu Zhang
  • Skills

  • Good driving and cooking skills.
  • Design of analytical database systems and key-value storage systems.
  • Deploy and tuning of SQL-on-Hadoop systems.
  • Good team work experience and capable of doing research independently.
  • Good English communication and writing skills.
  • Good programming skills with Java and C/C++, know Scala, JavaScript, Python and C#.
  • Familiar with embedded hardware/software development and Internet of Things.