2012.09 - Now Database Group, DBIIR Lab., Renmin University of China
Ph.D. Candidate, Thesis: High Performance Analytical Query Processing of Very Large Log Data. Advisor: Prof. Xiaoyong Du
2015.9 - 2016.10 Interactive Data Systems Group, The Ohio State University
Visiting Ph.D. Student, Topic: Interactive Data Analytics. Host Advisor: Prof. Arnab Nandi
2008.09 - 2012.06 Hohai University
Bachelor of Engineering, Computer Science and Technology.
2016.11 - Now Adaptive Data Layout Optimization of Very Large and Wide Tables
In this ongoing work, emerging hardwares and more extensive workloads are considered in data layout optimizations of very large and wide tables. With an adaptive framework, query workloads will benefit from a scalable and self-tuning storage layout in HDFS.
2015.01 - 2016.10 Wide Table Layout Optimization Based on Column Ordering and Duplication
In many Internet companies, log data and user features are generally stored as wide tables in columnar formats in HDFS. By reordering the columns and/or duplication of frequent accessed columns in the physical layout of a wide table, I/O performance of query workload on the table can be boosted by more than 50%. Paper on this work is accepted by SIGMOD'17. [Slides (Chinese)], [Poster].
2014.10 - 2014.12 Indexing and Ingestion Scheme for Real-time Analytics of Rapid Growing Log Data
Many types of log data such as Internet access logs grow rapidly. How to index and ingest large amount of data in a few seconds and support real-time query evaluation on the data becomes a big challenge. By taking lessons from LSM-Tree, which is widely used in key-value stores, we proposed a powerful log data indexing scheme which can ingest 500+ million tuples per node per second and support real-time lookups on multiple indexed dimensions. [Paper].
2013.06 - 2014.07 Benchmark Evaluations of SQL-on-Hadoop Systems
This work was internally called B52 Plan. We firstly setup an OpenStack infrastructure (named XingCloud @ RUC) on top of 52 physical servers. Then we performed extensive benchmark (TPC-DS and TPC-H, 300GB-3TB) evaluations of popular SQL-on-Hadoop Systems including Hive, Hive on Tez, Shark, Spark SQL, Impala and Presto on XingCloud under 25-, 50-, 100- and 200-node cluster scales. We have a team on this work and I am the technical leader. [Slides (Chinese)], [Paper1], [Paper2].
2013.12 - 2014.12 Sleeve: A Multi-thread Pipeline Computation Framework
Although there are lots of streaming engines running on a cluster of machines to provide good scalability, sometimes, we only need an efficient multi-thread program to release the frozen resources on a single machine to quickly finish our tasks. However, writing an efficient multi-thread program is not an easy job for everyone. Sleeve is such an easy-to-use framework for you to quickly assemble multiple threads into a pipeline and make it efficient. It provides a functional API like MapReduce. [Poster].
2013.05 - 2013.06 MetKB: An RDF Knowledge-base Enrichment Toolkit
RDF dataset like Yago or DBpedia can be seen as a knowledge-base (KB) and support applications such as entity search and exploration. Unfortunately, these KBs are static and only updated periodically, so that many new knowledges are missed and can not be used by on-top applications. To solve this problem, we designed a tool which keeps crawling HTML tables and mapping them to existing entities in the KB. New contents in the mapped tables are then used to enrich the KB. [Paper].
2013.02 - 2013.05 Distributed SparQL Query Engine
RDF datasets are now considerable large (tens / hundreds of GBs) and keep growing. While SparQL is the standard query language over RDF data, it becomes hard for a single computer to perform efficient query on such large datasets. We take RDF-3x (state-of-the-art single-node SparQL query engine at that time) as the storage engine on each node and designed a distributed SparQL query engine to improve query performance on large RDF datasets. [Paper].
2014.08 - 2015.04 System and Algorithm Group, Microsoft Research Asia
Mentor: Ying Yan Awarded: Star of Tomorrow Excellent Intern Main Works:
- Data layout optimization for Bing search log analysis pipeline (shipped),
- Automatic deployment of big data systems in Azure.
SIGMOD'17 Wide Table Layout Optimization based on Column Ordering and Duplication
Haoqiong Bian, Ying Yan, Liang Jeff Chen, Yueguo Chen, Xiaoyong Du, Thomas Mascibroda
SOCC'15 Poster Taming Big Wide Tables: Layout Optimization based on Column Ordering [Poster]
Haoqiong Bian, Ying Yan, Liang Jeff Chen, Yueguo Chen, Thomas Moscibroda
APWeb'15 A Fast Data Ingestion and Indexing Scheme for Real-time Log Analytics [Paper]
Haoqiong Bian, Yueguo Chen, Xiongpai Qin, Xiaoyong Du
BPOE'14 (VLDB'14 Workshop) A Study of SQL-on-Hadoop Systems [Paper]
Yueguo Chen, Xiongpai Qin, Haoqiong Bian, Jun Chen, Zhaoan Dong, Xiaoyong Du, Yanjie Gao, Dehai Liu, Jiaheng Lu, Huijie Zhang
华东师范大学学报(自然科学版),No.5,2014.09 Equi-join Optimization on Spark (Chinese) [Paper]
Haoqiong Bian, Yueguo Chen, Xiaoyong Du, Yanjie Gao
CIKM'13 Demo MetKB: Enriching RDF Knowledge Bases with Web Entity-Attribute Tables [Paper]
Haoqiong Bian, Yueguo Chen , Xiaoyong Du, Xiaolu Zhang
IEEE BigData Congress'13 Efficient SPARQL Query Evaluation In a Database Cluster [Paper]
Fang Du, Haoqiong Bian, Yueguo Chen, Xiaoyong Du
The National Contest in Big Data Technical Innovations'14 First Place in Database Track
Solved Problem: Index Solution for Real-time Loading and Query of Large Internet Log Data. Haoqiong Bian (team & tech lead), Liping Zhao, Ao Cheng, Peishen Jia
The National Contest in Big Data Technical Innovations'13 First Place in Database Track
Solved Problem: Real-time Detection of Black Holes in Telecom Paging Networks. Haoqiong Bian (team & tech lead), Jun Chen, Huijie Zhang
The Interdisciplinary Contest in Modeling (ICM)'11 Honorable Mention
Solved Problem: How environmentally and economically sound are electric vehicles? Is their widespread use feasible and practical? Guo Yang, Haoqiong Bian, Chenyu Zhang
Good driving and cooking skills.
Design of analytical database systems and key-value storage systems.
Deploy and tuning of SQL-on-Hadoop systems.
Good team work experience and capable of doing research independently.
Good English communication and writing skills.
Familiar with embedded hardware/software development and Internet of Things.