Computer science student at Renmin University of China.
Bio. I am a first year Ph.D. candidate in the DBIIR Lab at Renmin University of China (RUC) working in database systems. I am under the advisement of Yueguo Chen. I earned my bachelor's degree from Sichuan University in the year of 2015, then I joined the successive master-doctor program of RUC, and spent my last two years as a master student in computer science. During these two years, I studied Hadoop and popular SQL-on-Hadoop systems, such as Hive, Spark SQL and Presto, and I built Paraflow.
My current research investigates how to speed up big data analytics by optimizing the layout of columnar storage adaptively on HDFS. Existing analytical systems ubiquitously take advantage of columnar storage to speed up queries, and the layout of these columnar files on HDFS can be optimized adaptively to further boost query executions.
Besides researches, I enjoy playing with open source projects.
The challenge of Big Data has shifted the design of data analytical systems from single machines to large-scale distributed systems. My research focuses on key techniques of big data analytics to improve the performance of distributed analytical systems over big data.
Nowadays, many big data analysis systems share HDFS (Hadoop Distributed File System) as their common underlying storage, and relational tables are stored as columnar files to speed up query executions. The physical layouts of columnar files play a fundamental and critical role in system I/O performance, which is critical to the performance of existing analytical systems on HDFS. My current work investigates how to optimize physical layouts of columnar files adaptive to various workloads and system settings.
In the future, I plan to continue my research in the field of big data analytics, with a foucus on building analytical systems to exploit potential benefits of emerging new hardwares (such as NVM, GPU and FPGA) and lower the barrier to performance tuning for developers. In my dissertation work, I hope to build a new analytical system which is optimized efficiently and easy to use.
Towards Real-Time Analysis of ID-Associated Data. ER'18 Demo AcceptedGuodong JIN, Yixuan Wang, Xiongpai QIN, Yueguo CHEN, Xiaoyong DU. Rainbow: Adaptive Layout Optimization for Wide Tables. ICDE'18 DemoHaoqiong BIAN, Youxian TAO, Guodong JIN, Yueguo CHEN, Xiongpai QIN, Xiaoyong DU. [paper] [poster] [code] Entity Fiber based Partitioning, no Loss Staging and Fast Loading of Log Data. PDCAT'16Xiongpai QIN, Yueguo CHEN, Guodong JIN, Yang LIU, Yiming CONG, Xiaoyong DU. [paper] [code] No Loss Staging and Fast Loading of Log Data. NDBC'16 DemoXiongpai QIN, Guodong JIN, Yang LIU, Yiming CONG, Xiaoyong DU. [code]
Pixels. A flexible column storage format with adaptive optimization techniques embedded.This project will open source soon. Rainbow. A data layout optimization framework for wide tables stored on HDFS. Paraflow. A real-time analytical system for ID-associated data.This project is under active development. Pard. A parallel database running like a leopard. This is a course project of Distributed Database Systems. Claims. A distributed in-memory database system, which I was involved during my internship at InfoSys Bangalore.
Good understandings of Java as a system developing language. Familiar with source code of Facebook Presto and Apache ORC Good communication skills both in English and Mandarin Chinese. Good at cooking, still improving. Good team work spirit and considerable system design experiences.
Apr 2018: I'm demonstrating Rainbow: Adaptive Layout Optimization for Wide Tables at ICDE 2018. Sep 2017: I'm TA'ing the Principles and Implementation of Database System (for graduate students). Working hard! Jul 2017: We started a new project called Pixels. Jul 2017: Attending Strata Beijing 2017. Happy to meet new friends! Dec 2016: I'm joining InfoStep (at Bangalore, India), an internship program hosted by InfoSys, to develop the distributed in-memory database system (called Claims). Three months in India! Aug 2016: We got a seventeen position in the second round of Midleware Development Performance Challenge hosted by Alibaba Group.The contest requires to develop a system from scratch (without any other library dependencies except for JDK) to load and support queries over 100GB relational dataset as efficiently as possible on a single cheap server with only 4GB memroy. And Java is the only choice as the programming language. Aug 2016: I'm attending Strata+Hadoop World Beijing. Excited to meet Doug Cutting, and learn about excellent open source projects.