Youxian Tao
Youxian Tao is currently a master degree candidate at School of Information, Renmin University of China. He received his B.S. degrees in computer science and technology from the Hohai University in 2013. His current research interests are Adaptive Layout Optimization and OLAP system. He and his team members developed a data layout optimization framework, which helps improve the I/O performance of wide tables stored in columnar formats on HDFS and published a paper on ICDE 2018. Youxian Tao is not a pure programming guy - besides writing codes, he also enjoys cooking and travelling.
Research
2020.01 - 2020.04 Layout Optimization Based on Deep Reinforcement Learning
This research will open source soon.
2017.12 - Present Adaptive Hybrid Storage Optimization for Wide Tables Analysis
This research takes a break for some reasons.
2017.10 - 2018.08 Indexing and Ingestion for Real-time Analytics of ID-associated Data
Many types of log data such as Internet access logs grow rapidly. How to index and ingest large amount of data in a few seconds and support real-time query evaluation on the data becomes a big challenge. By taking lessons from LSM-Tree, which is widely used in key-value stores, we proposed a lightweight and powerful log data indexing scheme which can ingest 500+ million tuples per node per second and support real-time lookups on multiple indexed dimensions.
2019.09 - 2019.10 Adaptive Data Layout Optimization of Very Large and Wide Tables
In this ongoing work, emerging hardwares and more extensive workloads are considered in data layout optimizations of very large and wide tables. With an adaptive framework, query workloads will benefit from a scalable and self-tuning storage layout in HDFS.
Projects
Pixels
- A flexible column storage format with adaptive optimization techniques embedded.
This project will open source soon.
Paraflow
- A real-time analytical system for ID-associated data.
Paraflow enables users to load data into data warehouse (like HDFS) as soon as possible, and provides real-time analysis over data of being loaded and in the warehouse.
- Fast loading. Paraflow utilizes a well-designed pipeline for efficient data loading.
- No loss staging. Kafka is used in the system to stage data without losses.
- Real-time analysis. Lightweight indices are used in Paraflow to speed up queries.
Rainbow
- A data layout optimization framework for wide tables stored on HDFS.
Rainbow is an ETL tool which ADAPTIVELY improve the I/O performance HDFS column stores by reducing the disk seek costs. User can interact with Rainbow to monitor the optimization process in an ETL pipeline.
Internship
Intern Development Engineer
- Ant BlockChain Group.
- Mentor: Ying Yan
- Main Works:
- Mycloak: Release iteration of trusted computing plugin, automated testing process.
- Mykms: Multiparty key management, key private deployment, Admin management platform.
Big Data R & D Intern
- Data Platform [Big data query and analysis].
- Mentor: Dongdong Guo
- Main Works:
- Responsible for ClickHouse service.
- Update and iterative function development of ETL tool.
- Meet BI query needs, oncall.
Software Development Intern
- Technology Department.
- Mentor: Ming Li
- Main Works:
- Participate in the "cloud business intelligence prototype system" project.
- Frontend: data visualization display.
- Backend: the development of WeChat API interface.
Publications
ICDE'18 Demo Rainbow: Adaptive Layout Optimization for Wide Tables.
Contests
Programming The First PolarDB Database Performance Competition
Designint and implementing "Blockchain-based medical record service applet", and realize user privacy protection through encryption and decryption mechanism.
Innovative 2018 Alibaba Cloud Global Blockchain Competition
With Range as the core function, designing storage solutions, separating Key-Value data, dividing multiple DB fragments, and doing full data scanning through the producer-consumer model.