搜索分析与交互 (SAI) 组 @ DBIIR

We Talk with Data.

概览

相关研究工作:(1) 大数据实时分析系统; (2) 大数据的人机交互技术; (3) 知识图谱和语义搜索;(4) 大数据管理系统评测基准。

团队老师:杜小勇 陈跃国 覃雄派

团队学生:

开源主页:Github

电子邮箱:chenyueguo@ruc.edu.cn

当前研究课题

  • 垂直领域智能问答
    智能问答系统是人工智能领域的重要研究课题。它的功能是为用户的问题找到具体的答案。 例如,用户问题:“旅游过程中手机不小心掉进海水里了怎么办?”, 那么智能问答系统给出的答案:“请不要尝试开机或按下关机键。擦干手机表面的水渍,然后将手机放置到强力吸水环境(如活性炭、大米、防潮剂)中 48 小时,干燥后尝试开机”。常见的智能问答系统以开放领域为主,答案的相关度往往上不去。我们以垂直应用为背景,为企业解决如果构建大规模的问答对,以及如何利用已有的问答对精准回答用户的提问。在问题理解和检索中,我们用到了垂直领域的知识图谱。研究工作涉及自然语言处理、信息检索和机器学习方面的技术。

  • 政府治理大数据行为知识图谱研究
    互联网很多领域存在大量的欺诈行为。我们的研究关注于互联网金融领域,将用户多渠道的行为数据关联成行为知识图谱。进而,在海量行为网络上,研究群体欺诈行为识别算法,支撑异常行为模式发现的数据探索系统,以及大规模行为图谱的数据管理系统。参与研究的同学将接触到真实应用场景中的用户行为数据,研究高性能的特征提取方法,在真实应用中积累一手数据科学研究经验。

  • 大数据分析系统   
    大数据分析是支撑各类大数据应用的基础,我们研究的关键问题是如何在大数据背景下,尤其是在分布式的Hadoop集群上高效地支持大规模的查询和分析。我们对已有的Spark和Presto等开源系统进行了深入的剖析,围绕如何将数据更高效的装载到Hadoop集群中、如何让小查询更快速的执行、如何根据查询负载自适应地优化数据存储等关键技术对现有系统进行了优化,开发了ParaFlow和Rainbow等系统。大数据分析系统的研究需要一定的工程能力和编程语言基础,同时一些机器学习的技术也能融入其中。

  • 基于RDF知识图谱的探索式搜索技术研究
    RDF知识图谱蕴含丰富的实体和关系,为用户获取知识提供了一种新的途径。然而,面对规模庞大的、多领域的RDF知识图谱,用户通常会因为不熟悉信息空间或者不清楚搜索意图的原因,难以通过简单的查询检索到理想的结果。因此,本项目研究如何通过探索式搜索的方式,协助用户逐步调整和改进搜索目标,进而更有效地从庞大复杂的RDF知识图谱中找到感兴趣的内容。本项目涉及到信息检索、推荐系统以及人机交互等领域的关键技术,主要研究内容包括:1)面向实体的检索算法研究;2)面向实体的推荐算法研究;3)面向移动设备的交互方式研究。

开源系统

  • Pixels (2017至今)
    相关链接: Github | Slides | Paper
    面向大数据分析的自适应存储系统设计与优化,正在研究中。
    In many Internet companies, log data and user features are generally stored as wide tables in columnar formats in HDFS. By reordering the columns and/or duplication of frequent accessed columns in the physical layout of a wide table, I/O performance of query workload on the table can be boosted by more than 50%. Paper on this work is accepted by SIGMOD'17.

  • SEEDV1.0 ~ SEEDV3.0 (2017至今)
    相关链接: Github | SEED[ICDE 2016] | Poster | [IUI 2017]
    We introduce a system called SEED which is designed to support entity search and exploration in large Knowledge Graphs. We demonstrate SEED using a dataset of hundreds of thousands of movie related entities from the DBpedia Knowledge Graph. The system utilizes a graph embedding model for ranking entities and their relations, recommending related entities, and explaining their interrelations.


  • Paraflow (2016至今)   
    相关链接: Github | Homepage
    Log data contains valuable information for decision makings. Timely and efficiently analyzing of log data can bring significant business value. For example, by analyzing log data of servers and applications, we can infer root causes of failures. By analyzing log data of e-commerce sites, we can learn recent changes in browsing and purchasing behaviours of specific customers. Based on that, e-commerce sites can provide more personalized recommendations.

  • Rainbow (2017)   
    相关链接: Github | Homepage | Video
    We present a data layout optimization tool called Rainbow, which leverages workload-driven layout optimization algorithms to adjust data layouts adaptively without intervening the previous data blocks that have been stored. We also provide a Web UI for users to interact with the layout optimization process.


  • Entity Search (2016)
    相关链接: Paper
    In ESearch, we design an efective ranking model of entity types to facilitate blind feedback and user feed- back on desired entity types for category matching, so that users can efectively perform entity search without the need of explicitly providing any query entity types as inputs.

科研成果

  • Denghao Ma, Yueguo Chen, Xiaoyong Du, Yuanzhe Hao: Interpreting Fine-Grained Categories from Natural Language Queries of Entity Search. DASFAA 2018: 861-877
  • Denghao Ma, Yueguo Chen, Kevin Chen-Chuan Chang, Xiaoyong Du, Chuanfei Xu, Yi Chang: Leveraging Fine-Grained Wikipedia Categories for Entity Search. WWW 2018: 1623-1632
  • Haoqiong Bian, Youxian Tao, Guodong Jin, Yueguo Chen, Xiongpai Qin, Xiaoyong Du: Rainbow: Adaptive Layout Optimization for Wide Tables. ICDE 2018: 1657-1660
  • Zhaoan Dong, Jiaheng Lu, Tok Wang Ling, Ju Fan, Yueguo Chen: Using hybrid algorithmic-crowdsourcing methods for academic knowledge acquisition. Cluster Computing 20(4): 3629-3641 (2017)
  • Xiongpai Qin, Yueguo Chen, Jun Chen, Shuai Li, Jiesi Liu, Huijie Zhang: The Performance of SQL-on-Hadoop Systems - An Experimental Study. BigData Congress 2017: 464-471
  • Xiangling Zhang, Yueguo Chen, Jun Chen, Xiaoyong Du, Ke Wang, Ji-rong Wen: Entity Set Expansion via Knowledge Graphs. SIGIR 2017: 1101-1104
  • Haoqiong Bian, Ying Yan, Wenbo Tao, Liang Jeff Chen, Yueguo Chen, Xiaoyong Du, Thomas Moscibroda: Wide Table Layout Optimization based on Column Ordering and Duplication. SIGMOD Conference 2017: 299-314
  • Denghao Ma, Yueguo Chen, Jun Chen, Xiaoyong Du, Xiangliang Zhang. ESearch: Incorporating Text Corpus and Structured Knowledge for Open Domain Entity Search. WWW (Companion Volume) 2017: 253-256
  • 张香玲,陈跃国,毛文祥,荣崔田,杜小勇. 基于图模型的实体类型补全方法. 计算机学报. 40(10): 2352-2366 (2017)
  • Jun Chen, Giulio Jacucci, Yueguo Chen, Tuukka Ruotsalo: SEED: Entity Oriented Information Search and Exploration. IUI Companion 2017:137-140
  • 张香玲,陈跃国,马登豪,陈峻,杜小勇. 实体搜索综述. 软件学报. 28(6): 1-22 (2017)
  • Jun Chen, Yueguo Chen, Xiaoyong Du, Xiangling Zhang, Xuan Zhou. SEED: A system for entity exploration and debugging in large-scale knowledge graphs. ICDE 2016: 1350-1353
  • 杜小勇, 陈峻, 陈跃国:大数据探索式搜索研究. 通信学报. 36(12): 77-88 (2015)
  • Chen Liu, Dongxiang Zhang, Yueguo Chen: Personalized Knowledge Visualization in Twitter. ER 2015: 409-423
  • 杜小勇, 陈跃国, 覃雄派: 大数据与OLAP系统. 大数据, 1(1) 5:1-13 (2015)
  • Yueguo Chen, Xiongpai Qin, Haoqiong Bian, Jun Chen, Zhaoan Dong, Xiaoyong Du, Yanjie Gao, Dehai Liu, Jiaheng Lu, Huijie Zhang: A Study of SQL-on-Hadoop Systems. BPOE@ASPLOS/VLDB 2014: 154-166
  • Yueguo Chen, Lexi Gao, Shuming Shi, Xiaoyong Du, Ji-Rong Wen: Improving Context and Category Matching for Entity Search. AAAI 2014: 16-22
  • Haoqiong Bian, Yueguo Chen, Xiaoyong Du, Xiaolu Zhang: MetKB: enriching RDF knowledge bases with web entity-attribute tables. CIKM 2013: 2461-2464
  • Chuitian Rong, Wei Lu, Xiaoli Wang, Xiaoyong Du, Yueguo Chen, Anthony K. H. Tung: Efficient and Scalable Processing of String Similarity Join. IEEE Trans. Knowl. Data Eng. 25(10): 2217-2230 (2013)
  • 杜方, 陈跃国, 杜小勇. RDF 数据查询处理技术. 软件学报, 2013(06):1222-1242
  • Jiajiu Liu, Zi Huang, Hong Cheng, Yueguo Chen, Heng Tao Shen, Yanchun Zhang. Presenting Diverse Location Views with Real-time Near-duplicate Photo Elimination. ICDE 2013: 505-516
  • Xiaolu Zhang, Yueguo Chen, Jinchuan Chen, Xiaoyong Du, Lei Zou. Mapping Entity-Attribute Web Tables to Web-Scale Knowledge Bases. DASFAA 2013: 108-122
  • Yu Sun, Jin Huang, Yueguo Chen, Rui Zhang and Xiaoyong Du. Location Selection for Utility Maximization with Capacity Constraints. CIKM 2012: 2154-2158
  • Yueguo Chen, Bin Cui, Xiaoyong Du, Anthony K. H. Tung: Efficient approximation of the maximal preference scores by lightweight cubic views. EDBT 2012: 240-251
  • Yueguo Chen, Ke Chen, Mario A. Nascrimento. Effective and Efficient Shape-Based Pattern Detection over Streaming Time Series. IEEE Transactions on Knowledge and Data Engineering, 24(2): 265-278 (2012)
  • Fang Du, Yueguo Chen, Xiaoyong Du. Partitioned Index for Efficient SQARQL Query Processing of RDF Data. DASFAA 2012: 141-155
  • Yueguo Chen, Wei Wang, Xiaoyong Du, Xiaofang Zhou. Continuously monitoring the correlations of massive discrete streams. CIKM 2011: 1571-1576
  • Yueguo Chen, Gang Chen, Ke Chen, Beng Chin Ooi: Efficient Processing of Warping Time Series Join of Motion Capture Data. ICDE 2009: 1048-1059
  • Yueguo Chen, Su Chen, Yu Gu, Mei Hui, Feng Li, Chen Liu, Liangxu Liu, Beng Chin Ooi, Xiaoyan Yang, Dongxiang Zhang, Yuan Zhou: MarcoPolo: a community system for sharing and integrating travel information on maps. EDBT 2009: 1148-1151
  • Yueguo Chen, Shouxu Jiang, Beng Chin Ooi, Anthony K. H. Tung: Querying Complex Spatio-Temporal Sequences in Human Motion Databases. ICDE 2008: 90-99
  • Yueguo Chen, Mario A. Nascimento, Beng Chin Ooi, Anthony K. H. Tung: SpADe: On Shape-based Pattern Detection in Streaming Time Series. ICDE 2007: 786-795