第三届道远学术论坛获奖学生论文分享

第三届博士生与博士后道远学术论坛于1月13日圆满举行。道远学术论坛旨在构建一个广阔的交流空间与平台，以促进思想交锋、经验分享及合作深化，从而推动博士生和博士后群体在学术氛围、创新能力、研究质量及研究成果等方面的提升。本届论坛由深圳市大数据研究院与香港中文大学（深圳）联合主办，自2023年11月启动征稿，得到了香港中文大学（深圳）数据科学学院、理工学院、医学院等各学院以及深圳市大数据研究院博士生和博士后的积极参与与支持。

在本次论坛中，来自理工学院的李英儒和数据科学学院的邱俊文分别荣获口头报告组的第一、二名，数据科学学院的魏少魁和李子牛则分别获得海报展示组的第一、二名。

以下为获奖同学的论文分享。

The Third Doctoral and Postdoctoral Daoyuan Academic Forum was successfully held on January 13th. It aims to create a broad space and platform for exchange, facilitating intellectual clashes, experience sharing, and deepening collaboration to promote the improvement of doctoral students and postdoctoral groups in academic atmosphere, innovative ability, research quality, and research achievements. The forum was jointly organized by the Shenzhen Research Institute of Big Data and the Chinese University of Hong Kong (Shenzhen). It was initiated in November 2023 for paper submissions and received active participation and support from various departments.

During this forum, Yingru Li from the School of Science and Engineering (SSE) and Junwen Qiu from the School of Data Science (SDS) were awarded the first and second prizes, respectively, in the oral presentation category. Shaokui Wei and Ziniu Li, from SDS, received the first and second prizes, respectively, in the poster presentation category.

The following are the papers shared by the award-winning students.

In this poster presentation, we delve into the issue of backdoor attacks in machine learning models and introduce a novel method for purifying a backdoored model using a small clean dataset. Backdoor attacks involve an adversary's manipulation of the training set with poisoned samples to create a model that behaves normally on benign data but redirects specific trigger-embedded inputs to target classes. Our research connects the backdoor risk with adversarial risk and develops a new upper bound that focuses on shared adversarial examples (SAEs) between the contaminated and purified models. Leveraging this insight, we formulate a bi-level optimization problem for combating backdoors using adversarial training techniques. The proposed approach, Shared Adversarial Unlearning (SAU), operates in two stages: first, it generates SAEs; then, it strategically unlearns these SAEs to ensure they are correctly classified by the purified model or lead to dissimilar predictions between the two models. This process effectively mitigates the backdoor effect in the original model. Empirical evaluations across multiple benchmark datasets and network architectures demonstrate that our SAU method achieves state-of-the-art performance in defending against backdoor attacks.

Under resource constraints, reinforcement learning (RL) agents need to be simple, efficient and scalable with (1) large state space and (2) increasingly accumulated data of interactions when deploying in complex environments. We propose the HyperAgent, a RL framework with hypermodel, index sampling schemes and incremental update mechanism, enabling computation-efficient sequential posterior approximation and data-efficient action selection under general value function approximation beyond conjugacy. The implementation of HyperAgent is simple as it only add one module and a line of code additional to DDQN.

Practically, HyperAgent demonstrates its robust performance in large-scale deep RL benchmarks with significant efficiency gain in terms of both data and computation. Theoretically, among the practically scalable algorithms, HyperAgent is the first achieving provably scalable per-step computational complexity as well as sublinear regret under tabular RL. The core of our theoretical analysis is the sequential posterior approximation argument. This is made possible by the first analytical tool for sequential random projection, a non-trivial martingale extension of the Johnson-Lindenstrauss lemma, which is of independent interest.

This work bridges the theoretical and practical realms of RL, establishing a new benchmark for RL algorithms design.

In this work, we present an unbaised stochastic proximal gradient method, namely the normal map-based algorithm (nor-SGD). The method is developed for nonsmooth nonconvex composite-type optimization problems and we also explore its convergence properties. Using the time window-based strategy and a suitable merit function, we first analyze the global convergence behavior of nor-SGD and show that every accumulation point of the generated sequence of iterates is a stationary point almost surely and in an expectation sense. The obtained results hold under standard assumptions and extend the more limited convergence guarantees of the basic proximal stochastic gradient method. In addition, based on the well-known Kurdyka-Lojasiewicz (KL) analysis framework, we provide novel point-wise convergence results for the iterates generated by nor-SGD. In the meanwhile, we have derived convergence rates that depend on the KL exponent and the step size dynamics. The obtained rates are faster than related and existing convergence rates for SGD in the nonconvex setting. The techniques studied in this work can be potentially applied to other families of stochastic and simulation-based algorithms.

Alignment is crucial for training large language models. The predominant strategy is Reinforcement Learning from Human Feedback (RLHF), with Proximal Policy Optimization (PPO) as the de-facto algorithm. Yet, PPO is known to struggle with computational inefficiency, a challenge that this paper aims to address. We identify three important properties of RLHF tasks: fast simulation, deterministic transitions, and trajectory-level rewards, which are not leveraged in PPO. Based on these properties, we develop ReMax, a new algorithm tailored for RLHF. The design of ReMax builds on the celebrated algorithm REINFORCE but is enhanced with a new variance-reduction technique.

ReMax offers threefold advantages over PPO: first, it is simple to implement with just 6 lines of code. It further eliminates more than 4 hyper-parameters in PPO, which are laborious to tune. Second, ReMaxreduces memory usage by removing the need of a value model used in PPO. To illustrate, PPO runs out of memory when directly fine-tuning a Llama2-7B model on A100-80GB GPUs, whereas ReMax can support the training. Even though memory-efficient techniques (e.g., ZeRO and offload) are employed for PPO to afford training, ReMax can utilize a larger batch size to increase throughput. Third, in terms of wall-clock time, PPO is about twice as slow as ReMax per iteration. Importantly, these improvements do not sacrifice task performance.

新闻速递

第三届道远学术论坛获奖学生论文分享