Projects

The first reference-free evaluation metric for visual storytelling
It remains unclear whether conventional automatic evaluation metrics for text generation are applicable to Visual Storytelling (VIST). We collect VHED (VIST Human Evaluation Data) dataset, which first re-purposes human evaluation results for automatic evaluation; hence develop VRank (VIST ranker), a novel reference-free VIST metric learned from VHED. Experimental results show that our metric's prediction is significantly more aligned to human evaluation than other metrics with almost 30% higher accuracy when ranking story pairs. Moreover, we demonstrate that only VRank shows human-like behavior in its strong ability to find better stories when the quality gap between two stories is high.
[Paper Link] (ACL-IJCNLP'22)

Modeling Storylines for Visual Storytelling
Writing a coherent and engaging story is not easy, especially for automated visual storytelling models. We introduce PR-VIST, a framework that first represents the input image sequence as a story graph in which it finds the best path to form a storyline. PR-VIST then takes this path and learns to generate and refine the final story via Transformer with a human-like discriminator. This framework produces stories that are superior in terms of diversity, coherence, and humanness, per both automatic and human evaluations. An ablation study shows that both plotting and reworking contribute to the model's superiority.
[Paper Link] (ACL-IJCNLP'21 Findings)

Getting Flexible With Visual Stories
Most visual storytelling models remain limited in terms of the generated stories' fixed length, and the fix-length stories carry limited details and provide ambiguous textual information to the readers. Therefore, we propose Stretch-VST to generate prolonged stories by adding appropriate knowledge. The framework distills representative terms from a sequence of images and find the appropriate relations between terms on knowledge graph by a scoring function. We also design a length-controlled Transformer to generate diverse length stories with better focus and detail compared to the state of the art.
[Video], [Paper Link] (ACL-IJCNLP'21 Demo)

Conversational Visual Question Generation
Explored a novel scenario: a conversation agent views a set of the user's photos and asks an engaging question to initiate a conversation with the user. Introduced a two-phase framework that first generates a visual story for the photo set and then uses the story to produce an interesting question. The human evaluation shows that our framework generates more response-provoking questions for starting conversations than other vision-to-question baselines.
[Paper Link] (AAAI'21 workshop)

Multi-modal Dialog System
Proposed a multi-step joint-modality attention network based on recurrent neural network to reason on multiple modalities, including audio, vision, and language. The model jointly considered both visual and textual representations in each reasoning process to better integrate information from dynamic scenes.
[Paper Link] (IEEE/ACM TASLP), [Paper Link] (AAAI'20 workshop)

Multiview Items Recommendation
Developed a GNN-based recommendation model which provides superior recommendations by describing items from user and entity angles. Designed user-oriented modules that aggregate features to make personalized recommendations and a mixing layer which contrasts layer-wise GCN to obtain comprehensive features from internal entity-entity interactions.
[Paper Link] (SIGIR'20)

Stage-Wise Training for GNN-based Recommender Model
Applied stage-wise training on two state-of-the-art recommendation models, RippleNet and Knowledge Graph Convolutional Networks (KGCN), and evaluated the performance on six real world datasets. The result of the experiments showed that stage-wise training strategy can help both models to collect more information from the KG and improve the recommendation performance.
[Paper Link]

Luminance Variation Resistant Remote-PPG
Collected drivers’ facial dataset (2.7M continuous images) in different outdoor scenarios, including day time and nighttime. Developed an Adaptive Neural Network Model Selection algorithm to dynamically select personalized model and eliminate facial luminance variation noise from rPPG signal. This work successfully reduced the mean absolute error from 14.71 bpm to 4.51 bpm.
[Paper Link] (IEEE ACCESS) [Demo Video]

Motion Robust Remote-PPG
Built a face tracking algorithm to extract heart rate signal from driver’s face in continuous images sequence. Developed machine learning approach to eliminate rPPG noise caused by driver's facial motion. This work is first of its kind as the traditional rPPG work consider only in indoor and stable environment.
[Paper Link] (ACCV'16 workshop)