VISUAL TEXT CORRECTION

Project Overview

This project will be built based on the paper Visual Text Correction written by Amir Mazaheri and Mubarak Shah[1]. In this project, given a video clip word description and its corresponding video, our proposed framework will detect the inaccurate word and replace it with a more proper one. The topics such as constructing a more realistic dataset and improving an existing NLP result will be explored and discussed in our project.

Summary of previous work

The foundation of our project is built on the paper written by Amir Mazaheri and Mubarak Shah. In that paper, a system utilizes two modules, the inaccuracy detection module and the correct word prediction module, to find the most inaccurate word in a video description and replace it with a more proper one. The inaccuracy detection module uses a convolutional n-gram network and LSTM to extract the textural information and uses a gating module to extract the visual information. A combination of two makes the prediction of the more inaccurate word. For the correct word prediction module, the system uses text encoding and video encoding to make the prediction for the word.

Potential solutions

Approach 1: Generate a more realistic inaccurate description:

The original false dataset is created only according to the frequency of words and by swapping words with the same part of speech in the sentence. As a result, some extreme unrealistic examples like “swimming in the kitchen” are added to the dataset. Thus, one of our approaches can improve the false dataset by adding more metrics to construct it. We can construct the false dataset by replacing the original word with some words which have a better coherence to the location and movement. In addition, the possible replacement can be built in a tree structure and we can generate the false one from it.

Leave a Reply

Your email address will not be published. Required fields are marked *