Brief Review of the Article — MAT: Mask-Aware Transformer for Large Hole Image Inpainting

Murat Çelik
3 min readOct 23, 2023

--

There are various difficulties in image inpainting studies. The main ones are the model’s ability to understand the semantic structures in the images, the high costs even at low resolution, the size of the holes, and the limitation of the intelligibility of the image. In this study, which was developed by targeting these difficulties, the novel inpainting model, MAT , was developed.

Visual examples with different style representations.

The MAT model is the first transformers-based inpainting model that works with high-resolution images. With the introduced multi-head contextual attention structure, it produces outputs that are permanent in long range and suitable for the displayed tokens. With the modifications made in the transformers blocks in the system, it achieves more stable results in large hole masks during the training phase. It produces multiple and varied outputs with style manipulations.

The proposed Mask-Aware Transformer (MAT) for pluralistic inpainting architecture

The focus is on pluralistic generation, as there can be many methods of filling large holes when the architectural structure is started to be created. As seen in Figure of architecture, Its architecture includes a convolutional head, a transformer body, a convolutional tail, and a style manipulation module.

The model can be divided into two stages. The first stage starts with convolution, forms its body with transformer blocks, and finally, there is convolution in the tail structure. With this stage, the hole is filled with semantic integrity. The second stage, applies both refinement and style manipulation with Conv-U-Net.

In the first stage, 1/8 sized feature maps are created with the convolutional head by taking the image and mask with the holes at the beginning. The purpose of this phase is to focus on optimization and good visual representations by identifying local semantics early on. Another reason is to reduce computational costs with fast downsampling. After this step, token processing starts with 5 Transformer blocks. The aim here is to make accurate and stable transfers for the output in this body by paying attention to the active points in the incoming feature map. In the model, in which some modifications were made to the general transformers architecture, Layer Normalization was removed and fusion learning was used. These reduce the hassle of invalid tokens and optimize valid tokens. Finally, they apply some weight changes by adding noise for style manipulation in the tail. This is important in increasing the variety of outputs.
In the second stage, it is ensured that the images generated by Conv-U-Net are optimized and the output is of the desired size and high resolution.

For the model penalized using Perceptual and Adversarial Loss, the entire CelebAMask-HQ dataset and approximately 1 million images of the Places365 dataset were used in the experiments.

The results were scored with different metrics along with the models of the different studies. The model, which is seen in competition with the CoModGan and LaMa models in the results, shows that it has successful results compared to other models.

--

--

No responses yet