Automatic Image Colorization Using Machine Learning

Image colorization is the process by which a black and white image (or gray scal image) is converted to a colored image. In this project we are going to compare methods of automatically colorizing images using neural networks.

1. Introduction and Background

Image colorization traditional often involved a large amount of human effort. To automate the process and to generate results that can fool the human eye is the goal of this project. Automatic image colorization often involves the use of a class of convolutional neural networks (CNN) called autoencoders. These neural networks are able to distill the salient features of an image, and then regenerate the image based on these learned features. We will compare several variants of image colorization neural networks based on autoencoders.

We first give a brief introduction to autoencoders. We will then give a brief introduction on color spaces, which when chosen properly, can be used to improve the training and performance of the neural network. We will then give an overview of the vanila image colorization neural network model.

Autoencoders

Autoencoders are originally designed for unsupervised machine learning to learn an efficient representatin of an image or other type of data. Autoencoders are composed of two parts, the encoder and the decoder. The encoder, through a series of CNN and downsampling, learns a reduced dimentional representation of the input data. The decoder then, through the use of CNN and upsampling, attempts to regenerate the data from the these representations. If a well-trained decoder is able to regenerated data that is identical or as close as possible to the original input data, then this shows that the encoder was able to successfully find a compressed representation of the orginal input data.

Traditionally, by using the output of the encoder, autoencoders are used to filter noise, compress data, conduct dimentional reduction, among other things. More recently, however, there is an increased use of the decoder to generate novel data (generative model). Since if a well-trained decoder can generate accurate data from a set of representations, it should be able to generate data from representations it has never seen before. As an example, if the decoder can generate images of "cloud" and "pigs" from their representations, then when given the representation of "cloud+pigs", it should be able to generate pig-shaped clouds. Therefore, applications such as image colorization and style transfer, where the style of one painting is transfer to another, often has autoencoders embedded inside a deep neural network.

Color Spaces

Color spaces are used as representations of color. Many color spaces use 3 channels. The data stored in the 3 channels are combined to accurately reproduce the color of the object. One of the most commonly used color space is the RGB color space. In this color space, each of the "red", "green", and "blue" colors occupy a channel. Each channel is represented by 8 bits for a total of 256 discrete levels. The 3 channels combined can represent over 16 million colors.

YCbCr Color Space

For digital image and video schemes such as JPEG and MPEG, another color space is used, the YCbCr color space. The YCbCr (or YCC for short) is a more efficient representation than RGB, and thus is used for digital transmission and storage. The Y channel is the lumna or the intensity channel, which is in gray-scale. The Cb and Cr channels represent the blue and red difference chroma component. There are many benefits of using this color space. For example, since the human eye is more perceptive to black and white than color information, the Y channel can be send over a high bandwidth channel or stored at high resolution. The Cr and Cb channels, on the other hand, can be stored at a lower resolution or transmitted using a lower bandwidth channel.

An image in RGB format can be converted to YCbCr format and vice-versa using a simple map, or a matrix multiplication. The weights of the matrices can be obtained from the official JPEG documentation, and are also reproduced here.

$$ \left(\begin{matrix}Y\\Cb\\Cr\end{matrix}\right) = \left(\begin{matrix}0.299&0.587&0.144\\-.1687&-.3313&.5\\.5&-.4187&-.0813\end{matrix}\right) \left(\begin{matrix}R\\G\\B\end{matrix}\right) + \left(\begin{matrix}0\\128\\128\end{matrix}\right)$$

$$ \left(\begin{matrix}R\\G\\B\end{matrix}\right) = \left(\begin{matrix}1& 0&1.402\\1&-0.34414& -.71414\\1& 1.772& 0\end{matrix}\right)\left(\begin{matrix}Y\\Cb\\Cr\end{matrix}\right) - \left(\begin{matrix}0\\128\\128\end{matrix}\right)$$

Note that in the YCbCr reprsentation, for 8-bit channels, the Y channel has valid values from 16 to 235. The Cb and Cr channels have valie values from 16 to 240. This is different from the RGB representation will all channels have valid values ranging from 0 to 255. Also, due to the struction of the YCbCr space, even if all channels have valid values, the values of all three channels combined may not be a valid representation in the color space. Careful verification is needed.

CIE LAB Color Space

Anoter often used color space is the CIE LAB color space designed by the International Commission on Illumination (CIE). The L channel is used to code "lightness", and is ranged from black (0) to white (100). The A channel is used to code from green (−) to red (+), and the B channel is used to code from blue (−) to yellow (+). For 8-bit implementation, the A/B channels ranges from -127 to +128. The advantage of CIE LAB over others is that it is designed to approximate human vision, so that the same amount of numerical change in these values corresponds to roughly the same amount of visually perceived change.

In our project, we will be experimenting with both the YCbCr and CIE LAB color spaces. The reason we use these instead of RGB color space is that by separating out the Y/L or the gray-scale component, the neural network only has to learn the remaining two channels for colorization. This reduces size of the network and speed of convergence.

2. Vanila Automatic Image Colorization

We will use a vanila autoencoder colorization model as a baseline for our comparison. The layers of the autoencoder is listed in the table below.

Layer Kernel Stride
input NA NA
conv 64 2
conv 128 1
conv 128 2
conv 256 1
conv 256 2
conv 512 1
conv 512 1
conv 256 1
conv 128 1
upsample NA NA
conv 64 1
conv 64 1
upsample NA NA
conv 32 1
conv 2 1
upsamele NA NA
output NA NA

3. Using A Merged Model to Color Images

In vanila autoencoders, the encoder used are often not deep enough to extract the global features of the image, which are necessary to determine how to color certain regions of the image. However, if the encoder is extended deep enough to extract the needed global features, the dimension of the representation is often too small for the decoder to faithfully reproduce the original image. To satisfy these two competing requirements, two different neural pathways can be used on the encoder side. One path to obtain the global features, and another to obtain a rich representation of the image. We introduce two such approaches here.

The first approach is by Iizuka et al in their paper "Let there be Color!: Joint End-to-end Learning of Global and Local Image Priors for Automatic Image Colorization with Simultaneous Classification. The authors uses an 8-layer encoder to extract the mid-level representations of the image. However, the output of the 6th layer is forked. Another 7-layer neural network is applied to the forked output to extract the global features of the image. The global features and the mid-level representations are then concatenated together and fed into the decoder.

Another approach is by Baldassarre et al in their paper "Deep Koalarization: Image Colorization using CNNs and Inception-Resnet-v2'. The authors use an indepedent deep conv net that is pretrained to generate the global features of the image. The conv net they used is the Inception-Resnet V2. The extracted global features is injected to the output of the encoder in the autoencoder network.

4. Model Definition

We will construct a vanila autoencoder and a merged model to compare their performance in coloring images.

Vanila Autoencoder For Image Colorization

For our vanila autencoder, we use a 8-layer encoder, and a 8-layer decoder.

In [ ]:
#%% baseline model
from keras.layers import Conv2D
from keras.layers import UpSampling2D
from keras.models import Model
from keras.layers import Input
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import RepeatVector
from keras.layers import Reshape
from keras.layers.merge import concatenate
from keras.utils import plot_model
from os import listdir
from pickle import load
import numpy as np


def define_model_baseline(img_h, img_w):

    # encoder
    
    inputs1 = Input(shape=(img_h, img_w, 3,))
    encoder_output = Conv2D(64, (3,3), activation='relu', padding='same', strides=2)(inputs1)
    encoder_output = Conv2D(128, (3,3), activation='relu', padding='same')(encoder_output)
    encoder_output = Conv2D(128, (3,3), activation='relu', padding='same', strides=2)(encoder_output)
    encoder_output = Conv2D(256, (3,3), activation='relu', padding='same')(encoder_output)
    encoder_output = Conv2D(256, (3,3), activation='relu', padding='same', strides=2)(encoder_output)
    encoder_output = Conv2D(512, (3,3), activation='relu', padding='same')(encoder_output)
    encoder_output = Conv2D(512, (3,3), activation='relu', padding='same')(encoder_output)
    encoder_output = Conv2D(256, (3,3), activation='relu', padding='same')(encoder_output)
    
    # decoder
    
    decoder_output = Conv2D(128, (3,3), activation='relu', padding='same')(encoder_output)
    decoder_output = UpSampling2D((2, 2))(decoder_output)
    decoder_output = Conv2D(64, (3,3), activation='relu', padding='same')(decoder_output)
    decoder_output = UpSampling2D((2, 2))(decoder_output)
    decoder_output = Conv2D(32, (3,3), activation='relu', padding='same')(decoder_output)
    decoder_output = Conv2D(16, (3,3), activation='relu', padding='same')(decoder_output)
    decoder_output = Conv2D(2, (3, 3), activation='tanh', padding='same')(decoder_output)
    decoder_output = UpSampling2D((2, 2))(decoder_output)


    # tie it together 
    model = Model(inputs=inputs1, outputs=decoder_output)
    model.compile(loss='mean_squared_error', optimizer='adam', metrics=['acc'])
    # summarize model
    print(model.summary())
    plot_model(model, to_file='autoencoder_colorization_baseline.png', show_shapes=True)
    
    return model

The model is summarized here.

Merged Model for Colorization

Transfer Learning for Feature Extraction

As a comparison, in the merged model we will use the same autoencoder as the vanila model. The global features will be extracted using VGG16 as the base neural network. The last fully connected layer of VGG16 as well as the softmax layer is deleted from VGG16. An additional dropout layer and a fully connected (fc) layer with 1024 output is then attached to the remaining layers. The features are extracted from the newly connected fc layer. During training, the original VGG16 layers are fixed. Only the newly connected layers are trainable..The extracted features will be concatenated to the output of the decoder section of the autoencoder.

In [ ]:
#%% merge model

from keras.layers import Conv2D
from keras.layers import UpSampling2D
from keras.models import Model
from keras.layers import Input
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import RepeatVector
from keras.layers import Reshape
from keras.layers.merge import concatenate
from keras.utils import plot_model
from os import listdir
from pickle import load
import numpy as np

# %%
def define_model(img_h, img_w, feature_size):
    # feature extractor model
    inputs1 = Input(shape=(feature_size,))
    image_feature = Dropout(0.5)(inputs1)
    image_feature = Dense(1024, activation='relu')(image_feature)
    
    
    # encoder
    
    inputs2 = Input(shape=(img_h, img_w, 3,))
    encoder_output = Conv2D(64, (3,3), activation='relu', padding='same', strides=2)(inputs2)
    encoder_output = Conv2D(128, (3,3), activation='relu', padding='same')(encoder_output)
    encoder_output = Conv2D(128, (3,3), activation='relu', padding='same', strides=2)(encoder_output)
    encoder_output = Conv2D(256, (3,3), activation='relu', padding='same')(encoder_output)
    encoder_output = Conv2D(256, (3,3), activation='relu', padding='same', strides=2)(encoder_output)
    encoder_output = Conv2D(512, (3,3), activation='relu', padding='same')(encoder_output)
    encoder_output = Conv2D(512, (3,3), activation='relu', padding='same')(encoder_output)
    encoder_output = Conv2D(256, (3,3), activation='relu', padding='same')(encoder_output)
    
    
    #fusion
    #concat_shape = (-1, 28,28,256) #encoder_output.shape[:3].concatenate(image_feature.shape[1])
    #image_feature = backend.repeat(image_feature, encoder_output.shape[1] * encoder_output.shape[2])
    #image_feature = backend.reshape(image_feature, shape=([-1, 28,28,256]))
    concat_shape = (np.uint32(encoder_output.shape[1]), np.uint32(encoder_output.shape[2]),np.uint32(inputs1.shape[-1]))
    
    image_feature = RepeatVector(concat_shape[0]*concat_shape[1])(inputs1)
    image_feature = Reshape(concat_shape)(image_feature)

    
    fusion_output = concatenate([encoder_output, image_feature], axis=3)
    
    decoder_output = Conv2D(128, (3,3), activation='relu', padding='same')(fusion_output)
    decoder_output = UpSampling2D((2, 2))(decoder_output)
    decoder_output = Conv2D(64, (3,3), activation='relu', padding='same')(decoder_output)
    decoder_output = UpSampling2D((2, 2))(decoder_output)
    decoder_output = Conv2D(32, (3,3), activation='relu', padding='same')(decoder_output)
    decoder_output = Conv2D(16, (3,3), activation='relu', padding='same')(decoder_output)
    decoder_output = Conv2D(2, (3, 3), activation='tanh', padding='same')(decoder_output)
    decoder_output = UpSampling2D((2, 2))(decoder_output)


    # tie it together [image, seq] [word]
    model = Model(inputs=[inputs1, inputs2], outputs=decoder_output)
    model.compile(loss='mean_squared_error', optimizer='adam', metrics=['acc'])
    # summarize model
    print(model.summary())
    plot_model(model, to_file='autoencoder_colorization_merged.png', show_shapes=True)
    
    return model

The model is summarized here.

5. Data Set

We use MIT's Places dataset to train and test our models.

When loading the image files, the loaded image is in PIL format. After converting to arrary format, the color space used is the RGB color space. We will need to convert the image from RGB to YCbCr space. We will then convert to color images to gray-scale images. The gray-scale images will have their features extracted by the VGG16 network, and it will also serve as input to our autoencoder. We define some helper functions to convert image color spaces.

In [ ]:
# -*- coding: utf-8 -*-
"""
https://github.com/baldassarreFe/deep-koalarization
"""
from os import listdir
from pickle import dump
from pickle import load
from keras.applications.vgg16 import VGG16
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.applications.vgg16 import preprocess_input
from keras.models import Model
import numpy as np
from keras.layers import Input
from keras.layers import Dense
from keras.utils import plot_model
from keras.models import load_model


"""
Turning RGB into YCrCb can be done by a simple matrix multiplication.
The values in the matrix can be obtained from the official Jpeg doc here:
https://www.itu.int/rec/T-REC-T.871-201105-I/en

"""
def RGBtoYCC(arr):
    xform = np.array([[.299, .587, .114], [-.1687, -.3313, .5], [.5, -.4187, -.0813]])
    ycbcr = arr.dot(xform.T)
    ycbcr[:,:,[1,2]] += 128
    return np.uint8(ycbcr)

def YCCtoRGB(arr):
    xform = np.array([[1, 0, 1.402], [1, -0.34414, -.71414], [1, 1.772, 0]])
    rgb = arr.astype(np.float)
    rgb[:,:,[1,2]] -= 128
    rgb = rgb.dot(xform.T)
    np.putmask(rgb, rgb > 255, 255)
    np.putmask(rgb, rgb < 0, 0)
    return np.uint8(rgb)

def RGBtoGrayYCC(arr):
    #this function takes a RGB input and extracts the Y portion of YCC
    ycc = RGBtoYCC(arr)
    ret = np.zeros_like(ycc[:, :, 0])
    ret[:, :] = ycc[:, :, 0]
    return ret

'''
A note on taking the gray scale of a colored image. 
To turn a colored image into a gray scale image, use the following step:
    1. Change to YCC format
    2. Keep the Y (1st) channel, and change the other channels to 128 (**not 0**)
    3. Change the YCC format back to RGB format
    
An alternative and equivalent way is to:
    1. Change to YCC format
    2. Copy the Y (1st) channel 3 times, and use this as the RGB format
    
The above two approaches will generate the same matrix. This is so because the Y channel is the lumnia channel, and if all three channels
in the RGB are the same as Y, then it is a gray scale of the original
'''
def YCCtoGrayRGB(arr):
    #this function takes YCC input and turns it into a RGB gray image
    ret = np.zeros_like(arr)
    ret[:, :, 0] = arr[:, :, 0]
    ret[:, :, 1] = arr[:, :, 0]
    ret[:, :, 2] = arr[:, :, 0]
    return ret

To speedup training, we will first extract the features of the images using VGG16 and save the extracted features in file. When the features are needed during training, it will just be loaded from disk, rather than recomputed again. The following code is used to extract the features

In [ ]:
progress=0

for file in listdir('val_256'):
    img_path = 'val_256/' + file
    img = load_img(img_path, target_size=(224, 224)) #input_shape is (224,224,3) for VGG16 if the last 3 fc layers are used. If not used, can use any input_shape
    img_arr_rgb = img_to_array(img) #change to np array
    img_arr_ycc = RGBtoYCC(img_arr_rgb) #for loss
    img_arr_ycc_gray = RGBtoGrayYCC(img_arr_rgb) #for autoencoder input
    img_arr_rgb_gray = YCCtoGrayRGB(img_arr_ycc) #for VGG16 input
    
    x = np.expand_dims(img_arr_rgb_gray, axis=0) #expand to include batch dim at the beginning
    x = preprocess_input(x) #make input confirm to VGG16 input format
    fc2_features = feature_extract_model.predict(x)
    
    data=dict()
    data['img_arr_rgb'] = img_arr_rgb
    data['img_arr_ycc'] = img_arr_ycc
    data['img_arr_ycc_gray'] = img_arr_ycc_gray
    data['img_arr_rgb_gray'] = img_arr_rgb_gray
    data['fc2_features'] = fc2_features
    file_save_name = 'val_256/processed/' + file.split('.')[0] +'preproc.pk' #take the file name and use as id in dict
    
    fid = open(file_save_name, 'wb')
    dump(data, fid)
    fid.close()
    
    progress+=1
    if progress % 100 ==0:
        print(file)

6. Training

Due to limited computer memory, we will load data from disk on-the-fly to generate training data, instead of generating training data all at once and save in memory. This increases training time to having to do disk I/O, but it allows for greater number of training data to be used.

We create a generator for the baseline model, and another two generators (for YCbCr and CIE LAB color spaces) for the merged model.

In [ ]:
def data_generator_baseline(training_dir, num_train_samples, batch_size):
    # loop for ever over images
    current_batch_size=0
    while 1:
        files = listdir(training_dir) #'coco_images\processed'
        for file_idx in range(num_train_samples):
            # retrieve the photo feature
            if current_batch_size == 0:
                X1, Y = list(), list()
            file = training_dir+ '/' + files[file_idx]
            fid = open(file, 'rb')
            data = load(fid)
            fid.close()
            
            img_arr_ycc_crcb = data['img_arr_ycc'][:,:,1:]
            img_arr_rgb_gray = data['img_arr_rgb_gray']
    
            X1.append(img_arr_rgb_gray/255)
            Y.append(img_arr_ycc_crcb/255)
            current_batch_size += 1
            if current_batch_size == batch_size:
                current_batch_size = 0
                yield [np.array(X1), np.array(Y)]

                
def data_generator_lab(training_img_dir, training_feature_dir, num_train_samples, batch_size):
    # loop for ever over images
    current_batch_size=0
    while 1:
        files_img = listdir(training_img_dir) #'coco_images\processed'
        for file_idx in range(num_train_samples):
            # retrieve the photo feature
            if current_batch_size == 0:
                X1, X2, Y = list(), list(), list()
                
            file_img = training_img_dir+ '/' + files_img[file_idx]
            img = load_img(file_img, target_size=(224, 224))
            img_arr_rgb = img_to_array(img) #change to np array
            img_arr_lab = color.rgb2lab(img_arr_rgb/255.0)
            
            
            file_feature = training_feature_dir + '/' + files_img[file_idx].split('.')[0] +'_grayscale_VGG16_top_features.pk'
            fid = open(file_feature, 'rb')
            fc2_features = load(fid)
            fid.close()

            fc2_features = fc2_features/np.max(fc2_features)
            
            img_arr_lab_ab = np.zeros_like(img_arr_lab[:,:,1:]) #for loss
            img_arr_lab_ab = img_arr_lab[:,:,1:]/128
            img_arr_lab_l  = np.zeros_like(img_arr_lab[:,:,0])
            img_arr_lab_l  = img_arr_lab[:,:,0]/100
            
            img_arr_lab_l_expandD = np.expand_dims(img_arr_lab_l, axis=2)
            
            X1.append(fc2_features)
            X2.append(img_arr_lab_l_expandD)
            Y.append(img_arr_lab_ab)
            
            current_batch_size += 1
            if current_batch_size == batch_size:
                current_batch_size = 0
                yield [[np.squeeze(np.array(X1)), np.array(X2)], np.array(Y)]  
                
def data_generator_merge(training_img_dir, training_feature_dir, num_train_samples, batch_size):
    # loop for ever over images
    current_batch_size=0
    while 1:
        files_img = listdir(training_img_dir) #'coco_images\processed'
        for file_idx in range(num_train_samples):
            # retrieve the photo feature
            if current_batch_size == 0:
                X1, X2, Y = list(), list(), list()
                
            file_img = training_img_dir+ '/' + files_img[file_idx]
            img = load_img(file_img, target_size=(224, 224))
            img_arr_rgb = img_to_array(img) #change to np array
            img_arr_ycc = RGBtoYCC(img_arr_rgb) 
            img_arr_ycc_gray = RGBtoGrayYCC(img_arr_rgb) 

            
            file_feature = training_feature_dir + '/' + files_img[file_idx].split('.')[0] +'_grayscale_VGG16_features.pk'
            fid = open(file_feature, 'rb')
            fc2_features = load(fid)
            fid.close()
            
            img_arr_ycc_crcb = img_arr_ycc[:,:,1:] #for loss
            fc2_features = fc2_features/(np.max(fc2_features)-np.min(fc2_features))
            img_arr_ycc_gray_expandD = np.expand_dims(img_arr_ycc_gray, axis=2)
            
            X1.append(fc2_features)
            X2.append(img_arr_ycc_gray_expandD/255)
            Y.append(img_arr_ycc_crcb/255)
            
            current_batch_size += 1
            if current_batch_size == batch_size:
                current_batch_size = 0
                yield [[np.squeeze(np.array(X1)), np.array(X2)], np.array(Y)]

Due to limited hardware resources, we will only train on 28800 samples with a batch size of 32.

In [ ]:
training_dir = 'val_256/processed'
num_train_samples = 28800
batch_size = 32
steps_per_epoch = np.floor(num_train_samples/batch_size)
epochs = 3

#%%
epochs = 3

for i in range(epochs):
	# create the data generator
	generator = data_generator(training_dir, num_train_samples, batch_size)
	# fit for one epoch
	fit_history = model.fit_generator(generator, epochs=1, steps_per_epoch=steps_per_epoch, verbose=1)
	# save model
	model.save('model_merge_' + str(i) + '.h5')

7. Coloring Images

Once the model is trained, we will download some images from the web and colorize them. The images we used will be colored, we will obtain a gray scale image from the color images and feed the gray sale image in to the model. The output of the model will be compare with the original colored image.

In the following table we show the gray-scale image on the left. On the right is the original color image. The second image is the colorization by the vanilla autoencoder. The third image is the image generated by the merged model in working in the YCbCr image space The colorization is unsaturated, the colors are dull. The fourth image is the merged model working in the CIE LAB color space. The colors are much more vibrant and true. This is because the CIE LAB color space is closer to human vision, and therefore any changes in the numerical values during the optimization process translates well into improvements as perceived by human vision.

Also note that the training data contains many mountain scenes, and thus the model is able to color out door scenes well. It can identify the sky and color it blue, and the grass and color it green. However, it is not able to color humans and animals well. Note that it colored the horse in one of the picture green. This is probably because the model has not seen enough horses during training. Also, the jacket of the man in one of the images is colored wrong. This is due to the inherent ambiguity of the gray-scale image. From the gray-scale image alone, even a human can have trouble identifying the color of the jacket.

Gray Image Vanilla Autoencoder Merge Model (YCbCr) Merge Model (LAB) Original
350

8. Coloring Historic Photos

Lets colorize some historical black and white photos that does not have a color original. Note that the model colors the sky and the ground better than the buildings.

Historical Original Image Colorized Image
In [ ]:
model_merge = load_model('model_merge_1a0.h5')

x = np.expand_dims(img_arr_rgb_gray, axis=0) #expand to include batch dim at the beginning
x = preprocess_input(x) #make input confirm to VGG16 input format
fc2_features = feature_extract_model.predict(x)
crcb_output_merge = model.predict([fc2_features, np.reshape(img_arr_rgb_gray, (-1,224,224,3))], verbose=0)
ycc_output_merge = np.copy(img_arr_ycc)
ycc_output_merge[:,:,1:] = 255*np.squeeze(crcb_output_merge)
rgb_output_merge = YCCtoRGB(ycc_output_merge)
plt.imshow(rgb_output_merge)
In [3]:
%%HTML
<style>
div.prompt {display:none} #div.prompt {display:""} div.prompt {display:none} 
</style>