Addy Osmani
Prakanshul Saxena
My name is Prakanshul Saxena, senior year student, IIT Bhilai, in EE discipline. Below is a project report of my work in GSoC’22 @ INCF under the mentorship of Suresh Krishna and Dinesh Sathia Raj.
Sept 12, 2022 •

Background on Project

Last year, on this project, a similar implementation of the model by Google was done and the results regarding the same were obtained. A detailed summary regarding the background of the project is available on -https://dssr2.github.io/gaze-track/ . Below is the explanation for some useful terms frequently used in this report.

Important Terms

  • .tfrec file

    The TFRecord format is a simple format for storing a sequence of binary records.

  • Gazetrack dataset

    The main dataset whose split and directory structure are explained below.

  • Google Model

    The .tflite model provided by Google for experiment purposes.

  • SVR

    Multi-Regression Support Vector Machine

  • Individual Datapoint (gazetrack)

    The visualization of a tfrec data point is given below.

  • MED - Mean Eucledian Distance

Main Idea

To perform various experiments on the Google’s Tflite model to reimplement the results in Accelerating eye movement research via accurate and affordable smartphone eye tracking and analyze the trend of MED on the .tflite model after various versions of SVR. Code for the project is present here

DATASETS

Main Datsets and Splits

This project is based on different versions of the massive Gazecapture Dataset from MIT. Major portion of my work has been on a smaller version of the same dataset - gazetrack.tar.gz. The directory structure of the dataset is explained below.

There are two splits that are available for the same dataset - MIT Split and Google Split. MIT split contains frames of unique individuals in each train, val, and test set. This means that the frames corresponding to an individual are present in either train or val or test. Google Split contains frames of individuals across the train, val and test set. This means that the frames corresponding to an individual are present in train and test and val set.




Main Tfrec dataset

In order to completely implement the Google Model’s Input Pipeline for inference, the dataset should be saved as a .tfrec file and images be read through tf.example protocol buffer from the .tfrec and then the further processing on image can be done. So then I created a .tfrec based dataset from the already existing gazetrack.tar.gz dataset on the server.

The link for tfrec dataset creation is

https://github.com/prakanshuls22/GSoC_2022_INCF/blob/main/GSoC_22/Tfrec_creation_main.py

Individual and Combined tfrec datasets

Google had performed per user per block SVR personalisation. Now to perform this per user SVR training and testing, the outputs of the .tflite model had to be segregated by users which meant that the .tfrec files had to be arranged according to users. Thus I created .tfrecs suited to the same. A combined .tfrec file containing ten individuals with all the frames from test train and val of gazetrack.tar.gz. Also these ten individuals were chosen on the basis of - number of frames per user, i.e, these 10 individuals have the highest number of frames among all the users in the gazetrack dataset combined (test train and val). The results of the .tflite model on this .tfrec were then fed into SVR for per user personalisation.
The link to this SVR per user personalisation is here.

Unique 30 dataset (csv)

To carry out an experiment performed last year on the same project, we ran the SVR on Google model’s output on 30 unique gazetrack points per individual. For the same unique points for every individual were extracted from all the CSV outputs for google model.
The structure of the unique points CSV is elaborated in the image below for better understanding.

MODEL/VERSIONS

Google .tflite version

This version was provided by Google itself. We had experimented extensively with this version and the results are visualized below in the report.

Keypoints and Eye bounding box

The image on the left defines the different points that are used in the table below. A refers to the top left point of the bounding box where as B refers to the bottom left point of the bounding box. The Eye Dist in the image on the left is represented by 'e' in the table below. w and h in the below table refer to the width and the height of the bounding boxes of the eyes. 'S' in the 'BOUNDING BOX POINTS ALIGNMENT' refers to the swapped keypoints for that input version. By swapping I mean, that we provide left keypoints for the right eye and vice versa. This is done to test both the possible right and left combination of eyes - as viewed by an user through an image and the original position of the eye.

Experimental Versions
Version/Modification Bounding Box Points Alignment Bounding Box Format Eye Crop Size Bounding Box as input Bounding Box Normalization Input Sequence to Model Dist Function
v0 Default [x,y,w,h] Original Bottom Left By Screen Size (rb,lb,re,le) NA
v1 Default [x,y,w,h] Original Bottom Left By Image Size (rb,lb,re,le) NA
v2 Default [x,y,w,h] Original Top Left By Image Size (rb,lb,re,le) NA
v3 Default [x,y,w,h] e/2, e/2 Bottom Left By Image Size (rb,lb,re,le) np.linalg.norm
v3b S [x,y,w,h] e/2, e/2 Top Left By Image Size (rb,lb,re,le) np.linalg.norm
v3testing Default [x,y,w,h] e/2, e/2 Bottom Left By Image Size (rb,lb,re,le) math.dist
v4 Default [x,y,w,h] e/2, e/2 Top Left By Image Size (rb,lb,re,le) np.linalg.norm
v4b S [x,y,w,h] e/2, e/2 Top Left By Image Size (rb,lb,re,le) np.linalg.norm
v4testing Default [x,y,w,h] e/2, e/2 Top Left By Image Size (rb,lb,re,le) np.linalg.norm
v5 Default [x1,y1,x2,y2] e/2, e/2 Top Left By Image Size (rb,lb,re,le) np.linalg.norm
v5b S [x1,y1,x2,y2] e/2, e/2 Top Left By Image Size (rb,lb,re,le) np.linalg.norm
vfinal Default [x,y,w,h] e/2.5, e/3 Top Left By Image Size (lb,rb,le,re) abs
vfinal_unif Default [x,y,w,h] e/2.5, e/3 Top Left By Image Size (lb,rb,le,re) abs
vfinal_flipkey S [x,y,w,h] e/2.5, e/3 Top Left By Image Size (lb,rb,le,re) abs
vfinallrnoflip_flipkey S [x,y,w,h] e/2.5, e/3 Top Left By Image Size (rb,lb,re,le) abs
vfinallrnoflip Default [x,y,w,h] e/2.5, e/3 Top Left By Image Size (rb,lb,re,le) abs
https://github.com/prakanshuls22/GSoC_2022_INCF/tree/main/GSoC_22/All_versions

App Data Collection

A small instance of data collection was also done during the project. Data was collected through an app (android .apk). This data would be later used in the project for fine tuning our own trained model on the same lines as Google, and then the comprehensive comparison can be carried out for an end to end model for Gazetracking.

SVR

SVR (sklearn.svm.SVR) was used by google to further map its penultimate layer output in another tensor space, and then further used this mapped tensor to calculate the Mean Euclidean Distance. In this implementation also, SVR is used for the same.

Per-User personalisation - In this SVR execution, the outputs for a single individual are consolidated and then split into train and test set, and then SVR is fitted on the train set (against gaze ground truth value), and tested for the MED on the test set.

SVR was performed on different dataset splits and folds mentioned below.

  • Folds - 3 fold and 5 fold cross validation strategy
  • Split- Google Split (Fig reference)
  • Split Sequence

    First - First split the corresponding dataset in 70/30 train/test ratio, and then fitting the SVR on the train and then testing on test.

    Later - fitting on the corresponding dataset (in the two folds mentioned above) and extracting the best hyperparameters, and then tuning on those parameters on a 70% train division and then testing it on a 30% test division.

  • Epsilon Range
  • A lower and higher epsilon range has been swept through as an experiment. Higher range corresponds to more values between the interval of sweep for epsilon. For example - 0.1,0.2......1,2....10,20,30....100,200,300....1000. Whereas a lower range corresponds to less values between the interval of sweep for epsilon. For example - 0.1,1,10,100,1000.

Results and Visualizations

MED for 10 different users and the average for all those individuals was plotted across all the different versions for input to the google model and this is how the plots came out to be -

The legends for all the below plots is -
3 Fold & Split First
5 Fold & Split Later
3 Fold & Split First (Higher Epsilon Range)
5 Fold & Split First
Five Fold & Split First (Higher Epsilon Range)
3 Fold & Split First on Unique 30 points
3 Fold & Split First on Unique 30 points (Higher Epsilon Range)
5 Fold & Split First on Unique 30 points
3 Fold & Split First on Unique 30 points (Higher Epsilon Range)
Average of All Users Plot


The notebooks for the visualisations are present here

Challenges and Learnings


Challenges

No previous significant experience in Tensorflow.

Learnings

  • Working with larger datasets

  • Working with Tensorflow to all the way to the basic implementations

  • Writing and modifying code according to the situation/requirement of the team/organization

  • Working of SVR (fitting and sweeping)

Conclusion and Future Direction

The project is completed as major experimentations on the google model and the input version with the lowest error have been determined. Now lies ahead, further testing the model’s accuracy on our set of calibration data that we have collected throughout this project. Incorporation of face-filter and face-tilt features can be tested with the current version of google base model to try and further decrease the MED for gaze estimation.