Introduction

Google has introduced Universal Sentence Encoder which has been a very useful tool in the NLP domain. The main advantages of this embedding is that it is able to understand context, is trained on vast amount of data, and produces same shape vector for words, sentences and paragraphs thus making it easy to compare the vector space to find similiar embeddings in the vector space.

In [1]:
import tensorflow as tf
import tensorflow_hub as hub
import os
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

assert float(".".join(tf.__version__.split('.')[:2])) >= 2.2, "You need to download tf2"
assert float(".".join(hub.__version__.split('.')[:2])) >= 0.8, "You need to download tf-hub 0.8.0"

Folder Structure

Model

I omitted Google's universal encoder model due to size constraints, it can be donwloaded from https://tfhub.dev/google/universal-sentence-encoder/4?tf-hub-format=compressed

Data source

https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json

The training data is about 40 Mb, and for a list of topics (~440), there are questions and answers from context for a list of context. Example: One topic can have multiple context where in each context, there can be N question and answer pairs.

+-- 
+-- Google-Universal-Encoder-v4
|   +-- model files ... 
+-- data
|   +-- train-v2.0.json
In [2]:
location = os.path.join(os.getcwd(), 'Google-Universal-Encode-v4')
model = hub.load(location)

def embed(input):
  return model(input)
In [3]:
'''
Import data into the notebook
'''
import json 
new_path = os.path.join(os.getcwd(), 'data', 'train-v2.0.json')
training_data = json.load(open(new_path))
In [4]:
number_of_topics = 50
list_of_topics = [i['title'] for i in training_data['data'][0:number_of_topics]]

print("List of topics examined : ", ", ".join(list_of_topics))

dataset = []

for topic_area in training_data['data'][0:number_of_topics]:
    list_of_paragraphs = topic_area['paragraphs']
    
    for paragraph in list_of_paragraphs:
        context = paragraph['context']
        list_of_questions = paragraph['qas']
        
        for question in list_of_questions:
            list_of_answers = question['answers']
        '''
        Get the first answer from the list of answers if an answer exists, else ""
        '''
        dataset.append({'topic_area' : topic_area['title'], 
                    'question' : question['question'], 
                    'no_answer' : question['is_impossible'],
                    'context' : context,
                    'answer' : "" if len(list_of_answers) == 0 else list_of_answers[0]})
List of topics examined :  Beyoncé, Frédéric_Chopin, Sino-Tibetan_relations_during_the_Ming_dynasty, IPod, The_Legend_of_Zelda:_Twilight_Princess, Spectre_(2015_film), 2008_Sichuan_earthquake, New_York_City, To_Kill_a_Mockingbird, Solar_energy, Kanye_West, Buddhism, American_Idol, Dog, 2008_Summer_Olympics_torch_relay, Genome, Comprehensive_school, Republic_of_the_Congo, Prime_minister, Institute_of_technology, Wayback_Machine, Dutch_Republic, Symbiosis, Canadian_Armed_Forces, Cardinal_(Catholicism), Iranian_languages, Lighting, Separation_of_powers_under_the_United_States_Constitution, Architecture, Human_Development_Index, Southern_Europe, BBC_Television, Arnold_Schwarzenegger, Plymouth, Heresy, Warsaw_Pact, Materialism, Christian, Sony_Music_Entertainment, Oklahoma_City, Hunter-gatherer, United_Nations_Population_Fund, Russian_Soviet_Federative_Socialist_Republic, Alexander_Graham_Bell, Pub, Internet_service_provider, Comics, Saint_Helena, Aspirated_consonant, Hydrogen
In [5]:
from sklearn.metrics.pairwise import cosine_similarity

n_highest = 10


question_embeddings = embed([item['question'] for item in dataset])

def get_closest_matches(question="Who did Kanye produce Graduation with?"):
    assert isinstance(question, str), "Question must be a string of text"
    new_question_embedding = embed([question])
    similarity = cosine_similarity(new_question_embedding, question_embeddings).flatten()
    top_results = similarity.argsort()[::-1][:n_highest]
    return top_results, similarity

def print_results(top_results, similarity):
    for result in top_results:
        print("Question : {} with similarity of {}%".format(dataset[result]['question'], round(similarity[result] * 100, 1)))
        print("No answer" if dataset[result]['no_answer'] == True else "Answer : {}".format(dataset[result]['answer']['text']))
        print("No context" if dataset[result]['no_answer'] == True else "Context : {}".format(dataset[result]['context'][max(0,dataset[result]['answer']['answer_start'] - 50):min(len(dataset[result]['context']), dataset[result]['answer']['answer_start'] + 50)])) 
        print("\n")

        
tr, sim = get_closest_matches("Who did Kanye produce Graduation with?")

print("Question : {}".format("Who did Kanye produce Graduation with?"))
print("-----------------------------------------------------------")
print("Top Results ..... \n ")
print_results(tr, sim)
Question : Who did Kanye produce Graduation with?
-----------------------------------------------------------
Top Results ..... 
 
Question : What music group was in Kanye's first release off of Graduation? with similarity of 72.7%
Answer : Daft Punk
Context : e hit. "Stronger", which samples French house duo Daft Punk, has been accredited to not only encoura


Question : What was the name of Kanye West's high school? with similarity of 72.0%
Answer : Polaris High School
Context : as raised in a middle-class background, attending Polaris High School in suburban Oak Lawn, Illinois


Question : What was the name of the producer that helped Kanye West? with similarity of 71.9%
Answer : No I.D.
Context : upported him. West crossed paths with producer/DJ No I.D., with whom he quickly formed a close frien


Question : Who did Kanye name President of GOOD Music in 2015? with similarity of 71.9%
Answer : Pusha T
Context : he label houses artists including West, Big Sean, Pusha T, Teyana Taylor, Yasiin Bey / Mos Def, D'ba


Question : Who did Kanye West say doesn't care about black people? with similarity of 67.3%
Answer : George Bush
Context : Once it was West's turn to speak again, he said, "George Bush doesn't care about black people." At t


Question : What is the name of Kanye West's food company? with similarity of 66.9%
Answer : KW Foods LLC
Context : ough the process is being finalized. His company, KW Foods LLC, bought the rights to the chain in Ch


Question : Which years was Kanye West mentioned in Time Magazine? with similarity of 66.2%
Answer : 2005 and 2015
Context : f the 100 most influential people in the world in 2005 and 2015.


Question : What was the title of Kanye's sixth album? with similarity of 66.2%
Answer : Yeezus
Context : Describing his sixth studio album Yeezus (2013) as "a protest to music," West embrac


Question : What record company eventually signed Kanye West? with similarity of 66.0%
Answer : Roc-A-Fella
Context : -label head Damon Dash reluctantly signed West to Roc-A-Fella Records. Jay-Z later admitted that Roc


Question : What was the name of the CD that Kanye recorded based on his failed college experience? with similarity of 65.1%
Answer : College Dropout
Context : equire college. For Kanye to make an album called College Dropout it was more about having the guts 


Performance consideration

Below are the results to retrievew top 10 matching questions, answers, and context from a list of 2685 questions.

In [6]:
print(question_embeddings.shape)
%timeit get_closest_matches()
(2685, 512)
5.3 ms ± 85.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [ ]: