1.7 → Embedding and Word2Vec

Word to Vectors is a better way of converting a word into a numerical representation.

There is the dataset available by Stanford → https://nlp.stanford.edu/projects/glove/

You can download the dataset and run the code below to convert any word from the English language into its vector representation.

import numpy as np

def loadGlove(path):
    file = open(path, 'r', encoding='utf8')
    model = {}
    
    for l in file:
        line = l.split()
        word = line[0]
        value = np.array([float(val) for val in line[1:]])
        model[word] = value
    
    return model

glove = loadGlove('glove.6B.50d.txt')

glove['python']   # vector embedding for the word Python

Output →

array([ 0.5897  , -0.55043 , -1.0106  ,  0.41226 ,  0.57348 ,  0.23464 ,
       -0.35773 , -1.78    ,  0.10745 ,  0.74913 ,  0.45013 ,  1.0351  ,
        0.48348 ,  0.47954 ,  0.51908 , -0.15053 ,  0.32474 ,  1.0789  ,
       -0.90894 ,  0.42943 , -0.56388 ,  0.69961 ,  0.13501 ,  0.16557 ,
       -0.063592,  0.35435 ,  0.42819 ,  0.1536  , -0.47018 , -1.0935  ,
        1.361   , -0.80821 , -0.674   ,  1.2606  ,  0.29554 ,  1.0835  ,
        0.2444  , -1.1877  , -0.60203 , -0.068315,  0.66256 ,  0.45336 ,
       -1.0178  ,  0.68267 , -0.20788 , -0.73393 ,  1.2597  ,  0.15425 ,
       -0.93256 , -0.15025 ])

glove['neural']

Output →

array([ 0.92803 ,  0.29096 ,  0.67837 ,  1.0444  , -0.72551 ,  2.1995  ,
        0.88767 , -0.94782 ,  0.67426 ,  0.24908 ,  0.95722 ,  0.18122 ,
        0.064263,  0.64323 , -1.6301  ,  0.94972 , -0.7367  ,  0.17345 ,
        0.67638 ,  0.10026 , -0.033782, -0.76971 ,  0.40519 , -0.099516,
        0.79654 ,  0.1103  , -0.076053, -0.090434,  0.015021, -1.137   ,
        1.6803  , -0.34424 ,  0.77538 , -1.8718  , -0.17148 ,  0.31956 ,
        0.093062,  0.004996,  0.25716 ,  0.52207 , -0.52548 , -0.93144 ,
       -1.0553  ,  1.4401  ,  0.30807 , -0.84872 ,  1.9986  ,  0.10788 ,
       -0.23633 , -0.17978 ])

Here comes a simple question → How does the computer know that words are similar?