Generate one hot encodings from dict values

one-hot encoding python pandas example
one-hot encoding neural network
numpy one-hot
one hot encoding text
one-hot encoding characters
sparse one-hot encoding
shape mismatch: if categories is an array, it has to be of shape (n_features)
one-hot encoding inverse transform

I was trying to make a one hot array based on my dictionary characters: First, I created a numpy zeros that has row X column (3x7) and then I search for the id of each character and assign "1" to each row of the numpy array.

My goal is to assign each character with one hot array. "1" as "present" and "0" as "not present". Here we have 3 characters so we should have 3 rows, while the 7 columns serve as the characters existence in the dictionary.

However, I received an error stating that "TypeError: only integer scalar arrays can be converted to a scalar index". Can anyone please help me in this? Thank you

In order not to make everyone misunderstand my dictionary:

Here is how I create the dic:

sent = ["a", "b", "c", "d", "e", "f", "g"]
aaa = len(sent)
aa = {x:i for i,x in enumerate(sent)}

My code:

import numpy as np
sentences = ["b", "c", "e"]
a = {}
for xx in sentences:
   a[xx] = aa[xx]
a = {"b":1, "c":2, "e":4}
aa =len(a)

for x,y in a.items():
    aa = np.zeros((aa,aaa))
    aa[y] = 1

print(aa)

Current Error:

TypeError: only integer scalar arrays can be converted to a scalar index

My expected output:

[[0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0.]]

-------> Since its dictionary so the index arrangement should be different and the "1"s within the array is a dummy so that I can show my expected output.


Setting indices

(Comments inlined.)

# Sort and extract the indices.
idx = sorted(a.values())
# Initialise a matrix of zeros.
aa = np.zeros((len(idx), max(idx) + 1))
# Assign 1 to appropriate indices.
aa[np.arange(len(aa)), idx] = 1

print (aa)
array([[0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1.]])

numpy.eye
idx = sorted(a.values())
eye = np.eye(max(idx) + 1)    
aa = eye[idx]

print (aa)
array([[0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1.]])

sklearn.preprocessing.OneHotEncoder, By default, the encoder derives the categories based on the unique values in each feature Note: a one-hot encoding of y labels should use a LabelBinarizer instead. Performs an approximate one-hot encoding of dictionary items or strings. In the next example, we look at how we can directly one hot encode a sequence of integer values. One Hot Encode with Keras. You may have a sequence that is already integer encoded. You could work with the integers directly, after some scaling. Alternately, you can one hot encode the integers directly.


A one hot encoding treats a sample as a sequence, where each element of the sequence is the index into a vocabulary indicating whether that element (like a word or letter) is in the sample. For example if your vocabulary was the lower-case alphabet, a one-hot encoding of the work cat might look like:

 [1, 0., 1, 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 1, 0., 0., 0., 0., 0., 0.]

Indicating that this word contains the letters c, a, and t.

To make a one-hot encoding you need two things a vocabulary lookup with all the possible values (when using words this is why the matrices can get so large because the vocabulary is huge!). But if encoding the lower-case alphabet you only need 26.

Then you typically represent your samples as indexes in the vocabulary. So the set of words might look like this:

#bag, cab, fad
sentences = np.array([[1, 0, 6], [2, 0, 1], [5, 0, 3]])

When you one-hot encode that you will get a matrix 3 x 26:

vocab = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

#bag, cab, fad
sentences = np.array([[1, 0, 6], [2, 0, 1], [5, 0, 3]])

def onHot(sequences, dimension=len(vocab)):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
      results[i, sequence] = 1
    return results

onHot(sentences)

Which results in thee one-hot encoded samples with a 26 letter vocabulary ready to be fed to a neural network:

array([[1., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
   [1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
   [1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

How to One Hot Encode Sequence Data in Python, Next, we can create a binary vector to represent each integer value. A one hot encoding allows the representation of categorical data to be more expressive. char_to_int = dict((c, i) for i, c in enumerate(alphabet)). One-Hot Encoder. Though label encoding is straight but it has the disadvantage that the numeric values can be misinterpreted by algorithms as having some sort of hierarchy/order in them. This ordering issue is addressed in another common alternative approach called ‘One-Hot Encoding’.


My solution and for future readers:

I build the dictionary for the "sent" list:

sent = ["a", "b", "c", "d", "e", "f", "g"]
aaa = len(sent)
aa = {x:i for i,x in enumerate(sent)}

Then I find the indices for my own sentences based on the dictionary and assigned the numerical values to these sentences.

import numpy as np
sentences = ["b", "c", "e"]
a = {}
for xx in sentences:
   a[xx] = aa[xx]
a = {"b":1, "c":2, "e":4}
aa =len(a)

I extract the indices from the new assignment of "a":

index = []
for x,y in a.items():
    index.append(y)

Then I create another numpy array for these extract indices from the a.

index = np.asarray(index)

Now I create numpy zeros to store the existence of each character:

new = np.zeros((aa,aaa))
new[np.arange(aa), index] = 1

print(new)

Output:

[[0. 1. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0.]]

Tutorial: (Robust) One Hot Encoding in Python, One hot encoding is a common technique used to work with categorical features. If you deploy a model to production, the best way of saving those values is We'll create a new DataFrame that contains two categorical features, city features and build a dictionary that will map a feature to its encoder: In one-hot encoding, a separate bit of state is used for each state. It is called one-hot because only one bit is “hot” or TRUE at any time. For example, a one-hot encoded FSM with three states would have state encodings of 001, 010, and 100.


Here is another one by using sklearn.preprocessing

The lines are quite long and not much difference. I don:t know why but produced a similar results.

import numpy as np
from sklearn.preprocessing import OneHotEncoder
sent = ["a", "b", "c", "d", "e", "f", "g"]
aaa = len(sent)
aa = {x:i for i,x in enumerate(sent)}


sentences = ["b", "c", "e"]
a = {}
for xx in sentences:
   a[xx] = aa[xx]
a = {"a":0, "b":1, "c":2, "d":3, "e":4, "f":5, "g":6}
aa =len(a)

index = []
for x,y in a.items():
    index.append([y])

index = np.asarray(index)

enc = OneHotEncoder()
enc.fit(index)

print(enc.transform([[1], [2], [4]]).toarray())

Output

[[0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0.]]

Encoding Categorical Features, It is essential to encoding categorical features into numerical values. 1. LabelEncoder and OneHotEncoder. 2. DictVectorizer. 3. Pandas  The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array.


I like to use a LabelEncoder with a OneHotEncoder from sklearn.

import sklearn.preprocessing
import numpy as np

texty_data = np.array(["a", "c", "b"])
le = sklearn.preprocessing.LabelEncoder().fit(texty_data)
integery_data = le.transform(texty_data)
ohe = sklearn.preprocessing.OneHotEncoder().fit(integery_data.reshape((-1,1)))
onehot_data = ohe.transform(integery_data.reshape((-1,1)))

Stores it sparse, so that's handy. You can also use a LabelBinarizer to streamline this:

import sklearn.preprocessing
import numpy as np

texty_data = np.array(["a", "c", "b"])
lb = sklearn.preprocessing.LabelBinarizer().fit(texty_data)
onehot_data = lb.transform(texty_data)
print(onehot_data, lb.inverse_transform(onehot_data))

graphlab.toolkits.feature_engineering , Encode a collection of categorical features using a 1-of-K encoding scheme. Input columns to the one-hot-encoder must by of type int, string, dict, or list. string : The key in the output dictionary is the string category and the value is 1. One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction. Say suppose the dataset is as follows: The categorical value represents the numerical value of the entry in the dataset.


Using Categorical Data with One Hot Encoding, Let's work through an example. Imgur. The values in the original data are Red, Yellow and Green. We create a separate column for each possible value. Wherever  In this case, a one-hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value. In the “ color ” variable example, there are 3 categories and therefore 3 binary variables are needed.


Dictvectorizer for One Hot Encoding of Categorical Data, Dictvectorizer for One Hot Encoding of Categorical Data. September 30, 2014 T.to_dict().values() # Create Fit vectorizer.fit(test_dict)  Convert the DataFrame to a dictionary. The type of the key-value pairs can be customized with the parameters (see below). Determines the type of the values of the dictionary. ‘split’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]} Abbreviations are allowed. s indicates series and sp indicates split.


Guide to Encoding Categorical Values in Python, Overview of multiple approaches to encoding categorical values using python. For our uses, we are going to create a mapping dictionary that contains A common alternative approach is called one hot encoding (but also  Let’s create a dictionary from this list with list elements as keys and values as integers from 0 to n-1 (n is size of list) i.e. Python ''' Converting a list to dictionary with list elements as values in dictionary and keys are enumerated index starting from 0 i.e. index position of element in list ''' dictOfWords = { i : listOfStr[i] for i