gender identification in natural language processing

python gender classification by name
predict gender from name python
nlp decision tree
python code to identify a gender given a first name for the person
gender classification based on name
chapter 6 nltk
nltk deep learning
nltk bag of words

I have written below code using stanford nlp packages.

GenderAnnotator myGenderAnnotation = new GenderAnnotator();

But for the sentence "Annie goes to school", it is not able to identify the gender of Annie.

The output of application is:

     [Text=Annie CharacterOffsetBegin=0 CharacterOffsetEnd=5 PartOfSpeech=NNP Lemma=Annie NamedEntityTag=PERSON] 
     [Text=goes CharacterOffsetBegin=6 CharacterOffsetEnd=10 PartOfSpeech=VBZ Lemma=go NamedEntityTag=O] 
     [Text=to CharacterOffsetBegin=11 CharacterOffsetEnd=13 PartOfSpeech=TO Lemma=to NamedEntityTag=O] 
     [Text=school CharacterOffsetBegin=14 CharacterOffsetEnd=20 PartOfSpeech=NN Lemma=school NamedEntityTag=O] 
     [Text=. CharacterOffsetBegin=20 CharacterOffsetEnd=21 PartOfSpeech=. Lemma=. NamedEntityTag=O]

What is the correct approach to get the gender?

[PDF] Deep & Machine Learning Approaches to Analyzing Gender , gauging gender bias. We aim to identify the key language differences in how men and women are represented, and how these biases may translate to NLP. Python | Gender Identification by name using NLTK Natural Language Toolkit (NLTK) is a platform used for building programs for text analysis. We can observe that male and female names have some distinctive characteristics.

There are a lot of approaches and one of them is outlined in nltk cookbook.

Basically you build a classifier that extract some features (first, last letter, first two, last two letters and so on) from a name and have a prediction based on these features.

import nltk
import random

def extract_features(name):
    name = name.lower()
    return {
        'last_char': name[-1],
        'last_two': name[-2:],
        'last_three': name[-3:],
        'first': name[0],
        'first2': name[:1]

f_names = nltk.corpus.names.words('female.txt')
m_names = nltk.corpus.names.words('male.txt')

all_names = [(i, 'm') for i in m_names] + [(i, 'f') for i in f_names]

test_set = all_names[500:]
train_set= all_names[:500]

test_set_feat = [(extract_features(n), g) for n, g in test_set]
train_set_feat= [(extract_features(n), g) for n, g in train_set]

classifier = nltk.NaiveBayesClassifier.train(train_set_feat)

print nltk.classify.accuracy(classifier, test_set_feat)

This basic test gives you approximately 77% of accuracy.

Machine Learning, Natural Language Processing is really interesting world to get to know by following NLTK documentation is Gender Identification in just less  Natural Language Processing: Speaker, Language, and Gender Identification with LSTM. Abstract. Long short-term memory (LSTM) is a state-of-the-art network used for different tasks related to natural language processing (NLP), pattern recognition, and classification.

The gender annotator doesn't add the information to the text output but you can still access it through code as shown in the following snippet:

Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,parse,gender");

StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

Annotation document = new Annotation("Annie goes to school");


for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
  for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
    System.out.print(", Gender: ");


Annie, Gender: FEMALE
goes, Gender: null
to, Gender: null
school, Gender: null

Automatic gender identification of author of Russian text by machine , Keywords. natural language processing. gender identification task. machine learning. deep neural networks. Recommended articles. Citing articles (0)  The impressive progress in many Natural Language Processing (NLP) applications has increased the awareness of some of the biases these NLP systems have with regards to gender identities. In this paper, we propose an approach to extend biased single-output gender-blind NLP systems with gender-specific alternative reinflections.

Gender Identification (NLTK Book 6.1.1, Next, we use the feature extractor to process the names data, and divide the​  Proceedings of the First Workshop on Ethics in Natural Language Processing, pages 30–40, Valencia, Spain, April 4th, 2017. c 2017 Association for Computational Linguistics Gender as a Variable in Natural-Language Processing: Ethical Considerations Brian N. Larson Georgia Institute of Technology 686 Cherry St. MC 0165 Atlanta, GA 30363 USA

[PDF] Gender as a Variable in Natural-Language Processing, to self-identify for gender. Section 2 considers theoretical foundations for gender as a research construct and rationales for studying it. Section 3 proposes ethical​  Abstract: As Natural Language Processing (NLP) and Machine Learning (ML) tools rise in popularity, it becomes increasingly vital to recognize the role they play in shaping societal biases and stereotypes. Although NLP models have shown success in modeling various applications, they propagate and may even amplify gender bias found in text corpora.

Natural Language Processing: Speaker, Language, and Gender , Natural Language Processing: Speaker, Language, and Gender Identification with LSTM. Authors; Authors and affiliations. Mohammad K. the original data set with gradient descent the gender bias grows as the loss reduces, indicating that the optimization encourages bias; CDA mitigates this behavior. 1 Introduction Natural language processing (NLP) with neural networks has grown in importance over the last few years.