GitHub

namesex_light

Namesex_light is a lighweight package that predicts the gender tendency of Chinese given names. This module comes with a L2 regularized logistic regression trained on 10,730 Chinese given names (in traditional Chinese) with reliable gender lables collected from public data. The predict() function takes a list of names and output predicted gender tendency (1 for male and 0 for female) or probability of being a male name. Namesex_light has a sister project, namesex, that performs similar tasks with higher accuracy.

Additional information about namesex and namesex_light can be found in another document (in Chinese).

The prediction performance evaluated by ten-fold cross validation is:

Metric	Performance	Performance Std. Dev.
Accuracy	0.8957	0.007327
F1	0.8920	0.007873
Precision	0.8852	0.012238
Recall	0.8991	0.008936
Logloss	114.35	6.413972

Use pip/pip3 to install namesex_light.:

pip install namesex_light

To use namesex_light, pass in an array or list of given names to predict(). For each element in the input list, predict() returns 1 or 0 for male or female prediction. Set "predprob = True" to return probability of being a male name. The following is a simple sample code.:

>>> import namesex_light
>>> nsl = namesex_light.namesex_light()
>>> nsl.predict(['民豪', '愛麗', '志明'])
array([1, 0, 1])
>>> nsl.predict(['民豪', '愛麗', '志明'], predprob=True)
array([0.99968932, 0.00530066, 0.9938986 ])

Note that namesex_light was trained using Chinese given names only. However, it may be used to classifier translated names as well:

>>> nsl.predict(['阿波羅', '阿波羅', '雷', '艾美', '布蘭妮', '阿曼達'])
array([1, 1, 1, 0, 0, 1])

This module is intended for a quick plug-and-play. The original training dataset is not included.

Testing Dataset

This package comes with a small testing dataset that was not used for model training. The following sample code illustrate a simple usage.:

>>> testdata = namesex_light.testdata()
>>> nsl = namesex_light.namesex_light()
>>> pred = nsl.predict(testdata.gname)
>>> print("The first 5 given names are: {}".format(testdata.gname[0:5]))
The first 5 given names are: ['翊如', '妤庭', '諆璋', '大閎', '和維']
>>> print("    and their sex:          {}".format(testdata.sex[0:5]))
    and their sex:          [0, 0, 1, 1, 1]
>>> print("    and their predicted sex:{}".format(pred[0:5]))
    and their predicted sex:[0 0 1 1 1]
>>> accuracy = np.sum(pred == testdata.sex) / len(pred)
>>> print(" Prediction accuracy = {}".format(accuracy))
 Prediction accuracy = 0.8627450980392157

Note that the accuracy is slightly lower compared to the accuracy of ten-fold cross valudation. I guess this is normal since this testset is collected from a source that is different from the training dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
namesex_light		namesex_light
.gitignore		.gitignore
LICENSE		LICENSE
MAINFEST.in		MAINFEST.in
README.rst		README.rst
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

namesex_light

Testing Dataset

About

Releases

Packages

Languages

License

hsinmin/namesex_light

Folders and files

Latest commit

History

Repository files navigation

namesex_light

Testing Dataset

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages