Data for Human Identities across Nations of the Earth, Ngram Investigator (HINENI)

This web tool allows for quick exploration of small sets of nations and search terms. You are welcomed and encouraged to download the entire HINENI dataset, so that you might work independently with tools of your choosing.

I have hosted a copy of the annual, cross-sectional, incidences and prevalences on the Open Science Framework. Download the data as hineni.csv from https://osf.io/download/k7bwj/.

The file hineni.csv contains signifier prevalence data from 32 nations for the years 2012-2023.

ngram - A signifer consisting of one to five linguistic tokens (e.g. words, emojis, abbreviations) that was observed in many Twitter users' profile bios. Only ngrams that rise above a threshold of 1 per 10,000 users are included.
nation - A two-letter country code specifying the nation. Counts in other columns are from profiles geocoded to this nation. Uses the ISO 3166 standard.
obsYear - The year over which we have observed profiles. Profiles were observed from tweeting users. One profile bio (selected at random) was retained per user per year.
prevalence - Per 10,000 unique users in this nation, the whole number one would expect to include ngram within their bio
numerator - Also called incidence. The raw count of unique users in this year that were observed with ngram in their bio
denominator - Also called total accounts. The raw count of unique users observed in this year in this nation.

The counts are based on cross-sectional, annual samples of tweeting users.