Ipseology

the study of human identity
using large datasets
and computational methods

Introducing Jason Jeffrey Jones Identity Trends V2

Tables and examples describing the data behind the interface.

2023-04-06

Which words do Americans choose to describe themselves? The latest version of Jason Jeffrey Jones Identity Trends is ready to tell you. There are several feature improvements over V1:

Of course, you can still search for good old fashioned single tokens: Marvel at the trajectories of vegan, vegetarian and carnivore or mom, dad, mother and father

If you're craving details, check out the tables below or read the original peer-reviewed open-access research article. If you're ready to query your own words/phrases/emojis, start here.

 

How many years, users, words and phrases are in the data?

Year Unique user count Unique unigram count Unique bigram count Unique 3-gram count Unique 4-gram count Unique 5-gram count
2012 9,947,225 14,310 30,517 15,175 5,070 1,916
2013 11,395,106 13,279 30,134 15,178 4,812 1,728
2014 8,891,764 12,987 28,268 11,999 3,368 1,098
2015 8,564,955 13,200 27,696 11,096 3,014 990
2016 10,227,688 12,891 25,712 9,927 2,636 896
2017 10,638,679 13,012 24,682 9,335 2,484 893
2018 10,310,854 13,016 24,087 9,087 2,379 860
2019 9,817,008 13,038 23,785 8,723 2,284 810
2020 10,181,678 13,095 23,779 8,954 2,661 1,211
2021 8,170,309 13,702 24,917 8,931 2,436 912
2022 7,605,856 13,843 25,287 8,958 2,393 856
2023 3,000,501 14,312 26,968 9,458 2,314 674

 

Wondering what's a prevalence? Use this reference.

A prevalence is a whole number that tells you how many users per 10,000 include a word, phrase or emoji within their bio. In ipseology it is the preferred measure, because it allows for easy comparison across time and place.

The prevalence distribution of JJJITV2 has a large head, while most of what you are probably interested in is in the long tail. I say the head of the distribution is large, because more than 50% of the words, phrases and emojis that make it into the data just barely surpass the 1 per 10,000 minimum criterion. Within the 2022 data, the 1st quartile and median prevalence values are 1. The mean prevalence is 3.9, while the third quartile value still only reaches 2!

Terms in the tail have more variance. The table below shows a few examples from 2022 starting at the 81st percentile.

Ngram examples Prevalence Percentile
mirror, my children, nft artist, overwatch, usaf vet, you need to know, ♏, πŸ‡ΊπŸ‡ΈπŸ‡ΊπŸ‡ΈπŸ‡ΊπŸ‡Έ, 🌊🌊🌊 2 81st
cavs, climate change, freelance writer, milwaukee, scifi, school teacher, student athlete, truth seeker, views are mine, πŸ“Œ, πŸ™πŸΌ 3 86th
baptist, cowboy, millennial, nft enthusiast, pro - life, tattoos, taylor swift, traditional, your dreams, πŸ¦„, 🌡, πŸ’€ 4 90th
aquarius, believer in, latina, librarian, proud father, punk, usmc, yoga, 😘, ⚽️, 🌴 8 95th
beer, dogs, jesus, lover of, nerd, photographer, she / they, trump, vet, woman, !!!, πŸ³οΈβ€πŸŒˆ, πŸ’• 30 or more 99th

 

But I just want the data.

No problem, download all of the ngram prevalence data for US Twitter users 2012-2023 to serve your own analyses and visualizations.

 

Ipseology - Read and explore more

If this post intrigued you, check out more ipseology: