Ipseology

the study of human identity
using large datasets
and computational methods

Who-am-I Data Freely and Publicly Available

Created: 2023-09-18

Skip directly to data download links.

I study identity. Some might prefer the term self-concept. Regardless, to study how individuals express their identity or think about their selves, one needs data. The best data is language, and it comes in the form of responses to the prompt Who am I?

The most well-known of the Who am I? instruments is Kuhn & McPartland's Twenty Statements Test (TST). In the 1954 seminal work, Kuhn and McPartland say they are measuring "self-attitudes." The TST prompt is as plain and straightforward as one might hope:

"There are twenty numbered blanks on the page below. Please write twenty answers to the simple question 'Who am I?' in the blanks. Just give twenty different answers to this question. Answer as if you were giving the answers to yourself, not to somebody else. Write the answers in the order that they occur to you. Don't worry about logic or 'importance.' Go along fairly fast, for time is limited.

Modern Who Am I Texts

I argue that the Twitter profile bio is the modern-day equivalent of the Who-am-I instrument. The utility of this data has been lessened by the Musk takeover, however, it does not change what happened before: For the period 2012-2022, millions of individuals in countries around the world publicly expressed and revised their identities. In my Ipseology white paper, I implore researchers to take advantage of this unprecedented decade.

Start exploring the relative popularity of words, phrases and emojis within Americans' profile bios by using Jason Jeffrey Jones Identity Trends V2. Deliberately patterned after Google Search Trends and Google Ngrams, I built this tool so anyone could compare a decade of data for up to 10 keywords. Read more and try some of my favorite searches.

You want .csv files? You can have .csv files. In this table, I link to the most up-to-date data files I have compiled. I have made these freely and publicly available under terms of the CC BY 4.0 License.

  Description Download Reference
Annual Prevalence of American Twitter Users with specified Token in their Profile Bio Incidence (raw count) and prevalence (normalized proportion) of unique US Twitter user accounts that contain each token. Tokens are mostly words, but also contain abbreviations, emojis and more. TokensAnnualCross.csv is the final, updated version covering 2012 through 2023.
README for TokensAnnualCross.csv
Download TokensAnnualCross.csv Jones, Jason Jeffrey (2021). A dataset for the study of identity at scale: Annual Prevalence of American Twitter Users with specified Token in their Profile Bio 2015–2020. PLOS ONE, 16(11), e0260185.
PDF
Longitudinal 2015-2022 US Annual Prevalence Subsample Subsample of 680,509 unique US accounts that were observed each and every year 2015 through 2022. Incidence (raw count) and prevalence (normalized proportion) of accounts that contain each token.
README for TokensAnnualLongi.csv
Download TokensAnnualLongi.csv Jones, Jason Jeffrey (2021). A dataset for the study of identity at scale: Annual Prevalence of American Twitter Users with specified Token in their Profile Bio 2015–2020. PLOS ONE, 16(11), e0260185.
PDF

Ipseology - Read and explore more

If this post intrigued you, check out more ipseology: