Unsupervised learning and lessons from the science of personality testing

After completing a personality test as part of a job application, I recently became interested in the science and statistics of personality testing. This is part of the field of psychometrics, which is concerned with the measurement of mental traits and aptitudes. Psychometrics is a fascinating field in itself, but the more I read the more I started to see parallels between the challenges of personality testing and those of unsupervised learning.

In this post I’ll discuss some of the things that make personality testing difficult and how the the field of personality testing deals with these challenges. While personality testing does have some unique challenges, often there are lessons to be learned from the field that can be applied to unsupervised learning problems more generally.

Caveat 1: My training is not in psychology, so I cannot do justice to the complex and active field of personality testing. What I hope to do here is to lay out the challenges involved that can be generalised to other problems one might encounter as a data scientist and describe some of the ways the field of personality testing handles those.

Caveat 2: The ethics and efficacy of personality testing in job applications is an important topic that has been written about extensively by others, but that will not be the focus of this post. I suggest reading this piece by Cathy O’Neil for an excellent discussion.

What is personality testing?

A personality test is a test (generally a questionnaire) that is used to assess someone’s personality. What exactly is meant by personality is a complicated topic, so this is a purposely broad definition. As it turns out, it’s almost impossible to define personality without imbuing the definition with some kind of theory. Importantly, different personality tests by design measure different things.

The personality test that most people are familiar with in the Myers-Briggs Type Indicator (MBTI), which classifies people into one of sixteen categories on the basis of four dimensions. You may be surprised to learn the MBTI is not widely used by professional psychologists and has limited scientific support (see these articles for more on this topic). The Big Five is another fairly common test that is more accepted and used by psychologists.

What makes personality testing hard?

There are a number of factors that make assessing someone’s personality through a questionnaire difficult. While some of these are fairly specific to personality testing, others are much more general.

One major difficulty is one that is intrinsic to all unsupervised learning problems: since the data don’t come with labels, how can you tell how well your classification has worked? For example, if I design a personality test that tells me someone is an introvert, how can I be confident that result is correct?

A second difficulty is that by necessity personality tests measure personality indirectly. Imagine that you needed to figure out someone’s height but aren’t allowed to measure them or ask them their height. You can only ask questions like “Do you often bump your head on things?” and “Do you often notice dust on top of other people’s refrigerators?” This would make things quite difficult and should reduce your confidence in your results.

This is compounded by the additional problem of how to define personality and what a personality test should even measure. In the case of measuring height, we at least have a common and fairly uncontroversial definition of what height is. This is not the case for personality. Whatever a personality test measures needs to correspond to some ‘real’ aspect of personality as commonly understood. Otherwise, it can hardly be called a personality test.

A final factor that makes personality testing difficult is that personality tests almost always depend upon introspection. That is, personality tests generally ask you to give your own assessment of your personality and behaviour. Intelligence testing suffers from some of the same challenges as personality testing, but at least intelligence tests measure performance. Depending on how reliable you believe people’s assessments of their own behaviour and qualities, this could be a major difficulty. This challenge is fairly specific to personality testing, though surveys and questionnaires that ask people to predict their own behaviour likely suffer from the same problems with the reliability of self-assessment.

What is the goal of personality testing?

Before we can even begin to think about what a personality test should measure and how to determine if a test is good or not, we need to have some idea of the what we will use the test for. In general, before you set out to measure something, it’s a useful exercise to think about why you’re measuring it and what you’ll do with that information. This can inform the assumptions you allow and the criteria for success.

Different personality tests might be designed with different goals in mind, and one test might be used in different ways despite its intended use. These considerations must affect the design of the tests themselves and how we assess their results.

Here are some of the reasons you might want to use a personality test:

As a tool for introspection and self-reflection
To predict how someone might behave or perform based on their personality
To inform how you interact with others. For example, if you find out a colleague is an introvert, you may treat them differently than you would if you believe they’re an extrovert
To select people who are better suited to a particular job or task
Entertainment (a lot of people take personality tests online for fun, after all)

Each of these reasons has a quite different set of assumptions and values underlying it. For example, if your goal is to use the results of personality testing to decide who to hire, you’re assuming that there is high-quality evidence that personality test results correlate with performance and that people are honest when filling out personality tests as part of job applications.

These different goals will also have different measures of how well a test has worked. This is why it’s important to explicit about why you’re measuring the thing you’re measuring.

Before beginning an analysis, think carefully about what your goal is and what decisions will be informed by the result.

What assumptions inform what is included in a personality test?

Invariably, the way a personality test is designed will reflect the theories and assumptions of the designer. These theories and assumptions may be explicit or implicit, and the quality of the test will depend (in part) on how valid they are.

In casual speech we often talk about personality types, such as calling someone a ‘type A’ personality. A different way to think of personality is in terms of personality traits rather than personality types. For example, instead of saying someone is a type A personality, we might say they score highly on conscientiousness. An example of a test that relies on the personality type model is MBTI, which classifies people into one of sixteen categories. In contrast, the in the Big Five gives people a score for each of five qualities but does not categorise.

The distinction between assessing personality types and personality traits may not seem very important, but there are very different assumptions underlying the two approaches. If you believe that for a given personality trait most people will be close to average and very few will be extremes, splitting a population down the middle into two categories doesn’t make a lot of sense. For example, calling someone who scores 51/100 on an introversion/extroversion scale an extrovert but someone who scores 49/100 an introvert seems arbitrary. These two people are more similar (at least in terms of this trait) than someone scoring at either extreme of the scale. It really only makes sense to split a population in two like this if the distribution of the train is bimodal; that is, there are many people close to either extreme and few in the middle.

Another assumption underlying tests like MBTI is that the aspects of personality being measured are independent. This must be true to end up with 16 categories based on four traits.

Here’s are two examples to show why this is the case:

In the case on the left, we have two variables that each have a bimodal distribution (so it does make sense to divide the population into two). When we plot them against each other, we get four pretty convincing looking groups (indicated with ellipses).

However, in the second case we have only two distinct groupings, even though each variable has a bimodal distribution and so we could sensibly divide the population into two categories. This is because the two variables are correlated with each other. So, just about everyone who scores highly in the first personality quality also scores highly in the second. Two of the theoretically possible groupings don’t exist.

The assumption that the aspects of personality measured by MBTI are both bimodal and independent is thus central to the design and interpretation of the test.

Inevitably your analysis will include a lot of assumptions, such as which variables may be important. Be aware of these assumptions and make sure they are justified.

How do you decide what qualities to assess in a personality test?

Different personality tests measure different things. For example, MBTI measures introversion/extroversion, sensing/intuition, thinking/feeling, and judging/perception. The Big Five model of personality includes extroversion, conscientiousness, agreeableness, neuroticism, and openness to experience. Out of all the possible personality traits, how are the traits to measure decided upon? In unsupervised learning, this question is analogous to the decision of which variables to measure and eventually include.

If you try to assess dozens of different personality traits and it turns out some are highly correlated and effectively the same thing (for example, kindness and generosity are probably highly correlated), you can exclude some. However, what you never measure can’t be included. If you decide that introversion is not an important aspect of personality and don’t ask any questions about it in your questionnaire, you won’t be able to measure it. There is therefore a lot at stake when it comes to selecting the traits to measure.

Invariably theory will play a role here. MBTI is informed by Jung, so if you don’t accept Jung’s theory of personality you should perhaps be sceptical of MBTI. Another approach is to measure many different aspects of personality and then deal with question of how many aspects there really are in the analysis (I’ll discuss this more later). Even in this case, it’s impossible to be truly unbiased in choosing what to measure. The best you can do is to be explicit about your assumptions, make sure what you measure is relevant to your goal, and be mindful that by necessity you can only ever consider a subset of all possible variables.

Decisions about what variables to measure and include are crucial.

What do personality tests actually measure?

I’ve so far glossed over what it is that personality tests actually measure. Clearly personality is not something like height where there is an objectively true answer. Personality is very difficult to directly assess and is unlikely to be reducible to a single value.

Variables that cannot be directly measured are sometimes referred to as latent variables. A personality test might include 100 questions, but how people answer these questions may be driven by a much smaller number of underlying personality traits. The method commonly used to discover these latent variables is factor analysis.

Factor analysis is similar to principal component analysis (PCA), which is better known method in unsupervised learning. Like PCA, factor analysis helps to reveal the hidden structure in data. This is accomplished by reducing the number of variables by creating combinations of existing variables. Though there are important methodological and technical differences that I won’t go into (that will be the focus of another post), a major difference between the two is that in factor analysis these new variables are the latent variables that you’re actually interested in. They can’t be measured directly, but we can infer their existence through related variables such as responses to survey questions. Like PCA, factor analysis deals well with correlated variables.

Factor analysis seems to be most commonly used in psychometrics because that field deals a lot with measuring things that can’t be directly measured. PCA is much more common in data science, but depending on your dataset factor analysis may also be applicable.

Consider whether the data you have direct;y measure what you’re really interested in, and if not, how to deal with this in your analysis.

How do you know if a personality test is good?

If we think about personality tests as analogous to unsupervised learning problems, understanding how the quality of personality tests is assessed can provide a useful of example of how to think about assessing other unsupervised analyses.

Since personality tests measure something very abstract, how can we tell if they’re any good or not? After all, people don’t come with labels that you can compare personality test results to. I can’t make any claim to the effect that my test successful identifies 99% of introverts, because there’s no gold standard way of identifying introverts. On the other hand, this doesn’t mean there are no standards at all. There are definitely some results that would make you think the test was questionable.

One thing that is probably not a good measure of how good a personality test is is how a person feels about their result. While we’ve probably all had the experience of taking a personality test and feeling that the description we get back of our personality type is eerily accurate, this is probably linked to the Forer effect. This refers to people’s tendency to find very vague descriptions of their personality highly specific and insightful.

Some standards that can applied to personality tests that are laid out in the Standards for Educational and Psychological Testing. Reliability and validity are both important; a reliable test will give similar results when administered to the same person, whereas a valid test is one that actually measures the thing it’s meant to measure. Ideally a test would be both reliable and valid.

There are several different types of validity. Content validity considers whether or not a test adequately assesses different aspects of the quality it purports to measure. For instance, an intelligence test that only asked mathematics questions would not properly assess other types of intelligence and so would have low content validity.

Construct validity refers to how well a test measures the thing it’s meant to measure. This is obviously pretty hard to determine, though if a test does does a very bad job it should be clear. This is where you really need to examine the assumptions behind the test. For example, a personality test that asks you questions that depend upon very specific cultural knowledge would have poor construct validity, since if you don’t have that knowledge your results will be completely different even though that knowledge is irrelevant to the trait being measured.

Criterion validity looks at the link between the test result and some outcome. For example, if you had a test to see if someone is sociopath, showing that a high proportion of people who your test says are sociopaths end up committing a violent crime would suggest good criterion validity. In terms of personality testing, it’s not very clear what outcome you would attempt to link the result to, so this is less often referred to.

Despite these standards, it’s clearly very hard (if not impossible) to assess if a personality test is true. The better question may be whether or not it’s useful, and that will depend on what you’re using it for.

The appropriate way to assess how well an unsupervised analysis has performed will depend on what the goals of the analysis are.

Summary

If we think about personality testing as a specific example of unsupervised learning (albeit with some additional complexities), we can by analogy learn from the challenges involved in designing and analysing personality tests. Here is a summary of some general lessons we can learn from the field of personality testing:

Before beginning an analysis, think carefully about what your goal is and what decisions will be informed by the result.
Inevitably your analysis will have implicit assumptions built in, for example about what variables may be important or the desired result. Be aware of these assumptions where possible and make sure they are justified.
Decisions about what variables to measure and include are crucial.
Consider whether the data you have directly measure what you’re really interested in, and if not, how to deal with this in your analysis.
The appropriate way to assess how well an unsupervised analysis has performed will depend on what the goals of the analysis are. This is a difficult problem that is common to unsupervised learning generally.

Resources

An introduction to psychometric theory with applications in R: This online book delves much deeper into the science and statistics of psychometric theory. See also the related psych package.
psych package: An R package with many useful functions relevant to analysing personality tests.
How algorithms rule our working lives by Cathy O’Neil: A fantastic article that discusses some of the ethical issues related to using personality tests as part of job applications. This is based on a chapter from her book Weapons of Math Destruction, which I also recommend.
Why the Myers-Briggs test is totally meaningless by Joseph Stromberg and Estelle Caswell: Discusses some of the problems with MBTI.
Nothing personal: The questionable Myers-Briggs test by Dean Burnett: Another good piece about MBTI.