They're Watching You at Work

What happens when Big Data meets human resources? The emerging practice of "people analytics" is already transforming how employers hire, fire, and promote.

In 2003, thanks to Michael Lewis and his best seller Moneyball, the general manager of the Oakland A’s, Billy Beane, became a star. The previous year, Beane had turned his back on his scouts and had instead entrusted player-acquisition decisions to mathematical models developed by a young, Harvard-trained statistical wizard on his staff. What happened next has become baseball lore. The A’s, a small-market team with a paltry budget, ripped off the longest winning streak in American League history and rolled up 103 wins for the season. Only the mighty Yankees, who had spent three times as much on player salaries, won as many games. The team’s success, in turn, launched a revolution. In the years that followed, team after team began to use detailed predictive models to assess players’ potential and monetary value, and the early adopters, by and large, gained a measurable competitive edge over their more hidebound peers.

That’s the story as most of us know it. But it is incomplete. What would seem at first glance to be nothing but a memorable tale about baseball may turn out to be the opening chapter of a much larger story about jobs. Predictive statistical analysis, harnessed to big data, appears poised to alter the way millions of people are hired and assessed.

Yes, unavoidably, big data. As a piece of business jargon, and even more so as an invocation of coming disruption, the term has quickly grown tiresome. But there is no denying the vast increase in the range and depth of information that’s routinely captured about how we behave, and the new kinds of analysis that this enables. By one estimate, more than 98 percent of the world’s information is now stored digitally, and the volume of that data has quadrupled since 2007. Ordinary people at work and at home generate much of this data, by sending e-mails, browsing the Internet, using social media, working on crowd-sourced projects, and more—and in doing so they have unwittingly helped launch a grand new societal project. “We are in the midst of a great infrastructure project that in some ways rivals those of the past, from Roman aqueducts to the Enlightenment’s Encyclopédie,” write Viktor Mayer-Schönberger and Kenneth Cukier in their recent book, Big Data: A Revolution That Will Transform How We Live, Work, and Think. “The project is datafication. Like those other infrastructural advances, it will bring about fundamental changes to society.”

Some of the changes are well known, and already upon us. Algorithms that predict stock-price movements have transformed Wall Street. Algorithms that chomp through our Web histories have transformed marketing. Until quite recently, however, few people seemed to believe this data-driven approach might apply broadly to the labor market.

But it now does. According to John Hausknecht, a professor at Cornell’s school of industrial and labor relations, in recent years the economy has witnessed a “huge surge in demand for workforce-analytics roles.” Hausknecht’s own program is rapidly revising its curriculum to keep pace. You can now find dedicated analytics teams in the human-resources departments of not only huge corporations such as Google, HP, Intel, General Motors, and Procter & Gamble, to name just a few, but also companies like McKee Foods, the Tennessee-based maker of Little Debbie snack cakes. Even Billy Beane is getting into the game. Last year he appeared at a large conference for corporate HR executives in Austin, Texas, where he reportedly stole the show with a talk titled “The Moneyball Approach to Talent Management.” Ever since, that headline, with minor modifications, has been plastered all over the HR trade press.

The application of predictive analytics to people’s careers—an emerging field sometimes called “people analytics”—is enormously challenging, not to mention ethically fraught. And it can’t help but feel a little creepy. It requires the creation of a vastly larger box score of human performance than one would ever encounter in the sports pages, or that has ever been dreamed up before. To some degree, the endeavor touches on the deepest of human mysteries: how we grow, whether we flourish, what we become. Most companies are just beginning to explore the possibilities. But make no mistake: during the next five to 10 years, new models will be created, and new experiments run, on a very large scale. Will this be a good development or a bad one—for the economy, for the shapes of our careers, for our spirit and self-worth? Earlier this year, I decided to find out.

Ever since we’ve had companies, we’ve had managers trying to figure out which people are best suited to working for them. The techniques have varied considerably. Near the turn of the 20th century, one manufacturer in Philadelphia made hiring decisions by having its foremen stand in front of the factory and toss apples into the surrounding scrum of job-seekers. Those quick enough to catch the apples and strong enough to keep them were put to work.

In those same times, a different (and less bloody) Darwinian process governed the selection of executives. Whole industries were being consolidated by rising giants like U.S. Steel, DuPont, and GM. Weak competitors were simply steamrolled, but the stronger ones were bought up, and their founders typically were offered high-level jobs within the behemoth. The approach worked pretty well. As Peter Cappelli, a professor at the Wharton School, has written, “Nothing in the science of prediction and selection beats observing actual performance in an equivalent role.”

By the end of World War II, however, American corporations were facing severe talent shortages. Their senior executives were growing old, and a dearth of hiring from the Depression through the war had resulted in a shortfall of able, well-trained managers. Finding people who had the potential to rise quickly through the ranks became an overriding preoccupation of American businesses. They began to devise a formal hiring-and-management system based in part on new studies of human behavior, and in part on military techniques developed during both world wars, when huge mobilization efforts and mass casualties created the need to get the right people into the right roles as efficiently as possible. By the 1950s, it was not unusual for companies to spend days with young applicants for professional jobs, conducting a battery of tests, all with an eye toward corner-office potential. “P&G picks its executive crop right out of college,”

BusinessWeek noted in 1950, in the unmistakable patter of an age besotted with technocratic possibility. IQ tests, math tests, vocabulary tests, professional-aptitude tests, vocational-interest questionnaires, Rorschach tests, a host of other personality assessments, and even medical exams (who, after all, would want to hire a man who might die before the company’s investment in him was fully realized?)—all were used regularly by large companies in their quest to make the right hire.

The process didn’t end when somebody started work, either. In his classic 1956 cultural critique, The Organization Man, the business journalist William Whyte reported that about a quarter of the country’s corporations were using similar tests to evaluate managers and junior executives, usually to assess whether they were ready for bigger roles. “Should Jones be promoted or put on the shelf?” he wrote. “Once, the man’s superiors would have had to thresh this out among themselves; now they can check with psychologists to see what the tests say.”

Remarkably, this regime, so widespread in corporate America at mid-century, had almost disappeared by 1990. “I think an HR person from the late 1970s would be stunned to see how casually companies hire now,” Peter Cappelli told me—the days of testing replaced by a handful of ad hoc interviews, with the questions dreamed up on the fly. Many factors explain the change, he said, and then he ticked off a number of them: Increased job-switching has made it less important and less economical for companies to test so thoroughly. A heightened focus on short-term financial results has led to deep cuts in corporate functions that bear fruit only in the long term. The Civil Rights Act of 1964, which exposed companies to legal liability for discriminatory hiring practices, has made HR departments wary of any broadly applied and clearly scored test that might later be shown to be systematically biased. Instead, companies came to favor the more informal qualitative hiring practices that are still largely in place today.

But companies abandoned their hard-edged practices for another important reason: many of their methods of evaluation turned out not to be very scientific. Some were based on untested psychological theories. Others were originally designed to assess mental illness, and revealed nothing more than where subjects fell on a “normal” distribution of responses—which in some cases had been determined by testing a relatively small, unrepresentative group of people, such as college freshmen. When William Whyte administered a battery of tests to a group of corporate presidents, he found that not one of them scored in the “acceptable” range for hiring. Such assessments, he concluded, measured not potential but simply conformity. Some of them were highly intrusive, too, asking questions about personal habits, for instance, or parental affection. Unsurprisingly, subjects didn’t like being so impersonally poked and prodded (sometimes literally).

For all these reasons and more, the idea that hiring was a science fell out of favor. But now it’s coming back, thanks to new technologies and methods of analysis that are cheaper, faster, and much-wider-ranging than what we had before. For better or worse, a new era of technocratic possibility has begun.

Consider Knack, a tiny start-up based in Silicon Valley. Knack makes app-based video games, among them Dungeon Scrawl, a quest game requiring the player to navigate a maze and solve puzzles, and Wasabi Waiter, which involves delivering the right sushi to the right customer at an increasingly crowded happy hour. These games aren’t just for play: they’ve been designed by a team of neuroscientists, psychologists, and data scientists to suss out human potential. Play one of them for just 20 minutes, says Guy Halfteck, Knack’s founder, and you’ll generate several megabytes of data, exponentially more than what’s collected by the SAT or a personality test. How long you hesitate before taking every action, the sequence of actions you take, how you solve problems—all of these factors and many more are logged as you play, and then are used to analyze your creativity, your persistence, your capacity to learn quickly from mistakes, your ability to prioritize, and even your social intelligence and personality. The end result, Halfteck says, is a high-resolution portrait of your psyche and intellect, and an assessment of your potential as a leader or an innovator.

When Hans Haringa heard about Knack, he was skeptical but intrigued. Haringa works for the petroleum giant Royal Dutch Shell—by revenue, the world’s largest company last year. For seven years he’s served as an executive in the company’s GameChanger unit: a 12-person team that for nearly two decades has had an outsize impact on the company’s direction and performance. The unit’s job is to identify potentially disruptive business ideas. Haringa and his team solicit ideas promiscuously from inside and outside the company, and then play the role of venture capitalists, vetting each idea, meeting with its proponents, dispensing modest seed funding to a few promising candidates, and monitoring their progress. They have a good record of picking winners, Haringa told me, but identifying ideas with promise has proved to be extremely difficult and time-consuming. The process typically takes more than two years, and less than 10 percent of the ideas proposed to the unit actually make it into general research and development.

When he heard about Knack, Haringa thought he might have found a shortcut. What if Knack could help him assess the people proposing all these ideas, so that he and his team could focus only on those whose ideas genuinely deserved close attention? Haringa reached out, and eventually ran an experiment with the company’s help.

Over the years, the GameChanger team had kept a database of all the ideas it had received, recording how far each had advanced. Haringa asked all the idea contributors he could track down (about 1,400 in total) to play Dungeon Scrawl and Wasabi Waiter, and told Knack how well three-quarters of those people had done as idea generators. (Did they get initial funding? A second round? Did their ideas make it all the way?) He did this so that Knack’s staff could develop game-play profiles of the strong innovators relative to the weak ones. Finally, he had Knack analyze the game-play of the remaining quarter of the idea generators, and asked the company to guess whose ideas had turned out to be best.

When the results came back, Haringa recalled, his heart began to beat a little faster. Without ever seeing the ideas, without meeting or interviewing the people who’d proposed them, without knowing their title or background or academic pedigree, Knack’s algorithm had identified the people whose ideas had panned out. The top 10 percent of the idea generators as predicted by Knack were in fact those who’d gone furthest in the process. Knack identified six broad factors as especially characteristic of those whose ideas would succeed at Shell: “mind wandering” (or the tendency to follow interesting, unexpected offshoots of the main task at hand, to see where they lead), social intelligence, “goal-orientation fluency,” implicit learning, task-switching ability, and conscientiousness. Haringa told me that this profile dovetails with his impression of a successful innovator. “You need to be disciplined,” he said, but “at all times you must have your mind open to see the other possibilities and opportunities.”

What Knack is doing, Haringa told me, “is almost like a paradigm shift.” It offers a way for his GameChanger unit to avoid wasting time on the 80 people out of 100—nearly all of whom look smart, well-trained, and plausible on paper—whose ideas just aren’t likely to work out. If he and his colleagues were no longer mired in evaluating “the hopeless folks,” as he put it to me, they could solicit ideas even more widely than they do today and devote much more careful attention to the 20 people out of 100 whose ideas have the most merit.

Haringa is now trying to persuade his colleagues in the GameChanger unit to use Knack’s games as an assessment tool. But he’s also thinking well beyond just his own little part of Shell. He has encouraged the company’s HR executives to think about applying the games to the recruitment and evaluation of all professional workers. Shell goes to extremes to try to make itself the world’s most innovative energy company, he told me, so shouldn’t it apply that spirit to developing its own “human dimension”?

“It is the whole man The Organization wants,” William Whyte wrote back in 1956, when describing the ambit of the employee evaluations then in fashion. Aptitude, skills, personal history, psychological stability, discretion, loyalty—companies at the time felt they had a need (and the right) to look into them all. That ambit is expanding once again, and this is undeniably unsettling. Should the ideas of scientists be dismissed because of the way they play a game? Should job candidates be ranked by what their Web habits say about them? Should the “data signature” of natural leaders play a role in promotion? These are all live questions today, and they prompt heavy concerns: that we will cede one of the most subtle and human of skills, the evaluation of the gifts and promise of other people, to machines; that the models will get it wrong; that some people will never get a shot in the new workforce.

It’s natural to worry about such things. But consider the alternative. A mountain of scholarly literature has shown that the intuitive way we now judge professional potential is rife with snap judgments and hidden biases, rooted in our upbringing or in deep neurological connections that doubtless served us well on the savanna but would seem to have less bearing on the world of work.

What really distinguishes CEOs from the rest of us, for instance? In 2010, three professors at Duke’s Fuqua School of Business asked roughly 2,000 people to look at a long series of photos. Some showed CEOs and some showed nonexecutives, and the participants didn’t know who was who. The participants were asked to rate the subjects according to how “competent” they looked. Among the study’s findings: CEOs look significantly more competent than non-CEOs; CEOs of large companies look significantly more competent than CEOs of small companies; and, all else being equal, the more competent a CEO looked, the fatter the paycheck he or she received in real life. And yet the authors found no relationship whatsoever between how competent a CEO looked and the financial performance of his or her company.

Examples of bias abound. Tall men get hired and promoted more frequently than short men, and make more money. Beautiful women get preferential treatment, too—unless their breasts are too large. According to a national survey by the Employment Law Alliance a few years ago, most American workers don’t believe attractive people in their firms are hired or promoted more frequently than unattractive people, but the evidence shows that they are, overwhelmingly so. Older workers, for their part, are thought to be more resistant to change and generally less competent than younger workers, even though plenty of research indicates that’s just not so. Workers who are too young or, more specifically, are part of the Millennial generation are tarred as entitled and unable to think outside the box.

Malcolm Gladwell recounts a classic example in Blink. Back in the 1970s and ’80s, most professional orchestras transitioned one by one to “blind” auditions, in which each musician seeking a job performed from behind a screen. The move was made in part to stop conductors from favoring former students, which it did. But it also produced another result: the proportion of women winning spots in the most-prestigious orchestras shot up fivefold, notably when they played instruments typically identified closely with men. Gladwell tells the memorable story of Julie Landsman, who, at the time of his book’s publication, in 2005, was playing principal French horn for the Metropolitan Opera, in New York. When she’d finished her blind audition for that role, years earlier, she knew immediately that she’d won. Her last note was so true, and she held it so long, that she heard delighted peals of laughter break out among the evaluators on the other side of the screen. But when she came out to greet them, she heard a gasp. Landsman had played with the Met before, but only as a substitute. The evaluators knew her, yet only when they weren’t aware of her gender—only, that is, when they were forced to make not a personal evaluation but an impersonal one—could they hear how brilliantly she played.

We may like to think that society has become more enlightened since those days, and in many ways it has, but our biases are mostly unconscious, and they can run surprisingly deep. Consider race. For a 2004 study called “Are Emily and Greg More Employable Than Lakisha and Jamal?,” the economists Sendhil Mullainathan and Marianne Bertrand put white-sounding names (Emily Walsh, Greg Baker) or black-sounding names (Lakisha Washington, Jamal Jones) on similar fictitious résumés, which they then sent out to a variety of companies in Boston and Chicago. To get the same number of callbacks, they learned, they needed to either send out half again as many résumés with black names as those with white names, or add eight extra years of relevant work experience to the résumés with black names.

I talked with Mullainathan about the study. All of the hiring managers he and Bertrand had consulted while designing it, he said, told him confidently that Lakisha and Jamal would get called back more than Emily and Greg. Affirmative action guaranteed it, they said: recruiters were bending over backwards in their search for good black candidates. Despite making conscious efforts to find such candidates, however, these recruiters turned out to be excluding them unconsciously at every turn. After the study came out, a man named Jamal sent a thank-you note to Mullainathan, saying that he’d started using only his first initial on his résumé and was getting more interviews.

Perhaps the most widespread bias in hiring today cannot even be detected with the eye. In a recent survey of some 500 hiring managers, undertaken by the Corporate Executive Board, a research firm, 74 percent reported that their most recent hire had a personality “similar to mine.” Lauren Rivera, a sociologist at Northwestern, spent parts of the three years from 2006 to 2008 interviewing professionals from elite investment banks, consultancies, and law firms about how they recruited, interviewed, and evaluated candidates, and concluded that among the most important factors driving their hiring recommendations were—wait for it—shared leisure interests. “The best way I could describe it,” one attorney told her, “is like if you were going on a date. You kind of know when there’s a match.” Asked to choose the most-promising candidates from a sheaf of fake résumés Rivera had prepared, a manager at one particularly buttoned-down investment bank told her, “I’d have to pick Blake and Sarah. With his lacrosse and her squash, they’d really get along [with the people] on the trading floor.” Lacking “reliable predictors of future performance,” Rivera writes, “assessors purposefully used their own experiences as models of merit.” Former college athletes “typically prized participation in varsity sports above all other types of involvement.” People who’d majored in engineering gave engineers a leg up, believing they were better prepared.

Given this sort of clubby, insular thinking, it should come as no surprise that the prevailing system of hiring and management in this country involves a level of dysfunction that should be inconceivable in an economy as sophisticated as ours. Recent survey data collected by the Corporate Executive Board, for example, indicate that nearly a quarter of all new hires leave their company within a year of their start date, and that hiring managers wish they’d never extended an offer to one out of every five members on their team. A survey by Gallup this past June, meanwhile, found that only 30 percent of American workers felt a strong connection to their company and worked for it with passion. Fifty-two percent emerged as “not engaged” with their work, and another 18 percent as “actively disengaged,” meaning they were apt to undermine their company and co-workers, and shirk their duties whenever possible. These headline numbers are skewed a little by the attitudes of hourly workers, which tend to be worse, on average, than those of professional workers. But really, what further evidence do we need of the abysmal status quo?

Because the algorithmic assessment of workers’ potential is so new, not much hard data yet exist demonstrating its effectiveness. The arena in which it has been best proved, and where it is most widespread, is hourly work. Jobs at big-box retail stores and call centers, for example, warm the hearts of would-be corporate Billy Beanes: they’re pretty well standardized, they exist in huge numbers, they turn over quickly (it’s not unusual for call centers, for instance, to experience 50 percent turnover in a single year), and success can be clearly measured (through a combination of variables like sales, call productivity, customer-complaint resolution, and length of tenure). Big employers of hourly workers are also not shy about using psychological tests, partly in an effort to limit theft and absenteeism. In the late 1990s, as these assessments shifted from paper to digital formats and proliferated, data scientists started doing massive tests of what makes for a successful customer-support technician or salesperson. This has unquestionably improved the quality of the workers at many firms.

Teri Morse, the vice president for recruiting at Xerox Services, oversees hiring for the company’s 150 U.S. call and customer-care centers, which employ about 45,000 workers. When I spoke with her in July, she told me that as recently as 2010, Xerox had filled these positions through interviews and a few basic assessments conducted in the office—a typing test, for instance. Hiring managers would typically look for work experience in a similar role, but otherwise would just use their best judgment in evaluating candidates. In 2010, however, Xerox switched to an online evaluation that incorporates personality testing, cognitive-skill assessment, and multiple-choice questions about how the applicant would handle specific scenarios that he or she might encounter on the job. An algorithm behind the evaluation analyzes the responses, along with factual information gleaned from the candidate’s application, and spits out a color-coded rating: red (poor candidate), yellow (middling), or green (hire away). Those candidates who score best, I learned, tend to exhibit a creative but not overly inquisitive personality, and participate in at least one but not more than four social networks, among many other factors. (Previous experience, one of the few criteria that Xerox had explicitly screened for in the past, turns out to have no bearing on either productivity or retention. Distance between home and work, on the other hand, is strongly associated with employee engagement and retention.)

When Xerox started using the score in its hiring decisions, the quality of its hires immediately improved. The rate of attrition fell by 20 percent in the initial pilot period, and over time, the number of promotions rose. Xerox still interviews all candidates in person before deciding to hire them, Morse told me, but, she added, “We’re getting to the point where some of our hiring managers don’t even want to interview anymore”—they just want to hire the people with the highest scores.

The online test that Xerox uses was developed by a small but rapidly growing company based in San Francisco called Evolv. I spoke with Jim Meyerle, one of the company’s co‑founders, and David Ostberg, its vice president of workforce science, who described how modern techniques of gathering and analyzing data offer companies a sharp edge over basic human intuition when it comes to hiring. Gone are the days, Ostberg told me, when, say, a small survey of college students would be used to predict the statistical validity of an evaluation tool. “We’ve got a data set of 347,000 actual employees who have gone through these different types of assessments or tools,” he told me, “and now we have performance-outcome data, and we can split those and slice and dice by industry and location.”

Evolv’s tests allow companies to capture data about everybody who applies for work, and everybody who gets hired—a complete data set from which sample bias, long a major vexation for industrial-organization psychologists, simply disappears. The sheer number of observations that this approach makes possible allows Evolv to say with precision which attributes matter more to the success of retail-sales workers (decisiveness, spatial orientation, persuasiveness) or customer-service personnel at call centers (rapport-building). And the company can continually tweak its questions, or add new variables to its model, to seek out ever stronger correlates of success in any given job. For instance, the browser that applicants use to take the online test turns out to matter, especially for technical roles: some browsers are more functional than others, but it takes a measure of savvy and initiative to download them.

There are some data that Evolv simply won’t use, out of a concern that the information might lead to systematic bias against whole classes of people. The distance an employee lives from work, for instance, is never factored into the score given each applicant, although it is reported to some clients. That’s because different neighborhoods and towns can have different racial profiles, which means that scoring distance from work could violate equal-employment-opportunity standards. Marital status? Motherhood? Church membership? “Stuff like that,” Meyerle said, “we just don’t touch”—at least not in the U.S., where the legal environment is strict. Meyerle told me that Evolv has looked into these sorts of factors in its work for clients abroad, and that some of them produce “startling results.” Citing client confidentiality, he wouldn’t say more.

MIT’s Sandy Pentland has pioneered the use of electronic “badges” that transmit data about employees as they go about their days.

Meyerle told me that what most excites him are the possibilities that arise from monitoring the entire life cycle of a worker at any given company. This is a task that Evolv now performs for Transcom, a company that provides outsourced customer-support, sales, and debt-collection services, and that employs some 29,000 workers globally. About two years ago, Transcom began working with Evolv to improve the quality and retention of its English-speaking workforce, and three-month attrition quickly fell by about 30 percent. Now the two companies are working together to marry pre-hire assessments to an increasing array of post-hire data: about not only performance and duration of service but also who trained the employees; who has managed them; whether they were promoted to a supervisory role, and how quickly; how they performed in that role; and why they eventually left.

The potential power of this data-rich approach is obvious. What begins with an online screening test for entry-level workers ends with the transformation of nearly every aspect of hiring, performance assessment, and management. In theory, this approach enables companies to fast-track workers for promotion based on their statistical profiles; to assess managers more scientifically; even to match workers and supervisors who are likely to perform well together, based on the mix of their competencies and personalities. Transcom plans to do all these things, as its data set grows ever richer. This is the real promise—or perhaps the hubris—of the new people analytics. Making better hires turns out to be not an end but just a beginning. Once all the data are in place, new vistas open up.

For a sense of what the future of people analytics may bring, I turned to Sandy Pentland, the director of the Human Dynamics Laboratory at MIT. In recent years, Pentland has pioneered the use of specialized electronic “badges” that transmit data about employees’ interactions as they go about their days. The badges capture all sorts of information about formal and informal conversations: their length; the tone of voice and gestures of the people involved; how much those people talk, listen, and interrupt; the degree to which they demonstrate empathy and extroversion; and more. Each badge generates about 100 data points a minute.

Pentland’s initial goal was to shed light on what differentiated successful teams from unsuccessful ones. As he described last year in the Harvard Business Review, he tried the badges out on about 2,500 people, in 21 different organizations, and learned a number of interesting lessons. About a third of team performance, he discovered, can usually be predicted merely by the number of face-to-face exchanges among team members. (Too many is as much of a problem as too few.) Using data gathered by the badges, he was able to predict which teams would win a business-plan contest, and which workers would (rightly) say they’d had a “productive” or “creative” day. Not only that, but he claimed that his researchers had discovered the “data signature” of natural leaders, whom he called “charismatic connectors” and all of whom, he reported, circulate actively, give their time democratic