Looking Beyond the Label

I hate shopping.

I love buying things (providing I have the money), but I hate going into shops for clothes, looking at the labels and finding that there isn’t a size to fit me. This is particularly demoralising when I see something I really like, even more so when I realise that I can’t even order it on the internet. You see, I’m not a small man. I am 6 feet five inches tall, 16 stone and have an 18 inch neck. When I ask about clothes, I get the response

“These clothes are for the average man”

Now, don’t get me started on the use of the word average. I guess what they are saying is that they cater for the '95%' of the population, and that by being in the '5%' I am not normal.

It is only as I grew into a full-sized adult that I realised the irony of going to the Menswear department, because it really is the only place that I swear in public! There is plenty of evidence to suggest that the adult male is getting larger (both in height and weight), yet clothes on the rack are getting smaller to fit into the ‘perceived’ norm.

I have a lot of problems with marketing, and of ‘perception’. The media give the impression that men are of the George Clooney, Brad Pitt, Johnny Depp, David Beckham variety, and that this is normal. Given the influence the media has, this becomes what we perceive and therefore becomes a reality. The saying “perception is reality” is now used as a mantra to say that we should act on our perceptions, but the original quote was used as a warning to the marketing industry. In essence, beware of perceptions because they will cause you problems and distort what is true. Your reality becomes something that is unhelpful for others and may put you out of business.

The perception that everyone is like Brad Pitt is not reality. It is false. It doesn’t matter how much the media espouse it, or the marketing departments believe it, or the shop floor cater for it – it is still false. And if you are a man you will have noticed that menswear sections of clothing stores are getting smaller and smaller. They are usually put together with kidswear and household goods, while the rest of the store stocks a wide variety of women’s clothes. The reason for this is that there is a ‘percpetion’ that men are a standard size and all want to look the same. For many there is the view that men simply don’t care about what they wear - that they hate shopping. And with the smaller selections and the reduction in sizes, the perception that men hate shopping probably is a self fulfilling reality.

I hate shopping

But for a different to ‘average’ male my problems don’t end there.

“Go to a big man store”, I will be told.

If you are labelled a ‘big man’ it means the following. You are either very wide, or very wide and very tall. It also usually means that you given up all right to look in the least bit fashionable. The "big man" label doesn’t really work for people like me – not only do I care about what I wear, I am not majorly overweight. Even though I have a 48 inch chest I don’t have a 56 inch waist. When I ask for a 36 inch waist I get the response.

“Have you tried the standard high street stores?”

Some of the high street stores do stock a 48 inch chest but their long fittings stop at 44 inches. They assume I must be short and fat!

Aaaarrrrrrgggghhhhhhh!!!

So why am I having a rant about shopping? What does it have to do with data, analysis or anything else?

Søren Kierkegaard, the Danish philosopher, theologian, poet, social critic and religious author, once said:

“Once you label me you negate me.”

Labelling me as ‘not normal’ or ‘a big man’ is not helpful. I am a unique individual, just like every other human being, and my views, thoughts and needs are as important as anyone else's. And this is equally true in our professional lives. The skills and experiences that we bring are unique – trying to label individuals in a work setting is often problematic and counter-productive.

As a trainer in statistics (https://www.johnvarlow.com/courses) I usually start off telling student about data types. All data is either nominal, ordinal or scale. One of the things that we do early on is categorise our data. In essence, we put a label on it. We like data that is scale or ordinal (ordered categories) and we try to avoid nominal data (non-orderable categories) if at all possible. But, as an individual, I am a nominal piece of data. The nominal category I am in just includes me. You can put me with a number of other individuals who share a particular characteristic and label us e,g, “big”, “small”, “black”, “white”, “rich”, “poor”, but that label doesn’t define me, just one particular characteristic of me. You cannot put me in a list and rank me as being ‘better’ or “worse’ than anyone else, except on a specific characteristic (sometimes). I am at my best defined when I am not labelled and therefore all my characteristics are equally at play.

So why do we need to label people at work? It is often to reinforce hierarchy, to suggest that one role is more important, or more skilled than another. The latest popular label is “Data Scientist”. The data scientist is seen to be at the top of the analytical scale. Employees are encouraged to 'step up' from being an analyst, statistician or similar and become a data scientist. If you are a data scientist there are plenty of jobs that you can apply for, there is career progression, there are courses for you, there is job security, there is respect. But ask any data scientist what a date scientist is, and you will get very different responses. For some it is a combination of business analyst and programmer, for others it is someone with statistical training and deep domain expertise. We have a label that defines the data scientist, but nobody really knows what it means.

Everyone knows the line from Romeo and Juliet

“What's in a name? that which we call a rose by any other name would smell as sweet.”

Shakespeare clearly held the view that it wasn’t the label that was important, but the underlying characteristics. Why don’t we define our job roles based on skills required rather than a label? Why label people at all if we don’t know what the label means? It's because we want to be perceived, as an individual or an organisation, to be ‘with the times’, to be innovative, to be cutting edge, to be competing with everyone else. We have created the hype around data science, and there is now a shared perception that everybody needs a team of data scientists (regardless of what that means). We have gone from the insight of Shakespeare to the fictional response of Anne of Green Gables.

“I read in a book once that a rose by any other name would smell as sweet, but I've never been able to believe it. I don't believe a rose WOULD be as nice if it was called a thistle or a skunk cabbage.”

Did I mention I hate shopping?

The one thing I hate more than anything else is looking at a label and it saying:

“One Size Fits All”

Nonsense! What does this mean? Is one size of clothing really meant to be able to fit everybody? Unfortunately, unless you like tight Lycra, very baggy clothes or are the “average” person, the clothing will not fit you well. We have all seen the ‘one size’ hat wearer, with the hat over their eyes or awkwardly perched on the top of their head. And nothing that is “one size fits all” has ever fitted me. How can one size fit all when everyone's size is different? Heck, even standard glasses don't fit me!

Data Scientist – the one size fits all definition.

Of course, the more general a label is, the better it fits. I could label everyone in the world as “human” and I would be correct. And no-one would object to that definition. It is very general, and gives no specifics about anything except for an indisputable characteristic. If we are going to use labels at work then it is this type of label that we need to use – a label that cannot be disputed.

So what is data science?

Let’s take data first – everything we experience is data, so we could argue the term is moot. However, let’s suggest that we are talking about digital data, data that has been, or will be entered in to a computer in some fashion.

What is a scientist? Someone who conducts research to advance knowledge in an area of interest. In other words, finding out new things using scientific principles.

We have talked in previous blogs about the scientific method. The use of observations to formulate hypotheses, and the testing of hypotheses to form theories. We have also touched on the four elements of the scientific method. These are scepticism, reductionism, determinism and empiricism. In essence, if we have observable data, everything is explainable and follows logical rules, although there is a chance that the explanation may be wrong.

What is a data scientist? Someone who formulates new theories from data. This definition cannot be disputed. It really is a ‘one size fits all’ definition. Unfortunately, this is not the definition we use. We try to define data science by a number of more granular, ‘one size’ labels. Here are a few, and the dangers of taking this approach.

“A data scientist will use R and python”. R is a great statistical tool, and open source which means that it is available to everyone. It has strengths (price, the online community, specialist techniques) and weaknesses (interface, language). But there are other tools out there that do the same things. SPSS has strengths (interface, language, data manipulation) and disadvantages (price, online community). Both incorporate python, although I would argue that this isn’t necessary. To suggest that a data scientist must use R is like suggesting that a chef can only use NEFF ovens. In my training I have come across people who have had trouble with me for using output from SPSS rather than R (even though the output would have said exactly the same thing). When producing training materials, and embedding graphics and tables, SPSS is ideal. There is an old saying that applies to both statistical tests and statistical tools “Use the best tool for the job”..

“A data scientist must have programming or hacking skills”. I have never understood this assertion. Some data scientists will have these skills and some won’t. It depends on the type of data being looked at. You don’t say that you are only a chef if you can make soufflé, or only a carpenter if you can make a dressing table. I have also often seen data scientists using programming to do something that the ‘correct tool’ would have done for them much more simply.

“A data scientist will use algorithms and machine learning to discover patterns in the data.” Again, this is too specific for a ‘one size’ label. Any scientist will use observations to formulate hypotheses. A data scientist should use data observations to formulate hypotheses. There should be no restrictions on how these observations are ‘observed’. Yes, it might be through algorithms, machine learning, visualisations. It might be through segmentation, cluster analysis, factor analysis or other statistical techniques. It might be that an a-priori hypothesis is formed and then the correct data sought. A data scientist does not have to be limited to post-hoc hypotheses, and shouldn't be.

“A data scientist will use certain statistical techniques”. I have personally been told that some statistical techniques shouldn’t be used, and that we should concentrate on regression models. Going back to my previous points “Use the best tool for the job”. Rather than limit the tool or technique we should be encouraging any tool, and technique that helps to answer the question in front of us. Yes, this could be a regression model, but equally could be an ETS or ARIMA model. It could be that Statistical Process Control is sufficient for our needs, or it could be that simple univariate analysis is the best technique. This means that it is unlikely that one data scientist will have experience in all these techniques and will need to work with others (statisticians, analysts etc.)

None of this is intended to be a criticism of data scientists. Those that I have met have all been highly skilled individuals. But we do them, and others, a disservice by dictating what they can and can’t do. Providing they use data ethically (see my blog Epistles, Enquiries and Ethics) they should be able to use all the tools at their disposal (including other people who have the skills that they don’t). Any scientist doing medical research will bring in individuals to assist where skills needed are outside their area of expertise. It is no different in data science. Why should a data scientist be expected to be able to do everything themselves? What is the point of having computer programmers, data analysts, statisticians, data managers, business analysts etc? Surely all of these roles are defunct if a data scientist has all these skills?

I suggested earlier that we should concentrate on skills, knowledge and abilities rather than labels. If I want a chef to work in a French Restaurant, I don’t care if they can’t make a curry. If I want a carpenter to make chairs, I don’t care if they can’t make a bookcase. Similarly, if I want a data scientist to work with particular types of data, or in a particular setting I shouldn’t care if they have no knowledge outside of that setting. A surgeon can be experienced in Cardiac Surgery but not Bowel surgery. It doesn’t mean they are not a surgeon.

My own view is that the label of ‘data scientist’ incorporates data analysts, statisticians and similar. We would be better using data scientist as a ‘catch all’ label, and then looking for specific individuals with statistical skills, programming skills if necessary and some domain knowledge. Such an approach would remove the current view that a data scientist is ‘better’ than an analyst, or a statistician needs to ‘become’ a data scientist. Of course, I know that this won’t happen. While labelling is often counter-productive it is there to create a perception and there is no doubt that perception of individuals, organisations and situations is important. To quote Winston Churchill:

“One day President Roosevelt told me that he was asking publicly for suggestions about what the war should be called. I said at once 'The Unnecessary War'.”

That label, while probably true, would not have helped the perception. While World War 1 was labelled 'The Great War', and 'The War to end all Wars", World War 2 never got an alternate label.

So labels are, rightly or wrongly, seen as important tools to influence perception. If we are stuck with the label 'Data Scientist', let’s ensure that we cater for everybody. Let’s realise that too specific a ‘one size’ label excludes, and diminishes, those that have fantastic skills, experiences and abilities. While the desired perception may be that we are a modern, dynamic, cutting edge organisation, the reality may be that we are missing out on, and negating, some fantastic skills and people.