Chapter 1 Introduction

Here is an intro. And more.

1.1 Statistics vs Data Science

1.2 Experiments vs Observations

The type of analysis that is relevant for a given problem depends on the experimental design. More generally, experimental design refers to how the data were collected. How the data were collected has implications for the type of conclusions that can be drawn from the data (ie., you may have heard the phrase, “correlation does not imply causation”) and the subsequent analysis. A extremely sophisticated analysis does not overcome limitations in how the data were collected.

Given that how the data were collected greatly impacts the types of conclusions that can be drawn from the analysis, what are design features are important to consider? This topic is large, nuanced, and an entire series of courses have been created on this topic. The goal here is to think about some concepts that are particularly helpful to consider for any analysis.

The discussion will be framed around articulating whether the data collected were part of an experiment or were simply observed. The simplest definition for observational data are those that were collected without strong consideration about who is or how the data are collected. One example of observational data could be collected information about the shoes that people wear as they are walking in a busy part of town. Given this type of data collected, what could be some limitations about this type of data?

To contrast observational data, experimental data are those in which care was taken to how the data were collected. Within experimental data, it is common for there to be two or more conditions that are being explored. For example, in a clinical trial that is testing whether a vaccine or drug is effective and safe, the new vaccine or drug is often tested against a placebo. The placebo is a harmless substance that does not produce any change to the body, for example, the placebo could be a sugar pill that looks just like a new drug to be tested in every other way. The placebo would simply not contain the active ingredients of the new drug.

When conducting these experiments with two or more groups that are of interest to be compared, those who are participating in the study are randomly assigned or selected to be in one of the groups. Using the placebo vs new drug example, this would mean that each participant is randomly assigned to either receive the placebo or the new drug, but the participant does not know which one they are receiving. Often, those administering the treatment also do not know whether the placebo or active drug is being given as well.

1.2.1 Explore Random Assignment

How does the random assignment of individuals to treatment conditions change the design from an observational to an experimental design? The random process is really the primary differentiator between being an observational vs an experiment. The random assignment to each condition in the study has the ability to, on average, even out differences across the two groups. Since the inclusion of being in one of the two groups is random, if the study has enough participants, it is more likely for the two groups to be as equal as possible across all characteristics for the study participants.

What could random assignment look like in practice? More to come … .

1.2.2 Example: Natural Experiments

1.3 Data Structure

Data are often stored in tabular form for ease of use with common statistical programs, however data need not be in this structure. Data can come from anywhere and could consist of text, numbers representing some quantity, text labels representing groups, or many other formats. This section aims to give an introduction to the form and format of data, both common and uncommon.

1.3.1 Tabular Data

Tabular data are those that are most commonly used in statistics courses. Tabular data are such that rows indicate unique cases of data and the columns represent different attributes for those cases. Table 1.1 shows the an example of tabular data.

Table 1.1: Example of tabular data.
name human fictional height
He-Man Yes Yes 83
She-Ra Yes Yes 96
Voltron No Yes 3936

In this table, each row represents a unique person and the first column would represent the identifying attribute for each unique row. For example, the first row represents characteristics for He-Man and each subsequent column represents specific attributes about them. For example, the second and third columns are Yes/No attributes indicating if the person is human or fictional or not. What are the primary differences between the second and third columns? Notice that the third column, the fictional attribute, is one in which all elements in the current tabular data are the same value. This would be an attribute that does not vary across the different rows in the data. These attributes are not helpful from a statistical perspective in this small data set, but if more data were added in which some elements were not fictional, then this attribute would contain useful information. In statistics terminology, this type of attribute would be called a constant attribute. In statistics, we are interested in attributes that vary across our units in the data. Attributes that vary are often referred to as variables in statistics terminology. Throughout this textbook, the term attribute will be used instead of variable. An attribute will refer to a column of data that carries information about the units in the data.

The final column of data in Table 1.1 is one that represents the approximate height of each character, in inches.1 This attribute is different from the rest of the attributes in that it is a numeric quantity for each unit. Numeric quantities usually carry more information about magnitude differences compared to the text attributes described earlier. More explicitly, differences in the height attributes can be quantified and in this case, with the attribute being represented as the height in inches, each unit represents the same distance across the entire scale. That is, a one inch difference across the entire height scale is the same no matter where that inch occurs on the scale. These types of attributes will be used extensively in this text, most commonly these types of attributes will be used as the attribute of most interest in our analyses. For example, from the data above, it could be asked if there are differences in height of the characters based on if they are human or not. For this question, differences in height for those that are human (ie., human attribute = “Yes”) compared to those that are not human (ie., human attribute = “No”) would be used as the primary comparison. This text will explore questions like this from a descriptive and inferential framework later.

For these data, there are only three rows, but you could imagine adding more rows to these elements for other characters. For example, imagine adding a new row to this data for a rabbit. Take a few minutes to think about what a new row with the rabbit would look like for each column of Table 1.1. Would any of the columns in the table change from a constant to now being a variable?

1.3.2 Non-tabular Data

Data can come in many different formats, this book will not extensively cover data that are not in a tabular format. However, non-tabular data are very common in practical problems. In these situations, these data are often wrangled into a more structured format to conduct a statistical analysis. Some common non-tabular data formats include data coming from text, video, audio, graphics, images, sensors, and even more.

  1. This is based on a search for the character name plus height, for example, “He-Man height.”↩︎