**STATISTICS IN SYSTAT 5.0**

**WHAT IS SYSTAT? Systat 5.0** is a **DOS**-based Statistical Package. It works
in the plain** DOS** environment (and not the colorful Windows environment) and uses **COMMANDS**
to conduct the operations. We will use the commands. But **SYSTAT** has also a menu,
which you can access with the command "**SYSTAT**" from the C:\**SYSTDOS**
folder (we will later check out this menu).

**Nota**: Systat 6.0 and up are Windows base version of the package. However, they
can also work with command prompts in the main screen. The commands will be, in most of
the cases, the same as in the DOS version. I will mention some changes in command syntaxis
for cases where I have experienced the differences.

**PRELIMINARIES**. Before turning on the computer we need to think about how we are
going to do with the data that needs analysis. Our first task is to perform an **initial
exploration of batches**. We will start exploring the data we have recorded using **Stem-and-leaf
plots**. We know that these plots are a first approach to establish patterns or modes of
the sample of artifacts. However, first we need to enter the data in **SYSTAT**, so
that Stem-and-leaf plots can be produced.

**1. ENTERING DATA IN SYSTAT**. The first module we will use is the **EDIT**
module.** EDIT** creates and edits **SYSTAT** files in the Data Editor Worksheet (a
worksheet is an application that is table composed of columns and rows). The first thing
you will do is name the columns. You need with the codes for the variable (the first row
of the worksheet, that starts with "Cases" is reserved for variable names).
Numerical and alphabetical data get labeled differently. Data that is numerical can have
any letters in the code; codes for alphabetical data (where your data is recorded with
letters) needs to have a "$" suffix. Let's say we have data on length and width
of lithic tools (this is numerical information with perhaps two decimals, if you used a
very precise caliper for measurement); in addition, we have data on the stone used for the
artifacts (this can be alphabetical data, C, F, and O, or numerical data, 1, 2 and 3, for
Chert, Flint and Obsidian, respectively). The variables row for this data can be: **L, W,
M$**; or simply **L, W, M**. In the first set we would enter C, F or O in the M$
column rather than numbers.

**Important distinction**: the variable **M$** is a **categorical** or
grouping variable (C, F, and O are the categories). In contrast, **L** and **W** are**
continuous** variables (as they are measurements of things). When we analyze the data it
is important to distiguish **L** and **W** for each of the categories (an analysis
of all the data together is only a first step).

Then you can start entering data. Familiarize yourself with the keys to move around the
Data Editor Worksheet (arrows, PageUp, PageDown). When all your data is entered (and even
before that) you should save the data in a **SYSTAT** file. Hit "ESC" and the
cursor will move to the ">" prompt at the bottom of the worksheet. Write
"**SAVE** filename" (no extension!) and hit "**ENTER**". You
have just saved the data that is now in "**filename.sys**" (All **SYSTAT**
data files have the **.sys** extension) . Now you can proceed to produce more detailed
statistics on the data. Exit the **EDIT** module by typing "**QUIT**".

**MISTAKES? **If something went wrong in your entering of data, changes can be made
by simply opening your dataset: write "**USE filename**" at the prompt, and
then "**EDIT**". (Or if you are already in the worksheet, hit "**ESC**"
and at the prompt write "**USE filename**"). It is very easy to move around
the worksheet to make the necessary corrections with the arrows.

**2. EXPLORING THE DATA IN SYSTAT**

**STATISTICS OF A SYSTAT DATA FILE: THE STATS MODULE. **In the **STATS** module
the command **STATS L **(or STATS <var1>,<var2>,<...>) produces basic
statistics:

The command **STATS L/all** produces a complete list of all the calculations
possible. More importantly, the command **STATS L/SE** produces only the information
for Standard Error, a number you will need to have for many procedures.

If you have a batch of numbers grouped by materials, you will want to have the
statistics for each group. You will then use the **BY** group command before **STATS**.
But before using the **BY** command (and for other more comple operations) you will
need to sort the group variable, so that its cases are ordered in ascending or descending
alphanumeric order together. Use the **SORT** command in the **DATA** module. )

DATA |
[Type DATA at the > prompt; this runs the DATA module] |

USE datafile |
[reads a data file] |

SORT MAT$ |
[grouping variable in this example is material or MAT$; specify /A or /D (ascending or descending) for each variable being sorted; sorting of a var$ will be alphabetic.] |

RUN |
[initiates the sort] |

EDIT |
[to see the sorted file: it will be a temporary file, like AABBFFCC] |

SAVE datafile |
[save it to the original file; or give the sorted file a new name] |

STATS |
[go to the STATS module to start calculations by groups] |

USE datafile |
[reads a data file; and lists the variables] |

BY MAT$ |
[indicates that you want data by MAT$ groups] |

STATS L |
[will show numbers for each of the groups] |

At this point you should copy the information onto your data sheet so you can use the data later.

**Some definitions:**

**MEAN.** The arithmetic mean of a variable is its "average." The sum of
the values is divided by the number of (nonmissing) values.

**MEDIAN.** The median is a description of the center of a distribution. If the data
were sorted in increasing order, the median is the value above which half the values fall.

**SD**. Standard deviation, a measure of spread, is the square root of the sum of the
squared deviations (of the values from the mean) divided by (n-1).

**SEM**. The standard error of the mean is the standard deviation divided by the square
root of the sample size. It is the estimation error, or the average deviation of sample
means from the expected value of a variable.

**THE GRAPH MODULE**

The command** STEM** creates a stem-and-leaf plot for one or more variables in a **SYSTAT**
file. The plot shows the distribution of a variable graphically. Stem-and-leaf plots also
list the median (M in stem), minimum, lower-hinge (H in stem), upper-hinge (H in stem),
and maximum values of the sample. Unlike histograms, stem-and-leaf plots show actual
numeric values to the precision of the leaves. A stem-and-leaf is produced by the command:

"**STEM** <var1>,<var2>,<...>

(by default STATS will use its own stem scale; add / LINES=<#>" to define the number of lines or scale; it will use not the exact number of # but an appropriate close one )

As with STATS the command **BY** allows you to produce stems for each group (Do not
forget to sort the file).

Other commands in the **GRAPH** module include **BOX** and **HISTOGRAM**:

The **BOX** command creates box plots. The length of each box shows the range within
which the central 50% of the numbers fall: the midspread, with its borders at the upper
and lower quartiles. The whiskers show the range of values that fall within 1.5 of
box-lengths beyond box limit. Values between the inner (1.5 box-lengths) and outer fences
(3 box-lengths) are plotted with asterisks (*). Values outside the outer fence: OUTLIERS,
are plotted with empty circles (0).

The command** HISTOGRAM **creates a graph that show the sample density of a
continuous variable with vertical bars. The height of each bar shows the number of cases
whose values are contained in an interval of values of the variable:

**HISTOGRAM** <varlist> [draws histograms for each variable specified].

**PRODUCING ASCII FILES WITH INFORMATION FROM SYSTAT**. The information you see in
your screen can be put into a file that you can open in a word-processor to edit and
produce data like tables for a text. The procedure in **SYSTAT** is the following for
all the modules (we use here the **GRAPH** module):

GRAPH |
[you enter the module] |

USE filename |
[you call the file with the data; it will list the variables] |

OUTPUTfilename.txt |
[the file where the information will be] |

STEM var1 var 2<enter> |
[you enter the command for the plot with the variables you need] |

How to see this file: If you are in **C:\SYSTDOS**, type "**cd..**"
<enter>; in C:\ type "**edit c:\systdos\filename.txt **<enter>. Then
you will see the data saved in a basic wordprocessor.

**MANIPULATING A SYSTAT DATA FILE**. The command **IF .... THEN LET .... **is
used for conditional transformations and deletion of data. It allows you to conditionally
transform variables: **IF <condition> THEN LET <var> = <expression>**.
For all cases where the condition is true, **SYSTAT** executes the action. You can use
any mathematically valid combination of variables, numbers, functions, and operators. The **IF
... **command can be run from the prompt in the **EDIT**-worksheet screen.

A simple form of this command allows to make a copy of one variables onto a new column
with a different name: "**LET newvar=oldvar**". This task is very handy when
you have to make changes to** oldvar**: **newvar** will be a backup of the data.
Another example: say that all your measurements of lenght for variable **L** are 15mm
short. You can easy remedy this by writing only the **LET** command: "**LET
L2=L2+.15**". (L2 is the new variable on which you make changes). Commands using
the full **IF...** command would look like this:

"**IF L=11.0 THEN LET L=L+.15**" [to change only those values equal to
11.0];

"**IF L>11.0 THEN LET L=L+.15**"** **[to change only those GREATER than
11.0];

"**IF CASE > 500 THEN LET x = x^2**" [to change cases starting with #500]

"**IF AGE > 80 THEN LET AGE$ = 'ELDERLY' **"

"**IF X = 99 THEN LET X = .**"

"**IF SEX$ = 'MALE' AND AGE > 30 THEN LET GROUP = 1**"** **

"**IF CASE=45 THEN DELETE**" [will delete case # 45]

**SAVE** the file with these changes

The **DROP** command can eliminate variables from the worksheet: at the > prompt,
write **DROP** var1.

**MORE STATISTICAL ANALYSIS**

**METHODS FOR COMPARING GROUPS: **

**T-TEST** in the **STATS** module. The two-sample** t test** (or independent
t test) is ideal for comparing means for two groups of cases. For example, are the floor
area of structures at Black and Smith sites significantly different? We will the dataset
in Group 3 assignment sheet.

**USE** **G3EX2**

**TTEST A * S$** [the probability **p**: will tell you the significance of the
difference between artifacts of each group for the two variables ]

I quote the following paragraphs from Drennan's 1996 *Statistics for Archaeologists:
A Common Sense Approach*, Chapter 11:

*When presenting the result of a significance test, it is always necessary to say
just what significance test was used and to provide the resulting statistic and the
associated probability For the example in the text, we might say, "The 2.5 m ^{2}
difference in mean house floor area between the Black and Smith sites is very significant
(t = -2.69, .01 > p > .005)." This one sentence really says everything that
needs to be said. No further explanation would be necessary if uve were writing for a
professional audience whom we can assume to be familiar with basic statistical principles
and practice. The "statistic" in this case is it, and providing its value makes
it clear that significance was evaluated with a t test, which is quite a standard
technique that does not need to be explained anew each time it is used. The probability
that the observed difference between the two samples was just a consequence of the
vagaries of sampling is the significance, or the associated probability. Ordinarily p
stands for this probability so in this case we have provided the information that the
significance is less than 1%. This means the same thing as saying that our confideffice in
reporting a difference between the two periods is greater than 99%.*

*If, instead of performing a t test, we simply used the bullet graph to compare
estimates of the mean and their error ranges, as in Figure 11.1, we might say "As
Figure 11.1 shows, we can have greater than 99% confidence that mean house floor area
changed between Formative and Classic periods." The notion of estimates and their
error ranges for different confidence levels is also a very standard one which we do not
need to explain every time we use it. Bullet graphs, however, are less common than, say,
box-and-dot plots, so we cannot assume that everyone will automatically understand the
specific confidence levels of the different widths of the error bars. A key indicating
what the confidence levels are, as in Figure 11.1, is necessary.*

*In an instance like the example in the text, a bullet graph and a t test are
alternative approaches. Using and presenting both in a report qualifies as statistical
overkill. Pick the one approach that makes the simplest, clearest, most relevant statement
of what needs to be said in the context in which you are writing; use it; and go on.
Presentation of statistical results should support the argument you are making, not
interrupt it. The simplest, most straightforward presentation that provides complete
information is the best.*

Let's prepare then a **BULLET GRAPH** with the **MEAN** and the **Standard Error**
of floor area for each site. Create a new data file that would look like this: (Remember
that SE given by stats is one **SE** and represents **68%** confidence). Save the
file.

M$ |
Mean |
SE |

Chert | 17.4 | 3.4 |

Obsidian | 12.3 | 2.1 |

Flint | 23.7 | 5.7 |

If you want to graph the **SE** at more precise confidence levels you will need to
create columns **SE2** and **SE3** for **95%**, and **99%** confidence levels,
respectively. Use the **LET** command. The rule of thumb is to multiply the **SE**
result by 1.96 and 2.57, respectively. But be aware that this is good only for large
samples. The **t-table** might tell you that for a confidence level of 99% and 15 **df
**(degrees of freedom, which is** n-1**: for example, number of obsidian cases minus
1) you should multiply **SE** by 2.947.

After saving, type **SYGRAPH** at the command prompt. The file **filename.sys**
should still be in use, otherwise enter "**USE filename.sys**". To produce
the graph type:

"**CPLOT mean*m$ / error=se**". If you want a graph SE95 or SE 99 put
their code instead of** SE**.

Figure comparing graphs produced with STEM&LEAF, BOX-PLOT, and BULLET GRAPH techniques (From Drennan 1996: Figure 1.11).

**ANALYSIS OF VARIANCE or ANOVA** is also a test for comparing groups; it has the
same purpose as the t-test but its adequate for more that two groups. One of the questions
we can investigate with **ANOVA** is whether there is some preference for making
projectile points of different sizes out of different raw materials (we follow the lithic
example). The independent or grouping variable is **M$** (material) Thus the dependent
variable is **L** (length). First **SORT** the file by the grouping variable in the **DATA**
module. **SAVE** that file (see procedure in **SORT** command). The following
commands produce the analysis of variance:

STATS |
[runs the STATS module] |

USE sortedfile |
[sorted filename] |

BY groupingvariable |
[M$] |

PRINT=LONG |
[produces more detailed output for the ANOVA analysis] |

STATISTICS depvariable |
[WT; this produces N, Min, Max, Mean & SD for each group] |

The output has the statistics for each of the groups and the following:

The key information here is **P**. A 0.01 means that there is 1 chance in 100 of
randomly selecting three subsamples with the means and SD's that these have from three
populations whose means are the same; there is a minimal possibility that the observations
are due to the vagaries of sampling. We have in this case a high confidence that the
lithic artifacts made of different stones really do have different mean weights.

**REGRESSION ANALYSIS**. Regression is a statistical procedure for determining the
relationship between a random variable and corresponding values of an independent
variable. This analysis is good to mesure the relationship between size (X is cm) and
weight (Y is g) of lithic artifacts (we assume that this relationship will be positive,
that is the longer the artifact the heavier it is; a negative relationship can be expected
from the following premise: "there is a decrease in the amount of artifacts (X is
number) as we walk away from the site (Y is m)".

So the regression measures **how much of Y is explained by X **(or relationship
between size and weight). This analysis is widely used also in PREDICTIVE ANALYSIS. After
you analyze a sample of say 50 artifacts, you will get an** equation** that will allow
us to predict, with a certain confidence level, the **weight** of an artifact based on **size**.

First, create the data file in **EDIT** with size (X) and weigth (Y). Next, produce
in **GRAPH** a scatter plot of this relationship: "**PLOT** **Y*X**"
(scatter plots need to reveal a very rough shape (oval distribution, for example) for a
regression analysis. If there is a clear tendency toward a curved pattern it is
recommended to perform transformations in order to smooth that curved pattern. We will not
develop this requisite of the analysis here). So proceed with the regression analysis,
using the **MGLH** ("Multivariate General Linear Hypothesis") module.

**MGLH**

**USE datafile**

**MODEL depvar (Y)= CONSTANT + indepvar (X)**

**ESTIMATE** [starts calculations and generates output]

The output is:

This output has much more information than you really need here. The information we need for a regression analysis, however, is included. And we can draw the following conclusions:

**1**. The relationship between the variables volume (independent) and number of
artifacts (dependent) is expressed mathematically by the **regression equation** (ideal
straight line): **Y = 34.156 X + 8.223** (this is the equation that is used to predict
Y if we know X).

**2**.** R squared** (the strenght of the analysis) is .484, so **48.4% of the
variance of Y** **is "explained" by X** (or, 51.6% of the variation of Y is
"unexplained" by X).

**3**. The value of **Pearson R: .696 **indicates the direction of the
relationship between X and Y: positive, somewhat below the perfect middle slope of r=1.

**4.** Test of significance (on a sample size 20) F=16.093 and **p=.001**. The
probability value indicates that there is only a **0.1%** chance of getting a sample
with this r2 value (.484) from a universe where there is no relationship between X and Y.
In other words, there is a 99.9% chance of getting a sample like this from a universe were
there is a B relationship between X and Y.

When you are finished you will exit **SYSTAT** with the command "**QUIT**".