Skip to main content

Summary Statistics

Summary statistics help people quickly understand datasets and make informed decisions. Many interesting datasets are stored in spreadsheet files.

SheetJS is a JavaScript library for reading and writing data from spreadsheets.

This demo uses SheetJS to process data in spreadsheets. We'll explore how to extract spreadsheet data and how to compute simple summary statistics. This demo will focus on two general data representations:

The Import Tutorial is a guided example of extracting data from a workbook. It is strongly recommended to review the tutorial first.

Tested Deployments

This browser demo was tested in the following environments:

BrowserDate
Chrome 1192024-01-06

Data Representations

Many worksheets include one header row followed by a number of data rows. Each row is an "observation" and each column is a "variable".

The "Array of Objects" explanations use more idiomatic JavaScript patterns. It is suitable for smaller datasets.

The "Dense Worksheets" approach is more performant, but the code patterns are reminiscent of C. The low-level approach is only encouraged when the traditional patterns are prohibitively slow.

Arrays of Objects

The idiomatic JavaScript representation of the dataset is an array of objects. Variable names are typically taken from the first row. Those names are used as keys in each observation.

SpreadsheetJS Data

pres.xlsx data

[
{ Name: "Bill Clinton", Index: 42 },
{ Name: "GeorgeW Bush", Index: 43 },
{ Name: "Barack Obama", Index: 44 },
{ Name: "Donald Trump", Index: 45 },
{ Name: "Joseph Biden", Index: 46 }
]

The SheetJS sheet_to_json method1 can generate arrays of objects from a worksheet object. For example, the following snippet fetches a test file and creates an array of arrays from the first sheet:

const url = "https://docs.sheetjs.com/typedarray/iris.xlsx";

/* fetch file and pull file data into an ArrayBuffer */
const file = await (await fetch(url)).arrayBuffer();

/* parse workbook */
const workbook = XLSX.read(file, {dense: true});

/* first worksheet */
const first_sheet = workbook.Sheets[workbook.SheetNames[0]];

/* generate array of arrays */
const aoo = XLSX.utils.sheet_to_json(first_sheet);

Dense Worksheets

SheetJS "dense" worksheets2 store cells in an array of arrays. The SheetJS read method3 accepts a special dense option to create dense worksheets.

The following example fetches a file:

/* fetch file and pull file data into an ArrayBuffer */
const url = "https://docs.sheetjs.com/typedarray/iris.xlsx";
const file = await (await fetch(url)).arrayBuffer();

/* parse workbook */
const workbook = XLSX.read(file, {dense: true});

/* first worksheet */
const first_dense_sheet = workbook.Sheets[workbook.SheetNames[0]];

The "!data" property of a dense worksheet is an array of arrays of cell objects4. Cell objects include attributes including data type and value.

Analyzing Variables

Individual variables can be extracted by looping through the array of objects and accessing specific keys. For example, using the Iris dataset:

Iris dataset

The following snippet shows the first entry in the array of objects:

{
"sepal length": 5.1,
"sepal width": 3.5,
"petal length": 1.4,
"petal width": 0.2,
"class ": "Iris-setosa"
}

The values for the sepal length variable can be extracted by indexing each object. The following snippet prints the sepal lengths:

for(let i = 0; i < aoo.length; ++i) {
const row = aoo[i];
const sepal_length = row["sepal length"];
console.log(sepal_length);
}
Result
Loading...
Live Editor
function SheetJSAoOExtractColumn() {
  const [col, setCol] = React.useState([]);

  React.useEffect(() => { (async() => {
    const ab = await (await fetch("/typedarray/iris.xlsx")).arrayBuffer();
    const wb = XLSX.read(ab, {dense: true});
    const aoo = XLSX.utils.sheet_to_json(wb.Sheets[wb.SheetNames[0]]);
    /* store first 5 sepal lengths in an array */
    const col = [];
    for(let i = 0; i < aoo.length; ++i) {
      const row = aoo[i];
      const sepal_length = row["sepal length"];
      col.push(sepal_length); if(col.length >= 5) break;
    }
    setCol(col);
  })(); }, []);

  return ( <>
  <b>First 5 Sepal Length Values</b><br/>
  <table><tbody>
    {col.map(sw => (<tr><td>{sw}</td></tr>))}
  </tbody></table>
  </>
  );
}

Average (Mean)

For a given sequence of numbers x1xcountx_1\mathellipsis x_{count} the mean MM is defined as the sum of the elements divided by the count:

M[x;count]=1counti=1countxiM[x;count] = \frac{1}{count}\sum_{i=1}^{count} x_i

In JavaScript terms, the mean of an array is the sum of the numbers in the array divided by the total number of numeric values.

Non-numeric elements and array holes do not affect the sum and do not contribute to the count. Algorithms are expected to explicitly track the count and cannot assume the array length property will be the correct count.

This definition aligns with the spreadsheet AVERAGE function.

AVERAGEA differs from AVERAGE in its treatment of string and Boolean values: string values are treated as zeroes and Boolean values map to their coerced numeric equivalent (true is 1 and false is 0).

JavaScript Ecosystem

Some JavaScript libraries implement functions for computing array means.

LibraryImplementation
jStat8Textbook sum (divide at end)
simple-statistics9Neumaier compensated sum (divide at end)
stdlib.js10Trial mean (mean) / van Reeken (incrmean)

Textbook Sum

The mean of a sequence of values can be calculated by computing the sum and dividing by the count.

The following function accepts an array of objects and a key.

function aoa_average_of_key(aoo, key) {
let sum = 0, cnt = 0;
for(let R = 0; R < aoo.length; ++R) {
const row = aoo[R];
if(typeof row == "undefined") continue;

const field = row[key];
if(typeof field != "number") continue;

sum += field; ++cnt;
}
return cnt == 0 ? 0 : sum / cnt;
}
Live Demo (click to show)
Result
Loading...
Live Editor
function SheetJSAoOAverageKey() {
  const [avg, setAvg] = React.useState(NaN);

  function aoa_average_of_key(aoo, key) {
    let sum = 0, cnt = 0;
    for(let R = 0; R < aoo.length; ++R) {
      const row = aoo[R];
      if(typeof row == "undefined") continue;

      const field = row[key];
      if(typeof field != "number") continue;

      sum += field; ++cnt;
    }
    return cnt == 0 ? 0 : sum / cnt;
  }

  React.useEffect(() => { (async() => {
    const ab = await (await fetch("/typedarray/iris.xlsx")).arrayBuffer();
    const wb = XLSX.read(ab, {dense: true});
    const aoo = XLSX.utils.sheet_to_json(wb.Sheets[wb.SheetNames[0]]);
    setAvg(aoa_average_of_key(aoo, "sepal length"));
  })(); }, []);

  return ( <b>The average Sepal Length is {avg}</b> );
}

The textbook method suffers from numerical issues when many values of similar magnitude are summed. As the number of elements grows, the absolute value of the sum grows to orders of magnitude larger than the absolute values of the individual values and significant figures are lost.

van Reeken

Some of the issues in the textbook approach can be addressed with a differential technique. Instead of computing the whole sum, it is possible to calculate and update an estimate for the mean.

The van Reeken array mean can be implemented in one line of JavaScript code:

for(var n = 1, mean = 0; n <= x.length; ++n) mean += (x[n-1] - mean)/n;
Math details (click to show)

Let M[x;m]=1mi=1mxmM[x;m] = \frac{1}{m}\sum_{i=1}^{m}x_m be the mean of the first mm elements. Then:

M[x;m+1]M[x;m+1]

=1m+1i=1m+1xi= \frac{1}{m+1}\sum_{i=1}^{m+1} x_i

 

=1m+1i=1mxi+xm+1m+1= \frac{1}{m+1}\sum_{i=1}^{m} x_i + \frac{x_{m+1}}{m+1}

 

=mm+1(1mi=1mxi)+xm+1m+1= \frac{m}{m+1}(\frac{1}{m}\sum_{i=1}^{m} x_i) + \frac{x_{m+1}}{m+1}

 

=mm+1M[x;m]+xm+1m+1= \frac{m}{m+1}M[x;m] + \frac{x_{m+1}}{m+1}

 

=(11m+1)M[x;m]+xm+1m+1= (1 - \frac{1}{m+1})M[x;m] + \frac{x_{m+1}}{m+1}

 

=M[x;m]+xm+1m+11m+1M[x;m]= M[x;m] + \frac{x_{m+1}}{m+1} - \frac{1}{m+1}M[x;m]

 

=M[x;m]+1m+1(xm+1M[x;m])= M[x;m] + \frac{1}{m+1}(x_{m+1}-M[x;m])

new_meannew\_mean

=old_mean+(xm+1old_mean)/(m+1)= old\_mean + (x_{m+1}-old\_mean) / (m+1)

Switching to zero-based indexing, the relation matches the following expression:

new_mean = old_mean + (x[m] - old_mean) / (m + 1);

This update can be succinctly implemented in JavaScript:

mean += (x[m] - mean) / (m + 1);

The following function accepts an array of objects and a key.

function aoa_mean_of_key(aoo, key) {
let mean = 0, cnt = 0;
for(let R = 0; R < aoo.length; ++R) {
const row = aoo[R];
if(typeof row == "undefined") continue;

const field = row[key];
if(typeof field != "number") continue;

mean += (field - mean) / ++cnt;
}
return cnt == 0 ? 0 : mean;
}
Live Demo (click to show)
Result
Loading...
Live Editor
function SheetJSAoOMeanKey() {
  const [avg, setAvg] = React.useState(NaN);

  function aoa_mean_of_key(aoo, key) {
    let mean = 0, cnt = 0;
    for(let R = 0; R < aoo.length; ++R) {
      const row = aoo[R];
      if(typeof row == "undefined") continue;

      const field = row[key];
      if(typeof field != "number") continue;

      mean += (field - mean) / ++cnt;
    }
    return cnt == 0 ? 0 : mean;
  }

  React.useEffect(() => { (async() => {
    const ab = await (await fetch("/typedarray/iris.xlsx")).arrayBuffer();
    const wb = XLSX.read(ab, {dense: true});
    const aoo = XLSX.utils.sheet_to_json(wb.Sheets[wb.SheetNames[0]]);
    setAvg(aoa_mean_of_key(aoo, "sepal length"));
  })(); }, []);

  return ( <b>The average Sepal Length is {avg}</b> );
}
Historical Context

This algorithm is generally attributed to Welford11. However, the original paper does not propose this algorithm for calculating the mean!

Programmers including Neely12 attributed a different algorithm to Welford. van Reeken13 reported success with the algorithm presented in this section.

Knuth14 erroneously attributed this implementation of the mean to Welford.

Footnotes

  1. See sheet_to_json in "Utilities"

  2. See "Dense Mode" in "Utilities"

  3. See read in "Reading Files"

  4. See "Dense Mode" in "Utilities"

  5. See "Cell Types" in "Cell Objects"

  6. See "Underlying Values" in "Cell Objects"

  7. See "Cell Types" in "Cell Objects"

  8. See mean() in the jStat documentation.

  9. See mean in the simple-statistics documentation.

  10. See incrsum in the stdlib.js documentation.

  11. See "Note on a Method for Calculated Corrected Sums of Squares and Products" in Technometrics Vol 4 No 3 (1962 August).

  12. See "Comparison of Several Algorithms for Computation of Means, Standard Deviations and Correlation Coefficients" in CACM Vol 9 No 7 (1966 July).

  13. See "Dealing with Neely's Algorithms" in CACM Vol 11 No 3 (1968 March).

  14. See "The Art of Computer Programming: Seminumerical Algorithms" Third Edition page 232.