Typed Arrays and ML
Machine learning libraries in JS typically use "Typed Arrays". Typed Arrays are not JS Arrays! With some data wrangling, translating between SheetJS worksheets and typed arrays is straightforward.
This demo covers conversions between worksheets and Typed Arrays for use with TensorFlow.js and other ML libraries.
Live code blocks in this page load the standalone build from version 3.18.0
.
For use in web frameworks, the @tensorflow/tfjs
module should be used.
For use in NodeJS, the native bindings module is @tensorflow/tfjs-node
.
CSV Data Interchange
tf.data.csv
generates a Dataset from CSV data. The function expects a URL.
Fortunately blob URLs are supported, making data import straightforward:
function worksheet_to_csv_url(worksheet) {
/* generate CSV */
const csv = XLSX.utils.sheet_to_csv(worksheet);
/* CSV -> Uint8Array -> Blob */
const u8 = new TextEncoder().encode(csv);
const blob = new Blob([u8], { type: "text/csv" });
/* generate a blob URL */
return URL.createObjectURL(blob);
}
This demo mirrors TFjs
docs,
fetching an XLSX export of the example dataset.
TF CSV Demo using XLSX files (click to show)
If the live demo shows a message
ReferenceError: tf is not defined
please refresh the page. This is a known bug in the documentation generator.
In the other direction, XLSX.read
will readily parse CSV exports.
JS Array Interchange
The official Linear Regression tutorial loads data from a JSON file:
[
{
"Name": "chevrolet chevelle malibu",
"Miles_per_Gallon": 18,
"Cylinders": 8,
"Displacement": 307,
"Horsepower": 130,
"Weight_in_lbs": 3504,
"Acceleration": 12,
"Year": "1970-01-01",
"Origin": "USA"
},
{
"Name": "buick skylark 320",
"Miles_per_Gallon": 15,
"Cylinders": 8,
"Displacement": 350,
"Horsepower": 165,
"Weight_in_lbs": 3693,
"Acceleration": 11.5,
"Year": "1970-01-01",
"Origin": "USA"
},
// ...
]
In real use cases, data is stored in spreadsheets
Following the tutorial, the data fetching method is easily adapted. Differences from the official example are highlighted below:
/**
* Get the car data reduced to just the variables we are interested
* and cleaned of missing data.
*/
async function getData() {
/* fetch file */
const carsDataResponse = await fetch('https://sheetjs.com/data/cd.xls');
/* get file data (ArrayBuffer) */
const carsDataAB = await carsDataResponse.arrayBuffer();
/* parse */
const carsDataWB = XLSX.read(carsDataAB);
/* get first worksheet */
const carsDataWS = carsDataWB.Sheets[carsDataWB.SheetNames[0]];
/* generate array of JS objects */
const carsData = XLSX.utils.sheet_to_json(carsDataWS);
const cleaned = carsData.map(car => ({
mpg: car.Miles_per_Gallon,
horsepower: car.Horsepower,
}))
.filter(car => (car.mpg != null && car.horsepower != null));
return cleaned;
}
Low-Level Operations
While it is more efficient to use low-level operations, JS or CSV interchange is strongly recommended when possible.
Data Transposition
A typical dataset in a spreadsheet will start with one header row and represent each data record in its own row. For example, the Iris dataset might look like
XLSX.utils.sheet_to_json
will translate this into an array of row objects:
var aoo = [
{"sepal length": 5.1, "sepal width": 3.5, ...},
{"sepal length": 4.9, "sepal width": 3, ...},
...
];
TF.js and other libraries tend to operate on individual columns, equivalent to:
var sepal_lengths = [5.1, 4.9, ...];
var sepal_widths = [3.5, 3, ...];
When a tensor2d
can be exported, it will look different from the spreadsheet:
var data_set_2d = [
[5.1, 4.9, ...],
[3.5, 3, ...],
...
]
This is the transpose of how people use spreadsheets!
Typed Arrays and Columns
A single typed array can be converted to a pure JS array with Array.from
:
var column = Array.from(dataset_typedarray);
Similarly, Float32Array.from
generates a typed array from a normal array:
var dataset = Float32Array.from(column);
Exporting Datasets to a Worksheet
XLSX.utils.aoa_to_sheet
can generate a worksheet from an array of arrays.
ML libraries typically provide APIs to pull an array of arrays, but it will
be transposed. To export multiple data sets, manually "transpose" the data:
/* assuming data is an array of typed arrays */
var aoa = [];
for(var i = 0; i < data.length; ++i) {
for(var j = 0; j < data[i].length; ++j) {
if(!aoa[j]) aoa[j] = [];
aoa[j][i] = data[i][j];
}
}
/* aoa can be directly converted to a worksheet object */
var ws = XLSX.utils.aoa_to_sheet(aoa);
Importing Data from a Spreadsheet
sheet_to_json
with the option header:1
will generate a row-major array of
arrays that can be transposed. However, it is more efficient to walk the sheet
manually:
/* find worksheet range */
var range = XLSX.utils.decode_range(ws['!ref']);
var out = []
/* walk the columns */
for(var C = range.s.c; C <= range.e.c; ++C) {
/* create the typed array */
var ta = new Float32Array(range.e.r - range.s.r + 1);
/* walk the rows */
for(var R = range.s.r; R <= range.e.r; ++R) {
/* find the cell, skip it if the cell isn't numeric or boolean */
var cell = ws[XLSX.utils.encode_cell({r:R, c:C})];
if(!cell || cell.t != 'n' && cell.t != 'b') continue;
/* assign to the typed array */
ta[R - range.s.r] = cell.v;
}
out.push(ta);
}
If the data set has a header row, the loop can be adjusted to skip those rows.
TF.js Tensors
A single Array#map
can pull individual named fields from the result, which
can be used to construct TensorFlow.js tensor objects:
const aoo = XLSX.utils.sheet_to_json(worksheet);
const lengths = aoo.map(row => row["sepal length"]);
const tensor = tf.tensor1d(lengths);
tf.Tensor
objects can be directly transposed using transpose
:
var aoo = XLSX.utils.sheet_to_json(worksheet);
// "x" and "y" are the fields we want to pull from the data
var data = aoo.map(row => ([row["x"], row["y"]]));
// create a tensor representing two column datasets
var tensor = tf.tensor2d(data).transpose();
// individual columns can be accessed
var col1 = tensor.slice([0,0], [1,tensor.shape[1]]).flatten();
var col2 = tensor.slice([1,0], [1,tensor.shape[1]]).flatten();
For exporting, stack
can be used to collapse the columns into a linear array:
/* pull data into a Float32Array */
var result = tf.stack([col1, col2]).transpose();
var shape = tensor.shape;
var f32 = tensor.dataSync();
/* construct an array of arrays of the data in spreadsheet order */
var aoa = [];
for(var j = 0; j < shape[0]; ++j) {
aoa[j] = [];
for(var i = 0; i < shape[1]; ++i) aoa[j][i] = f32[j * shape[1] + i];
}
/* add headers to the top */
aoa.unshift(["x", "y"]);
/* generate worksheet */
var worksheet = XLSX.utils.aoa_to_sheet(aoa);