Browser Automation
Headless automation involves controlling "headless browsers" to access websites and submit or download data. It is also possible to automate browsers using custom browser extensions.
The SheetJS standalone script can be added to
any website by inserting a SCRIPT
tag. Headless browsers usually provide
utility functions for running custom snippets in the browser and passing data
back to the automation script.
Use Case
This demo focuses on exporting table data to a workbook. Headless browsers do not generally support passing objects between the browser context and the automation script, so the file data must be generated in the browser context and sent back to the automation script for saving in the file system.
Key Steps (click to hide)
1) Launch the headless browser and load the target site.
2) Add the standalone SheetJS build to the page in a SCRIPT
tag.
3) Add a script to the page (in the browser context) that will:
- Make a workbook object from the first table using
XLSX.utils.table_to_book
- Generate the bytes for an XLSB file using
XLSX.write
- Send the bytes back to the automation script
4) When the automation context receives data, save to a file
This demo exports data from https://sheetjs.com/demos/table.
It is also possible to parse files from the browser context, but parsing from the automation context is more efficient and strongly recommended.
Puppeteer
Puppeteer enables headless Chromium automation for NodeJS. Releases ship with an installer script that prepares a compatible browser version.
- NodeJS
- Deno
Binary strings are the favored data type. They can be safely passed from the
browser context to the automation script. NodeJS provides an API to write
binary strings to file (fs.writeFileSync
using encoding binary
).
The key steps are commented below:
const fs = require("fs");
const puppeteer = require('puppeteer');
(async () => {
/* (1) Load the target page */
const browser = await puppeteer.launch();
const page = await browser.newPage();
page.on("console", msg => console.log("PAGE LOG:", msg.text()));
await page.setViewport({width: 1920, height: 1080});
await page.goto('https://sheetjs.com/demos/table');
/* (2) Load the standalone SheetJS build from the CDN */
await page.addScriptTag({ url: 'https://cdn.sheetjs.com/xlsx-0.19.3/package/dist/xlsx.full.min.js' });
/* (3) Run the snippet in browser and return data */
const bin = await page.evaluate(() => {
/* NOTE: this function will be evaluated in the browser context.
`page`, `fs` and `puppeteer` are not available.
`XLSX` will be available thanks to step 2 */
/* find first table */
var table = document.body.getElementsByTagName('table')[0];
/* call table_to_book on first table */
var wb = XLSX.utils.table_to_book(table);
/* generate XLSB and return binary string */
return XLSX.write(wb, {type: "binary", bookType: "xlsb"});
});
/* (4) write data to file */
fs.writeFileSync("SheetJSPuppeteer.xlsb", bin, { encoding: "binary" });
await browser.close();
})();
Demo
This demo was last tested on 2023 April 29 against Puppeteer 19.11.1.
1) Install SheetJS and Puppeteer:
npm i --save https://cdn.sheetjs.com/xlsx-0.19.3/xlsx-0.19.3.tgz [email protected]
2) Save the SheetJSPuppeteer.js
code snippet to SheetJSPuppeteer.js
.
3) Run node SheetJSPuppeteer.js
.
When the script finishes, the file SheetJSPuppeteer.xlsb
will be created.
This file can be opened with Excel.
Deno Puppeteer is a fork. It is not officially supported by the Puppeteer team.
Base64 strings are the favored data type. They can be safely passed from the
browser context to the automation script. Deno can decode the Base64 strings
and write the decoded Uint8Array
data to file with Deno.writeFileSync
The key steps are commented below:
import puppeteer from "https://deno.land/x/[email protected]/mod.ts";
import { decode } from "https://deno.land/std/encoding/base64.ts"
/* (1) Load the target page */
const browser = await puppeteer.launch();
const page = await browser.newPage();
page.on("console", msg => console.log("PAGE LOG:", msg.text()));
await page.setViewport({width: 1920, height: 1080});
await page.goto('https://sheetjs.com/demos/table');
/* (2) Load the standalone SheetJS build from the CDN */
await page.addScriptTag({ url: 'https://cdn.sheetjs.com/xlsx-0.19.3/package/dist/xlsx.full.min.js' });
/* (3) Run the snippet in browser and return data */
const b64 = await page.evaluate(() => {
/* NOTE: this function will be evaluated in the browser context.
`page`, `fs` and `puppeteer` are not available.
`XLSX` will be available thanks to step 2 */
/* find first table */
var table = document.body.getElementsByTagName('table')[0];
/* call table_to_book on first table */
var wb = XLSX.utils.table_to_book(table);
/* generate XLSB and return binary string */
return XLSX.write(wb, {type: "base64", bookType: "xlsb"});
});
/* (4) write data to file */
Deno.writeFileSync("SheetJSPuppeteer.xlsb", decode(b64));
await browser.close();
Demo
This demo was last tested on 2023 April 29 against deno-puppeteer 16.2.0.
1) Install deno-puppeteer:
env PUPPETEER_PRODUCT=chrome deno run -A --unstable https://deno.land/x/[email protected]/install.ts
2) Save the SheetJSPuppeteer.ts
code snippet to SheetJSPuppeteer.ts
.
3) Run deno run -A --unstable SheetJSPuppeteer.ts
.
When the script finishes, the file SheetJSPuppeteer.xlsb
will be created.
This file can be opened with Excel.
Playwright
Playwright presents a unified scripting framework for Chromium, WebKit, and other browsers. It draws inspiration from Puppeteer. In fact, the example code is almost identical!
Differences from the Puppeteer example are highlighted below:
const fs = require("fs");
const { webkit } = require('playwright'); // import desired browser
(async () => {
/* (1) Load the target page */
const browser = await webkit.launch(); // launch desired browser
const page = await browser.newPage();
page.on("console", msg => console.log("PAGE LOG:", msg.text()));
await page.setViewportSize({width: 1920, height: 1080}); // different name :(
await page.goto('https://sheetjs.com/demos/table');
/* (2) Load the standalone SheetJS build from the CDN */
await page.addScriptTag({ url: 'https://cdn.sheetjs.com/xlsx-0.19.3/package/dist/xlsx.full.min.js' });
/* (3) Run the snippet in browser and return data */
const bin = await page.evaluate(() => {
/* NOTE: this function will be evaluated in the browser context.
`page`, `fs` and the browser engine are not available.
`XLSX` will be available thanks to step 2 */
/* find first table */
var table = document.body.getElementsByTagName('table')[0];
/* call table_to_book on first table */
var wb = XLSX.utils.table_to_book(table);
/* generate XLSB and return binary string */
return XLSX.write(wb, {type: "binary", bookType: "xlsb"});
});
/* (4) write data to file */
fs.writeFileSync("SheetJSPlaywright.xlsb", bin, { encoding: "binary" });
await browser.close();
})();
Demo
This demo was last tested on 2023 April 29 against Playwright 1.33.0.
1) Install SheetJS and Playwright:
npm i --save https://cdn.sheetjs.com/xlsx-0.19.3/xlsx-0.19.3.tgz [email protected]
2) Save the SheetJSPlaywright.js
code snippet to SheetJSPlaywright.js
.
3) Run node SheetJSPlaywright.js
.
When the script finishes, the file SheetJSPlaywright.xlsb
will be created.
This file can be opened with Excel.
PhantomJS
PhantomJS is a headless web browser powered by WebKit.
This information is provided for legacy deployments. PhantomJS development has been suspended and there are known vulnerabilities, so new projects should use alternatives. For WebKit automation, new projects should use Playwright.
Binary strings are the favored data type. They can be safely passed from the
browser context to the automation script. PhantomJS provides an API to write
binary strings to file (fs.write
using mode wb
).
Integration Details and Demo (click to show)
The steps are marked in the comments:
var page = require('webpage').create();
page.onConsoleMessage = function(msg) { console.log(msg); };
/* (1) Load the target page */
page.open('https://sheetjs.com/demos/table', function() {
/* (2) Load the standalone SheetJS build from the CDN */
page.includeJs("https://cdn.sheetjs.com/xlsx-0.19.3/package/dist/xlsx.full.min.js", function() {
/* (3) Run the snippet in browser and return data */
var bin = page.evaluateJavaScript([ "function(){",
/* find first table */
"var table = document.body.getElementsByTagName('table')[0];",
/* call table_to_book on first table */
"var wb = XLSX.utils.table_to_book(table);",
/* generate XLSB file and return binary string */
"return XLSX.write(wb, {type: 'binary', bookType: 'xlsb'});",
"}" ].join(""));
/* (4) write data to file */
require("fs").write("SheetJSPhantomJS.xlsb", bin, "wb");
phantom.exit();
});
});
PhantomJS is very finicky and will hang if there are script errors. It is strongly recommended to add verbose logging and to lint scripts before use.
Demo
This demo was last tested on 2023 April 29 against PhantomJS 2.1.1
1) Download and unzip the PhantomJS release from the official website.
2) Save the SheetJSPhantom.js
code snippet to SheetJSPhantom.js
.
3) Run the command.
In macOS:
./phantomjs-2.1.1-macosx/bin/phantomjs SheetJSPhantom.js
When the script finishes, the file SheetJSPhantomJS.xlsb
will be created.
This file can be opened with Excel.