Sheets in PST Mailboxes
PST (Personal Storage Table) is a common file format for storing messages. Electronic discovery commonly involves extracting data from attached spreadsheets in e-mail messages stored in PST archives.
pst-extractor
1 is a NodeJS module designed for extracting objects from PST
files. It has been used to extract spreadsheets from the Enron Corpus2 and
other large mailboxes.
SheetJS is a JavaScript library for reading and writing data from spreadsheets.
This demo uses pst-extractor
and SheetJS to read spreadsheets. We'll explore
how to load SheetJS in a NodeJS script or website, extract spreadsheets files,
and generate HTML and CSV views of the underlying data.
The "Live Demo" reads PST files. Individual spreadsheets within the file can be downloaded or previewed in the browser.
This demo was last tested on 2024 March 11 against pst-extractor
1.9.0
Overview
The SheetJS NodeJS module can be
imported from scripts that use pst-extractor
.
Parsing PST Files
The pst-extractor
module exposes a PSTFile
class. The constructor requires a
proper NodeJS buffer.
The following snippet reads and parses enron.pst
from the local filesystem.
fs.readFileSync
3 accepts a filename and returns a Buffer:
const fs = require("fs"), PSTExtractor = require("pst-extractor");
const file = fs.readFileSync("enron.pst");
const pst = new (PSTExtractor.PSTFile)(file);
Walking the Tree
pst-extractor
presents a tree-like structure to inspect the contents of the
PST file. It is recommended to use recursive functions to walk the tree.
The following tree walker will collect all XLSX and XLS attachments:
/* walk the PST file and add all attachments to the specified array */
function walk(f,arr) {
if(f.hasSubfolders) for(let sf of f.getSubFolders()) walk(sf,arr);
if(f.contentCount <= 0) return;
for(let e = f.getNextChild(); e != null; e = f.getNextChild()) {
for(let i = 0; i < e.numberOfAttachments; ++i) {
var a = e.getAttachment(i);
/* XLS spreadsheet test by filename */
if(/.xls[xmb]?$/.test(a.filename)) arr.push(a);
}
}
}
/* generate a list of attachments */
const files = [];
walk(pst.getRootFolder(), files);
Generating Buffers
The PSTAttachment
class holds attachment metadata. To avoid loading everything
in memory, the raw data is exposed as a custom stream object. Since the SheetJS
read
function requires data in a Buffer
or Uint8Array
, a helper function
is used to collect the data:
/* collect data from the attachment into a "Buffer" */
function collect(file) {
const strm = file.fileInputStream;
const data = Buffer.alloc(strm._length.low);
strm.readCompletely(data);
return data;
}
/* collect data from the first attachment */
const buf0 = collect(files[0]);
Processing Attachments
Given a NodeJS Buffer, the SheetJS read
method4 parses the data and returns
a workbook object5. Individual worksheets can be extracted from the workbook
and converted to CSV6 or HTML7.
The following example prints the contents of each worksheet in CSV form:
const XLSX = require("xlsx");
/* parse workbook and print CSV contents of each sheet */
const wb = XLSX.read(buf0);
wb.SheetNames.forEach(n => {
const ws = wb.Sheets[n];
const csv = XLSX.utils.sheet_to_csv(ws);
console.log(`#### ${file.filename} ! ${n}`);
console.log(csv);
});
Browser Caveats
The SheetJS Standalone scripts
can be loaded through a SCRIPT
tag.
This demo uses a special pst-extractor
build for the web.
Compared to the NodeJS build, browser scripts require special Buffer wrappers.
For example, the following function will fail since the library does not support
ArrayBuffer
objects:
async function error_fetch_and_parse_pst(url) {
const ab = await (await fetch(url)).arrayBuffer();
// this will throw an error
return new (PSTExtractor.PSTFile)(ab);
}
The browser build exposes the Buffer
object in the PSTExtractor
global:
async function correct_fetch_and_parse_pst(url) {
const ab = await (await fetch(url)).arrayBuffer();
const buf = new PSTExtractor.Buffer(ab);
return new (PSTExtractor.PSTFile)(buf);
}
Browser Build
The pst-extractor
library is designed for NodeJS. Parts of the library expect
a NodeJS Buffer
, which does not exist in the browser. A fake Buffer
can be
added and exposed in a script.
pstextractor.js
is loaded in the demo page.
Build instructions (click to show)
- Initialize a new NodeJS project and install the dependency:
mkdir pstextract
cd pstextract
npm init -y
npm i --save [email protected]
- Save the following to
shim.js
:
const PSTExtractor = require("pst-extractor");
module.exports = PSTExtractor;
module.exports.Buffer = Buffer;
- Build the script:
npx [email protected] -s PSTExtractor -o pstextractor.js shim.js