Modern Spreadsheets in Stata
Stata is a statistical software package. It offers a robust C-based extension system.
SheetJS is a JavaScript library for reading and writing data from spreadsheets.
This demo uses SheetJS to pull data from a spreadsheet for further analysis within Stata. We'll create a Stata native extension that loads the Duktape JavaScript engine and uses the SheetJS library to read data from spreadsheets and converts to a Stata-friendly format.
The demo will read a Numbers workbook and generate variables for each column. A sample Stata session is shown below:
This demo covers Stata extensions. For directly processing Stata DTA files, the "Stata DTA Codec" works in the browser or NodeJS.
This demo was last tested by SheetJS users on 2023 November 15.
Stata has limited support for processing spreadsheets through the import excel
command1. At the time of writing, it lacked support for XLSB, NUMBERS, and
other common spreadsheet formats.
SheetJS libraries help fill the gap by normalizing spreadsheets to a form that Stata can understand.
Integration Details
The current recommendation involves a native plugin that reads arbitrary files and generates clean XLSX files that Stata can import.
The extension function ultimately pairs the SheetJS read
2 and write
3
methods to read data from the old file and write a new file:
var wb = XLSX.read(original_file_data, {type: "buffer"});
var new_file_data = XLSX.write(wb, {type: "array", bookType: "xlsx"});
The extension function cleanfile
will take one or two arguments:
plugin call cleanfile, "pres.numbers"
will generate sheetjs.tmp.xlsx
from
the first argument ("pres.numbers"
) and print instructions to load the file.
plugin call cleanfile, "pres.numbers" verbose
will additionally print CSV
contents of each worksheet in the workbook.
C Extensions
Stata C extensions are shared libraries or DLLs that use special Stata methods for parsing arguments and returning values.
Arguments are passed to the stata_call
function in the DLL.
SF_display
and SF_error
display text and error messages respectively.
Duktape JS Engine
This demo uses the Duktape JavaScript engine. The SheetJS + Duktape demo covers engine integration details in more detail.
The SheetJS Standalone scripts can be loaded in Duktape by reading the source from the filesystem.
Complete Demo
This demo was tested in Windows x64 and macOS x64. The path names and build commands will differ in other platforms and operating systems.
The cleanfile.c
extension defines one plugin
function. It can be chained with import excel
:
program cleanfile, plugin
plugin call cleanfile, "pres.numbers" verbose
program drop cleanfile
import excel "sheetjs.tmp.xlsx", firstrow
Create Plugin
- Linux/MacOS
- Windows
0) Ensure a compatible C compiler (Xcode on macOS) is installed.
1) Open Stata and run the following command:
pwd
The output will be the default data directory. On macOS this is typically
~/Documents/Stata
2) Open a terminal window and create a project folder sheetjs-stata
within the
Stata data directory:
# `cd` to the Stata data directory
cd ~/Documents/Stata
mkdir sheetjs-stata
cd sheetjs-stata
0) Ensure "Windows Subsystem for Linux" (WSL) and Visual Studio are installed.
1) Open a new "x64 Native Tools Command Prompt" window and create a project
folder c:\sheetjs-stata
:
cd c:\
mkdir sheetjs-stata
cd sheetjs-stata
2) Enter WSL:
bash
3) Download stplugin.c
and
stplugin.h
from the Stata website:
curl -LO https://www.stata.com/plugins/stplugin.c
curl -LO https://www.stata.com/plugins/stplugin.h
4) Download Duktape. In Windows, the following commands should be run in WSL. In macOS, the commands should be run in the same Terminal session.
curl -LO https://duktape.org/duktape-2.7.0.tar.xz
tar -xJf duktape-2.7.0.tar.xz
mv duktape-2.7.0/src/*.{c,h} .
5) Download cleanfile.c
.
In Windows, the following commands should be run in WSL. In macOS, the commands should be run in the same Terminal session.
curl -LO https://docs.sheetjs.com/stata/cleanfile.c
- Linux/MacOS
- Windows
6) Observe that macOS does not need a "Linux Subsystem" and move to Step 7.
7) Build the plugin:
gcc -shared -fPIC -DSYSTEM=APPLEMAC stplugin.c duktape.c cleanfile.c -lm -std=c99 -Wall -ocleanfile.plugin
6) Exit WSL:
exit
The window will return to the command prompt.
7) Build the DLL:
cl /LD cleanfile.c stplugin.c duktape.c
Install Plugin
- Linux/MacOS
- Windows
8) Copy the plugin to the Stata data directory:
cp cleanfile.plugin ../
8) Copy the DLL to cleanfile.plugin
in the Stata data directory. For example,
with a shared data directory c:\data
:
mkdir c:\data
copy cleanfile.dll c:\data\cleanfile.plugin
Download SheetJS Scripts
- Linux/MacOS
- Windows
9) Move to the Stata data directory:
cd ..
10) Observe that macOS does not need a "Linux Subsystem" and move to Step 11.
9) Move to the c:\data
directory:
cd c:\data
10) Enter WSL
bash
11) Download SheetJS scripts and the test file.
In Windows, the following commands should be run in WSL. In macOS, the commands should be run in the same Terminal session.
curl -LO https://cdn.sheetjs.com/xlsx-0.20.0/package/dist/shim.min.js
curl -LO https://cdn.sheetjs.com/xlsx-0.20.0/package/dist/xlsx.full.min.js
curl -LO https://sheetjs.com/pres.numbers
Stata Test
The screenshot in the introduction shows the result of steps 13 - 19
12) If it is not currently running, start the Stata application.
- Linux/MacOS
- Windows
13) Run the following command in Stata:
dir
Inspect the output and confirm that cleanfile.plugin
is listed.
13) Move to the c:\data
directory in Stata:
cd c:\data
14) Load the cleanfile
plugin:
program cleanfile, plugin
16) Read the pres.numbers
test file:
plugin call cleanfile, "pres.numbers" verbose
The result will show the data from pres.numbers
:
. plugin call cleanfile, "pres.numbers" verbose
Worksheet 0 Name: Sheet1
Name,Index
Bill Clinton,42
GeorgeW Bush,43
Barack Obama,44
Donald Trump,45
Joseph Biden,46
Saved to `sheetjs.tmp.xlsx`
import excel "sheetjs.tmp.xlsx", firstrow will read the first sheet and use headers
for more help, see import excel
17) Close the plugin:
program drop cleanfile
18) Clear the current session:
clear
19) In the result of Step 16, click the link on import excel "sheetjs.tmp.xlsx", firstrow
Alternatively, manually type the command:
import excel "sheetjs.tmp.xlsx", firstrow
The output will show the import result:
. import excel "sheetjs.tmp.xlsx", firstrow
(2 vars, 5 obs)
20) Open the Data Editor (in Browse or Edit mode) and compare to the screenshot:
browse Name Index
- Run
help import excel
in Stata or see "import excel" in the Stata documentation.↩ - See
read
in "Reading Files"↩ - See
write
in "Writing Files"↩