TabbyXL: Software Platform for Rule-Based Spreadsheet Data Extraction and Transformation*

Abstract Spreadsheets are widely used in science, engineering, business, and other activities. Overall, they conceal a large volume of data in a form intended to be interpreted by humans. We present a novel software platform facilitated for liberating such data. It provides rule-based spreadsheet data extraction and transformation to a structured form. Its core consists of a flexible table object model and a domain-specific rule language for table analysis. They serve to represent knowledge of table layout and content features, as well as their interpretation depending on transformation goals. This enables processing arbitrary tables originating from various domains. Our empirical results demonstrate that one ruleset can be applied to process arbitrary tables having the same features of layout, style, or content. The paper also describes two applications using the software platform to develop programs for rule-based converting data from arbitrary spreadsheet tables.

User-Defined Rules The user-defined rules map the physical structure into the logical structure of a table

Structural Analysis
The actions recover pairs of two kinds: entry-label and label-label <add label> associates an entry with a label <set parent> binds two labels as a parent and its child Example

Interpretation
The actions serve to recover label-category pairs <set category> associates a label with a category <group> places two labels to one group that can be considered as an undefined category Example when label l1: cell.mark== "stub" label l2: cell.mark== "stub", cell.rt== l1.cell.rtthen group l1 with l2

Structural Analysis
Contest task: The extraction of header hierarchies from tables The more detail can be found at https://github.com/tabbydoc/tabbyxl/wiki/statistical-atlas (a and c) to their canonicalized versions (b and d) the cell cleansing (a), role analysis (b, c), structural analysis (d, e), and interpretation (f, g) when cell c: c.text.matches("NA") then set text "" to c b when cell c: (cl % 2) == 0, !blank then new entry c c when cell c: (cl % 2) == 1 then new label c d when entry e label l: cell.cr== e.cell.crthen add label l to e e when entry e label l: cell.rt== e.cell.rt,cell.cl== e.cell.cl-1 then add label l to e f when label l: cell.rt== 1 then set category "A" to l g when label l: cell.rt> 1 then set category "B" to l a This example is reproducible at https://codeocean.com/capsule/5326436 Testing dataset: A random subset of SAUS a Contestant: Senbazuru [Chen and Cafarella, 2014] F-score: Senbazuru -0.8860 vs. TabbyXL -0.8657 a http://dbgroup.eecs.umich.edu/project/sheets/datasets.html -based statistical atlas of the Irkutsk region -(b) via extracting data from government statistical reports -(a)

split | set text | set indent | set mark | new entry
| new label | add label | set parent | set category | group <EOL> {action} merge = 'merge' identifier 'with' identifier split = 'split' identifier set text = 'set text' <a Java string expr> 'to' identifier set indent = 'set indent' <a Java integer expr> 'to' identifier set mark = 'set mark' <a Java string expr> 'to' identifier new entry = 'new entry' identifier ['as' <a Java string expr>] new label = 'new label' identifier ['as' <a Java string expr>] add label = 'add label' identifier | (<a Java string expr> 'of' identifier | <a Java string expr>) 'to' identifier set parent = 'set parent' identifier 'to' identifier set category = 'set category' identifier | <a Java string expr> 'to' identifier group = 'group' identifier 'with' identifier identifier = <a Java identifier>Cell CleansingThe actions correct an inaccurate layout and content of a hand-coded table <merge> combines two adjacent cells when they share one border <split> divides a merged cell that spans n-tiles (row-column intersections) into n-cells <set text> modifies a textual content of a cell <set indent> modifies a text indentation of a cell

Table Analysis and Interpretation) Core TOM Access API Two options are provided
t 1 -a time of parsing and compiling the original ruleset into a Java program ** t 2 -a time of executing the generated Java program † t 1 -a time of parsing the original ruleset and adding the result into a rule engine session ‡ t 2 -a time of asserting facts into the working memory and matching rules against the facts * The more detail can be found at https://github.com/tabbydoc/tabbyxl/wiki/industrial-safety-inspection Table object model associating functional roles with data items Table analysis and interpretation driven by user-defined rules Formulated actions to recover missing semantics of arbitrary tables Translation of rules to executable spreadsheet transformation programs Limitations The inaccurate cell structure prevents the table analysis The very limited interpretation (without external vocabularies)