SPREADsHEETLLM:EncodingSpreadsheetsforLargeLanguage Models Yuzhang Tian*' Jianbo Zhao*′ Haoyu Dong Junyu Xiong' Shiyu Xia' Mengyu Zhou Yun Lin' Jose Cambronero Yeye He Shi Han Dongmei Zhang Microsoft Corporation Abstract Spreadsheets are characterized by their exten- sive two-dimensional grids flexible layouts s asod m soo Sueo pp pe Spreedsheet nificant challenges for large language models Tesks 2 (LLMs). In response we introduce SPREAD- 2 SHEETLLM pioneering an effcient encod- 8 M M ing method designed to unleash and optimize vanoar LDAi LLMs’ powerful understanding and reason- eet Encoding 2 sds uo de Spreadsheet SpreadsheetLLM sheetc qodde ozs u sodo incorporates cell addresses values and for- mats. However this approach was limited Figure 1: The SPREADSHEETLLM pipeline. by LLMs' token constraints making it im- practical for most applications. To tackle this ing spreadsheet layout and structure (Dong et al. challenge we develop SHEETCOMPRESSOR 2019b; Gol et al. 2019; Hulsebos et al. 2019; Dou an innovative encoding framework that - et al. 2018; Wang et al. 2021; Deng et al. 2022; presses spreadsheets effectively for LLMs. It Chen and Cafarella 2014) a longstanding chal- > prises three modules: structural-anchor- lenge for traditional models is crucial for effective 5 based pression inverse index translation and data-format-aware aggregation. It signif- data analysis and intelligent user interaction. Re- 2060 2 icantly improves performance in spreadsheet cently the rapid development of Large Language table detection task outperforming the vanilla Models (LLMs) has opened new frontiers in table approach by 25.6% in GPT4's in-context learn- processing (Li et al. 2023b) and reasoning (Cheng 4 7 ing setting. Moreover fine-tuned LLM with et al. 2022). However spreadsheets pose unique SHEETCOMPRESSOR has an average pres- challenges for LLMs due to their expansive grids 2 sion ratio of 25x but achieves a state-of-the-art 78.9% F1 score surpassing the best existing that usually exced the token limitations of popular A models by 12.3%. Finally we propose Chain LLMs as well as their inherent two-dimensional X i of Spreadsheet for downstream tasks of spread- layouts and structures which are poorly suited to sheet understanding and validate in a new and linear and sequential input. Furthermore LLMs of- demanding spreadsheet QA task. We methodi- ten struggle with spreadsheet-specific features such cally leverage the inherent layout and structure as cell addresses and formats plicating their of spreadsheets demonstrating that SPREAD- ability to effectively parse and utilize spreadsheet SHEETLLM is highly effective across a variety data as detailed in Appendix A. of spreadsheet tasks. In this paper we introduce SPREADSHEETLLM 1Introduction a pioneering framework to unleash and maximize the potential of LLMs for spreadsheet understand- Spreadsheets are ubiquitous for data management ing and reasoning. We initially propose a vanilla and extensively utilized within platforms like Mi- encoding method to serialize spreadsheets into crosoft Excel and Google Sheets. Understand- sequences augmenting the Markdown encoding * Equal contribution. method by including essential cell addresses and Wock during intermship at Microsoft. (optional) formats. Furthermore large spreadsheets Coresponding authoe. that exceed the token limits of LLMs not only limit
their processing but also as observed in prior stud- (CoT) methodology (Zheng et al. 2023; Jiang et a.. ies degrade accuracy performance as the size in- 2023b) we propose Chain of Spreadsheet (CoS) creases (Liu et al. 2024). To address this chal- to depose spreadsheet reasoning into a table lenge we propose SHEETCOMPRESSOR featur- detection-match-reasoning pipeline. It significantly ing a novel encoding framework prising three outperformed existing SOTA methods for table portable modules: QA (Herzig et al. 2020; Cheng et al. 2022). Our 1) Structural Anchors for Efficient Layout primary contributions are summarized as follows: Understanding: Observations indicate that large • We propose SPREADSHEETLLM the first spreadsheets often contain numerous homogeneous work that substantially leverage LLMs for un- rows or columns which contribute minimally to un- p sends ee e us derstanding the layout and structure (see left panel To address challenges in scale diversity in Figure 2 (a). To address this we identify struc- and plexity of spreadsheets we propose tural anchors-heterogeneous rows and columns at SHEETCOMPRESSOR an innovative encod- possible table boundaries that offer substantial lay- ing framework to press spreadsheets for out insights as depicted in Figure 2 (b). Then we LLMs with efficient encoding. remove distant homogeneous rows and columns • We fine-tune a variety of cutting-edge LLMs to achieve optimal performance on spread- spreadsheet as illustrated in Figure 2 (c). sheet table detection and demonstrate the 2) Inverted-Index Translation for Token Ef- high effectiveness of SPREADSHEETLLM ficieney: The vanilla encoding method bees in accurately understanding plex spread- token-consuming when handling spreadsheets with sheet layouts and structures. numerous empty cells and repetitive values as shown in Figure 2 (c). To improve efficiency we • In order to extend the horizontal capabilities of SPREADSHEETLLM to a wide range of depart from traditional row-by-row and column-by- column serialization and employ a lossless inverted- downstream tasks we propose CoS and verify it on Spreadsheet QA highlighting its poten- index translation in JSON format. This method cre- ates a dictionary that indexes non-empty cell texts tial for intelligent user interaction. and merges addresses with identical text optimiz- 2RelatedWork ing token usage while preserving data integrity. 3) DataFormat Agregation for Numerical Spreadsheet Representation Spreadsheet repre- Cells: Adjacent numerical cells often share similar sentation involves converting the spreadsheets into number formats. Recognizing that exact numeri- specific representations for different models. There cal values are less crucial for grasping spreadsheet are various methods for spreadsheet (table) repre- structure we extract number format strings and sentation. (Dong et al. 2019a b) enhance Mask- data types from these cells. Then adjacent cells RCNN to leverage spatial and visual information with the same formats or types are clustered to- in spreadsheets and (Deng et al. 2024) explores gether. This method is visualized in the right exam- the usage of LLMs to evaluate image tables but it ple of Figure 2 where rectangular regions are rep- doesn’t work well for spreadsheet images as input resented by uniform format strings and data types to VLMs (Xia et al. 2024). To capture sequential streamlining the understanding of numerical data semantics in rows and columns LSTMs are further distribution without excessive token expenditure. adopted (Nishida et al. 2017; Gol et al. 2019) in We conducted a prehensive evaluation of our row&column directions. Pre-trained LMs (Dong method on a variety of LLMs. Our experiments et al. 2022) are then proposed to understand spread- show that SHEETCOMPRESSOR significantly re- sheets (Wang et al. 2021). Recent studies (Zhang Xq Suspooto joouspeands rop ssesn uxon soonp et al. 2023; Li et al. 2023b; Sui et al. 2023) 96%. Moreover SPREADSHEETLLM has shown have explored the efficacy of using Markdown and exceptional performance in spreadsheet table de- HTML for table representation. However they are tection the foundational task of spreadshcet under- not well suited to spreadsheets due to their single standing surpassing the previous SOTA method table input as experiments show in Appendix B. by 12.3% (Dong et al. 2019b). We also applied Spreadsheet Understanding While most table SPREADSHEETLLM to a representative spread- LLMs are restricted to single table settings spread sheet QA task. Inspired by the Chain of Thought sheets with multiple tables typically exceed token 2
. LA La ee D02 (LHX) Encading [AA 6 Demdo|B4 Quun) 'SebRegier":A1 I54 I74 QuntamMindG4 P44 Fiv s bRnginn.AL Wind": I294 4 If、 IG (00 575 29]1 234%[0 157882 016 *10000N: r05 9.12 K6 besed Exdrsetion Figure 2: Ilustration of the SHEETCoMPRESsOR framework. The original spreadsheet contains two tables featuring numerous data entres or hierarchical headers which can be viewed in the supplementary materials. The pleted spreadsheet consists of 576 rows and 23 columns with an vanilla encoding of 61 240 tokens. Initially we first extract cells using structural anchors rearranging them into a smaller 24x8 sheet. Subsequently we perform index-invert removing empty cell. Finall we aggregate cells based one data formats achieving an extremely pact representation of the spreadsheet which contains only 708 tokens. limits. Moreover the diversity in multi-table layout cantly with long contexts (Liu et al. 2024; Xu and structure significantly confounds the problem. ct al. 2023). Efforts to improve model performance Spreadsheet table detection (Dong et al. 2019b; and reduce costs have led to the development of Christodoulakis et al. 2020; Doush and Pontelli pression techniques for long prompts. Some 2010: Vitagliano et al. 2022) aims at identifying researchers employ information-theory metrics to all tables on a given sheet and determining their re- filter out redundant information (Li 2023; Jiang spective ranges. As a fundamental task for spread- et al. 2023a). Additionally specialized models sheet understanding this task triggers hundreds have been proposed to optimize prompt pres- of millions of daily average usage in mercial sion (Pan et al. 2024). However these strategies spreadsheet tools (Zhang et al. 2024) and the accu- pue spduoud sensuel [emeu ssosppe Xquerud racy still urges improvements due to the flexibility may not suit tabular data potentially leading to and plexity of spreadsheets. considerable structure and data information loss. Spreadsheet Downstream Tasks Spreadsheet DBCopilot (Wang et al. 2023) enables text-to- understanding is enabling for a series of spread- SQL conversion on large databases through schema sheet tasks such as table question answering anal- routing. However due to LLMs’ insufficient abil- ysis (He et al. 2024; Cheng et al. 2021b 2022; ity in understanding inherent multi-table layouts Jiang et al. 2022: Herzig et al 2020) table ex and plex table structures that cannot execute traction (Chen and Cafarella 2013 2014; Li et al. queries similar to SQL schema routing is imprac- 2024) formula or code generation (Chen et al. tical restricting the broader application of cutting- 2021; Cheng et al. 2021a; Joshi et al. 2024; Chen edge tabular works (Cheng et al. 2022; Li et al. et al. 2024; Li et al. 2023a) error detection (Wang 2023b; Sui et al. 2024) on spreadsheet data. and He 2019: Dou et al. 2016) etc. In this pa- per we choose spreadsheet QA one of the most 3Method demanded spreadsheet analysis tasks. It is an ex- tension of the Table QA task in spreadsheet data We propose a novel spreadsheet encoding frame- with the additional plexity of detecting and work in a Markdown-like style as text. To achieve matching multiple tables within a spreadsheet. a more pact and effcient representation we introduce three independent yet binable mod- LLMs’ Token EficiencyRelated work suggests ules: structural-anchor-based extraction inverted- that the performance of LLMs degrades signifi- index translation and data-format-aware aggrega-
tion which enable efficient data pression and rows and columns that are located more than units enhance performance on downstream tasks. away from any anchor point because they rarely 3.1Vanilla Spreadsheet Encoding with Cell serve as table boundaries. The parameter : serves as a threshold to control the scope of neighborhood Value Address and Format retention effectively eliminating areas predomi- Due to the absence of standardized practices in nantly filled with homogeneous data that do not spreadsheet encoding for LLMs we first explore contribute to an understanding of the spreadsheet's traditional spreadsheet encoding methods. Ap- layout and structure. We explored the effects of pendix B presents a parison of different main- different & values in an ablation study as detailed stream tabular data encoding methods including in Appendix D.1. HTML XML and Markdown. Based on the en- The extracted rows and columns can be ex- coding length and performance on spreadsheet un- pressed as: derstanding tasks we use a Markdown-like style representation: A= pCpCmgCn (4) S={Cell iemjn (1) where the extracted "skeletons"are defined T = markdown ↓encode (Cell §)} : =|Addressi Valuei Format..n (2) pact spreadsheet: where S Rm.n denotes the spreadsheet T R1 S = extract(S) = address_map(rp Cq ). denotes the text representation of a cell and (5) m n respectively represent the row and column in- dex of the cell and the row and column range of S. Based on the pressed spreadsheet S we can obtain extremely shorter text representation Te- We also explored the inclusion of cell format infor- Furthermore after cxtraction we perform a co- mation (such as background color bold font bor- ordinate re-mapping to ensure continuity in cell ders etc.) into each cell's representation. However coordinates preserving the integrity of data rela- these experiments demonstrated that such detailed tionships within the pressed spreadsheet. This encoding adversely affects model performance due to rapid token limit exceedance and LLMs’ inad- me uu e s udde-a equate capability to process format information of prediction results ensuring that analyses remain effectively as detailed in Appendix A. We plan to consistent even after pression. This method fil- further explore this in future research focusing on ters out 75% spreadsheet content but preserves 97% enhancing the model’s ability to understand and rows and columns at the edges of table boundaries. utilize format and structural cues. 3.3 Inverted-index Translation 3.2 Structural-anchor-based Extraction Spreadsheets often contain numerous empty rows Large spreadsheets often feature numerous homo- columns and scattered cells. The standard en- geneous rows or columns which minimally con- coding method as detailed in Section 3.1 em- ploys a grid-based method that pairs cell addresses tribute to the understanding of their layout and with their contents. This approach necessitates structure as depicted in Figure 2 (a). To effec- sd sds recording empty cells to maintain the spreadsheet's two-dimensional structure which significantly in- layout and structural information we propose a novel heuristic-based method detailed further in creases token consumption. Furthermore cells with identical values are encoded repeatedly fur- Appendix C. This method identifies heterogeneous ther exacerbating token usage. rows and columns at the cdges of table bound- ariestermed structural anchors: To address these inefficiencies we propose a two-stage Inverted-index-based Translation method. The first stage involves converting the A={rpCo}pmgen (3) traditional matrix-style encoding into a dictionary format where cell values serve as keys indexing where r ={Cell ps and=(Cell hm= the addresses. In the second stage cells sharing the Using these anchor points our method discards same value are merged with empty cells excluded t
and cel addresses noted as ranges. This method Finally based on the NFSs and data type the ag- effectively reduces the number of required tokens gregator aggregates the cells by Algorithm 1. This by eliminating redundancies and simplifying the process can be represented as follows: representation of repeated and empty cells. The translation process is represented mathematically NFSs = nfs((Cell em.jen) (7) as follows: T = invert(T) Ta = aggregator({Cell} NFSs R) : = {Value : Address or Address_Region .). (6) (8) Inverted-index Translation is a lossless pres- where R denotes the predefined rules as detailed sion method general for all spreadsheet understand- above. In this way we further reduce the number ing tasks and it remarkably increases SHEETCOM- of tokens. The pression ratio of the data re- PRESsOR's pression ratio from 4.41 to 14.91. gions also increases from 14.91 to 24.79. More More details can be found in Table 1. detailed pression effects of different modules are displayed in Table 1. 3.4 Data-format-aware Aggregation 3.5 Chain of Spreadsheet In spreadsheets adjacent cells typically share the To extend the applicability of SPREADSHEETLLM same data format. As shown in Figure 2 (3) col- to a broader range of downstream tasks we in- umn C records the sell-in billed revenue for differ- troduce the Chain of Spreadsheet (CoS) which ent products. Nonetheless the concrete numerical unfolds two stages: values are not essential for understanding the struc- ture and semantics of the spreadsheet (although Table Identification and Boundary Detection Initially the pressed spreadsheet and the spe- there might loss of fine-trained details of exact quantities e.g. "18 476" and "18 674" this does cific task query are input into the LLM. Leveraging not impact our prehension that this column the advances in spreadsheet table detection the represents revenue). In contrast the data type is model identifies the table that is relevant to the critical for understanding spreadsheets. On one query and determines the precise boundaries of the hand data types represent fundamental semantic relevant content. This step ensures that only perti- 1I ioqunu suoud no oup se qons sadod nent data is considered in the subsequent analysis motivates us to implement rules to match the value optimizing the processing efficiency and focus. of the cell to different data types. On the other hand Response Generation The query and the identi- in contrast to detailed numerical values identical fied table section are re-input into the LLM. The data types may be pressed through clustering model then processes this information to generate thereby reducing the number of tokens. an accurate response to the query. In this section we introduce Data-format-aware Through the CoS SPREADSHEETLLM effec- Aggregation for further pression and informa- tively handles plex spreadsheets by breaking tion integration. Specifically we employ Number down the process into manageable parts thus en- Format String (NFS) which is a built-in cell at- abling precise and context-aware responses. In this tribute in spreadsheets. NFSs can be extracted by paper we validate the effect of the Spreadsheet QA default using tools like ClosedXML or OpenPyXL task which is detailed in Section 4.2. used to describe the format of cell data as a string. 4 Experiments mm-d" indicating a specifc date format. How- In our experimental evaluation we first verified the ever spreadsheet users do not always explicitly add effectiveness of our method in spreadsheet under- NFSs to cells so NFSs are sometimes absent. As a standing. For this purpose we chose the classic plement we propose a rule-based recognizer to and foundational task of spreadsheet table detec- map a cell value to a specific predefined data type: tion (Dong et al. 2019b). This task serves as a Year Integer Float Percentage Scentific notation critical benchmark for assessing the framework's Date Time Currency Email and Others. The first nine types broadly cover approximately 55% of the structures within spreadsheets. Building upon this cells in our dataset derived from real-world corpora. foundational understanding we further explored
SPREADSHEETLLM 大型语言模型的电子表格编码 SPREADSHEETLLM Encoding Spreadsheets for Large Language Models.pdf
