Tutorial: Table Extraction
Extract structured table data from PDFs using Aryn DocParse
In this example, we’ll use DocParse to extract a cash flow table (shown below) from the 10k financial document of 3M, and turn it into a pandas dataframe.
Extracting Table Cell from DocParse
If you inspect the partitioned_file variable, you’ll notice that it’s a large JSON object with details about all the components in the PDF (checkout this page to understand the schema of the returned JSON object in detail). Below, we highlight the table
element that contains the information about the table in the page.
In particular let’s look at the cells
field which is an array of cell objects that represent each of the cells in the table. Let’s focus on the first element of that list.
Displaying the Table
Here we’ve detected the first cell, its bounding box (which indicates the coordinates of the cell in the PDF), whether it’s a header cell and its contents. You can then process this JSON however you’d like for further analysis. In the notebook we use the tables_to_pandas
function to turn the JSON into a pandas dataframe and then perform some analysis on it:
The output is given below:
Years ended December 31 (Millions) | 2018 | 2017 | 2016 | |
---|---|---|---|---|
0 | Major GAAP Cash Flow Categories | |||
1 | Net cash provided by operating activities | $ 6,439 | 6,240 | $ 6,662 |
2 | Net cash provided by (used in) investing activities | 222 | (3,086) | (1,403) |
3 | Net cash used in financing activities. | (6,701) | (2,655) | (4,626) |
4 | Free Cash Flow (non-GAAP measure) | |||
5 | Net cash provided by operating activities | $ 6,439 | $ 6,240 | 6,662 |
6 | Purchases of property, plant and equipment (PP&E | (1,577) | (1,373) | (1,420) |
7 | Free cash flow | $ 4,862 | $ 4,867 | $ 5,242 |
8 | Net income attributable to 3M | $ 5,349 | $ 4,858 | $ 5,050 |
9 | Free cash flow conversion | 91 % | 100 % | 104 % |
Was this page helpful?