Is it possible to extract table infomation using Apache Tika?

apache tika-python
apache tika javadoc
apache tika excel example
apache tika alternatives
apache tika pdf
apache tika tutorial
apache tika example
extract table from pdf using pdfbox

I am looking at a parser for pdf and MS office document formats to extract tabular information from files. Was thinking of writing separate implementations when I saw Apache Tika. I am able to extract full text from any of these file formats. But my requirement is to extract tabular data where I am expecting 2 columns in a key value format. I checked most of the stuff available in the net for a solution but could not find any. Any pointers for this?

Well I went ahead and implemented it separately using apache poi for the MS formats. I came back to Tika for PDF. What Tika does with the docs is that it will output it as "SAX based XHTML events"1

So basically we can write a custom SAX implementation to parse the file.

The structure text output will be of the form (Meta details avoided)

<body><div class="page"><p/>
<p>Key1 Value1 </p>
<p>Key2 Value2 </p>
<p>Key3 Value3</p>
<p/>
</div>
</body>

In our SAX implementation we can consider the first part as key (for my problem I already know the key and I am looking for values, so it is a substring).

Override public void characters(char[] ch, int start, int length) with the logic

Please note for my case the structure of the content is fixed and I know the keys that are coming in, so it was easy doing it this way. This is not a generic solution

PDFParser (Apache Tika 1.18 API), As of Tika 1.6, it is possible to extract inline images with the As of this writing, the PDFParser extracts text within tables, but it does not compute table cell See the parser implementations for the kinds of context information they expect. I am looking at a parser for pdf and MS office document formats to extract tabular information from files. Was thinking of writing separate implementations when I saw Apache Tika.

Tika doesn't parse table information. In fact confusing part is that it converts tables tags as <p> which actually means we lose the structure. This is the case till current version 1.14. In future that may be remedied but no plans till now to work on that direction.

You can refer to JIRA which discusses this shortcoming in Tika. After the JIRA, wiki was also updated to reflect this inadequacy.[Disclaimer: I raised the JIRA]

Now the solution part: In my experience, Aspose.Pdf for Java does a brilliant job for converting pdf into html. But its licensed. You can check the quality via free trial version. Code and example links.

PDFParser (Apache PDFBox) - TIKA, There are three ways of configuring the PDFParser. Via the tika-config.xml file (​many thanks to Thamme Gowda and Chris One needs to apply some advanced computation to extract table structure from a PDF. there can be missing or wrong font information that can lead to no spaces or extra spaces. This article gives details about 1. how to extra text or meta data from PDF documents using Apache Tika and Python 2. installing Tika server and also automating the process of restarting tika

I use a combination of tika (tika-app-1.19.jar) & aspose (aspose-pdf-18.9.1.jar)...

I first modify the pdf using Aspose, to have pipes ('|') at the end of the table-columns... ... and then read it into Tika and convert it to text...

InputStream is = part.getInputStream(); // input-stream of PDF or PDF part

// Aspose add pipes ("|")
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
Document pdfDocument   = new Document(is);   // load existing PDF file

PageCollection pageCollection = pdfDocument.getPages();
int iNumPages = pageCollection.size();

for(int i = 1; i <= iNumPages; i++)
{
    Page page = pageCollection.get_Item(i);
    TableAbsorber absorber = new TableAbsorber();// Create TableAbsorber object to find tables
    absorber.visit(page);// Visit first page with absorber

    IGenericList<AbsorbedTable> listTables = absorber.getTableList();

    for(AbsorbedTable absorbedTable : listTables)
    {
        IGenericList<AbsorbedRow> listRows = absorbedTable.getRowList();

        for(AbsorbedRow absorbedRow : listRows)
        {
            IGenericList<AbsorbedCell> listCells = absorbedRow.getCellList();

            for(AbsorbedCell absorbedCell : listCells)
            {
                TextFragmentCollection  collectionTextFrag = absorbedCell.getTextFragments();

                Rectangle rectangle = absorbedCell.getRectangle();

                // Add pipes ("|") to indicate table ends
                TextBuilder  textBuilder  = new TextBuilder(page);
                TextFragment textFragment = new TextFragment("|");
                double x = rectangle.getURX();
                double y = rectangle.getURY();
                textFragment.setPosition(new Position(x, y));
                textBuilder.appendText(textFragment);
            }
        }
    }
}
pdfDocument.save(outputStream);
is = new ByteArrayInputStream(outputStream.toByteArray()); // input-steam of modified PDF with pipes included ("|")

now the above pdf input stream with pipes ("|") at table cell ends can be pulled into Tika and changed to text...

BodyContentHandler handler   = new BodyContentHandler();
Metadata           metadata  = new Metadata();
ParseContext       context   = new ParseContext();
PDFParser          pdfParser = new PDFParser();

PDFParserConfig config = pdfParser.getPDFParserConfig();
config.setSortByPosition(true); // needed for text in correct order
pdfParser.setPDFParserConfig(config);

//InputStream stream = new ByteArrayInputStream(sIS.getBytes(StandardCharsets.UTF_8));
pdfParser.parse(is, handler, metadata, context);
String sPdfData = handler.toString();

Supported Document Formats, The Excel parser in Tika uses the HSSF event API and is able to extract much of the document structure, including all (non-empty) worksheets and their table structures. The above information, as well as the Album, Track, Year, Genre and  With Tika, you can get the textual content of your files returned in a number of different formats. These can be plain text, html, xhtml, xhtml of one part of the file etc. This is controlled based on the ContentHandler you supply to the Parser.

I found a very helpful blog article here that parses tables using a ContentHandlerDecorator (with Groovy, but similar enough;): https://opensource.com/article/17/8/tika-groovy

I adapted it to just parse all <td> parts into a tab separated line, and collecting the rows in a List by following <tr> tags, because I needed the table rows to stay intact but no special logic inside table cells.

You can pass your Decorator to the BodyHandler, which wraps it as a delegate, like so:

new AutoDetectParser().parse(inputStream,
    new BodyContentHandler(new MyContentHandlerDecorator()),
    new Metadata());

Extracting Tables from PDF – Rishu Shrivastava, Apache Tika is an open source tool which extracts metadata and data as In order to extract the one table out of this document, let us open an  Tika can detect several common audio formats and extract metadata from them. Even text extraction is supported for some audio files that contain lyrics or other textual content. Extracted metadata includes sampling rates, channels, format information, artists, titles etc.

Chapter 5. Content extraction - Tika in Action, The original and most important use case for Tika is extracting textual content from we'll show two ways to confidently integrate the facade method with Apache Such search engines are increasingly important in our world of information Table 5.1. The arguments for the org.apache.tika.parser.Parser 's parse() method. Tables Aren't Extracted as Tables. Right. Tables aren't stored as tables in PDF files. A human is easily able to see tables, but all that is stored in the PDF is text chunks and coordinates on a page (if there's any text at all). One needs to apply some advanced computation to extract table structure from a PDF. Tika does not currently do this.

List of Tables - Tika in Action, Table 2.1. Information included in views of the Tika GUI window Chapter 5. Content extraction. Table 5.1. The arguments for the org.apache.tika.parser. TIKA - Extracting HTML Document. Given below is the program to extract content and metadata from an HTML document. Save the above code as HtmlParse.java, and compile it from the command prompt by using the following commands −. Given below is the snapshot of example.txt file.

Content Analysis with Apache Tika, Apache Tika is a toolkit for extracting content and metadata from various In order to parse documents using Apache Tika, we need only one Maven dependency: ? context-specific information, used to customize the parsing process parser libraries such as Apache POI or PDFBox as much as possible. Hello everyone, I'm trying to parse and index .doc files into elasticsearch with apache Tika. Actually, my project is to build a resume search engine for my company. Since we have a standardized resume format, I would like to parse these resume using apache tika in Java. Basically I have a .doc file like this : Jean Wisser avenue des Ternes 75017 Paris Business Intelligence Consultant

Comments
  • Hey Rajesh, After a year I am facing same problem as yours :) I would like to know if there is any generic solution to this problem. In my case pdf files will contain any type of table structure and I have to make sure that tables are extracted properly and if possible annotate table captions. Is it possible to do using Tika? Or is there any other API which can do this?
  • @Shekhar I didnt get any generic solution. Basically you should be able to do this in MS formats but i doubt if pdf is possible(refer:stackoverflow.com/a/9803283/869488 . Same thread has some python solution which might work if you know the table caption. Not tried myself).
  • Tabula (tabula.technology) is a free, MIT licensed option for extracting tables from PDFs. If you'd like us to integrate that with Tika, please open an issue on our JIRA.