Python PDF read straight across as how it looks in the PDF

python pdf parser
pdfminer example
pdfminer documentation
pdfminer htmlconverter

If I use the code in the answer here: Extracting text from a PDF file using PDFMiner in python?

I can get the text to extract when applying to this pdf: https://www.tencent.com/en-us/articles/15000691526464720.pdf

However, you see under "CONSOLIDATED INCOME STATEMENT", it reads down ... ie... Revenues VAS Online advertising then later it reads the numbers... I want it to read across, ie:

Revenues 73,528 49,552 73,528 66,392 VAS 46,877 35,108 etc... is there a way to do this?

Looking for other possible solutions other than pdfminer.

And if I try using this code for PyPDF2 not all of the text even shows up:

# importing required modules
import PyPDF2

# creating a pdf file object
pdfFileObj = open(file, 'rb')

# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

# printing number of pages in pdf file
a=(pdfReader.numPages)

# creating a page object
for i in range(0,a):
    pageObj = pdfReader.getPage(i)
    print(pageObj.extractText())

You can use PDFMiner to do the job and in my experience it works better than other open source Python tools out there.

The key is to specify the laparams parameter correctly and not leave it to its default values. This parameter is used to give PDFMiner more information about the layout of the page. Since the text here corresponds to tables with wide spaces, we need to instruct PDFMiner to use a large character margin (char_margin).

The code for the layout is here. Play around with the hyperparameters that give the best results for this particular document.

Here's a sample code for the pdf in question. I am using only a single page for demonstration here:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path, pages):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'

    laparams=LAParams(all_texts=True, detect_vertical=True, 
                      line_overlap=0.5, char_margin=1000.0, #set char_margin to a large number
                      line_margin=0.5, word_margin=2,
                      boxes_flow=1)
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set(pages)

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

pdf_text_page6 = convert_pdf_to_txt("15000691526464720.pdf", pages=[6])

The output for the given page (page 6 corresponding to page 7 in the document) looks like the block below. It is not perfect but all the numerical components of the table are captured in the same line as the text.

Page 7 of 11 

  Unaudited    Unaudited 

  1Q2018  1Q2017   1Q2018  4Q2017 

Revenues  73,528  49,552   73,528  66,392 

    VAS   46,877  35,108   46,877  39,947 

   Online advertising   10,689  6,888   10,689  12,361 

    Others  15,962  7,556   15,962  14,084 

Cost of revenues  (36,486)  (24,109)   (36,486)  (34,897) 

Gross profit  37,042  25,443   37,042  31,495 

Chapter 13 – Working with PDF and Word Documents, the book/ebook bundle directly from No Starch Press. To start learning how PyPDF2 works, we'll use it on the example PDF shown in Figure 13-1. Then open meetingminutes.pdf in read binary mode and store it in pdfFileObj . To get But PyPDF2 cannot write arbitrary text to a PDF like Python can do with plaintext files. Either you will need to use a more advanced PDF library (likely one that will build on top of the simple libraries) that will take the X, Y location of each text block along with its font information to determine the vertical positioning, or develop this yourself. It looks like the software that JosephA is talking about is doing exactly this.


Your issue is more to do with how PDF files are constructed than an issue with pyPDF2. I ran into many of the same problems while parsing PDFs to re-construct a page layout.

Whan a PDF is generated each text block is positioned on the page and rendered based on the font rules applied (similar to constructing an HTML document using nothing but absolution positioning and CSS). A simple PDF library will simply return the text from each block in the order they are defined in the file (I've had documents when the pages were generated in reverse, with the last paragraph, defined first).

Either you will need to use a more advanced PDF library (likely one that will build on top of the simple libraries) that will take the X, Y location of each text block along with its font information to determine the vertical positioning, or develop this yourself. It looks like the software that JosephA is talking about is doing exactly this.

Python for NLP: Working with Text and PDF Files, We will see how we can work with simple text files and PDF files using Python. In this section, we will see how to read from a text file in Python, create a text file, and write data Look at the following script to see how this works: For example, we can now easily iterate through each line and print the first word in the line. PyPDF2 can extract data from PDF files and manipulate existing PDFs to produce a new file. After spending a little time with it, I realized PyPDF2 does not have a way to extract images, charts, or other media from PDF documents. But it can extract text and return it as a Python string. Reading a PDF document is pretty simple and straight forward.


I first looked up the extractText function of PyPDF2 and tried to "strip" any new lines from the output to give you the "across" the page one-liner.

The output wasn't so desirable...output

Also, it doesn't seem reliable in terms of your output. From the PyPDF2 documentation: "Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated."

So I went and explored the options of using Tesseract. So this is a bit of a deviation on using a "pdf extraction library" and it's basically "build your own extraction script".

It's not too difficult once you have the grasp of Tesseract. It took me about an hours research with existing knowledge of tesseract.

Here are my results from my own code extracting the pdf page by page: https://gist.github.com/Benehiko/60862a6be13b3b652b7d506121b95811

Please note my code has a drawback. It does not extract the pages in order.

Just in case the link dies:

from PIL import Image
import pytesseract
import subprocess
import pathlib
import glob
import os

pathlib.Path("pages").mkdir(parents=False, exist_ok=True)
params = ['convert', "-density", "300", 'test.pdf', '-depth', '8', 
'pages/test_%02d.tiff']

subprocess.check_call(params)

images = glob.glob("pages/*.tiff")
for image in images:
    image = Image.open(image)
    ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
    os.environ["TESSDATA_PREFIX"] = ROOT_DIR + "/tessdata"
    text = pytesseract.image_to_string(image, lang='eng', nice=0, 
    output_type=pytesseract.Output.STRING).replace("\n", " ")
    print(text)

An Explanation of the code:

This first converts the pdf to separate "tiff" images since reading a multi-paged tiff with pytesseract for some reason only reads the first page. The tiff files are saved in a separate directory called "pages". Pytesseract reads each file and then returns the text, which is then formatted by use of ".replace" which removes all the lines and formats the text as one line.

A place to start: Tesseract install

Using tesseract in python: pytesseract

Training data used: eng.traineddata

Extra Source: pdf to tiff

Pytesseract: documentation

I hope this helps you. Not sure if this was something you were looking for.

PDF Processing with Python, One more thing you can never process a pdf directly in exising frameworks of Machine Version 0.4 is tested and works on Python 2.6, 2.7, 3.3, 3.4, 3.5, and 3.6 In most cases, you can use the included command-line scripts to extract text  But, it is a bit different here. PDF documents are binary files and more complex than just plaintext files, especially since they contain different font types, colors, etc. That doesn't mean that it is hard to work with PDF documents using Python, it is rather simple, and using an external module solves the issue.


How to Work with a PDF in Python, Learn how to work with a PDF in Python and how to extract metadata from Let us look into some of the libraries Python offers to handle PDFs: Here, we created the object pdfmerge and looped through the PDF paths. it or look at the Your projects page or you can just directly go to the URL of your  After digging into how PDF’s are represented in pdfrw, the following method seemed to be the easiest way to populate fillable PDF’s. The code is simple: it just reads in a template pdf and creates a new pdf with the designated fillable fields populated.


Extracting data from PDFs using Python, If you look at the content of the PDF, you can see that there is a lot of When I Googled around for 'Python read pdf', PyPDF2 was the first tool I stumbled upon. Reading a PDF document is pretty simple and straight forward. Python has a lot of libraries for PDF extract,many of them have been discussed below. I would like to add up PDFMiner and Slate to the queue PDFMiner PDFMiner is a tool for extracting information from PDF documents.


Exporting Data from PDFs with Python, There are many times where you will want to extract data from a PDF and export it in Once we have extracted the data we want, we will also look at how we can take The PDFMiner package has been around since Python 2.4. The PDFMiner package tends to be a bit verbose when you use it directly. parser = PDFParser(open_file) # Create a PDF document object that stores the document structure. doc = PDFDocument() # Connect the parser and document objects.