How to identify PDF files that need OCR?

how to determine if pdf is searchable
is pdf/a searchable
how to determine if pdf is searchable python
batch ocr pdf
how to ocr a pdf
how to tell if a pdf is ocr
how to use ocr
text based pdf

I have over 30,000 pdf files. Some files are already OCR and some are not. Is there a way to find out which files are already OCR'd and which pdfs are image only?

It will take for ever if I ran every single file through an OCR processor.

I would write a small script to extract the text from the PDF files and see if it is "empty". If there is text the PDF already was OCRed. You could either use ghostscript or XPDF to extract the text.

EDIT: This should get you started:

foreach ($pdffile in get-childitem -filter *.pdf){
    $pdftext=invoke-expression ("\path\to\xpdf\pdftotext.exe '"+$pdffile.fullname+"' -");
    write-host $pdffile.fullname
    write-host $pdftext.length;
    write-host $pdftext;
    write-host "-------------------------------";
}

Unfortunately even when you have only images in your PDF pdftotext will extract some text, so you will have to do some more work to check whether you need to OCR the pdf.

How can I detect if a PDF needs to be OCRd?, You just received 1000 PDFs from the other side which are a mix of PDFs created from How can you quickly detect which files need to be OCRd? Follow the instructions in my Batch OCR using Acrobat Professional article. Edit, Create, Convert PDFs Easily. Perfect for Windows. (Ideal tool)

XPDF worked for me in a different way. But not sure it is the right way.

My PDFs with image also gave text content. So I used pdffonts.exe to verify if the fonts are embedded in the document or not.In my case all image files showed 'no' for embedded value.

> Config Error: No display font for 'Symbol' 
> Config Error: No display font for 'ZapfDingbats' 
> name                                 type              emb sub uni object ID
> ------------------------------------ ----------------- --- --- --- --------- 
> Helvetica                            Type 1            no  no  no       7  0

Where as all searchable PDFs gave 'yes'

> Config Error: No display font for 'Symbol'
> Config Error: No display font for 'ZapfDingbats'
> name                                 type              emb sub uni object ID
> ------------------------------------ ----------------- --- --- --- ---------
> ABCDEE+Calibri                       TrueType          yes yes no       7  0
> ABCDEE+Calibri,Bold                  TrueType          yes yes no       9  0

How to identify PDF files that need OCR?, I would write a small script to extract the text from the PDF files and see if it is "​empty". If there is text the PDF already was OCRed. You could  Easily eSign Documents, Forms and Agreements Online. eSign From Any Device. Upload, Edit, Sign & Export PDF Forms Online. No Installation Needed. Try Now!

I found that TotalCmd has a plugin that handles this: https://totalcmd.net/plugring/pdfOCR.html

pdfOCR is wdx plugin that discovers how many pages of PDF file in current directory needs character recognition (OCR), i.e. how many pages in PDF file have no searchable text in their layout. This is mostly needed when one is preparing PDF files for one’s documentation or archiving system. Generally in one’s work with PDF files they need to be transformed from scanned version to text searchable form before they are included in any documentation to allow for manual or automatic text search. The pdfOCR plugin for Total Commander fulfils a librarian’s need by presenting the number of pages that are images only with no text contained. The number of scanned pages are presented in the column "needOCR". By comparing the needOCR number of pages with the number of total pages one can decide if a PDF file needs additional OCR processing.

Checking whether a PDF file is Searchable, Documents where the text is searchable are placed in one folder and documents that do Download a batch sequence to identify PDFs that have been OCR-ed  If there is text the PDF already was OCRed. You could either use ghostscript or XPDF to extract the text. EDIT: This should get you started: Unfortunately even when you have only images in your PDF pdftotext will extract some text, so you will have to do some more work to check whether you need to OCR the pdf.

Find PDF Files That Need OCR Processing – Houdah Blog, Once a scan has been processed by OCR, the PDF file contains both an image of the document and an invisible text version. The text can then be  Back in HoudahSpot: Select the PDF files you want to process. Select HoudahSpot > Services > OCR PDF Document from the menu. PDFPen will launch in the background, process your files and quit. Once the files have been processed and text content was found, they will disappear from your HoudahSpot search.

How to OCR Text in PDF and Image Files in Adobe Acrobat, How to OCR Text in PDF and Image Files in Adobe Acrobat What's not so great is finding content stored away inside one of your hundreds of All you have to do is open the scanned document or image that you'd like to  If the report shows a checkmark, then you will want to OCR the document. The script is not effective in these circumstances: The PDF contains a mix of searchable and non-searchable pages. For example, if you combined a PDF output from Word and one you scanned. The PDF hasn’t been OCRd, but may be Bates Stamped

[PDF] PDF Cheat Sheet to OCR Pages, We have noticed that when the Adobe printer Print to PDF option, do not use PDF file to create another version of the file using Print to PDF. When you use the “Check File(s)” function in ACCESS, it will identify links and. Open the PDF document in the Adobe® Acrobat® and try to select any text on the page with a selection tool. If you can highlight a text string and copy/paste it into a text editor (such as the Notepad, Microsoft Word or Outlook), then the document does contain a searchable text.

Comments
  • Thanks for answering. At least you have given me something to think about. Could a powershell script be constructed with ghostscript or xpdf? Do you have anything handy that I can try? Thanks Again.
  • Added some script to my answer
  • @Fuji-H2O I am looking for the same solution. I need to check if pdf has at least 1 image or not. I know its very old question but if you remember please help me with the solution.
  • what if the pdf has both text and images ?