How to reduce wand memory usage?

Related searches

I am using wand and pytesseract to get the text of pdfs uploaded to a django website like so:

image_pdf = Image(blob=read_pdf_file, resolution=300)
image_png = image_pdf.convert('png')

req_image = []
final_text = []

for img in image_png.sequence:
    img_page = Image(image=img)
    req_image.append(img_page.make_blob('png'))

for img in req_image:
    txt = pytesseract.image_to_string(PI.open(io.BytesIO(img)).convert('RGB'))
    final_text.append(txt)

return " ".join(final_text)

I have it running in celery in a separate ec2 server. However, because the image_pdf grows to approximately 4gb for even a 13.7 mb pdf file, it is being stopped by the oom killer. Instead of paying for higher ram, I want to try to reduce the memory used by wand and ImageMagick. Since it is already async I don't mind increased computation times. I have skimmed this: http://www.imagemagick.org/Usage/files/#massive, but am not sure if it can be implemented with wand. Another possible fix is a way to open a pdf in wand one page at a time rather than putting the full image into RAM at once. Alternatively, how could I interface with ImageMagick directly using python so that I could use these memory limiting techniques?

Remember that the wand library integrates with MagickWand API, and in turn, delegates PDF encoding/decoding work to ghostscript. Both MagickWand & ghostscript allocated additional memory resources, and do there best to deallocate at the end of each task. However, if routines are initialized by python, and held by a variable, it's more than possible to introduce memory-leaks.

Here's some tips to ensure memory is managed correctly.

  1. Use with context management for all Wand assignments. This will ensure all resources pass through __enter__ & __exit__ management handlers.

  2. Avoid blob creation for passing data. When creating a file-format blob, MagickWand will allocated additional memory to copy & encode the image, and python will hold resulting data in addition to the originating wand instance. Usually fine on the dev environment, but can grow out of hand quickly in a production setting.

  3. Avoid Image.sequence. This is another copy-heavy routine, and results in python holding a bunch of memory resources. Remember ImageMagick manages the image stacks very well, so if you're not reordering / manipulating individual frames, it's best to use MagickWand methods & not involve python.

  4. Each task should be an isolated process, and can cleanly shut-down on completion. This shouldn't be an issue for you w/ celery as a queue worker, but worth double checking the thread/worker configuration + docs.

  5. Watch out for resolution. A pdf resolution of 300 @ 16Q would result in a massive raster image. With many OCR (tesseract/opencv) techniques, the first step is to pre-process the inbound data to remove extra/unneeded colors / channels / data / &tc.

Here's an example of how I would approach this. Note, I'll leverage ctypes to directly manage the image stack w/o additional python resources.

import ctyles
from wand.image import Image
from wand.api import library

# Tell wand about C-API method
library.MagickNextImage.argtypes = [ctypes.c_void_p]
library.MagickNextImage.restype = ctypes.c_int

# ... Skip to calling method ...

final_text = []
with Image(blob=read_pdf_file, resolution=100) as context:
    context.depth = 8
    library.MagickResetIterator(context.wand)
    while(library.MagickNextImage(context.wand) != 0):
        data = context.make_blob("RGB")
        text = pytesseract.image_to_string(data)
        final_text.append(text)
return " ".join(final_text)

Of course your milage may vary. If your comfortable with subprocess, you may be able to execute gs & tesseract directly, and eliminate all the python wrappers.

How to reduce memory usage converting a pdf to jpg using Wand , Trying to lower the memory footprint of a pdf to jpg conversion using wand. specs: tesseract 3.05.02 leptonica-1.76.0 libjpeg 9c : libpng 1.6.35 … Press and hold the “Windows” button and the “R” button. In the run window you will need to write the following: “dfrgui” without the quotes. Press the “Enter” button on the keyboard. In the new window that shows you will need to left click or tap on the hard drives you want to defragment.

I was also suffering from memory leaks issues. After some research and tweaking the code implementation, my issues were resolved. I basically worked correctly using with and destroy() function.

In some cases I could use with to open and read the files, as in the example below:

with Image(filename = pdf_file, resolution = 300) as pdf:

This case, using with, the memory and tmp files are correctly managed.

And in another case I had to use the destroy() function, preferably inside a try / finally block, as below:

try:
    for img in pdfImg.sequence:
    # your code
finally:
    pdfImg.destroy()

The second case, is an example where I cann't use with because I had to iterate the pages through the sequence, so, I already had the file open and was iterating your pages.

This conbination of solution resolved my problems with memory leaks.

wand.resource — Global resource management — Wand 0.5.5, There is the global resource to manage in MagickWand API. Useful for dynamically reducing system resources before attempting risky, or slow running, limits # Use 100MB of ram before writing temp data to disk. limits['memory'] = 1024� Pointing the Wand to the T1 point which is on top of the vertebra, “wand” in a circular clockwise motion for 9,18,27, or 36 counts at a comfortable speed (do not rush). This will strengthen the head and heart connections, and helps to reduce stress and relax.

The code from @emcconville works, and my temp folder is not filling up with magick-* files anymore

I needed to Import ctypes and not cstyles

I also got the error mentioned by @kerthik

solved it by saving the image and loading it again, it is properly also possible to save it to memory

from PIL import Image as PILImage

...
context.save(filename="temp.jpg")
text = pytesseract.image_to_string(PILImage.open("temp.jpg"))`

EDIT I found the in memory conversion on How to convert wand.image.Image to PIL.Image?

img_buffer = np.asarray(bytearray(context.make_blob(format='png')),dtype='uint8')
bytesio = io.BytesIO(img_buffer)
text = ytesseract.image_to_string(PILImage.open(bytesio),lang="dan")

[PDF] Wand Documentation, 3.1.5 Improved Memory Deallocation & atexit Support Never use Wand directly within a HTTP service, or on any server with public access. importance) in an image and automatically removes seams to reduce image size� As a quick recap, I showed how python generators can be used to reduce memory usage and make the code execute faster. The advantage lies in the fact that generators don’t store all results in memory, rather they generate them on the fly, hence the memory is only used when we ask for the result.

I run into a similar issue.

Found this page interesting: http://www.imagemagick.org/script/architecture.php#tera-pixel

And how to limit the amount of memory used by ImageMagick through wand: http://docs.wand-py.org/en/latest/wand/resource.html

Just adding something like:

from wand.resource import limits

# Use 100MB of ram before writing temp data to disk.
limits['memory'] = 1024 * 1024 * 100

It may increase the computation time (but like you, I don't mind too much) and I actually did not notice so much difference.

I confirmed using Python's memory-profiler that it is working as expected.

Security Policy, The security policy covers areas such as memory, which paths to read or write, how many As of ImageMagick 7.0.1-8, you can prevent the use of any delegate or all delegates (set (MagickCore) or MagickSetSecurityPolicy() (MagickWand). Disable Runtime Broker to fix high CPU and memory usage Go to Start menu > Settings app and then open System > Notifications & Actions. Deselect the option “Show me tips about Windows” and reboot

[PDF] Excel4apps Wands (Oracle) 5.x Installation Guide – Performance , In-Memory Database . not required (pruned) before the results are further reduced by additional Configuring GL Wand to use the In-Memory Column Store. Is there a way to limit the memory usage? I have a process that is based on this example, and is meant to run long term. I observe that this worker processes are hogging up lots of memory(~4GB) after an overnight run. Doing a join to release memory is not an option and I am trying to figure out a way without join-ing. This seems a little

You can use a quick trick to clear out the RAM on your iPhone, giving you more free RAM for apps to use: Press and hold the Power button until the Power slider appears. Make sure your iPhone is unlocked first. Press and hold the Home button for five seconds when the slider appears.

Comments
  • what is context here ?
  • Sharp eye! The context variable should be an instance of Image object allocated by the with stack. I'll update the example.
  • Thank you for clarification, but then pytesseract expects pillow Image object and so throws TypeError: Unsupported image object
  • This answer address memory usage with Wand library, and not pillow/pytesseract API. I can't speak for those python modules, but I can comment that Tesseract uses Leptonica as it's raster I/O, so you can eliminate pillow, PIL, pytesseract, Wand, & ImageMagick entirely. The documentation shows you how, and the internet is full of examples.