How to force Django models to be released from memory

django memory management
django annotate
python release memory
django memory profiler
django iterate over queryset

I want to use a management command to run a one-time analysis of the buildings in Massachusetts. I have reduced the offending code to an 8 line snippet that demonstrates the problem I encounter. The comments just explain why I want to do this at all. I am running the code below verbatim, in an otherwise-blank management command

zips = ZipCode.objects.filter(state='MA').order_by('id')
for zip in zips.iterator():
    buildings = Building.objects.filter(boundary__within=zip.boundary)
    important_buildings = []
    for building in buildings.iterator():
        # Some conditionals would go here
        important_buildings.append(building)
    # Several types of analysis would be done on important_buildings, here
    important_buildings = None

When I run this exact code, I find that memory usage steadily increases with each iteration outer loop (I use print('mem', process.memory_info().rss) to check memory usage).

It seems like the important_buildings list is hogging up memory, even after going out of scope. If I replace important_buildings.append(building) with _ = building.pk, it no longer consumes much memory, but I do need that list for some of the analysis.

So, my question is: How can I force Python to release the list of Django models when it goes out of scope?

Edit: I feel like there's a bit of a catch 22 on stack overflow -- if I write too much detail, no one wants to take the time to read it (and it becomes a less applicable problem), but if I write too little detail, I risk overlooking part of the problem. Anyway, I really appreciate the answers, and plan to try some of the suggestions out this weekend when I finally get a chance to get back to this!!

You don't provide much information about how big your models are, nor what links there are between them, so here are a few ideas:

By default QuerySet.iterator() will load 2000 elements in memory (assuming you're using django >= 2.0). If your Building model contains a lot of info, this could possibly hog up a lot of memory. You could try changing the chunk_size parameter to something lower.

Does your Building model have links between instances that could cause reference cycles that the gc can't find? You could use gc debug features to get more detail.

Or shortcircuiting the above idea, maybe just call del(important_buildings) and del(buildings) followed by gc.collect() at the end of every loop to force garbage collection?

The scope of your variables is the function, not just the for loop, so breaking up your code into smaller functions might help. Although note that the python garbage collector won't always return memory to the OS, so as explained in this answer you might need to get to more brutal measures to see the rss go down.

Hope this helps!

EDIT:

To help you understand what code uses your memory and how much, you could use the tracemalloc module, for instance using the suggested code:

import linecache
import os
import tracemalloc

def display_top(snapshot, key_type='lineno', limit=10):
    snapshot = snapshot.filter_traces((
        tracemalloc.Filter(False, "<frozen importlib._bootstrap>"),
        tracemalloc.Filter(False, "<unknown>"),
    ))
    top_stats = snapshot.statistics(key_type)

    print("Top %s lines" % limit)
    for index, stat in enumerate(top_stats[:limit], 1):
        frame = stat.traceback[0]
        # replace "/path/to/module/file.py" with "module/file.py"
        filename = os.sep.join(frame.filename.split(os.sep)[-2:])
        print("#%s: %s:%s: %.1f KiB"
              % (index, filename, frame.lineno, stat.size / 1024))
        line = linecache.getline(frame.filename, frame.lineno).strip()
        if line:
            print('    %s' % line)

    other = top_stats[limit:]
    if other:
        size = sum(stat.size for stat in other)
        print("%s other: %.1f KiB" % (len(other), size / 1024))
    total = sum(stat.size for stat in top_stats)
    print("Total allocated size: %.1f KiB" % (total / 1024))

tracemalloc.start()

# ... run your code ...

snapshot = tracemalloc.take_snapshot()
display_top(snapshot)

Strategies for reducing memory usage in Django migrations, If you've ever written a data migration on a large table in Django, you This post is a collection of strategies for reducing memory usage during Django migrations. You're just making a simple update to some of your model's fields. an iterator​() function which will force the queryset not to cache any data. So, how to force Django models to be released from memory? You can't tell have such problem just using process.memory_info().rss. I can, however, propose a solution for you to optimize your code. And write a demo on why process.memory_info().rss is not a very accurate tool to measure memory being used in some block of code.

Very quick answer: memory is being freed, rss is not a very accurate tool for telling where the memory is being consumed, rss gives a measure of the memory the process has used, not the memory the process is using (keep reading to see a demo), you can use the package memory-profiler in order to check line by line, the memory use of your function.

So, how to force Django models to be released from memory? You can't tell have such problem just using process.memory_info().rss.

I can, however, propose a solution for you to optimize your code. And write a demo on why process.memory_info().rss is not a very accurate tool to measure memory being used in some block of code.

Proposed solution: as demonstrated later in this same post, applying del to the list is not going to be the solution, optimization using chunk_size for iterator will help (be aware chunk_size option for iterator was added in Django 2.0), that's for sure, but the real enemy here is that nasty list.

Said that, you can use a list of just fields you need to perform your analysis (I'm assuming your analysis can't be tackled one building at the time) in order to reduce the amount of data stored in that list.

Try getting just the attributes you need on the go and select targeted buildings using the Django's ORM.

for zip in zips.iterator(): # Using chunk_size here if you're working with Django >= 2.0 might help.
    important_buildings = Building.objects.filter(
        boundary__within=zip.boundary,
        # Some conditions here ... 

        # You could even use annotations with conditional expressions
        # as Case and When.

        # Also Q and F expressions.

        # It is very uncommon the use case you cannot address 
        # with Django's ORM.

        # Ultimately you could use raw SQL. Anything to avoid having
        # a list with the whole object.
    )

    # And then just load into the list the data you need
    # to perform your analysis.

    # Analysis according size.
    data = important_buildings.values_list('size', flat=True)

    # Analysis according height.
    data = important_buildings.values_list('height', flat=True)

    # Perhaps you need more than one attribute ...
    # Analysis according to height and size.
    data = important_buildings.values_list('height', 'size')

    # Etc ...

It's very important to note that if you use a solution like this, you'll be only hitting database when populating data variable. And of course, you will only have in memory the minimum required for accomplishing your analysis.

Thinking in advance.

When you hit issues like this you should start thinking about parallelism, clusterization, big data, etc ... Read also about ElasticSearch it has very good analysis capabilities.

Demo
process.memory_info().rss Won't tell you about memory being freed.

I was really intrigued by your question and the fact you describe here:

It seems like the important_buildings list is hogging up memory, even after going out of scope.

Indeed, it seems but is not. Look the following example:

from psutil import Process

def memory_test():
    a = []
    for i in range(10000):
        a.append(i)
    del a

print(process.memory_info().rss)  # Prints 29728768
memory_test()
print(process.memory_info().rss)  # Prints 30023680

So even if a memory is freed, the last number is bigger. That's because memory_info.rss() is the total memory the process has used, not the memory is using at the moment, as stated here in the docs: memory_info.

The following image is a plot (memory/time) for the same code as before but with range(10000000)

I use the script mprof that comes in memory-profiler for this graph generation.

You can see the memory is completely freed, is not what you see when you profile using process.memory_info().rss.

If I replace important_buildings.append(building) with _ = building use less memory

That's always will be that way, a list of objects will always use more memory than a single object.

And on the other hand, you also can see the memory used don't grow linearly as you would expect. Why?

From this excellent site we can read:

The append method is "amortized" O(1). In most cases, the memory required to append a new value has already been allocated, which is strictly O(1). Once the C array underlying the list has been exhausted, it must be expanded in order to accommodate further appends. This periodic expansion process is linear relative to the size of the new array, which seems to contradict our claim that appending is O(1).

However, the expansion rate is cleverly chosen to be three times the previous size of the array; when we spread the expansion cost over each additional append afforded by this extra space, the cost per append is O(1) on an amortized basis.

It is fast but has a memory cost.

The real problem is not the Django models not being released from memory. The problem is the algorithm/solution you've implemented, it uses too much memory. And of course, the list is the villain.

A golden rule for Django optimization: Replace the use of a list for querisets wherever you can.

Database access optimization | Django documentation, Remember that you may be optimizing for speed or memory or both, depending on your Write your own custom SQL to retrieve data or populate models. By default QuerySet.iterator() will load 2000 elements in memory (assuming you're using django >= 2.0). If your Building model contains a lot of info, this could possibly hog up a lot of memory. You could try changing the chunk_size parameter to something lower.

Laurent S's answer is quite on the point (+1 and well done from me :D).

There are some points to consider in order to cut down in your memory usage:

  1. The iterator usage:

    You can set the chunk_size parameter of the iterator to something as small as you can get away with (ex. 500 items per chunk). That will make your query slower (since every step of the iterator will reevaluate the query) but it will cut down in your memory consumption.

  2. The only and defer options:

    defer(): In some complex data-modeling situations, your models might contain a lot of fields, some of which could contain a lot of data (for example, text fields), or require expensive processing to convert them to Python objects. If you are using the results of a queryset in some situation where you don’t know if you need those particular fields when you initially fetch the data, you can tell Django not to retrieve them from the database.

    only(): Is more or less the opposite of defer(). You call it with the fields that should not be deferred when retrieving a model. If you have a model where almost all the fields need to be deferred, using only() to specify the complementary set of fields can result in simpler code.

    Therefore you can cut down on what you are retrieving from your models in each iterator step and keep only the essential fields for your operation.

  3. If your query still remains too memory heavy, you can choose to keep only the building_id in your important_buildings list and then use this list to make the queries you need from your Building's model, for each of your operations (this will slow down your operations, but it will cut down on the memory usage).

  4. You may improve your queries so much as to solve parts (or even whole) of your analysis but with the state of your question at this moment I cannot tell for sure (see PS on the end of this answer)

Now let's try to bring all the above points together in your sample code:

# You don't use more than the "boundary" field, so why bring more?
# You can even use "values_list('boundary', flat=True)"
# except if you are using more than that (I cannot tell from your sample)
zips = ZipCode.objects.filter(state='MA').order_by('id').only('boundary')
for zip in zips.iterator():
    # I would use "set()" instead of list to avoid dublicates
    important_buildings = set()

    # Keep only the essential fields for your operations using "only" (or "defer")
    for building in Building.objects.filter(boundary__within=zip.boundary)\
                    .only('essential_field_1', 'essential_field_2', ...)\
                    .iterator(chunk_size=500):
        # Some conditionals would go here
        important_buildings.add(building)

If this still hogs too much memory for your liking you can use the 3rd point above like this:

zips = ZipCode.objects.filter(state='MA').order_by('id').only('boundary')
for zip in zips.iterator():
    important_buildings = set()
    for building in Building.objects.filter(boundary__within=zip.boundary)\
                    .only('pk', 'essential_field_1', 'essential_field_2', ...)\
                    .iterator(chunk_size=500):
        # Some conditionals would go here

        # Create a set containing only the important buildings' ids
        important_buildings.add(building.pk)

and then use that set to query your buildings for the rest of your operations:

# Converting set to list may not be needed but I don't remember for sure :)
Building.objects.filter(pk__in=list(important_buildings))...

PS: If you can update your answer with more specifics, like the structure of your models and some of the analysis operations you are trying to run, we may be able to provide more concrete answers to help you!

Query Expressions | Django documentation, It makes it possible to refer to model field values and perform database operations them without actually having to pull them out of the database into Python memory. where each model is annotated with an extra attribute field_lower produced, The examples in this section are designed to show how to force Django to  It makes it possible to refer to model field values and perform database operations using them without actually having to pull them out of the database into Python memory. Instead, Django uses the F() object to generate an SQL expression that describes the required operation at the database level. Let’s try this with an example.

Have you considered Union? By looking at the code you posted you are running a lot of queries within that command but you could offload that to the database with Union.

combined_area = FooModel.objects.filter(...).aggregate(area=Union('geom'))['area']
final = BarModel.objects.filter(coordinates__within=combined_area)

Tweaking the above could essentially narrow down the queries needed for this function to one.

It's also worth looking at DjangoDebugToolbar - if you haven't looked it it already.

Performance and optimization | Django documentation, Once you've created your data models, Django automatically gives you a only via model classes, rather than from model instances, to enforce a separation with “What”, that were published between January 30, 2005, and the current day. 10 How to force Django models to be released from memory Jan 17 '19 9 python Save the output of a shell command into a text file Dec 6 '13 8 How to access target of std::tr1::shared_ptr in GDB Jul 23 '14

To release memory, you must duplicate the important details of each in the buildings in the inner loop into a new object, to be used later, while eliminating those not suitable. In code not shown in the original post references to the inner loop exist. Thus the memory issues. By copying the relevant fields to new objects, the originals can be deleted as intended.

Making queries | Django documentation, But freed to where? Where did this “memory” come from? Somewhere in your computer, there's a physical device storing data when you're running your Python​  Model inheritance in Django works almost identically to the way normal class inheritance works in Python, but the basics at the beginning of the page should still be followed. That means the base class should subclass django.db.models.Model.

Memory Management in Python – Real Python, In some cases, all allocated memory could be released only when Python If you are interested in Python's memory model, you can read my  Internally, Django uses a django.core.files.File instance any time it needs to represent a file. Most of the time you’ll use a File that Django’s given you (i.e. a file attached to a model as above, or perhaps an uploaded file). If you need to construct a File yourself, the easiest way is to create one using a Python built-in file object:

Garbage collection in Python: things you need to know, The len() method forces the queryset to be evaluated and retrieve results that typical case of fetching all data from database into memory These methods avoid creating full model instances and retrieve only the desired  Pickling QuerySet s¶. If you pickle a QuerySet, this will force all the results to be loaded into memory prior to pickling.Pickling is usually used as a precursor to caching and when the cached queryset is reloaded, you want the results to already be present and ready for use (reading from the database can take some time, defeating the purpose of caching).

Performance tips for Django applications, books and authors are both Django querysets, although not representative of our actual models. The latter had 2012 objects and the former  And since Django ORM does not have a concept of sessions, it would create a total of 100 Django model instances to represent multiple copies of your five categories. Using prefetch_related would result in an extra query (a 100% increase!) but would often result in better performance, especially with lots of concurrent traffic.

Comments
  • Does your analysis code happen to create references between instances of building so that you'd end up with a reference cycle, preventing gc from doing its work?
  • I've taken out the analysis code. the code above is verbatim what I run
  • Are you running this code with DEBUG=True?
  • The catch-22 is resolved by providing a minimally reproducible sample of your code and the conditions to reproduce the problems. Since you have not provided that, guesses tend to surface. And in SO form the best guess receives your 1/2 bounty.
  • The above code was minimally reproducible. Any django model would have had the effect that I mentioned, because I misunderstood how process.memory_info().rss worked. Turned out there was no memory issue in the above snippet. I awarded the full bounty for that reason
  • rss is going never go down, is a measure of the memory the process has used , not the memory the process is using.
  • isn't it an overhead to call gc.collect() at the end of every loop? as it can take considerable time to evaluate every memory object within a large system
  • The list is not the issue, as it is really quite small in individual passes of the loop, and my issue was about accumulating memory linearly over multiple iterations of the loop. I am still using the list. But the other information you provided, particularly about memory profiling, helped me diagnose the real issue. thanks.
  • I'm glad to help, any time.