Django loaddata - Out of Memory
I made a dump of my db using dumpdata and it created a 500mb json file
now I am trying to use loaddata to restore the db, but seems like Django tries to load the entire file into memory before applying it and i get an out of memory error and the process is killed.
Isn't there a way to bypass this problem?
loaddata is generally use for fixtures, i.e. a small number of database objects to get your system started and for tests rather than for large chunks of data. If you're hitting memory limits then you're probably not using it for the right purpose.
If you still have the original database, you should use something more suited to the purpose, like PostgreSQL's
pg_dump or MySQL's
Django dumpdata and loaddata (Example), A protip by itseranga about django, database, model, and dumpdata. Following command will dump the whole database with out including for me, but I found that when I tried to restore, I got an out of memory error,� Django dumpdata and loaddata. django database model The steps worked for me, but I found that when I tried to restore, I got an out of memory error, because my
As Joe pointed out, PostgreSQL's pg_dump or MySQL's mysqldump is more suited in your case.
In case you have lost your original database, there are 2 ways you could try to get your data back:
One: Find another machine, that have more memory and can access to your database. Build your project on that machine, and run the loaddata command on that machine.
I know it sounds silly. But it is the quickest way if your can run django on your laptop and can connect to the db remotely.
Two: Hack the Django source code.
Check the code in django.core.erializers.json.py:
def Deserializer(stream_or_string, **options): """ Deserialize a stream or string of JSON data. """ if not isinstance(stream_or_string, (bytes, six.string_types)): stream_or_string = stream_or_string.read() if isinstance(stream_or_string, bytes): stream_or_string = stream_or_string.decode('utf-8') try: objects = json.loads(stream_or_string) for obj in PythonDeserializer(objects, **options): yield obj except GeneratorExit: raise except Exception as e: # Map to deserializer error six.reraise(DeserializationError, DeserializationError(e), sys.exc_info())
The code below is the problem. The
json module in the stdlib only accepts string, and cant not handle stream lazily. So django load all the content of a json file into the memory.
stream_or_string = stream_or_string.read() objects = json.loads(stream_or_string)
You could optimize those code with py-yajl. py-yajl creates an alternative to the built in json.loads and json.dumps using yajl.
#5423 ("dumpdata" should stream output one row at a time) – Django, I attempted to run django-admin.py dumpdata on a database table with ~2,000 records, and it ran out of memory. This problem would be avoided if dumpdata� I tried to import a 3.2GB xml file, partly to see if Django loads the whole file into memory when it's importing and partly to see how large pg client transactions work. Whichever caused the problem, I have a fair idea that even thought the machine has 4GB of memory, the process ran out of memory.
I'd like to add that I was quite successful in a similar use-case with ijson: https://github.com/isagalaev/ijson
In order to get an iterator over the objects in a json file from django dumpdata, I modified the json Deserializer like this (imports elided):
Serializer = django.core.serializers.json.Serializer def Deserializer(stream_or_string, **options): if isinstance(stream_or_string, six.string_types): stream_or_string = six.BytesIO(stream_or_string.encode('utf-8')) try: objects = ijson.items(stream_or_string, 'item') for obj in PythonDeserializer(objects, **options): yield obj except GeneratorExit: raise except Exception as e: # Map to deserializer error six.reraise(DeserializationError, DeserializationError(e), sys.exc_info())
The problem with using py-yajl as-is is that you still get all the objects in one large array, which uses a lot of memory. This loop only uses as much memory as a single serialized Django object. Also ijson can still use yajl as a backend.
Providing initial data for models | Django documentation, You can provide initial data with migrations or fixtures. of the rows created by a fixture and then run loaddata again, you'll wipe out any changes you've made. tl;dr: Load data up to 77x faster with django-postgres-copy and an in-memory csv. Go to results. When starting a new Django project often my first step is to create a model and bulk load in some existing data. As I’ve learned more about Django and databases, I’ve learned a few ways to speed up the data loading process. In this post I’ll walk through progressively more efficient ways of
I ran into this problem migrating data from a Microsoft SQL Server to PostgreSQL, so
pg_dump weren't an option for me. I split my json fixtures into chunks that would fit in memory (about 1M rows for a wide table and 64GB ram).
def dump_json(model, batch_len=1000000): "Dump database records to a json file in Django fixture format, one file for each batch of 1M records" JSONSerializer = serializers.get_serializer("json") jser = JSONSerializer() for i, partial_qs in enumerate(util.generate_slices(model.objects.all(), batch_len=batch_len)): with open(model._meta.app_label + '--' + model._meta.object_name + '--%04d.json' % i, 'w') as fpout: jser.serialize(partial_qs, indent=1, stream=fpout)
You can then load them with
manage.py loaddata <app_name>--<model_name>*.json. But in my case I had to first
sed the files to change the model and app names so they'd load to the right database. I also nulled the pk because I'd changed the pk to be an
AutoField (best practice for django).
sed -e 's/^\ \"pk\"\:\ \".*\"\,/"pk": null,/g' -i *.json sed -e 's/^\ \"model\"\:\ \"old_app_name\.old_model_name\"\,/\ \"model\"\:\ "new_app_name\.new_model_name\"\,/g' -i *.json
You might find pug useful. It's a FOSS python package of similarly hacky tools for handling large migration and data mining tasks in django.
django-admin and manage.py | Django documentation, You can provide initial data with fixtures or migrations. of the rows created by a fixture and then run loaddata again, you'll wipe out any changes you've made. It turns out that Django caches the results of each queryset, when you iterate over it. Each and every object in the queryset will be saved in memory. This lets the ORM to access the queryset again very efficiently, without the need to hit the database again. 3. Optimize Django memory usage: using iterator() method
You can use XML format for serialization/deserialization. It's implemented internally via file streams and doesn't require a lot of memory in comparison with JSON. Unfortunately, Django JSON Deserialization doesn't use streams
So just try:
./manage.py dumpdata file.xml
./manage.py loaddata file.xml
The output of dumpdata can be used as input for loaddata . running will not take effect if the particular Python modules have already been loaded into memory. django-admin dumpdata [app_label[.ModelName] [app_label[.ModelName] ]]¶ Outputs to standard output all data in the database associated with the named application(s). If no application name is provided, all installed applications will be dumped. The output of dumpdata can be used as input for loaddata.
LayerMapping data import utility¶. The LayerMapping class provides a way to map the contents of vector spatial data files (e.g. shapefiles) into GeoDjango models.. This utility grew out of the author’s personal needs to eliminate the code repetition that went into pulling geometries and fields out of a vector layer, converting to another coordinate system (e.g. WGS84), and then inserting
The default formatting to use for date fields on Django admin change-list pages – and, possibly, by other parts of the system – in cases when only the year and month are displayed. For example, when a Django admin change-list page is being filtered by a date drilldown, the header for a given month displays the month and the year.
All tasks run in concurrent processes in the background managed by Celery. We originally used Redis backend but found it would routinely run out of memory during peak load scenarios and high concurrency. So we switched to Django's filebased cache backend. Although that fixed the memory issue, we saw that 20-30% of the cache entries never got
- What's your django version? Is
setting.DEBUGset to false?
- I didn't know about loaddata was just for fixtures, good to know!
- this way only for cases of database matches. If i want to migrate from sqlite to postgresql - is'nt working!
- Thank you for your help, fortunately the db wasn't gone yet so i ended up with pg_dump!
- I neglected to provide the
util.generate_slicesfunction in the answer. It is from the NLP utilities package within pug
- This worked for me, although in dumpdata you have to specify: --format xml
- I wanted to add one point to my answer - if you use JSONField on your model - you probably cannot serialize it with XML. This is because of default serializer is not able to do that. So you have two options here: optimize JsonSerializer at Django OR teach XMLSerializer handle JSONField