Is it possible to improve Mongoexport speed?

install mongoexport mac
mongoexport csv all fields
mongoexport exported 0 records
mongoexport command not found
mongodump vs mongoexport
mongoexport limit
mongoexport projection
mongodump query

I have a 130M rows MongoDB 3.6.2.0 collection. It has several simple fields and 2 fields with nested JSON documents. Data is stored in compressed format (zlib).

I need to export one of embedded fields into JSON format as soon as possible. However, mongoexport is taking forever. After 12 hours of running it has processed only 5.5% of data, which is too slow for me.

The CPU is not busy. Mongoexport seems to be single-threaded.

Export command I am using:

mongoexport -c places --fields API \
    --uri mongodb://user:pass@hostIP:hostPort/maps?authSource=admin \
    -o D:\APIRecords.json

It's actually getMore command which is unreasonably slow under the hood:

2018-05-02T17:59:35.605-0700 I COMMAND  [conn289] command maps.places command: getMore { getMore: 14338659261, collection: "places", $db: "maps" } originatingCommand: { find: "places", filter: {}, sort: {}, projection: { _id: 1, API: 1 }, skip: 0, snapshot: true, $readPreference: { mode: "secondaryPreferred" }, $db: "maps" } planSummary: COLLSCAN cursorid:14338659261 keysExamined:0 docsExamined:5369 numYields:1337 nreturned:5369 reslen:16773797 locks:{ Global: { acquireCount: { r: 2676 } }, Database: { acquireCount: { r: 1338 } }, Collection: { acquireCount: { r: 1338 } } } protocol:op_query 22796ms

I have tried running multiple commands with --SKIP and --LIMIT options in separate processes like this

mongoexport -c places --SKIP 10000000 --LIMIT 10000000 --fields API \
    --uri mongodb://user:pass@hostIP:hostPort/maps?authSource=admin \
    -o D:\APIRecords.json
mongoexport -c places --SKIP 20000000 --LIMIT 10000000 --fields API \
    --uri mongodb://user:pass@hostIP:hostPort/maps?authSource=admin \
    -o D:\APIRecords.json

etc. But I was not able to finish waiting till the command with first non-zero SKIP even starts!

I have also tried with --forceTableScan option, which did not make any difference.

I have no indexes on places table.

My storage configuration:

journal.enabled: false
wiredTiger.collectionConfig.blockCompressor: zlib

Collection stats:

'ns': 'maps.places',
'size': 2360965435671,
'count': 130084054,
'avgObjSize': 18149,
'storageSize': 585095348224.0

My server specs:

Windows Server 2012 R2 x64
10Gb RAM 4TB HDD 6 cores Xeon 2.2Ghz

I've run a test and with SSD it's having the same terrible read throughput as with HDD.

My question:

Why is reading so slow? Has anyone else experienced the same issue? Can you give me any hints on how to speed up data dumping?

Update

I moved the DB to fast NVME SSD drives and I think now I can state my concerns about MongoDB read performance in a more clear way.

Why does this command, which seeks to find a chunk of documents not having specific field:

2018-05-05T07:20:46.215+0000 I COMMAND  [conn704] command maps.places command: find { find: "places", filter: { HTML: { $exists: false }, API.url: { $exists: true } }, skip: 9990, limit: 1600, lsid: { id: UUID("ddb8b02c-6481-45b9-9f84-cbafa586dbbf") }, $readPreference: { mode: "secondaryPreferred" }, $db: "maps" } planSummary: COLLSCAN cursorid:15881327065 keysExamined:0 docsExamined:482851 numYields:10857 nreturned:101 reslen:322532 locks:{ Global: { acquireCount: { r: 21716 } }, Database: { acquireCount: { r: 10858 } }, Collection: { acquireCount: { r: 10858 } } } protocol:op_query 177040ms

only yields 50Mb/sec read pressure onto a fast flash drive? This is clearly performance of a single-threaded random (scattered) read. Whereas I have just proven that the drive allows 1Gb/sec read/write throughput easily.

In terms of Mongo internals, would it not be wiser to read BSON file in a sequential order and gain 20x scanning speed improvement? (And, since my blocks are zlib compressed, and server has 16 cores, better to decode fetched chunks in one or several helper threads?) Instead of iterating BSON document after document.

I also can confirm, even when I'm not specifying any query filters, and clearly want to iterate ENTIRE collection, fast sequential read of the BSON file is not happening.

You can try using the pandas and joblib library to export to the JSON file in parts. You can refer to this gist for processing the data in MongoDB.

from pandas import DataFrame
from joblib import Parallel,delayed

def process(idx,cursor):
    file_name = "fileName"+str(idx)+".json"
    df = DataFrame(list(cursor))
    df.to_json(file_name, orient='records')

#make a list of cursors.. you can read the parallel_scan api of pymongo

cursors = mongo_collection.parallel_scan(no_of_parts_of_collection)
Parallel(n_jobs=4)(delayed(process)(idx,cursor) for idx,cursor in enumerate(cursors)

The n_jobs parameter shall spawn processes equal to the number specified. Each process shall hold one core. I used 4 since your server has 6 cores available. The parallel_scan() api takes a number and divides the collection into the parts equal to the number provided. You can try higher numbers to break the collection into evenly divided cursors.

I have tried a similar approach but the signature of and definition of my process function was different. I was able to process 2.5M records in under 20 mins. You can read this answer of mine to get an idea as to what exactly I was trying to achieve.

mongodb - Is it possible to improve Mongoexport speed?, I need to export one of embedded fields into JSON format as soon as possible. However, mongoexport is taking forever. After 12 hours of running it has  Top Six Ways to Increase Your Sprint Speed. If your athlete is running with his arms swinging from side to side, he is suffering from teenage kid syndrome. This means he spends too much time working on the muscles in the mirror (chest and biceps) and not enough on his back, glutes, and hamstrings.

mongoexport, Increases the amount of internal reporting returned on standard output or in log files. Increase the verbosity with the -v form by including the option multiple times, (  DSL Connection Too Slow? Here's How to Speed It Up Old wiring and interference from other devices can slow your DSL connection to a crawl. Spending a bit of time and money to improve your wiring

I don't work with Mongo, but a common trick could be used: make a simple app that efficiently sequentially queries all the data, filter it and save in the format you want.

If you need to save in a complex format and there are no libraries to work with it (I really doubt that), it might still be efficient to read all, filter, put it back in a temporary collection, export that collection fully, drop temporary collection.

Mongodump affects app performance really bad, Is there something I can do to improve this, like limiting number of queries per second for mongodump or anything? We have a standalone mongo  If possible, connect the modem to that jack to help improve DSL connection speed. If the above method is not possible, connect a dedicated line to the jack that is closest to where the telephone line enters the house and run it to the phone jack that is located the closest to the cable modem. This method is known to help speed up a DSL connection.

mongoexport is a client library and it uses public API and socket connection to the mongodb itself.

So it doesn't have access to the document BSON on disk

Does this look like what you mentioned?

https://docs.mongodb.com/manual/core/backups/#back-up-by-copying-underlying-data-files

mongodump also could be an option for you

Improve Mongo Import or Export Performance, So that's why I am asking how can I improve performance through Mongo Java Driver. Here is my document (Example): My Json String To increase your brain's processing speed, make sure you're getting plenty of aerobic exercise since it can improve the flow of blood to your brain. Also, eat a balanced diet that's rich in foods linked to brain health, like avocados, blueberries, wild salmon, and nuts.

B'cuz My Data is Big: mongoexport - Export Data from , Increases the amount of internal reporting returned on standard output or in log files. Increase the verbosity with the -v form by including the  While it isn't possible to increase your Internet's speed past the speed for which you're paying your Internet Service Provider, most people don't get the most out of their Internet connections.

Using the mongoexport tool, In this recipe, we will be looking at mongoexport, a utility provided to export MongoDB data in JSON and CSV format. Instead of seeing a hint of your icons underneath, it’s grey and opaque. This requires less processing power for your device to draw and should speed things up a bit. The other item you can consider is Reduce Motion. This will reduce a lot of unnecessary animations, like the parallax effect on your home screen.

Mongodb v3.6 Manual, Avoid using mongoimport and mongoexport for full instance production Increases the amount of internal reporting returned on standard output or in log files. increasing the number of insertion workers may increase the speed of the import. Database Administrators Stack Exchange is a question and answer site for database professionals who wish to improve their database skills and learn from others in the community.

Comments
  • I use Mongo, but I don't have any knowledge of exporting to be able to help. However, something to try: if you have found that the export binary is single-threaded, could you kick off several parallel exports, with each one specifying a different query? I don't know if that would result in disk thrashing that would make your export speed worse, or whether the operation is sufficiently single-core-bound that it would help.
  • Thanks for you comment halfer, yes I tried running parallel exports each skipping different amount of records (10M,20M, etc), but it turned out Mongo can't skip records without actually crawling them one by one (correct me if I am wrong) which resulted in only 1st export process being active and the rest being 'hanging' (
  • I think index info is not relevant here as I want full collection scan, not a scan of some subset of the collection...
  • Rather than skips, can you construct a query on a string or numeric field to divide the data into ranges? For example, if there is string data in places, then items starting with A first, then B, etc. They might be both parallelisable and benefit from indexing.
  • Oh. That's true, but to build such an index (i don't have any) it would require to do full collection scan anyway first (internally). Thanks for the suggestion.
  • That's very interesting Yayati I will give parallel_scan a try!
  • except that parallel collection scan was only supported under MMAPV1 and this will give you one cursor with wiredTiger
  • Indeed, Asya is right, for WiredTiger this won't work (at least in 3.6) according to docs.mongodb.com/manual/reference/command/…: This command will not return more than one cursor for the WiredTiger storage engine
  • Thanks for suggestions, cache size should not matter for this task 'cause every document needs to be accessed only once. Yes collection was compressed but only 1 core was busy decompressing it, total CPU load was ~5%. There were no other operations during export.
  • It does matter. WiredTiger loads the data from disk, uncompresses it, and loads it into its cache. "Cache" is the WiredTiger term for its working memory. The slow export you're seeing is symptomatic of an underprovisioned hardware.
  • Kevin, average document size is only 16Kb. Server had 60Gb RAM 50% of it was devoted to WiredTiger Cache. But even 5Gb RAM would be more than enough to read hundreds thousands 16Kb documents from disk, process them and overwrite them in cache IF MongoDB had some efficient internal way to do it, but it does not.
  • @AnatolyAlekseev: you should still be able to run comparisons on it. Assuming it's ordinary Latin characters, take a look at the distribution of strings beginning with A, B, C, etc. You can still use gt/lt on strings (I don't know about Mongo specifically, but strings are comparable generally).
  • @halfer good catch, fixed. Btw strings are comparable in MongoDB.
  • Does it meant I need to create my own bson format parser? Thanks, not the best solution for me )