pyodbc - very slow bulk insert speed

pyodbc fast_executemany
pyodbc performance
pyodbc bulk insert from list
pyodbc mssql performance
pyodbc vs sqlalchemy performance
pyodbc insert
python bulk insert sql server
pyodbc executemany dataframe

With this table:

CREATE TABLE test_insert (
    col1 INT,
    col2 VARCHAR(10),
    col3 DATE
)

the following code takes 40 seconds to run:

import pyodbc

from datetime import date


conn = pyodbc.connect('DRIVER={SQL Server Native Client 10.0};'
    'SERVER=localhost;DATABASE=test;UID=xxx;PWD=yyy')

rows = []
row = [1, 'abc', date.today()]
for i in range(10000):
    rows.append(row)

cursor = conn.cursor()
cursor.executemany('INSERT INTO test_insert VALUES (?, ?, ?)', rows)

conn.commit()

The equivalent code with psycopg2 only takes 3 seconds. I don't think mssql is that much slower than postgresql. Any idea on how to improve the bulk insert speed when using pyodbc?

EDIT: Add some notes following ghoerz's discovery

In pyodbc, the flow of executemany is:

  • prepare statement
  • loop for each set of parameters
    • bind the set of parameters
    • execute

In ceODBC, the flow of executemany is:

  • prepare statement
  • bind all parameters
  • execute

I was having a similar issue with pyODBC inserting into a SQL Server 2008 DB using executemany(). When I ran a profiler trace on the SQL side, pyODBC was creating a connection, preparing the parametrized insert statement, and executing it for one row. Then it would unprepare the statement, and close the connection. It then repeated this process for each row.

I wasn't able to find any solution in pyODBC that didn't do this. I ended up switching to ceODBC for connecting to SQL Server, and it used the parametrized statements correctly.

Performance issue · Issue #239 · mkleehammer/pyodbc · GitHub, Inserting row by row using above code and it is very slow. workaround is to use a bulk copy utility such as the bcp utility for Microsoft SQL  pyodbc-very slow bulk insert speed (4) With this table: CREATE TABLE test_insert ( col1 INT, col2 VARCHAR(10), col3 DATE ) the following code takes 40 seconds to run:

Tried both ceODBC and mxODBC and both were also painfully slow. Ended up going with an adodb connection with help from http://www.ecp.cc/pyado.html. Total run time improved by a factor of 6!

comConn = win32com.client.Dispatch(r'ADODB.Connection')
DSN = 'PROVIDER=Microsoft.Jet.OLEDB.4.0;DATA SOURCE=%s%s' %(dbDIR,dbOut)
comConn.Open(DSN)

rs = win32com.client.Dispatch(r'ADODB.Recordset')
rs.Open('[' + tblName +']', comConn, 1, 3)

for f in values:
    rs.AddNew(fldLST, f)

rs.Update()

Slow Handling of executemany() · Issue #120 · mkleehammer/pyodbc, Even for routine updates and inserts, I write the updates to a file, bcp them to a Microsoft ODBC Driver 13.1 for SQL Server (Linux - CentOS 6.8) I see that your code is attempting a fast executemany() after it has already  How to speed up bulk insert to MS SQL Server from CSV using pyodbc (2) Below is my code that I'd like some help with. I am having to run it over 1,300,000 rows meaning it takes up to 40 minutes to insert ~300,000 rows.

Trying to insert +2M rows into MSSQL using pyodbc was taking an absurdly long amount of time compared to bulk operations in Postgres (psycopg2) and Oracle (cx_Oracle). I did not have the privileges to use the BULK INSERT operation, but was able to solve the problem with the method below.

Many solutions correctly suggested fast_executemany, however, there are some tricks to using it correctly. First, I noticed that pyodbc was committing after each row when autocommit was set to True in the connect method, therefore this must be set to False. I also observed a non-linear slow down when inserting more than ~20k rows at a time, i.e. inserting 10k rows was subsecond, but 50k was upwards of 20s. I assume that the transaction log is getting quite large and slowing the whole thing down. Therefore, you must chunk your insert and commit after each chunk. I found 5k rows per chunk delivered good performance, but this would obviously depend on many factors (the data, the machine, db config etc...).

import pyodbc

CHUNK_SIZE = 5000

def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in xrange(0, len(l), n): #use xrange in python2, range in python3
        yield l[i:i + n]

mssql_conn = pyodbc.connect(driver='{ODBC Driver 17 for SQL Server}',
                            server='<SERVER,PORT>',
                            timeout=1,
                            port=<PORT>,
                            uid=<UNAME>, 
                            pwd=<PWD>,
                            TDS_Version=7.2,
                            autocommit=False) #IMPORTANT

mssql_cur = mssql_conn.cursor()
mssql_cur.fast_executemany = True #IMPORTANT

params = [tuple(x) for x in df.values]

stmt = "truncate table <THE TABLE>"
mssql_cur.execute(stmt)
mssql_conn.commit()

stmt = """
INSERT INTO <THE TABLE> (field1...fieldn) VALUES (?,...,?)
"""
for chunk in chunks(params, CHUNK_SIZE): #IMPORTANT
    mssql_cur.executemany(stmt, chunk)
    mssql_conn.commit()

Speed up Bulk inserts to SQL db using Pandas and Python, The approach here: every row in a dataframe is converted to tupple; Every record is then inserted to the table using pyodbc. params = 'DRIVER  Using a separate transaction won't make a blind bit of difference. It's all about the number of round-trips to the database server. With a parameter array, there's one round-trip for all the updates. Without a parameter array, there's one round-trip per update row - very, very slow.

pyodbc 4.0.19 added a Cursor#fast_executemany option to help address this issue. See this answer for details.

Inserting data with executemany is prohibitively slow, http://stackoverflow.com/questions/5693885/pyodbc-very-slow-bulk-insert-speed. At this point I'm looking at dumping everything to a csv file  Or just export the data to a csv and then use bulk insert (which is very, very fast). You will have to build a format file but it might be worth it. link – bsheehy Apr 21 '15 at 20:24

I wrote data to text file and then invoked BCP utility. Much much quicker. From about 20 to 30 minutes to a few seconds.

sql server, The inserts are executed using: ubuntu 16.04; Python 2.7; pyodbc 4.0.23 (also tested pymssql+freetds and pypyodbc); Microsoft ODBC driver  Using chuncksize helps with the memory but still the speed is very slow. My impression, looking at the MSSQL DBase trace is that the insertion is actually performed one row at the time. The only viable approach now is to dump to a csv file on a shared folder and use BULK INSERT. But it very annoying and inelegant! 👍

Designing Performance-Optimized ODBC Applications, Developing performance-oriented ODBC applications is not easy. Microsoft's Passing arrays of parameter values, for bulk insert operations, for example, with  method which allows anyone with a pyodbc engine to send their DataFrame into sql. Unfortunately, this method is really slow. It creates a transaction for every row. This means that every insert locks the table. This leads to poor performance (I got about 25 records a second.) So I thought I would just use the pyodbc driver directly.

How to Insert Values into SQL Server Table using Python, Your table created in SQL Server – in our example, the table name is: dbo.​Person. import pyodbc conn = pyodbc.connect('Driver={SQL Server};' 'Server=​RON\  An inordinately large number of rows would be indicated by a very slow call to fetchall() at the DBAPI level: 2 0.300 0.600 0.300 0.600 { method 'fetchall' of 'sqlite3.Cursor' objects }

Inserting about 60 gigs worth of CSVs into SQL Server using PYODBC, I'm using MS SQL 2012 and Python // Pyodbc, Pandas and Sql Alchemy to writing this to sql using the native to_sql using sql alchemy as my connection is way to slow, Unfortunately, I have no bulk-insert rights in MS Sql so I have to use the commitrange must be benchmarked, based on database performance, diskio,  So, for example, have a for loop go through chunks of 10,000 rows (my_data_frame.to_sql(TableName,engine,chunksize=10000)). – bsheehy Apr 21 '15 at 20:20 Or just export the data to a csv and then use bulk insert (which is very, very fast). You will have to build a format file but it might be worth it.

Comments
  • Try using an explicit transaction.
  • Reading stackoverflow.com/questions/1063770/…, it doesn't seem like pyodbc has support for explicit transaction.
  • That's not the way I read it. You turn off auto-commit, and have to explicitly call rollback or commit. However, I have no idea if it makes a difference or not, but it would be something I would try myself.
  • What you described is exactly what my code does. Autocommit is off by default.
  • I don't see any reason for this to be slow. What version of SQL Server, and is the installation a standard installation, i.e. no funny configs etc? Like running databases from USB etc? You can also try and attach SQL Profiler to the db and see if you can spot where the inefficiency comes from, but your equivalent code in c# executes in less than 3 seconds on my pc.
  • Thanks for confirmation and tips. I have filed this as code.google.com/p/pyodbc/issues/detail?id=250