Speeding up pandas.DataFrame.to_sql with fast_executemany of pyODBC

pyodbc executemany dataframe
pandas to_sql method='multi
pandas to_sql speed up
pandas dataframe to sql example
pandas to_sql method='multi
pandas to_sql memory error
pandas to_sql autocommit
pandas to_sql freezes

I would like to send a large pandas.DataFrame to a remote server running MS SQL. The way I do it now is by converting a data_frame object to a list of tuples and then send it away with pyODBC's executemany() function. It goes something like this:

 import pyodbc as pdb

 list_of_tuples = convert_df(data_frame)

 connection = pdb.connect(cnxn_str)

 cursor = connection.cursor()
 cursor.fast_executemany = True
 cursor.executemany(sql_statement, list_of_tuples)
 connection.commit()

 cursor.close()
 connection.close()

I then started to wonder if things can be sped up (or at least more readable) by using data_frame.to_sql() method. I have came up with the following solution:

 import sqlalchemy as sa

 engine = sa.create_engine("mssql+pyodbc:///?odbc_connect=%s" % cnxn_str)
 data_frame.to_sql(table_name, engine, index=False)

Now the code is more readable, but the upload is at least 150 times slower...

Is there a way to flip the fast_executemany when using SQLAlchemy?

I am using pandas-0.20.3, pyODBC-4.0.21 and sqlalchemy-1.1.13.

EDIT (2019-03-08): Gord Thompson commented below with good news from the update logs of sqlalchemy: Since SQLAlchemy 1.3.0, released 2019-03-04, sqlalchemy now supports engine = create_engine(sqlalchemy_url, fast_executemany=True) for the mssql+pyodbc dialect. I.e., it is no longer necessary to define a function and use @event.listens_for(engine, 'before_cursor_execute') Meaning the below function can be removed and only the flag needs to be set in the create_engine statement - and still retaining the speed-up.

Original Post:

Just made an account to post this. I wanted to comment beneath the above thread as it's a followup on the already provided answer. The solution above worked for me with the Version 17 SQL driver on a Microsft SQL storage writing from a Ubuntu based install.

The complete code I used to speed things up significantly (talking >100x speed-up) is below. This is a turn-key snippet provided that you alter the connection string with your relevant details. To the poster above, thank you very much for the solution as I was looking quite some time for this already.

import pandas as pd
import numpy as np
import time
from sqlalchemy import create_engine, event
from urllib.parse import quote_plus


conn =  "DRIVER={ODBC Driver 17 for SQL Server};SERVER=IP_ADDRESS;DATABASE=DataLake;UID=USER;PWD=PASS"
quoted = quote_plus(conn)
new_con = 'mssql+pyodbc:///?odbc_connect={}'.format(quoted)
engine = create_engine(new_con)


@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
    print("FUNC call")
    if executemany:
        cursor.fast_executemany = True


table_name = 'fast_executemany_test'
df = pd.DataFrame(np.random.random((10**4, 100)))


s = time.time()
df.to_sql(table_name, engine, if_exists = 'replace', chunksize = None)
print(time.time() - s)

Based on the comments below I wanted to take some time to explain some limitations about the pandas to_sql implementation and the way the query is handled. There are 2 things that might cause the MemoryError being raised afaik:

1) Assuming you're writing to a remote SQL storage. When you try to write a large pandas DataFrame with the to_sql method it converts the entire dataframe into a list of values. This transformation takes up way more RAM than the original DataFrame does (on top of it, as the old DataFrame still remains present in RAM). This list is provided to the final executemany call for your ODBC connector. I think the ODBC connector has some troubles handling such large queries. A way to solve this is to provide the to_sql method a chunksize argument (10**5 seems to be around optimal giving about 600 mbit/s (!) write speeds on a 2 CPU 7GB ram MSSQL Storage application from Azure - can't recommend Azure btw). So the first limitation, being the query size, can be circumvented by providing a chunksize argument. However, this won't enable you to write a dataframe the size of 10**7 or larger, (at least not on the VM I am working with which has ~55GB RAM), being issue nr 2.

This can be circumvented by breaking up the DataFrame with np.split (being 10**6 size DataFrame chunks) These can be written away iteratively. I will try to make a pull request when I have a solution ready for the to_sql method in the core of pandas itself so you won't have to do this pre-breaking up every time. Anyhow I ended up writing a function similar (not turn-key) to the following:

import pandas as pd
import numpy as np

def write_df_to_sql(df, **kwargs):
    chunks = np.split(df, df.shape()[0] / 10**6)
    for chunk in chunks:
        chunk.to_sql(**kwargs)
    return True

A more complete example of the above snippet can be viewed here: https://gitlab.com/timelord/timelord/blob/master/timelord/utils/connector.py

It's a class I wrote that incorporates the patch and eases some of the necessary overhead that comes with setting up connections with SQL. Still have to write some documentation. Also I was planning on contributing the patch to pandas itself but haven't found a nice way yet on how to do so.

I hope this helps.

Speed up Bulk inserts to SQL db using Pandas and Python, EDIT (2019-03-08): Gord Thompson commented below with good news from the update logs of sqlalchemy: Since SQLAlchemy 1.3.0, released  Speeding up pandas.DataFrame.to_sql with fast_executemany of pyODBC (4) I would like to send a large pandas.DataFrame to a remote server running MS SQL. The way I do it now is by converting a data_frame object to a list of tuples and then send it away with pyODBC's executemany() function.

After contacting the developers of SQLAlchemy, a way to solve this problem has emerged. Many thanks to them for the great work!

One has to use a cursor execution event and check if the executemany flag has been raised. If that is indeed the case, switch the fast_executemany option on. For example:

from sqlalchemy import event

@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
    if executemany:
        cursor.fast_executemany = True

More information on execution events can be found here.


UPDATE: Support for fast_executemany of pyodbc was added in SQLAlchemy 1.3.0, so this hack is not longer necessary.

Speeding up inserts of lines with fast_executemany option of pyODBC, Speed up Bulk inserts to SQL db using Pandas and Python different ways of writing data frames to database using pandas and pyodbc; How to speed up the Now lets add cursor.fast_executemany = True to the function already writes dataframe df to sql using pandas 'to_sql' function, sql alchemy and  I would like to send a large pandas.DataFrame to a remote server running MS SQL. The way I do it now is by converting a data_frame object to a list of tuples and then send it away with pyODBC’s executemany() function. It goes something like this: import pyodbc as pdb list_of_tuples = convert_df(data_frame) connection = pdb.connect(cnxn_str) cursor = connection.cursor() cursor.fast

I ran into the same problem but using PostgreSQL. They now just release pandas version 0.24.0 and there is a new parameter in the to_sql function called method which solved my problem.

from sqlalchemy import create_engine

engine = create_engine(your_options)
data_frame.to_sql(table_name, engine, method="multi")

Upload speed is 100x faster for me. I also recommend setting the chunksize parameter if you are going to send lots of data.

to_sql is too slow · Issue #15276 · pandas-dev/pandas · GitHub, Speeding up inserts of lines with fast_executemany option of pyODBC. Showing 1-7 of 7 DataFrame to a remote server running MS SQL. I am using pandas-​0.20.3, pyODBC-4.0.21 and sqlalchemy-1.1.13. My first attempt of tackling data_frame.to_sql(table_name, engine, index=False). Simple, but very  Speeding up pandas.DataFrame.to_sql with fast_executemany of pyODBC I would like to send a large pandas.DataFrame to a remote server running MS SQL. The way I do it now is by converting a data_frame object to a list of tuples and then send it away with pyODBC's executemany() function.

I just wanted to post this full example as an additional, high-performance option for those who can use the new turbodbc library: http://turbodbc.readthedocs.io/en/latest/

There clearly are many options in flux between pandas .to_sql(), triggering fast_executemany through sqlalchemy, using pyodbc directly with tuples/lists/etc., or even trying BULK UPLOAD with flat files.

Hopefully, the following might make life a bit more pleasant as functionality evolves in the current pandas project or includes something like turbodbc integration in the future.

import pandas as pd
import numpy as np
from turbodbc import connect, make_options
from io import StringIO

test_data = '''id,transaction_dt,units,measures
               1,2018-01-01,4,30.5
               1,2018-01-03,4,26.3
               2,2018-01-01,3,12.7
               2,2018-01-03,3,8.8'''

df_test = pd.read_csv(StringIO(test_data), sep=',')
df_test['transaction_dt'] = pd.to_datetime(df_test['transaction_dt'])

options = make_options(parameter_sets_to_buffer=1000)
conn = connect(driver='{SQL Server}', server='server_nm', database='db_nm', turbodbc_options=options)

test_query = '''DROP TABLE IF EXISTS [db_name].[schema].[test]

                CREATE TABLE [db_name].[schema].[test]
                (
                    id int NULL,
                    transaction_dt datetime NULL,
                    units int NULL,
                    measures float NULL
                )

                INSERT INTO [db_name].[schema].[test] (id,transaction_dt,units,measures)
                VALUES (?,?,?,?) '''

cursor.executemanycolumns(test_query, [df_test['id'].values, df_test['transaction_dt'].values, df_test['units'].values, df_test['measures'].values]

turbodbc should be VERY fast in many use cases (particularly with numpy arrays). Please observe how straightforward it is to pass the underlying numpy arrays from the dataframe columns as parameters to the query directly. I also believe this helps prevent the creation of intermediate objects that spike memory consumption excessively. Hope this is helpful!

any way to increase sqlalchemy/pandas write speed? : Python, to_sql with fast_executemany of pyODBC. I would like to send a large pandas.​DataFrame to a remote server running MS SQL. The way I do it now is by  AttributeError: 'psycopg2.extensions.cursor' object has no attribute 'fast_executemany' You are using psycopg2, which is a postgresql driver. This issue and fix pertain to Microsoft SQL Server using the pyodbc driver.

It seems that Pandas 0.23.0 and 0.24.0 use multi values inserts with PyODBC, which prevents fast executemany from helping – a single INSERT ... VALUES ... statement is emitted per chunk. The multi values insert chunks are an improvement over the old slow executemany default, but at least in simple tests the fast executemany method still prevails, not to mention no need for manual chunksize calculations, as is required with multi values inserts. Forcing the old behaviour can be done by monkeypatching, if no configuration option is provided in the future:

import pandas.io.sql

def insert_statement(self, data, conn):
    return self.table.insert(), data

pandas.io.sql.SQLTable.insert_statement = insert_statement

The future is here and at least in the master branch the insert method can be controlled using the keyword argument method= of to_sql(). It defaults to None, which forces the executemany method. Passing method='multi' results in using the multi values insert. It can even be used to implement DBMS specific approaches, such as Postgresql COPY.

SQL Server, Python, Million Data Entry, Frequently Asked Questions , I am using pyodbc drivers and pandas.to_sql with a sqlalchemy connection engine to (https://github.com/mkleehammer/pyodbc/wiki/Features-beyond-the-​DB-API#fast_executemany) Save the dataframe as CSV and then write a separate script bulk loading the CSV file into DB. This is the approach I ended up using. How to speed up bulk insert to MS SQL Server from CSV using pyodbc (2) Below is my code that I'd like some help with. I am having to run it over 1,300,000 rows meaning it takes up to 40 minutes to insert ~300,000 rows.

Questions for tag pandas-to-sql, pyodbc vs turbodbc. when to_sql When uploading pandas DataFrame to SQL Server, turbodbc will definitely be faster than pyodbc fast_executemany=False ,​but  python - Speeding up pandas.DataFrame.to_sql with fast_executemany of pyODBC . I would like to send a large pandas.DataFrame to a remote server running MS SQL. The way I do it now is by converting a data_frame object to a list of tuples and then send it away with pyODBC's execut…

How to insert a pandas dataframe into a table with a column that contains a list · dataframe.to_sql inserts json ProgrammingError: (pyodbc. Invalid argument(​s) 'fast_executemany' when using to_sql() Speeding up Pandas to_sql method. Speeding up pandas.DataFrame.to_sql with fast_executemany of pyODBC Executing multiple statements with Postgresql via SQLAlchemy does not persist changes SQLAlchemy: unexpected results when using `and` and `or`

I have a scheduled etl process that pulls data from one mssql server, filters it, and pushes it to another server. The process pulls about 20 different tables, each with 10's of thousands of rows and a dozen columns. I am using pyodbc drivers and pandas.to_sql with a sqlalchemy connection engine to write.

Comments
  • I think it is not related as the original question was regarding the speeding up of the method to_sql. You are asking now about an error of an argument in the same method, which isn't related anymore with the original question - afaik. Just trying to adhere to the norms of SO that I normally see. regarding the extra information you've provided now, perhaps the error is raised because the already present table is of a different size and thus cannot be appended to (type error)? Also the last code snippet I've provided was for illustration purposes, you probably need to alter it somewhat.
  • Not sure why I haven't shared this before but here is the class I use often for getting dataframes in and out of a SQL database: gitlab.com/timelord/timelord/blob/master/timelord/utils/… Enjoy!
  • @erickfis I've updated the class with a proper example. Do note that the not every database will use the same driver and will thus raise an error when using this class. An example database that does not use this is PostgreSQL. I haven't found a fast way yet to insert data into PSQL. One way to still use this class like that is by explicitly turning the switch off by calling: con._init_engine(SET_FAST_EXECUTEMANY_SWITCH=False) After having initialized the class. Good luck.
  • @hetspookjee - Since this is the most popular answer by far, please consider updating it to mention that SQLAlchemy 1.3.0, released 2019-03-04, now supports engine = create_engine(sqlalchemy_url, fast_executemany=True) for the mssql+pyodbc dialect. I.e., it is no longer necessary to define a function and use @event.listens_for(engine, 'before_cursor_execute'). Thanks.
  • Thanks Gord Thompson for the update! I have set your comment to the top and also made a community wiki article out of my post for future updates.
  • Thanks so much for doing the legwork on this. Just for clarity sake, this decorator and function should be declared before instantiating a SQLAlchemy engine?
  • You're most welcome. I declare it right after instantiating the engine in the constructor of a class.
  • so this removes the need for the pyodbc specific connection code? just need to call to_sql() after this function?
  • i tried just calling to_sql directly after the function, but it didn't speed anything up
  • @J.K. - Please consider updating your answer to mention that SQLAlchemy 1.3.0, released 2019-03-04, now supports engine = create_engine(sqlalchemy_url, fast_executemany=True) for the mssql+pyodbc dialect. I.e., it is no longer necessary to define a function and use @event.listens_for(engine, 'before_cursor_execute'). Thanks.