Python code optimisation for an sql insert

python insert into sql server
how to insert data into table using python and mysql
python mysql insert multiple rows
python insert dataframe into sql server
python insert list into database
python mysql insert variable
sqlite3 python
how to insert user input into database in python

I have the following code and I'm running it on some big data (2 hours processing time), I'm looking into CUDA for GPU acceleration, but in the mean time can anyone suggest ways to optimise the following code?

I is taking a 3D point from dataset 'T' and finding the point with the minimum distance to another point dataset 'B'

Is there any time saved by sending the result to a list first then inserting to the database table?

All suggestions welcome

    conn = psycopg2.connect("<details>")
    cur = conn.cursor()

    for i in range(len(B)):
        i2 = i + 1
        # point=T[i]
        point = B[i:i2]
        # print(B[i])
        # print(B[i:i2])
        disti = scipy.spatial.distance.cdist(point, T, metric='euclidean').min()
        print("Base: ", end='')
        print(i, end='')
        print(" of ", end='')
        print(len(B), end='')
        print(" ", end='')
        print(disti)

        cur.execute("""INSERT INTO pc_processing.pc_dist_base_tmp (x,y,z,dist) values (%s, %s, %s, %s)""",
                    (xi[i], yi[i], zi[i], disti))
        conn.commit()

    cur.close()
@@@@@@@@@@@@@@ EDIT @@@@@@@@@@@@@

Code update:

   conn = psycopg2.connect("dbname=kap_pointcloud host=localhost user=postgres password=Gnob2009")
    cur = conn.cursor()

    disti = []

    for i in range(len(T)):
        i2 = i + 1
        point = T[i:i2]
        disti.append(scipy.spatial.distance.cdist(point, B, metric='euclidean').min())
        print("Top: " + str(i) + " of " + str(len(T)))

Insert code to go here once I figure out the syntax

@@@@@@@@ EDIT @@@@@@@@

The solution with a lot of help from Alex

   cur = conn.cursor()
      # list for accumulating  insert-params
    from scipy.spatial.distance import cdist

    insert_params = []
    for i in range(len(T)):
        XA = [B[i]]
        disti = cdist(XA, XB, metric='euclidean').min()
        insert_params.append((xi[i], yi[i], zi[i], disti))
        print("Top: " + str(i) + " of " + str(len(T)))

    # Only one instruction to insert everything
    cur.executemany("INSERT INTO pc_processing.pc_dist_top_tmp (x,y,z,dist) values (%s, %s, %s, %s)",
                    insert_params)
    conn.commit()

For timing comparison the:

inital code took: 0:00:50.225644

Without multiline prints: 0:00:47.934012

taking commit out of the loop: 0:00:25.411207

I'm assuming the only way to make it faster is to get CUDA working?

OK. Let's accumulate all suggestions from comments.

Suggesion 1. commit as rare as possible, don't print at all
conn = psycopg2.connect("<details>")
cur = conn.cursor()
insert_params=[]

for i in range(len(B)):
    i2 = i + 1
    point = B[i:i2]
    disti = scipy.spatial.distance.cdist(point, T, metric='euclidean').min()        
    cur.execute("""INSERT INTO pc_processing.pc_dist_base_tmp (x,y,z,dist) values (%s, %s, %s, %s)""", (xi[i], yi[i], zi[i], disti))        

conn.commit() # Note that you commit only once. Be careful with **realy** big chunks of data
cur.close()

If you really need debug information inside your loops - use logging.

You will be able to turn on/off logging info when you need.

Suggestion 2. executemany for rescue
conn = psycopg2.connect("<details>")
cur = conn.cursor()
insert_params=[] # list for accumulating  insert-params 

for i in range(len(B)):
    i2 = i + 1
    point = B[i:i2]
    disti = scipy.spatial.distance.cdist(point, T, metric='euclidean').min()        
    insert_params.append((xi[i], yi[i], zi[i], disti))

# Only one instruction to insert everything
cur.executemany("INSERT INTO pc_processing.pc_dist_base_tmp (x,y,z,dist) values (%s, %s, %s, %s)", insert_params)                     
conn.commit()
cur.close()
Suggestion 3. Don't use psycopg2 at all. Use BULK operations

Instead of cur.execute, conn.commit write csv-file. And then use COPY from created file.

BULK solution must provide ultimate performance but needs an effort to make it work.

Choose yourself what is appropriate for you - how much speed do you need.

Good luck

Python MySQL Insert Into, Well organized and easy to understand Web building tutorials with lots of examples of how to use HTML, CSS, JavaScript, SQL, PHP, Python, Bootstrap, Java  The end goal is to insert new values into the dbo.Person table using Python. Once you have your data ready, proceed to the next step. Step 2: Establish a connection between Python and SQL Server. Next, you’ll need to establish a connection between Python and SQL Server. Here is a full guide that explains how to connect Python to SQL Server

There are 2 solutions

1) Try to do the single commit or commit in chunks if len(B) is very large.

2) you can prepare a list of data that you are inserting and do the bulk insert.

eg:

insert into pc_processing.pc_dist_base_tmp (x, y, z, dist) select * from unnest(array[1, 2, 3, 4], array[1, 2, 3, 4]);

Optimized I/O operations in Python, How do you insert data into a SQL table in Python? This article gives details about: different ways of writing data frames to database using pandas and pyodbc; How to speed up the inserts to sql database using python

Try committing when the loop is finished instead of every single iteration

How to Optimize SQL Queries, How do you insert multiple rows in SQL using Python? Well organized and easy to understand Web building tutorials with lots of examples of how to use HTML, CSS, JavaScript, SQL, PHP, Python, Bootstrap, Java and XML. w3schools .com THE WORLD'S LARGEST WEB DEVELOPER SITE

12 Ways To Optimize SQLite Databases, Also, Python is robust when it comes to processing text files and SQL databases. contains the information necessary to reconstruct the object in another python script. Let's try to insert some data in the database created, First, Establish a MySQL database connection in Python. Then, Define the SQL INSERT Query (here you need to know the table’s column details). Execute the INSERT query using the cursor.execute () and get a number of rows affected. After the successful execution of a query, Don’t forget to commit your changes to the database.

Performing CRUD operations with a Python SQL Library for SQL , This article sorts out some special techniques for optimizing SQL Queries. If you want to insert too much data, consider bulk insert Related to optimization of Python code — Do You have these Python Speedup Skills? //You can consider separate two sql write select age,name from user where age <18; select age,name from user where age >18; Reason: Using != and <> is likely to invalidate the index. 8. If you insert too much data, consider bulk insertion. If you want to insert too much data, consider bulk insert. Counter-example:

Tips for SQL Database Tuning and Performance, This article explains how to optimize your code to get the most out of //write the query, with ? for values to insert String sql = "INSERT INTO  The callable callable accepts as single parameter the Python value, and must return a value of the following types: int, float, str or bytes. sqlite3.complete_statement (sql) ¶ Returns True if the string sql contains one or more complete SQL statements terminated by semicolons. It does not verify that the SQL is syntactically correct, only

Comments
  • I suppose you're overcommitting here. Can you: a) move conn.commit outside of loop, b) prepare data in loop and then executemany from prepared data, c) use bulk-loading (COPY) from prepared data
  • And print-s are not free. Remove them or at least make only 1 print instead of 7.
  • Thanks moving the print lines to a single lined saved 3 seconds for 8000 records, seeing as I'm using 80,000+ records that should be a 30 second+ saving. But I suspect the real savings will be in the commitment stage when communicating with the database.
  • Redirect to /dev/null will be even faster. BUT real savings will be: a) commit larger chunks of works, b) executemany, c) BULK operations
  • I've tried execute many cur.executemany('insert into pc_processing.pc_dist_base_tmp(x) values (%s)', [(x,) for x in xi]) this works but if I add ...values (%s, %s)', [(x,) for x in xi], [(y,) for y in yi]) it complains about too many variables.
  • This works with your example but I'm trying to pass an array and it fails cur.execute( """ insert into pc_processing.pc_dist_base_tmp(x, y, z, dist) select * from unnest(xi, yi, zi, disti); """ ) all arrays are the same length, it is complaining the column xi does not exist, which is not the case.
  • xi was a list object so I xi = np.asarray(xi) but the error persists