Faster way to convert string to binary in Python 3.6+?

I have billions lines of string like this: 1010101110100111100100101 in memory. I need to convert this into binary integer list. It will take minutes, seems too slow. My code :

def string2vec(binary_str):
    return [int(binary_str[i:i + 8], base=2) for i in range(0, 64, 8)]
result= [ string2vec(l) for l in lines ]  # this code is slow


binary_str length is 64, and every 8 binary chars into 1 binary int.

binary_str length is 64, and every 8 binary chars into 1 binary int.

All that string slicing and Python looping is expensive. Use int(s,2) to convert the whole binary string to integer. Then use array to manage the integers as 64-bit integers and convert to 8-bit integers. You can decide if you want big- or little-endian results for the bytes:

import random
import time
import array

ints = [random.randrange(1<<64) for _ in range(1000)] # Make 1000 integers
strs = [f'{n:064b}' for n in ints]                    # Represent as binary strings
print(f'{ints[0]:016X} {strs[0]}')

start = time.perf_counter()
ints2 = [int(s,2) for s in strs]  # convert all the strings to integers
a = array.array('Q',ints)         # Store in an array.  Q = quadwords (64-bit ints)
a.byteswap()                      # Optional if you want the opposite endian-ness of your machine.
b = array.array('B')              # Another array of bytes
b.frombytes(a.tobytes())          # Populate byte array with the bytes from the quadword array.
print(time.perf_counter() - start)

assert ints == ints2
print([hex(n) for n in b[:8]])

Output:

1E27DFA21406A338 0001111000100111110111111010001000010100000001101010001100111000
0.0005346000000372442
['0x1e', '0x27', '0xdf', '0xa2', '0x14', '0x6', '0xa3', '0x38']

My machine is little-endian (most are). This converts one thousand 64-digit binary strings into integers, stores them in an array, byte swaps them to represent big-endian, then remaps the bytes of the array to a byte array...all in 534.6 microseconds on my machine. I've displayed the first 64-character string and its hexadecimal representation, and the first 8 bytes of the final result. If you truly have "billions" of these strings, it'll take about 9 minutes per billion, but don't read them all into memory at once :)

Python, This method is almost similar to the above function. The difference here is that rather than converting the character to it's ASCII using ord function, the conversion at� In Python version 3.6 and above you can use f-string to format result. str = "hello world" print(" ".join(f" {ord (i):08b}" for i in str)) 01101000 01100101 01101100 01101100 01101111 00100000 01110111 01101111 01110010 01101100 01100100

EDIT: It appears this functionality may be built into python; see comments. I'll leave this answer because it provides a minimal working example of a C library for Python that manipulates arrays, which I was not able to find elsewhere online.

I agree with many of the comments that something has clearly gone wrong if you have a bunch of binary strings in human-readable format sitting around in memory. However, if there are reasons outside your control that this can't be avoided, you could try writing the relevant functionality in C. Here's a straightforward example to start from:

include <Python.h>

static PyObject * binary_string(PyObject * self, PyObject * args);

static PyMethodDef PyBinaryString_methods[] =
{
  { "binary_string", binary_string, METH_VARARGS, "binary string" },
  { NULL, NULL, 0, NULL }
};

static struct PyModuleDef PyBinaryString_module =
{
  PyModuleDef_HEAD_INIT,
  "PyBinaryString",
  "Binary String",
  -1,
  PyBinaryString_methods
};

PyMODINIT_FUNC PyInit_PyBinaryString(void)
{
  return PyModule_Create(&PyBinaryString_module);
}

static PyObject * binary_string(PyObject * self, PyObject * args)
{
  const char * string;

  char buf[8];

  if(!PyArg_ParseTuple(args, "s", &string)) { return NULL; }

  for(int i = 0; i < 8; i++)
  {
    buf[i] = 0;

    for(int j = 0; j < 8; j++)
    {
      buf[i] |= (string[8 * i + j] & 1) << (7 - j);
    }
  }

  return PyByteArray_FromStringAndSize(buf, 8);
}

Here I'm exploiting the fact that the string is going to consist of ASCII '0' and '1' characters exclusively, and that the ASCII code for the former is even whereas the ASCII code for the latter is odd.

On my system I can compile this via

cc -fPIC -shared -O3 -I/usr/include/python -o PyBinaryString.so PyBinaryString.c

and then use it in Python like so:

>>> from PyBinaryString import binary_string
>>> binary_string("1111111111111111111111111111111111111111111111111111111100000000")
bytearray(b'\xff\xff\xff\xff\xff\xff\xff\x00')

I'm not a Python programmer, so someone might be able to provide a better way of getting data in/out of the python object formats. However on my machine this runs about an order of magnitude faster than the native python version.

If you know more about the layout in memory -- say if you know that all the strings of ASCII '0' and '1' characters are contiguous -- you could have the C code convert everything at once, which would probably speed things up further.

How to Convert a Python String to int – Real Python, In this tutorial, you'll learn how you can convert a Python string to an int . Python Pit Stop: This tutorial is a quick and practical way to find the info you need However, there are several other number systems, such as binary and then you use a formatted string, such as an f-string (in Python 3.6+), and an� Python | Convert String to Binary Data conversion have always been widely used utility and one among them can be conversion of a string to it’s binary equivalent. Let’s discuss certain ways in which this can be done.

Since there are only 2^8 = 256 possible values, you could construct a lookup table (in the form of a dict) containing 8-character strings as keys as corresponding integers as values.

binascii — Convert between binary and ASCII — Python 3.8.5 , The binascii module contains a number of methods to convert between binary and The binascii module contains low-level functions written in C for greater speed that are used by the Changed in version 3.6: Added the newline parameter. # Python3 code to demonstrate working of. If it is plain text, stored in a binary field, you can do the same thing. It helps to display your XML in a tree view. -Command "Invoke-Expression -Command ([System. org or mail your article to [email protected]

18.14. binascii — Convert between binary and ASCII — Python 2.7 , The binascii module contains a number of methods to convert between binary and various ASCII-encoded binary representations. Normally� Python | Ways to convert hex into binary Conversion of hex to binary is a very common programming question. In this article, we will see a few methods to solve the above problem.

struct — Interpret bytes as packed binary data — Python 3.8.5 , Several struct functions (and methods of Struct ) take a buffer argument. Pack the values v1, v2, … according to the format string format and write the packed bytes into the Format characters have the following meaning; the conversion between C and Python Changed in version 3.6: Added support for the 'e' format . binascii.a2b_base64 (string) ¶ Convert a block of base64 data back to binary and return the binary data. More than one line may be passed at a time. binascii.b2a_base64 (data, *, newline=True) ¶ Convert binary data to a line of ASCII characters in base64 coding. The return value is the converted line, including a newline char if newline is true.

7.3. struct — Interpret strings as packed binary data — Python 2.7.18 , Unpack the string (presumably packed by pack(fmt, . version 2.7, not all integer conversion codes would use the __int__() method to convert,� Strings in 3.X: Unicode and Binary Data. One of the most noticeable changes in Python 3.0 is the mutation of string object types. In a nutshell, 2.X's str and unicode types have morphed into 3.X's bytes and str types, and a new mutable bytearray type has been added.

Comments
  • Have you tried parsing it 64 bits at a time? then you can extract the 8-bit chunks from those "large" integers using bitshift operations.
  • This is a perfect example of why you do not store numbers as binary (0,1) characters. How do these numbers get into memory? Read from a file? Then you should think about using awk or C to parse your data.
  • You may need to show more of your code; specifically how you load it into memory from files. The bottleneck may not be in the conversion but in how you're "loading it into memory"
  • ok, that means it is almost 60,000,000 lines. Each lines creates 8 int, which means in total you are allocating at least 480MB of data just for the integers. The bottleneck is not really cause by your string-to-int logic then. A simple [ [i, i+1, i+2, i+3, i+4, i+5, i+6, i+7] for i in range(50000000)] is taking me forever to run. Instead of using list comprehension and create a whole list (which could cost you over 1GB of memory), will you consider using a generator expression instead?
  • Have done an experiment with 64 bit int parsing + bitwise operation. It is taking less than half of the time. May OP advise in what case the result is not correct?
  • int(s,2).to_bytes(8,'big') does the same thing.
  • Thank you . I have tired, but it only improved little.
  • it is not free to calculate the hash code of the string. I doubt the hash code calculation is going to be obviously faster than int(str, base)