Byte Multiplication Trick

Tags

GPU, Mandelbrot, Python code, software design

I’ve been working on an arbitrary-precision numeric class (in Python) that stores numbers in what amounts to base-256 — that is to say, in machine-native binary. It differs from the variable-length integers in Python by supporting fractions (and from Python’s Decimal number type by being binary).

It occurred to me I could implement multiplication with a lookup table rather than actually doing the math (at the CPU level, that may be what in fact is going on). So, I thought I’d compare the two implementations.

One might wonder what the point of an arbitrary-precision binary numeric type is given that Python’s Decimal type is more precise (and no doubt faster). But my goal lies beyond Python. I’m just using Python to develop an algorithm for calculating deep zooms in the Mandelbrot fractal. My idea is to use the many CPUs in a GPU to do parallel processing. Each pixel of a Mandelbrot rendering stands alone — there is no link to nearby pixels.

Which makes Mandelbrot renderings perfect for parallel processing. I want that processing to be as fast as possible, so I want to program as “close to the metal” as possible. Something like C or C++, if not assembly level. My ~~ultimate goal~~ fantasy is a “Mandelbrot Machine” capable of fairly deep real-time zooms.

[At this point, I’m not sure why something like that wouldn’t already exist. I doubt the idea is unique to me. Using a GPU is too obvious.]

The Mandelbrot equation is simple:

$\displaystyle{z}_{1}=({z}_{0}\times{z}_{0})+{C}$

Both z and C are complex numbers. The latter, C, is the coordinate (aka pixel) being rendered, and is the same throughout the calculation (it is effectively a constant). The former, z, starts at zero. The calculated z is plugged back into the equation to generate another calculated z. This continues until either 2<|z| or an iteration maximum is reached.

The formula normally uses z² rather than z × z, but I used multiplication to show how the algorithm only requires multiplication and addition. The decimal point matters for a correct answer (and to align addition), but we can ignore it when comparing different multiplication techniques.

Addition isn’t possible without aligning the two operands, but multiplication is just their Cartesian product. The decimal point only matters for setting the result in the right range:

$\displaystyle{123.0}\times{987.0}={121401.0}\\[0.4em]{12.3}\times{9.87}={121.401}\\[0.4em]{0.123}\times{0.987}={0.121401}$

The actual digits of the answer don’t change. Since setting the result decimal point is a simple calculation done after the actual multiplying, we can ignore it here.

For a baseline, we’ll start with an algorithm that uses actual multiplication of the values involved:

 def multiply_bs_0 (bs1:bytes, bs2:bytes) -> bytes:

     ”’Multiply byte strings. Version 0. (No MulTable)”’

     print(f’multiply0({bs1.hex(“.”)}, {bs2.hex(“.”)})‘)

 

     # Output buffer

     buf = [0] * (len(bs1) + len(bs2))

 

     # Generate Cartesian products…

     for ix1,b1 in enumerate(bs1):

         for ix2,b2 in enumerate(bs2):

             # Buffer index…

             ix = ix1+ix2

 

             # Determine product and any carry…

             carry,product = divmod(b1*b2, 256)

             print(f’[{ix}]: {b1} * {b2} = {product}, {carry}‘)

 

             # Add to result buffer…

             buf[ix] += product

             buf[ix+1] += carry

 

     # Ripply carry buffer…

     for ix in range(len(buf)):

         carry,value = divmod(buf[ix], 256)

         # Handle overflow…

         if carry:

             buf[ix] = value

             buf[ix+1] += carry

 

     print(f’multiply0.exit: {buf}‘)

     return bytes(buf)

 

 

 if __name__ == ‘__main__’:

     print()

 

     bs1 = bytes([255,255])

     bs2 = bytes([255])

     v1 = int.from_bytes(bs1, byteorder=‘little’)

     v2 = int.from_bytes(bs2, byteorder=‘little’)

 

     res = multiply_bs_0(bs1, bs2)

     val = int.from_bytes(res, byteorder=‘little’)

     flg = ‘Ok!’ if (v1*v2)==val else str(v1*v2)

     print()

 

     print(f’{list(bs1)} × {list(bs2)} = {res}‘)

     print(f’{v1} × {v2} = {val} ({flg})‘)

     print()



The multiply_bs_0 function takes two parameters, both byte strings (type bytes), multiplies them, and returns the result, also a byte string.

Because we’re using bytes, the values range from 0 to 255. If we multiply the largest value, 255 × 255, we get 65,025 — which is way out of range, so we need to carry to the next byte.

A multiplication result can never have more digits than the sum of the number of digits in the operands. This is true regardless of base. For example, consider 99×99=9801. The nine digits are the maximum possible value for a digit, so the result is the largest possible result with two-digit operands. It has four digits.

A multiplication can have fewer digits in the result than the sum of the number of operand digits. For example, 01×01=0001 is a valid two-digit multiplication, but the result (absent leading zeros) has only one digit.

Bottom line, we calculate the size of the result buffer by summing the lengths of the two operands (line #6).

Lines #8 through #20 comprise a nested for loop that generates the Cartesian products and adds them to the buffer. Line #15 uses the built-in divmod function to handle possible overflow (which is added to the next buffer slot).

After adding the products, the buffer values may be too large, so lines #22 to #28 do a ripple carry to ensure all values are legal byte values.

Lines #34 through #49 exercise the function and flag a correct result in the display (or print what the value should have been).

When run, this prints:

multiply0(ff.ff, ff)
[0]: 255 * 255 = 1, 254
[1]: 255 * 255 = 1, 254
multiply0.exit: [1, 255, 254]

[255, 255] × [255] = b'\x01\xff\xfe'
65535 × 255 = 16711425 (Ok!)

Ok!

Now let’s try using a multiplication table:

 itobs = lambda n: bytes(reversed(divmod(n,256)))

 

 MulTable = [[itobs(a*b) for b in range(256)] for a in range(256)]



The itobs (int to bytes) lambda function (line #1) converts an integer value to a pair of bytes.

MulTable is a nested list — essentially a 256×256 grid (65,536 values) — with all possible multiplication outcomes for two bytes. Obtaining an outcome just requires indexing the value:

val = MulTable[oper1][oper2]

Where oper1 and oper2 are each byte values to be multiplied. The result is always two bytes.

We can demonstrate MulTable:

 from examples import MulTable

 

 if __name__ == ‘__main__’:

     print()

 

     #Show off some byte multiplications…

     print(f’{MulTable[0x00][0x00] = }‘)

     print(f’{MulTable[0x01][0x01] = }‘)

     print(f’{MulTable[0x02][0x02] = }‘)

     print(f’{MulTable[0x2a][0x2a] = }‘)

     print(f’{MulTable[0x45][0x67] = }‘)

     print(f’{MulTable[0x80][0x80] = }‘)

     print(f’{MulTable[0xAB][0xCD] = }‘)

     print(f’{MulTable[0xff][0xff] = }‘)

     print(f’‘)



Which when run prints:

MulTable[0x00][0x00] = b'\x00\x00'
MulTable[0x01][0x01] = b'\x01\x00'
MulTable[0x02][0x02] = b'\x04\x00'
MulTable[0x2a][0x2a] = b'\xe4\x06'
MulTable[0x45][0x67] = b'\xc3\x1b'
MulTable[0x80][0x80] = b'\x00@'
MulTable[0xAB][0xCD] = b'\xef\x88'
MulTable[0xff][0xff] = b'\x01\xfe'

All correct answers, so the basic idea works.

Now let’s try using it to multiply byte strings. Here’s my first version:

 from examples import MulTable

 

 def multiply_bs_1 (bs1:bytes, bs2:bytes) -> bytes:

     ”’Multiply byte strings. Version 1.”’

     print(f’multiply1({bs1.hex(“.”)}, {bs2.hex(“.”)})‘)

 

     # Output buffer…

     buf = [0] * (len(bs1) + len(bs2))

 

     # Generate Cartesian products…

     for ix1,b1 in enumerate(bs1):

         for ix2,b2 in enumerate(bs2):

             # Buffer index…

             ix = ix1+ix2

 

             # Get b1*b2 product…

             res = MulTable[b1][b2]

             print(f’[{ix}]: {b1} * {b2} = {res}‘)

 

             # Add LSB…

             buf[ix] += res[0]

             # Add MSB…

             buf[ix+1] += res[1]

 

     # Ripply carry buffer…

     for ix in range(len(buf)):

         carry,value = divmod(buf[ix], 256)

         # Handle overflow…

         if carry:

             buf[ix] = value

             buf[ix+1] += carry

 

     # Return buffer…

     print(f’multiply1.exit: {buf}‘)

     return bytes(buf)

 

 

 if __name__ == ‘__main__’:

     print()

 

     bs1 = bytes([255,255])

     bs2 = bytes([255])

     v1 = int.from_bytes(bs1, byteorder=‘little’)

     v2 = int.from_bytes(bs2, byteorder=‘little’)

 

     res = multiply_bs_1(bs1, bs2)

     val = int.from_bytes(res, byteorder=‘little’)

     flg = ‘Ok!’ if (v1*v2)==val else str(v1*v2)

     print()

 

     print(f’{list(bs1)} × {list(bs2)} = {res}‘)

     print(f’{v1} × {v2} = {val} ({flg})‘)

     print()



The main difference between this version 1 and version 0 above is in lines #16 to #23 where we get the MulTable result and add its two bytes to the result buffer. As we did in the previous version, we need a ripple carry loop (lines #25 to #31) to ensure we return valid byte values.

The exercise code from line #38 to #53 is the same as used before, and when run, this has (nearly) the same output:

multiply1(ff.ff, ff)
[0]: 255 * 255 = b'\x01\xfe'
[1]: 255 * 255 = b'\x01\xfe'
multiply1.exit: [1, 255, 254]

[255, 255] × [255] = b'\x01\xff\xfe'
65535 × 255 = 16711425 (Ok!)

So far, so good.

I had the idea that, rather than using a list buffer, I could use a dict with the buffer index as the key. I was curious which method of indexing was faster. Is it better to constantly update a list or selectively update dictionary entries?

That led to version 2:

 from examples import MulTable

 

 def multiply_bs_2 (bs1:bytes, bs2:bytes) -> bytes:

     ”’Multiply byte strings. Version 2.”’

     print(f’multiply2({bs1.hex(“.”)}, {bs2.hex(“.”)})‘)

 

     # Accumulator…

     terms = {ix:0 for ix in range(len(bs1) + len(bs2))}

 

     # Generate Cartesian products…

     for ix1,b1 in enumerate(bs1):

         for ix2,b2 in enumerate(bs2):

             # Buffer index…

             ix = ix1+ix2

 

             # Get product and add to terms…

             res = MulTable[b1][b2]

             terms [ix] += res[0]

             terms [ix+1] += res[1]

 

             answer = [terms[ix] for ix in sorted(terms.keys())]

             print(f’multiply2.products[{b1},{b2}]: {answer}‘)

 

     # Ripple carry the terms…

     for ix in sorted(terms.keys()):

         carry,value = divmod(terms[ix], 256)

         if 0 < carry:

             terms[ix] = value

             terms[ix+1] += carry

 

     # Return the terms…

     answer = [terms[ix] for ix in sorted(terms.keys())]

     print(f’multiply2.exit: {answer}‘)

     return bytes(answer)

 

 

 if __name__ == ‘__main__’:

     print()

 

     bs1 = bytes([255,255])

     bs2 = bytes([255])

     v1 = int.from_bytes(bs1, byteorder=‘little’)

     v2 = int.from_bytes(bs2, byteorder=‘little’)

 

     res = multiply_bs_2(bs1, bs2)

     val = int.from_bytes(res, byteorder=‘little’)

     flg = ‘Ok!’ if (v1*v2)==val else str(v1*v2)

     print()

 

     print(f’{list(bs1)} × {list(bs2)} = {res}‘)

     print(f’{v1} × {v2} = {val} ({flg})‘)

     print()



This is similar to version 1, but the list used as a result buffer is replaced with a dict (line #8). At the end, we generate a list from the dictionary (line #32).

The nested for loop (lines #10 to #22) works differently here. It looks up the product result in MulTable (line #17) and adds the values to the indexed dictionary entries (lines #18 and #19).

The ripple carry (lines #24 to #29) is essentially the same as before but happens to be indexing dictionary entries rather than list slots.

Lines #37 to #52 exercise the code as done with previous versions. When run, this prints:

multiply2(ff.ff, ff)
multiply2.products[255,255]: [1, 254, 0]
multiply2.products[255,255]: [1, 255, 254]
multiply2.exit: [1, 255, 254]

[255, 255] × [255] = b'\x01\xff\xfe'
65535 × 255 = 16711425 (Ok!)

Ok! This version works, too.

Thinking about the two-byte MulTable results, it occurred to me that while the buffer indexes are revisited multiple times, the pair of indexes for the operands are unique. Consider the following multiplication:

The number pairs (4/2, 3/2, etc.) are the unique pairs — ix1 and ix2 in the code above, rather than their sum, ix, which indexes the result buffer. Obviously, that sum repeats, but the pairs do not. As you see in the diagram above, each intermediate result from MulTable has a unique pair indexing it, no repeats.

Note in the upper right the sum of the number of digits in the operands: 5+3=8. The result can have no more than eight digits. The middle right multiplication of those same numbers gives the number of multiplications (or lookups) — the count of the Cartesian products.

Given the unique index pairs for those products, I though why not just stash them in a dictionary — keyed by the index pair — and add them up (and do carry) as a second step. That led to this version:

 from examples import MulTable

 

 def multiply_bs_3 (bs1:bytes, bs2:bytes) -> bytes:

     ”’Multiply byte strings. Version 2.”’

     print(f’multiply3({bs1.hex(“.”)}, {bs2.hex(“.”)})‘)

 

     # Output buffer…

     buf = [0] * (len(bs1) + len(bs2))

 

     # Accumulator…

     terms = {}

 

     # Generate Cartesian products…

     for ix1,b1 in enumerate(bs1):

         for ix2,b2 in enumerate(bs2):

             # Terms index…

             key = (ix1,ix2)

 

             # Add to terms…

             terms[key] = MulTable[b1][b2]

             print(f’multiply3.product[{b1},{b2}]: {terms[key]}‘)

 

     # Sum the terms into the result buffer…

     for key in sorted(terms.keys()):

         ix = sum(key)

         res = terms[key]

         buf[ix] += res[0]

         buf[ix+1] += res[1]

 

     # Ripply carry buffer…

     for ix in range(len(buf)):

         carry,value = divmod(buf[ix], 256)

         # Handle overflow…

         if carry:

             buf[ix] = value

             buf[ix+1] += carry

 

     # Return buffer…

     print(f’multiply3.exit: {buf}‘)

     return bytes(buf)

 

 

 if __name__ == ‘__main__’:

     print()

 

     bs1 = bytes([255,255])

     bs2 = bytes([255])

     v1 = int.from_bytes(bs1, byteorder=‘little’)

     v2 = int.from_bytes(bs2, byteorder=‘little’)

 

     res = multiply_bs_3(bs1, bs2)

     val = int.from_bytes(res, byteorder=‘little’)

     flg = ‘Ok!’ if (v1*v2)==val else str(v1*v2)

     print()

 

     print(f’{list(bs1)} × {list(bs2)} = {res}‘)

     print(f’{v1} × {v2} = {val} ({flg})‘)

     print()



Which has noticeable differences from previous versions. Firstly, the nested Cartesian product for loop (lines #13 to #21) now just stores the MulTable result in the terms dictionary under a tuple key of (ix1,ix2). As these keys are unique, there’s no need to initialize terms or sum values into existing keys.

Secondly, a new for loop (lines #23 to #28) does a one-pass sum of terms into the result buffer.

The ripple carry loop in lines #30 to #36 is what we’ve seen before. Likewise, the exercising code from lines #43 to #58.

When run, this prints:

multiply3(ff.ff, ff)
multiply3.product[255,255]: b'\x01\xfe'
multiply3.product[255,255]: b'\x01\xfe'
multiply3.exit: [1, 255, 254]

[255, 255] × [255] = b'\x01\xff\xfe'
65535 × 255 = 16711425 (Ok!)

Ok!

Now that we have four different versions, let’s time them with a long-ish multiplication:

 from sys import argv

 from time import perf_counter_ns

 import examples as eg

 

 if __name__ == ‘__main__’:

     times = int(argv[2]) if 2 < len(argv) else 1000

 

     # Operand strings…

     s1 = argv[2] if 2 < len(argv) else ‘255, … ,255’

     s2 = argv[3] if 3 < len(argv) else ‘255, … ,255’

 

     # Convert string to integer values…

     ds1 = [int(d, base=0) for d in s1.split(‘,’)]

     ds2 = [int(d, base=0) for d in s2.split(‘,’)]

 

     # Make the byte strings for multiplying…

     bs1 = bytes(ds1)

     bs2 = bytes(ds2)

     v1 = int.from_bytes(bs1, byteorder=‘little’)

     v2 = int.from_bytes(bs2, byteorder=‘little’)

     answer = v1 * v2

     print(f’1 = {v1:,}‘)

     print(f’2 = {v2:,}‘)

     print(f’A = {answer:,}‘)

     print()

 

     # Version 1…

     t0 = perf_counter_ns()

     for _ in range(times):

         res = eg.multiply_bs_0(bs1, bs2)

     t1 = perf_counter_ns()

     val = int.from_bytes(res, byteorder=‘little’)

     print(f’version 0: {(t1–t0)/1000000:.3f} ms‘)

     print(‘Ok!’ if val==answer else f’OOPS: {val:,}‘)

 

     # Version 1…

     t0 = perf_counter_ns()

     for _ in range(times):

         res = eg.multiply_bs_1(bs1, bs2)

     t1 = perf_counter_ns()

     val = int.from_bytes(res, byteorder=‘little’)

     print(f’version 1: {(t1–t0)/1000000:.3f} ms‘)

     print(‘Ok!’ if val==answer else f’OOPS: {val:,}‘)

 

     # Version 2…

     t0 = perf_counter_ns()

     for _ in range(times):

         res = eg.multiply_bs_2(bs1, bs2)

     t1 = perf_counter_ns()

     val = int.from_bytes(res, byteorder=‘little’)

     print(f’version 2: {(t1–t0)/1000000:.3f} ms‘)

     print(‘Ok!’ if val==answer else f’OOPS: {val:,}‘)

 

     # Version 3…

     t0 = perf_counter_ns()

     for _ in range(times):

         res = eg.multiply_bs_3(bs1, bs2)

     t1 = perf_counter_ns()

     val = int.from_bytes(res, byteorder=‘little’)

     print(f’version 3: {(t1–t0)/1000000:.3f} ms‘)

     print(‘Ok!’ if val==answer else f’OOPS: {val:,}‘)



This calls each of the four functions a bunch of times (1000 by default — line #6). The input strings on lines #9 and #10 have been shortened to fit this post. Each actually contains 12 occurrences of 255 — the ellipses symbolize the missing 10.

Note how, as in the previous examples, we validate the returned multiplication result against an integer result done on line #21.

When run, this prints:

1 = 79,228,162,514,264,337,593,543,950,335
2 = 79,228,162,514,264,337,593,543,950,335
A = 6,277,101,735,386, ... ,946,612,225

version 0: 19.739 ms
Ok!
version 1: 19.414 ms
Ok!
version 2: 24.090 ms
Ok!
version 3: 33.035 ms
Ok!

The result surprised me a little.

Using multiplication to generate Cartesian products is just as fast as using a lookup table and adding values to a list. Using a dict is noticeably slower, and version using tuple indexes is even slower.

It would be interesting to convert those first two versions to C++ or assembly and see how they compare without any Python overhead.

[Come to think of it, I did just write a CPU simulator and assembler (in Python). A Python CPU simulator would run much slower than a real CPU, but it should allow comparisons between routines. Hmmm… 🤔]

Link: Zip file containing all code fragments used in this post.

∅

1 thought on “Byte Multiplication Trick”

Wyrd Smythe said:

September 29, 2025 at 11:06 am

ATTENTION: The WordPress Reader strips the style information from posts, which can destroy certain important formatting elements. If you’re reading this in the Reader, I highly recommend (and urge) you to [A] stop using the Reader and [B] always read blog posts on their website.

This post is: Byte Multiplication Trick

The Hard-Core Coder

~ I can't stop writing code!

Byte Multiplication Trick

1 thought on “Byte Multiplication Trick”

Over to you... Cancel reply

Share this:

Related

1 thought on “Byte Multiplication Trick”

Over to you... Cancel reply