Fast multiplication routines

From Retrosoftware

Jump to: navigation, search

Sample Code Library > Fast multiplication routines


Fast multiplication routines in 6502

This is a copy of an article I posted to the Beeb mailing list yonks ago - perhaps it's useful to someone...?


Multiplication by iteration and shifting

Normally, multiplication of two 8-bit numbers can be achieved with a little routine such as this:

.multiply
\ Multiplies A*X, and stores result in product/product+1

CPX #0:BEQ zero
DEX:STX product+1
LSR A:STA product
LDA #0
BCC s1:ADC product+1:.s1 ROR A:ROR product
BCC s2:ADC product+1:.s2 ROR A:ROR product
BCC s3:ADC product+1:.s3 ROR A:ROR product
BCC s4:ADC product+1:.s4 ROR A:ROR product
BCC s5:ADC product+1:.s5 ROR A:ROR product
BCC s6:ADC product+1:.s6 ROR A:ROR product
BCC s7:ADC product+1:.s7 ROR A:ROR product
BCC s8:ADC product+1:.s8 ROR A:ROR product
STA product+1
RTS
.zero
STX product:STX product+1
RTS

This completes in an average of 113 cycles (excluding the multiply by zero special case) - not bad, but it's possible to do better, provided you're happy to set aside some space for some tables...


Faster multiplication by table lookup

There is a method which yields the product of two values from the difference between two squares.

Mathematically:

 (a+b)^2 = a^2 + b^2 + 2ab     --> (I)
 (a-b)^2 = a^2 + b^2 - 2ab     --> (II)

(I) minus (II) gives:

 (a+b)^2 - (a-b)^2 = 4ab

or, in other words:

       (a+b)^2     (a-b)^2
 ab =  -------  -  -------
          4           4

So this means we can store a table of f(n) = n^2 / 4 for n = 0..510, and then can achieve multiplication via a single 16-bit subtract! In reality, we use 4 lookup tables; 2 for the LSB/MSBs of the 16-bit value when n is less than 256, and a further 2 for when n is 256 or greater.

We can divide by 4 without worrying about truncation, because we know that (a+b)^2 - (a-b)^2 will always be a multiple of 4, therefore we lose no information or accuracy by discarding the lower 2 bits in the table itself.

Here's some code:

 10 sqrlo256%=&900
 20 sqrlo512%=&A00
 30 sqrhi256%=&B00
 40 sqrhi512%=&C00
 50 :
 60 FOR N%=0 TO 255
 70 s256%=(N%*N%) DIV 4
 80 s512%=((N%+256)*(N%+256)) DIV 4
 90 N%?sqrlo256%=s256% AND 255
100 N%?sqrhi256%=s256% DIV 256
110 N%?sqrlo512%=s512% AND 255
120 N%?sqrhi512%=s512% DIV 256
130 NEXT
140 :
150 num1=&70:num2=&71:result=&72
160 FOR N%=0 TO 2 STEP 2:P%=&700:[OPT N%
170 .mult
180 SEC:LDA num1:SBC num2:BCS positive
190 EOR #255:ADC #1:.positive TAY
200 CLC:LDA num1:ADC num2:TAX
210 BCS morethan256:SEC
220 LDA sqrlo256%,X:SBC sqrlo256%,Y:STA result
230 LDA sqrhi256%,X:SBC sqrhi256%,Y:STA result+1
240 RTS
250 .morethan256
260 LDA sqrlo512%,X:SBC sqrlo256%,Y:STA result
270 LDA sqrhi512%,X:SBC sqrhi256%,Y:STA result+1
280 RTS
290 ]:NEXT
300 :
310 !result=0
320 REPEAT
330 ?num1=RND(256)-1:?num2=RND(256)-1:CALL mult
340 PRINT;?num1;"*";?num2;"=";!result;" (";?num1*?num2;")"
350 IF ?num1*?num2<>!result:END
360 UNTIL FALSE

So there's a routine which achieves an 8 bit * 8 bit multiplication in just 55 cycles - over twice the speed of the original routine. The only problem with this is the amount of table space required - 1k.

It turns out that we can reduce the table space by 256 bytes by unifying sqrlo256 and sqrlo512. They are virtually the same, apart from that sqrlo512(n) = sqrlo256(n) EOR &80 when n is odd.

Which leads us to the following space optimisation:

 10 sqrlo%=&900
 20 sqrhi256%=&A00
 30 sqrhi512%=&B00
 40 :
 50 FOR N%=0 TO 255
 60 s256%=(N%*N%) DIV 4
 70 s512%=((N%+256)*(N%+256)) DIV 4
 80 N%?sqrlo%=s256% AND 255
 90 N%?sqrhi256%=s256% DIV 256
100 N%?sqrhi512%=s512% DIV 256
110 NEXT
120 :
130 num1=&70:num2=&71:result=&72
140 FOR N%=0 TO 2 STEP 2:P%=&700:[OPT N%
150 .mult
160 SEC:LDA num1:SBC num2:BCS positive
170 EOR #255:ADC #1:.positive TAY
180 CLC:LDA num1:ADC num2:TAX
190 BCS morethan256
200 LDA sqrhi256%,X:STA result+1
210 LDA sqrlo%,X:SEC:BCS lessthan256
220 .morethan256
230 LDA sqrhi512%,X:STA result+1
240 TXA:AND #1:BEQ skip:LDA #&80:.skip
250 EOR sqrlo%,X:.lessthan256
260 SBC sqrlo%,Y:STA result
270 LDA result+1:SBC sqrhi256%,Y:STA result+1
280 RTS
290 ]:NEXT
300 :
310 !result=0
320 REPEAT
330 ?num1=RND(256)-1:?num2=RND(256)-1:CALL mult
340 PRINT;?num1;"*";?num2;"=";!result;" (";?num1*?num2;")"
350 IF ?num1*?num2<>!result:END
360 UNTIL FALSE

...which is an average of about 67 cycles, i.e. still substantially better than the original routine.

I wish I'd thought of this trick when I was writing multiplication-intensive demos years ago...