Fast multiplication routines
From Retrosoftware
(New page: Here's a little trick I'd never come across before: doubtless it's well known, but thought I'd share the goodies. Normally, multiplication of two 8-bit numbers can be achieved with a litt...) |
m |
||
| Line 1: | Line 1: | ||
Here's a little trick I'd never come across before: doubtless it's well known, but thought I'd share the goodies. Normally, multiplication of two 8-bit numbers can be achieved with a little routine such as this: | Here's a little trick I'd never come across before: doubtless it's well known, but thought I'd share the goodies. Normally, multiplication of two 8-bit numbers can be achieved with a little routine such as this: | ||
| + | <tt> | ||
.multiply | .multiply | ||
\ Multiplies A*X, and stores result in product/product+1 | \ Multiplies A*X, and stores result in product/product+1 | ||
| Line 21: | Line 22: | ||
STX product:STX product+1 | STX product:STX product+1 | ||
RTS | RTS | ||
| + | </tt> | ||
This completes in an average of 113 cycles (excluding the multiply by zero special case) - not bad, but it's possible to do better, provided you're happy to set aside some space for some tables... | This completes in an average of 113 cycles (excluding the multiply by zero special case) - not bad, but it's possible to do better, provided you're happy to set aside some space for some tables... | ||
| Line 29: | Line 31: | ||
Mathematically: | Mathematically: | ||
| + | <tt> | ||
(a+b)^2 = a^2 + b^2 + 2ab --> (I) | (a+b)^2 = a^2 + b^2 + 2ab --> (I) | ||
(a-b)^2 = a^2 + b^2 - 2ab --> (II) | (a-b)^2 = a^2 + b^2 - 2ab --> (II) | ||
| + | </tt> | ||
(I) minus (II) gives: | (I) minus (II) gives: | ||
| + | <tt> | ||
(a+b)^2 - (a-b)^2 = 4ab | (a+b)^2 - (a-b)^2 = 4ab | ||
| + | </tt> | ||
or, in other words: | or, in other words: | ||
| + | <tt> | ||
(a+b)^2 (a-b)^2 | (a+b)^2 (a-b)^2 | ||
ab = ------- - ------- | ab = ------- - ------- | ||
4 4 | 4 4 | ||
| + | </tt> | ||
So this means we can store a table of f(n) = n^2 / 4 for n = 0..510, and then can achieve multiplication via a single 16-bit subtract! In reality, we use 4 lookup tables; 2 for the LSB/MSBs of the 16-bit value when n is less than 256, and a further 2 for when n is 256 or greater. | So this means we can store a table of f(n) = n^2 / 4 for n = 0..510, and then can achieve multiplication via a single 16-bit subtract! In reality, we use 4 lookup tables; 2 for the LSB/MSBs of the 16-bit value when n is less than 256, and a further 2 for when n is 256 or greater. | ||
| Line 48: | Line 56: | ||
Here's some code: | Here's some code: | ||
| + | <tt> | ||
10 sqrlo256%=&900 | 10 sqrlo256%=&900 | ||
20 sqrlo512%=&A00 | 20 sqrlo512%=&A00 | ||
| Line 84: | Line 93: | ||
350 IF ?num1*?num2<>!result:END | 350 IF ?num1*?num2<>!result:END | ||
360 UNTIL FALSE | 360 UNTIL FALSE | ||
| + | </tt> | ||
So there's a routine which achieves an 8 bit * 8 bit multiplication in just 55 cycles - over twice the speed of the original routine. The only problem with this is the amount of table space required - 1k. | So there's a routine which achieves an 8 bit * 8 bit multiplication in just 55 cycles - over twice the speed of the original routine. The only problem with this is the amount of table space required - 1k. | ||
| Line 91: | Line 101: | ||
Which leads us to the following space optimisation: | Which leads us to the following space optimisation: | ||
| + | <tt> | ||
10 sqrlo%=&900 | 10 sqrlo%=&900 | ||
20 sqrhi256%=&A00 | 20 sqrhi256%=&A00 | ||
| Line 127: | Line 138: | ||
350 IF ?num1*?num2<>!result:END | 350 IF ?num1*?num2<>!result:END | ||
360 UNTIL FALSE | 360 UNTIL FALSE | ||
| + | </tt> | ||
...which is an average of about 67 cycles, i.e. still substantially better than the original routine. | ...which is an average of about 67 cycles, i.e. still substantially better than the original routine. | ||
I wish I'd thought of this trick when I was writing multiplication-intensive demos years ago... | I wish I'd thought of this trick when I was writing multiplication-intensive demos years ago... | ||
Revision as of 07:16, 28 June 2008
Here's a little trick I'd never come across before: doubtless it's well known, but thought I'd share the goodies. Normally, multiplication of two 8-bit numbers can be achieved with a little routine such as this:
.multiply \ Multiplies A*X, and stores result in product/product+1 CPX #0:BEQ zero DEX:STX product+1 LSR A:STA product LDA #0 BCC s1:ADC product+1:.s1 ROR A:ROR product BCC s2:ADC product+1:.s2 ROR A:ROR product BCC s3:ADC product+1:.s3 ROR A:ROR product BCC s4:ADC product+1:.s4 ROR A:ROR product BCC s5:ADC product+1:.s5 ROR A:ROR product BCC s6:ADC product+1:.s6 ROR A:ROR product BCC s7:ADC product+1:.s7 ROR A:ROR product BCC s8:ADC product+1:.s8 ROR A:ROR product STA product+1 RTS .zero STX product:STX product+1 RTS
This completes in an average of 113 cycles (excluding the multiply by zero special case) - not bad, but it's possible to do better, provided you're happy to set aside some space for some tables...
There is a method which yields the product of two values from the difference between two squares.
Mathematically:
(a+b)^2 = a^2 + b^2 + 2ab --> (I) (a-b)^2 = a^2 + b^2 - 2ab --> (II)
(I) minus (II) gives:
(a+b)^2 - (a-b)^2 = 4ab
or, in other words:
(a+b)^2 (a-b)^2
ab = ------- - -------
4 4
So this means we can store a table of f(n) = n^2 / 4 for n = 0..510, and then can achieve multiplication via a single 16-bit subtract! In reality, we use 4 lookup tables; 2 for the LSB/MSBs of the 16-bit value when n is less than 256, and a further 2 for when n is 256 or greater.
Curiously, the integer division by 4 - and hence, truncation of the values stored in the tables - doesn't appear to sacrifice any accuracy. It's as if there's never any 'borrow' from bit 2 to bit 1 in any of the subtractions it ever performs, but I haven't investigated why this should be. Perhaps someone else can explain? I honestly expected to have to implement some fiddle factor to make up for truncation when I was writing this, and yet it seems totally unnecessary.
Here's some code:
10 sqrlo256%=&900
20 sqrlo512%=&A00
30 sqrhi256%=&B00
40 sqrhi512%=&C00
50 :
60 FOR N%=0 TO 255
70 s256%=(N%*N%) DIV 4
80 s512%=((N%+256)*(N%+256)) DIV 4
90 N%?sqrlo256%=s256% AND 255
100 N%?sqrhi256%=s256% DIV 256
110 N%?sqrlo512%=s512% AND 255
120 N%?sqrhi512%=s512% DIV 256
130 NEXT
140 :
150 num1=&70:num2=&71:result=&72
160 FOR N%=0 TO 2 STEP 2:P%=&700:[OPT N%
170 .mult
180 SEC:LDA num1:SBC num2:BCS positive
190 EOR #255:ADC #1:.positive TAY
200 CLC:LDA num1:ADC num2:TAX
210 BCS morethan256:SEC
220 LDA sqrlo256%,X:SBC sqrlo256%,Y:STA result
230 LDA sqrhi256%,X:SBC sqrhi256%,Y:STA result+1
240 RTS
250 .morethan256
260 LDA sqrlo512%,X:SBC sqrlo256%,Y:STA result
270 LDA sqrhi512%,X:SBC sqrhi256%,Y:STA result+1
280 RTS
290 ]:NEXT
300 :
310 !result=0
320 REPEAT
330 ?num1=RND(256)-1:?num2=RND(256)-1:CALL mult
340 PRINT;?num1;"*";?num2;"=";!result;" (";?num1*?num2;")"
350 IF ?num1*?num2<>!result:END
360 UNTIL FALSE
So there's a routine which achieves an 8 bit * 8 bit multiplication in just 55 cycles - over twice the speed of the original routine. The only problem with this is the amount of table space required - 1k.
It turns out that we can reduce the table space by 256 bytes by unifying sqrlo256 and sqrlo512. They are virtually the same, apart from that sqrlo512(n) = sqrlo256(n) EOR &80 when n is odd.
Which leads us to the following space optimisation:
10 sqrlo%=&900
20 sqrhi256%=&A00
30 sqrhi512%=&B00
40 :
50 FOR N%=0 TO 255
60 s256%=(N%*N%) DIV 4
70 s512%=((N%+256)*(N%+256)) DIV 4
80 N%?sqrlo%=s256% AND 255
90 N%?sqrhi256%=s256% DIV 256
100 N%?sqrhi512%=s512% DIV 256
110 NEXT
120 :
130 num1=&70:num2=&71:result=&72
140 FOR N%=0 TO 2 STEP 2:P%=&700:[OPT N%
150 .mult
160 SEC:LDA num1:SBC num2:BCS positive
170 EOR #255:ADC #1:.positive TAY
180 CLC:LDA num1:ADC num2:TAX
190 BCS morethan256
200 LDA sqrhi256%,X:STA result+1
210 LDA sqrlo%,X:SEC:BCS lessthan256
220 .morethan256
230 LDA sqrhi512%,X:STA result+1
240 TXA:AND #1:BEQ skip:LDA #&80:.skip
250 EOR sqrlo%,X:.lessthan256
260 SBC sqrlo%,Y:STA result
270 LDA result+1:SBC sqrhi256%,Y:STA result+1
280 RTS
290 ]:NEXT
300 :
310 !result=0
320 REPEAT
330 ?num1=RND(256)-1:?num2=RND(256)-1:CALL mult
340 PRINT;?num1;"*";?num2;"=";!result;" (";?num1*?num2;")"
350 IF ?num1*?num2<>!result:END
360 UNTIL FALSE
...which is an average of about 67 cycles, i.e. still substantially better than the original routine.
I wish I'd thought of this trick when I was writing multiplication-intensive demos years ago...