Just thought I'd add a few more thoughts about optimisation.
Sometimes lookup tables can save you some time, if you're prepared to afford the memory.
e.g. the multiply by 64 can just become:
Code:
; X = number to be multiplied by 64
LDA mult64lo,X:STA val
LDA mult64hi,X:STA val+1
This is much quicker again than doing the maths by hand.
In your sprite routine, if you are using colour 0 (
edited from colour 8) to represent a transparent pixel, you can speed up the masking using a table, if you dare to allocate 256 bytes (gasp) as a mask table lookup:
Code:
LDA (sprite),Y
STA pokeme+1
LDA screen,X
.pokeme AND masktable ; must be 256-byte aligned
ORA pokeme+1
STA screen,X
...or that kind of thing, where masktable(n) just contains &00, &55, &AA or &FF to mask neither, both or one of the pixels in a Mode 2 byte.
Even better of course is to allocate a separate mask for each sprite, but I don't think it's so feasible to double the size of your sprite data in memory just for this. In general, the balance to be struck is between speed and memory usage. Tricky one.
Always remember about X and Y. If they're free, it's much quicker to preserve A by doing TAX .... TXA than by doing STA zp .... LDA zp. The stack is the slowest - avoid PHA .... PLA in critical code (unless you really need to use the stack due to reentrancy), because you can do just as good a job with STA ... LDA. If I don't want to consider which zp location is free, I sometimes use:
Code:
STA preserveme+1
...
...
.preserveme
LDA #0
which is just as fast as using the zero page.
Remember JSR/RTS have an overhead. If you have a very small subroutine, consider moving its code into the body of the calling code, if it's a critical section of code - e.g. don't JSR mult64, but repeat the code from my last post everywhere you need it (it's only 11 bytes long).
I'll add more as I think of them....