by hamster » Fri Apr 29, 2011 2:50 am
Hi,
You haven't really given enough info...
One possible way is to use the IP core generator to generate a single multiplier IP block, then have multiplexers on the to select the correct a and b inputs. The products then need to be added and stored in the appropriate destination. This would give you an output matrix every 64 cycles as a 4x4 matrix multiply needs 64 multiplications.
Or, if you have specialised requirements you could maybe use 64 multipliers and 48 adders to give you the result in one operation. This could give you an output matrix every cycle but would need a lot of resources and the fan-outs for the inputs could greatly limit performance.
The most balanced way would most probably be to have four multipliers and connect the products through to a tree of three adders to give the result. This allows computation an element for the output matrix by only selecting input rows and columns for the multipliers 'a' or 'b' inputs. As each 'a' and 'b' input would then only need a 4 input MUX to select the correct input, it would be much nicer to implement. This could calculate an output matrix every 16 cycles (if nothing is pipelined).
Mike