subject

The transpose of a matrix interchanges its rows and columns. Here is a simple C loop to show the transpose:

for (i = 0; i < 3; i++) {
for (j = 0; j < 3; j++) {
output[j][i] = input[i][j];
}
}

Assume both the input and output matrices are stored in the row major order (row major order means row index changes fastest). Assume you are executing a 256 × 256 double-precision transpose on a processor with a 16 KB fully associative (so you don’t have to worry about cache conflicts) LRU replacement level 1 data cache with 64-byte blocks. Assume level 1 cache misses or prefetches require 16 cycles, always hit in the level 2 cache, and the level 2 cache can process a request every 2 processor cycles. Assume each iteration of the inner loop above requires 4 cycles if the data is present in the level 1 cache. Assume the cache has a write-allocate fetch-on-write policy for write misses. Unrealistically assume writing back dirty cache blocks requires 0 cycles.

For the simple implementation given above, this execution order would be nonideal for the input matrix. However, applying a loop interchange optimization would create a nonideal order for output matrix. Because loop interchange is not sufficient to improve its performance, it must be blocked instead.

a. (5 pts) What block size should be used to completely fill the data cache with one input and output block if the level 1 cache is fully associative 64 KB?

b. (5 pts) What is the minimum associativity required of the level 1 cache for consistent performance independent of both arrays’ position in memory?

c. (10 pts) Assume you are designing a hardware prefetcher for the unblocked matrix transposition code above. The simplest type of hardware prefetcher only prefetches sequential cache blocks after a miss. More complicated "nonunit stride" hardware prefetchers can analyze a miss reference stream, and detect and prefetch nonunit strides. Assume prefetches write directly into the cache and no pollution (overwriting data that needs to be used before the data that is prefetched). For best performance given a nonunit stride prefetcher, in the steady state of the inner loop, how many prefetches need to be outstanding at a given time?

ansver
Answers: 3

Another question on Computers and Technology

question
Computers and Technology, 24.06.2019 11:00
In three to five sentences, describe how you can organize written information logically and sequentially
Answers: 1
question
Computers and Technology, 24.06.2019 20:00
Write c++programs for the following problem: let the user enter two numbers and display which is greater. !
Answers: 1
question
Computers and Technology, 24.06.2019 20:50
Which key function of a business involves finding, targeting, attracting, and connecting with the right customers?
Answers: 3
question
Computers and Technology, 25.06.2019 02:30
One important thing in finding employment is to get your resume noticed and read.true or false
Answers: 2
You know the right answer?
The transpose of a matrix interchanges its rows and columns. Here is a simple C loop to show the tra...
Questions
question
Arts, 07.11.2020 02:30
question
Mathematics, 07.11.2020 02:30
question
History, 07.11.2020 02:30
question
Business, 07.11.2020 02:30
question
Mathematics, 07.11.2020 02:30
question
Mathematics, 07.11.2020 02:30
Questions on the website: 13722360