## Homework Assignments

Assignment 1 (due September 5, 2012):
1. Patterson and Hennessy, Revised Fourth Edition, Exercise 1.2
2. Patterson and Hennessy, Revised Fourth Edition, Exercise 1.7
3. Use the table for Patterson and Hennessy, Revised Fourth Edition, Exercise 1.16.1, to calculate the execution time for a single processor for routines A-E.
Assignment 2 (due September 10, 2012):

Read Dennard et al., IEEE Journal of Solid-State Circuits, Vol. SC-9, pp. 256-268 (1974)

1. Patterson and Hennessy, Revised Fourth Edition, Exercise 1.4
2. Patterson and Hennessy, Revised Fourth Edition, Exercise 1.6
3. Patterson and Hennessy, Revised Fourth Edition, Exercise 2.1
4. Patterson and Hennessy, Revised Fourth Edition, Exercise 2.7
5. The following questions concern simple applications of queueing theory:
1. If an M/M/1 queue in a server has task arrivals at a rate of 30 per second and serves at a rate of 50 per second, how many tasks are in the system (waiting and being served) on average?
2. In part (a), how many tasks are being served, on average?
3. What is the utilization of an M/M/1 queue in a server that has four tasks waiting on average?
6. A certain program takes 26.67 seconds to run on 3 processors and 16 s to run on 7 processors. Find the execution time on one processor, the fraction of the program that can be parallelized, the theoretical execution time on an infinite number of processors, and P1/2.
7. Evaluate the average gate capacitance of, and power dissipated by, a processor with the following properties:
• Gate oxide thickness = 1.4 nm
• Gate length = 65 nm
• Gate area = (65 nm)×(65 nm)
• Permittivity of free space = 8.85×10-12 Farads/m
• Relative permittivity of SiO2 = 3.9
• Rail-to-rail voltage = 1.1 V
• Clock frequency = 3.6 GHz
• Switching probability = 1.0
• Number of transistors per die = 1.3×109
Assignment 3 (due September 17, 2012):
1. Use the method on p. 16 of Professor Cantrell's notes on floating-point data representations to convert the decimal fraction 0.1416 to a correctly rounded binary fraction with 10 bits. Using a calculator, find the numerical error, i.e., the difference (in base 10) between 0.1416 and the binary fraction that you have computed.
2. The following numbers are hexadecimal representations of IEEE-754 single-precision floating-point numbers. (Note that these representations use an implicit 1 bit.) In each case, give the binary equivalent of the hexadecimal form, the unpacked binary form (that is, the binary form in which the implicit 1 bit is written out explicitly), and the decimal equivalent:
```BF 80 00 00
3F 80 00 00
40 00 00 00
40 40 00 00
```
3. Write and run a program for your computer that uses IEEE-754 single-precision floating-point arithmetic to:
1. Initialize the value of the floating-point variable x to 1.0, and
2. divide x by 2.0 repeatedly, writing the result at each step, until the result underflows to zero.
Explain your output in detail. Does your program allow for computation with denormalized numbers?
4. Download the QtSpim simulator for the MIPS R3000 architecture and install it on your computer. Refer to Dr. Dodge's QtSpim installation notes, but ignore his comments about QtSpim on the Mac. Run the test programs test1.s and test2.s to verify that your installation has succeeded. Submit a screenshot of QtSpim taken after running test2.s.
Assignment 4 (due September 24, 2012):
Read Patterson and Hennessy, Revised Fourth Edition, Chapter 3 and Appendices C and D (on the CD-ROM included with the book)
1. Patterson and Hennessy, Revised Fourth Edition, Exercise 2.6, parts 1, 2, 3
2. Patterson and Hennessy, Revised Fourth Edition, Exercise 2.10, all parts
3. Patterson and Hennessy, Revised Fourth Edition, Exercise 2.13, parts 1, 2, 3
4. Patterson and Hennessy, Revised Fourth Edition, Exercise 2.16, part 1
5. Patterson and Hennessy, Revised Fourth Edition, Exercise 2.27, parts 1, 2, 3
6. What is the ALU result if the 4-bit ALU Control signal is 0100? What happens if the ALU Control signal is 0101?
7. Download and install on your computer the PathSim simulator for the single-cycle implementation of the MIPS subset discussed in Sections 4.1 through 4.4 of the text. Work through Lab A (that accompanies the simulator) for the instruction lw \$a2, 8(\$t1) and include the answer sheet in your homework. (To get to the documentation and the lab exercises, you have to click on "About PathSim".)
Assignment 5 (due October 1, 2012):
Read Patterson and Hennessy, Revised Fourth Edition, Chapter 4
1. The ALU for bit 31 shown in the book and in class supported the set on less than (slt) instruction using just the sign bit of Result31. Try this operation on the decimal values a = -7 and b = 6, using a 4-bit, 2's complement representation. Show that this example produces an incorrect value of the Set output. Therefore the Overflow output must also be used to compute Set. Modify ALU31 to handle slt correctly. Show your design on a copy of the figure. (P&H, Second Edition, Problem 4.23)
2. Write a sequence of MIPS instructions to determine if there is a carryout from the addition of two registers, such as \$t3 and \$t4. Put a 0 or 1 in register \$t2 if the carryout is 0 or 1, respectively. [Hint: This can be done in two instructions.] (P&H, Second Edition, Problem 4.17)
3. Patterson and Hennessy, Revised Fourth Edition, Exercise 3.12.2.
4. On a copy of the datapath for R-type instructions, show the hexadecimal values of all data signals for the instruction sub \$5, \$4, \$3, assuming that register 3 contains the two's complement representative of -13 (base 10) and register 4 contains the two's complement representative of 4 (base 10).
5. On a copy of the datapath for store instructions, show the hexadecimal values of all data signals for the instruction sw \$4, -24(\$29), assuming that register 29 contains 0x7ffffffc (base 16) and register 4 contains the unsigned integer representative of 268,468,224(base 10).
Assignment 6 (due October 8, 2012):
Read Patterson and Hennessy, Revised Fourth Edition, Chapter 4
1. Patterson and Hennessy, Revised Fourth Edition, Exercise 4.1
2. On a copy of the datapath and control for the multicycle implementation, show the hexadecimal values of all data and control signals in clock period 13 of execution of the following code:
```       lw \$t2, 0(\$t3)
lw \$t3, 4(\$t3)
beq \$t2, \$t3, Label # Assume branch not taken
sw \$t5, 8(\$t3)
```
Assume that register t3 contains 0x10000000, M[0x10000000] = 2310, and M[0x10000004] = -2110.
3. Repeat Problem 2 for clock period 21.
4. What is the least significant bit of the exception cause register after the control detects that Current State[3-0] = 0001 and Opcode[5-0] = 010101?
5. Patterson and Hennessy, Revised Fourth Edition, Exercises 4.10.1-4.10.3 (assume the single-cycle implementation).
6. In this exercise, you will take the first steps toward adding the instruction
`addi`
(add immediate) to the multicycle implementation discussed in the notes. This instruction is documented in the SPIM instruction reference. Add any necessary datapaths and control signals to the datapath and control for the multicycle implementation, and add any necessary states to the multicycle finite state machine.
Assignment 7 (due October 15, 2012):
Review Patterson and Hennessy, Revised Fourth Edition, Chapter 4. Begin reading Chapter 5.
1. We wish to add the datapath parts and control signals needed to implement the jal instruction in the multi-cycle datapath and control shown in the figure. On a copy of the figure, show the necessary additions to the datapath and control lines. There are multiple solutions; choose the solution that minimizes the number of clock periods needed for this additional instruction. This instruction is documented in the SPIM instruction reference.
2. Show the additions to the multicycle finite state machine that will be necessary to implement the jal instruction.
3. Make a table that shows the values of all datapath and control signals in all clock periods of multicycle execution of the instruction j Imm, where the 26-bit immediate value Imm is equal to the least significant 26 bits of 0x2bf 800d. Assume that the 4 most significant bits of the program counter are 0000. It will be helpful to refer to the table of steps in executing an instruction.
4. Make a table that shows the values of all datapath and control signals in all clock periods of multicycle execution of the instruction add \$5, \$4, \$3, assuming that register 3 contains the two's complement representative of -13 (base 10) and register 4 contains the two's complement representative of 4 (base 10).
5. Make a table that shows the values of all datapath and control signals in all clock periods of multicycle execution of the instruction lw \$5, -32(\$11), where register 11 contains 0x1000 20ec.
Assignment 8 (due October 22, 2012):
Finish reading Patterson and Hennessy, Revised Fourth Edition, Chapter 5.
1. On copies of pages 32-37 of the notes on pipelining, show the values of all data signals. Assume that (\$1) = 0x1000 0000, (M[0x1000 0014]) = 0x7fff fffc, (\$2) = 0x0000 000e, (\$3) = 0x0000 0008.
2. On a copy of the figure of the MIPS pipeline datapath and control (without forwarding), show the binary or hexadecimal values of all data and control signals in the ninth clock period of execution of the following instruction sequence:
```     lw \$10, 68(\$24)
lw \$11, 324(\$24)
sll \$0, \$0, 0
sll \$0, \$0, 0
sll \$0, \$0, 0
sll \$0, \$0, 0
sll \$0, \$0, 0
sll \$0, \$0, 0
sll \$0, \$0, 0
sll \$0, \$0, 0
sw \$12, 580(\$24)
```
Assume that (\$24) = 0x10000000, (M[0x1000 0044]) = 0x7fff fff9, (M[0x1000 0144]) = 0x0000 000d.
Extra credit: What plausible engineering application do the load, add and store instructions in this code have? [Hint: Indexed addressing is often used to access elements of an array or stack.] Also, what is the hexadecimal encoding of the instruction sll \$0, \$0, 0?
Assignment 9 (due October 29, 2012):
Study Patterson and Hennessy, Revised Fourth Edition, Chapter 5.
1. [30] How many clock periods does the following code segment take given a) no pipeline, b) a 5 stage pipeline assuming one clock period per instruction, but no data forwarding, c) a 5 stage pipeline with data forwarding as described in P&H, Chapter 4, section 7.
```         or   \$t1,\$0,\$a2
or   \$t3,\$0,\$a0
or   \$t4,\$0,\$a1
lw   \$t5,0(\$t7)
lw   \$t6,0(\$t8)
mul  \$t2,\$t5,\$t6

```
2. [20] Find the data hazards in the following code segment, and show the hazards on a pipeline diagram, as in P&H, Fig. 4.52:
```    la   \$t1,i1
or   \$t2,\$t3,\$t1
ori  \$a0,\$a0,42
```
3. [10] Reorder the following code segment to remove the data hazards. Assume that data forwarding takes place:
```     lw  \$t0,24(\$a0)
sub \$t4,\$t4,\$t0
sub \$t8,\$t8,\$t3
mul \$t7,\$t7,\$t1
```
4. [10] What is the CPI for the reordered sequence of instructions in the preceding problem?
5. [20] Find the control hazards in the following code segment, and show the hazards on a pipeline diagram::
```        la   \$t0, ar2
lw   \$t1, size
lw   \$t2, nrows
lw   \$t3, ncols
addi \$t4, \$t2, -1  # nrmax
addi \$t5, \$t3, -1  # ncmax
ori  \$t6, \$0, 0    # initialize row index to 0
lwc1 \$f0, val
mfc1 \$s4, \$0
rloop:  mul  \$t9, \$t6, \$t3 # multiply rindex by ncols
mul  \$t9, \$t9, \$t1 # multiply by size of one array element to get roffset
ori  \$t7, \$0, 0    # initialize column index to 0
cloop:  mul  \$s0, \$t7, \$t1 # multiply cindex by size to get coffset
add  \$s1, \$s0, \$t9 # offset of ar2[rindex][cindex] = roffset + coffset
add  \$t8, \$s1, \$t0 # address of ar2[rindex][cindex] = offset + base
sw   \$s4, 0(\$t8)   # store val in ar2[rindex][cindex]
addi \$t7, \$t7, 1   # increment the column index
sub  \$s2, \$t5, \$t7 # nc = ncmax - cindex
bgez \$s2, cloop    # branch back to cloop if nc >= 0
addi \$t6, \$t6, 1   # increment the row index
sub  \$s3, \$t4, \$t6 # nr = nrmax - rindex
bgez \$s3, rloop    # branch back to rloop if nr >= 0
ori  \$v0, \$0, 10   # reach here if row loop is done
syscall            # end of program!
```
6. [10] Which instruction could do useful work in the branch delay slot, without changing the computational results in any way?
```          or    \$t0,\$0,\$a0          # Reg. t0 points to the array element
or    \$t1,\$0,\$a2          # Reg. t1 is a counter
loop:     sw    \$a3,0(\$t0)          # Store the value into the array element
add   \$t0,\$a1,\$t0         # Increment the pointer by the value of size
addi  \$t1,\$t1,-1          # Decrement the counter
bgtz  \$t1, loop           # branch back to loop if counter >= 0
#  (since we store at the head of the
#   loop, we compute one more address
#   than necessary just to reduce
#   the number of compares & branches)
beamup: jr \$ra                      # Beam me up....
```
7. [20] Find the number of clock periods required for execution of the following code segment, assuming both data forwarding and early branch detection::
```lw \$t2, 0(\$t3)
lw \$t3, 4(\$t3)
beq \$t2, \$t3, Label # Assume branch not taken
sw \$t5, 8(\$t3)
```
Comment: Pipelining this code segment is tricky because a branch follows a load. You will need to refer to the detailed diagrams of forwarding after a load in the lecture slides, as well as slide 102. You will also need to decide whether the branch should be detected in the RF stage or the EX stage.
Assignment 10 (due November 5, 2012):
Study Patterson and Hennessy, Revised Fourth Edition, Chapter 5. Start reading Chapter 6.
1. Study the very last slide in the notes on pipelining. Your task is to determine as much as you can about the five instructions in the five pipeline stages. If you do not have enough information to fill in a field of an instruction, explain why. For some fields, it will be easiest to decode the machine instruction into assembly language using the SPIM instruction reference. For other fields it will be easiest to look at the control line settings on pp. 40 and 42 of the pipelining notes. (This exercise in reverse engineering is adapted from Patterson and Hennessy, 2nd edition, problem 6.9.)
2. You are given the information that a particular program has stride 10 in an 8-way interleaved memory system and a bank cycle time of 5 clock periods after an access occurs. Complete the following table:

 Memory address Bank accessed Clock cycle when accessed 8001 1 1 8011 3 2 8021 5 3 8031 7 4 8041 1 7 8051 8061 8071 8081 8091 8101 8111
Verify that, for a stream of addresses with stride 10, each access to bank 1 after the first one is delayed by 2 clock periods. Also verify that the average access time is 1.5 clock periods per memory access.
Assignment 11 (due November 12, 2012):
1. Design a 2-way set-associative cache with 16-byte cache lines and a data capacity of 16 kB (kilobytes).
2. For the cache that you designed in Problem 1, find the values of c, d, and s, as defined on p. 35 of the slides on hierarchical memory. Also find the number of bits in the tag, index and byte offset fields, assuming 32-bit addressing.
3. Assume a 1-bit valid field, and find the total size in bytes of the cache that you designed in Problem 1, including the valid, tag and data bits. Draw a diagram of your design similar to p. 38 of the slides on hierarchical memory.
4. This problem explores the performance consequences of having only a single level of cache in a modern processor with 40 ns DRAM column access time and a clock period of 400 ps (clock frequency 2.5 GHz). You are given the following data: A particular processor with separate data and instruction caches would have a CPI of 1.00 if all data and instructions were instantly available. 20% of the instructions executed are reads and 20% are writes. The penalty for main memory access is 100 clock periods for both data and instructions. The probability of an instruction cache miss is 2%. The probability of a data cache read miss is 5%. The data cache has a write-back, write-allocate organization. Find the resulting CPI including cache miss penalties.
Assignment 12 (due November 26, 2012):
1. A hypothetical processor has a cache that is connected to main memory over a 16×PCIe bus. One lane of a PCIe bus has a bandwidth of 250 MB/s in each direction, i.e., 250 MB/s for reads and 250 MB/s for writes.
The hit rate of this particular cache is 90%. On average, there are 2×109 cache references per second, of which 75% are reads and 25% are writes. At any given time, on average, 25% of the lines in the cache are "dirty", i.e., have been modified but not written to main memory. In this architecture, a word is 32 bits long and a cache line is 16 bytes (4 words) long. The address length in this architecture is 32 bits (1 word). Calculate what percentage of the available bus bandwidth the traffic between the cache and memory consumes, assuming that the cache is
1. write-back/write-allocate, or
2. write-through/write-allocate.
Considering the bus as a queueing system, calculate the mean number of items in the queueing system in the two cases above. Discuss whether it would be feasible to add another high-bandwidth device on the bus.
2. Derive a formula for the MTTF (mean time to failure) of a hard drive in terms of the probability of failure per unit time, assuming that failures obey a Poisson distribution as described on slide 63 of the notes on performance analysis.
3. Show that the MTTF for N disks is equal to 1/N times the MTTF for 1 disk, using the approach of the previous problem.
4. Application programmers usually worry about network packet loss probabilities, not bit error rates. The purpose of this exercise is to relate the packet loss rate (due to bit errors) to the bit error rate. (Packets can be lost for other reasons, such as buffer spill, router crashes, etc.)

Let the bit error probability on a certain point-to-point link be b. We will assume that b is very small compared to 1. We will assume, also, that errors in different bits of a frame are uncorrelated, and that all frames have exactly N bits.

1. Derive a formula in terms of N and b for the probability (let's call it P) of getting exactly one error in a frame of N bits.
2. Show that, if N is very large compared to 1 and Nb is of order 1 or smaller, then the probability of getting an error in exactly one bit of an N-bit frame is approximately equal to Nbe-Nb.
Hint: You will need to use a formula from the limits unit of your calculus class.
3. Assuming that N = 12000 (which is correct for 1500-octet frames), evaluate P for the following cases:
1. b = 10-5 (e.g., a modem link)
2. b = 10-9 (e.g., an old, slow fiberoptic link)
3. b = 10-12 (e.g., a fast fiberoptic link)
4. Estimate the number of bad frames per second, assuming that the bit rates listed in part (c) are as follows:
1. 53 kb/s
2. 1 Gb/s
3. 100 Gb/s
For which case does the bit error rate most need improvement?
Assignment 13 (due December 3, 2012):
Finish reading Patterson and Hennessy, Revised Fourth Edition, Chapter 7.
1. You are running a long, time-consuming simulation on a multicore server that is heavily used by all the engineers in your design group. Wanting to make optimal use of this shared resource, you make some measurements to determine how many cores your simulation can effectively use. You find that one run of your program takes 36 hours to run on 2 cores and 15 hours to run on 8 cores. How many cores should you specify for your program to use, and how long will one run take on that number of cores?
2. Patterson and Hennessy, Revised Fourth Edition, Problem 7.3.
3. Patterson and Hennessy, Revised Fourth Edition, Problem 7.6.
4. (Modified from Patterson and Hennessy, Revised Fourth Edition, Problem 7.12)
Consider the following three CPU organizations:

SS CPU: A 2-core superscalar microprocessor that provides out-of-order issue capabilities on two functional units (FUs) per core. Only a single thread can run on each core at a time.

MT CPU: A fine-grained multithreaded processor that allows instructions from two threads to run concurrently on two functional units. However, only instructions from a single thread can be issued on any cycle.

SMT CPU: A symmetric multithreaded processor that allows instructions from two threads to run concurrently on two functional units. Instructions from either or both threads can be issued to run on any clock period.

Assume that we have two threads, X and Y, to run on these CPUs. The threads include the following instructions, each of which takes one clock period (CP) to execute unless otherwise noted, or unless there is a hazard.

 Thread X Thread Y A1 -- takes 2 CPs to execute B1 -- no dependencies A2 -- depends on the result of A1 B2 -- conflicts for a FU with B1 A3 -- conflicts for a FU with A2 B3 -- no dependencies A4 -- depends on the result of A2 B4 -- depends on the result of B2

1. Assume that you have one SS CPU. How many CPs will it take to execute these two threads?
2. Next assume that you have one MT CPU. How many CPs will it take to execute these two threads?
3. Next assume that you have one SMT CPU. How many CPs will it take to execute these two threads?

 | Home | Prof. Cantrell | Teaching | Research | PhoTEC | Vita | Contact | PGP key | Office | Links |

 | EE 2310 | EE 4301 | EEOP 6334 | EEDG 6345 | EERF 6351 | EEGR 6481 | Ph.D. Graduates | Other courses | Recent courses |