University of Melbourne Department of Computer Science and Software Engineering 433-313 Computer Design Semester 2, 2003 Project 2 Aim The aim of this project is to give you experience with advanced processor implementation techniques and with low level code optimization issues. Deadline The deadline for this project is 11pm on Thursday, 23 October. Specifications Your task in this project is to optimize a small program fragment for execution on four different implementations of the DLX architecture. The four implementations are named Machine PipeMem, Machine PipeFP, Machine SuperMem and Machine SuperFP. The machines with Pipe in their name are only pipelined, while the machines with Super in their name are pipelined and superscalar. The machines with FP in their name have relatively better floating point hardware, while the machines with Mem in their name have relatively better memory systems. All four machines have the usual five-stage DLX pipeline, complete with all possible bypass paths. All four machines have the same set of four functional units. These are the following. o The integer ALU executes all integer arithmetic instructions and comparisons, and computes memory addresses for loads and stores. It takes one cycle for all operations. o The floating-point no-op unit copies its one input to its output. It is used in the implementation of the instruction (movf) that copies one FP register to another, taking one cycle. o The floating-point adder performs FP additions and subtractions. In Machine PipeFP and Machine SuperFP, the adder takes two cycles to produce its result, and is fully pipelined, i.e. it is able to start a new operation every cycle. In Machine PipeMem and Machine SuperMem, the adder takes two cycles to produce its result, but is not pipelined, i.e. it cannot start a new operation while another one is in progress. o The floating-point multiplier performs FP multiplications. In Machine PipeFP and Machine SuperFP, the multipler takes four cycles to produce its result, and is partially pipelined, which in this case means that it is able to start a new operation only two cycles after starting the previous one. In Machine PipeMem and Machine SuperMem, the multipler takes six cycles to produce its result, and is not pipelined. Each machine has exactly one of each of these functional units; no functional unit is duplicated. For instructions that use one of the three floating point functional units, the execute stage is followed immediately by the writeback stage, as shown on slide 26 of slide set 4 of the lecture notes. Loads and stores use the integer ALU to compute the memory address. The amount of time loads spend in the ME stage depends on the machine type: Machine PipeMem and SuperMem spend five cycles, whereas Machine PipeFP and SuperFP spend seven cycles. This is meant to reflect the time taken to get the items required from L2 cache, with the L2 cache being faster in the machines that emphasize the memory system. This is realistic to the extent that most number crunching programs work with data that is too big to fit into primary caches, and they have low enough locality that by the time an item that was loaded into the L1 cache once upon a time is accessed again, it has been evicted from the L1 cache. The assumption that all accesses take the same amount of time is of course unrealistic, but it is required by the limited capabilities of the simulator discussed below. (In particular, by that fact that the simulator only simulates pipelines, not caches.) Stores spend only one cycle in the ME stage in all machines, since the write through to the L2 cache and (if necessary) to memory can be handled by a write buffer. At most one load or store instruction can be in its ME stage at any one point in time. o Machine PipeMem and Machine PipeFP can each fetch, decode and issue only one instruction each cycle. This means that at any point in time, only one instruction can be in the IF stage, only one instruction can be in the ID stage, and only one instruction can be in the first cycle of its EX stage. However, since the machine has more than one functional unit and some instructions take more than one cycle in the execute stage, it is possible even on these two machines that e.g. an FP multiplication, an FP addition, and an integer operation are all in their EX stage at the same time. (Since the integer operation spends only one cycle in the EX stage, the FP multiplication and addition would have to be in later (non-first) cycles of their EX stages.) o Machine SuperMem and SuperFP can each fetch, decode and issue two instructions each cycle. This means that at any point in time, up to two instructions can be in the IF stage, up to two instructions can be in the ID stage, and up to two instructions can be in the first cycle of their EX stages, provided of course that their EX stages do not use the same functional unit. There are no restrictions on dual instruction issue except those imposed by structural constraints and the availability of data; e.g. there is no requirement that the first instruction be an integer instruction, and there is no requirement that the first instruction have an address that is divisible by eight. In all these machines, if a stall prevents an instruction from progressing from one stage to the next, it will stay in the stage it is now, and its presence in that stage may prevent the instructions following it from progressing to that stage. The writeback stage is an exception from this rule; we assume that the register files have enough write ports to handle in parallel all the writebacks that may be needed in a given cycle. Another effect you need to take into consideration is that DLX has delayed branches, with the branch delay being fixed as one instruction. (This means that if the branch is taken, DLX always executes exactly one instruction after the branch.) If a branch instruction is in its ID stage during cycle N, and the branch is taken, then the fetch of the instruction being branched to cannot start until cycle N + 1. This may cause stalls in superscalar machines. Programs are not allowed to put a branch instruction into the delay slot of another branch instruction. The simulator To help you in your work on this project, we have made available a simulator for the family of machines to which PipeFP, PipeMem, SuperFP and SuperMem belong. This simulator includes a function for each kind of DLX instruction that you may use in this project. Each of these functions simulates the effect of executing instructions of the corresponding type, and also calculates in which cycle(s) the various stages of the instruction will take place. You can get a copy of this simulator by executing the command "/home/subjects/313/local/p2/.create" in your 313/p2 directory. This command will copy a shell script named 313proj2 and some other files to your current directory, and it will create a subdirectory called ARENA used by the 313proj2 script. The input of the simulator implemented by the 313proj2 script consists of a file containing a C code fragment, in the format shown by /home/subjects/313/local/p2/example. This C code fragment contains, for each DLX instruction in the program, a call to the function that simulates the operation of that DLX instruction; the arguments of the call specify the instruction's operands. However, the DLX simulator functions for branch instructions do not implement branching themselves. Instead, the functions simulating the conditional branch instructions return a boolean that states whether the branch should be taken or not; performing the branch is the job of the C code. Since DLX branch instructions are delayed branches, they must take place after the execution of the instruction that follows the branch instruction itself. For an example of how this should be done, see below. The files in the ARENA subdirectory should work on all the departmental machines you have access to. If you switch machine types and thus compilers, you will also need to recompile everything; a "make clobber" followed by a "make" in ARENA will do this. The program fragment The program fragment this exercise is concerned with is the following: #define SIZE 128 float r[SIZE], x[SIZE], y[SIZE]; float xavg, ydiff; int i; ... for (i = 1; i < SIZE - 1 ; i++) { xavg = 0.5 * (x[i+1] + x[i-1]); ydiff = y[i] - y[i-1]; r[i] = (x[i] + xavg) * ydiff; } A simple compiler could translate the loop into the following DLX code, shown here in the format required by the simulator. /* ** on entry to each loop iteration, ** r1 contains the address of x[i-1] (initially x[0], since initially i=1) ** r2 contains the address of y[i-1] (initially y[0], since initially i=1) ** r3 contains the address of r[i-1] (initially r[0], since initially i=1) ** r4 contains the address of r[SIZE-2] ** f0 contains 0.5 ** ** inside each loop iteration ** f1 contains x[i-1] ** f2 contains x[i] ** f3 contains x[i+1] ** f4 contains y[i-1] ** f5 contains y[i] ** f6 contains x[i+1] + x[i-1] ** f7 contains xavg ** f8 contains ydiff ** f9 contains x[i] + xavg ** f10 contains r[i] */ loop: /* xavg = 0.5 * (x[i+1] + x[i-1]); */ lf(f3, 8, r1); /* load x[i+1] into f3 */ lf(f1, 0, r1); /* load x[i-1] into f1 */ addf(f6, f3, f1); /* put x[i+1] + x[i-1] into f6 */ multf(f7, f0, f6); /* put xavg into f7 */ /* ydiff = y[i] - y[i-1]; */ lf(f5, 4, r2); /* load y[i] into f5 */ lf(f4, 0, r2); /* load y[i-1] into f4 */ subf(f8, f5, f4); /* put ydiff into f8 */ /* r[i] = (x[i] + xavg) * ydiff; */ lf(f2, 4, r1); /* load x[i] into f2 */ addf(f9, f2, f7); /* put x[i] + xavg into f9 */ multf(f10, f9, f8); /* compute the future value of r[i] */ sf(4, r3, f10); /* store f10 into r[i] */ /* loop control */ addi(r1, r1, 4); /* point r1 at next element of x */ addi(r2, r2, 4); /* point r2 at next element of y */ addi(r3, r3, 4); /* point r3 at next element of r */ sge(r5, r3, r4); /* set r5 to nonzero iff r3 >= r4 */ branch_taken = beq(r5);/* branch to loop if r5 is zero */ nop(); if (branch_taken) goto loop;/* branch delay slot */ Note that the function simulating the conditional branch instruction is the only one that returns a value. This value is a boolean, and it is true if the branch should be taken. You act on this return value by adding a conditional goto in your C code after the instruction in the delay slot of the branch, this goto specifying the target of the conditional branch instruction. (If you do not put the conditional goto after the instruction in the delay slot of the branch, the results of the simulation will be invalid.) The simulation ends when execution reaches the end of your C code fragment. A reminder: in DLX, floats are four bytes and are stored in the FP registers fN, while integers, booleans and pointers are four bytes and are stored in the general purpose registers rN, with r0 being hard-wired to zero. The mnemonics of the instructions used above are the following: lf: load float, sf: store float, addf: add float, subf: subtract float, multf: multiply float, addi: add (integer) immediate, sge: set if greater than or equal, beq: branch if equal to zero, nop: no operation. The mnenomics of all the instructions you may use are listed in the file /home/subjects/313/local/p2/instrs.) Hints The first question of the set of tutorial questions dealing with pipelining looks at a similar problem. The answers for all the tutorial questions are on the 313 web page. Questions If you have any questions about any aspect of the project, you should post them to the newsgroup cs.313 instead of sending email, so that other students can also read the answers. Submission and marking You should submit four files via the command submit 313 2. The four files you must submit are pipe_fp,v, pipe_mem,v, super_fp,v and super_mem,v. They should be RCS files containing your DLX code for the program fragment above optimized for the respective machines. While we will normally look at only the last version of each of these files, we reserve the right to look at the development histories of the RCS files in cases of disputed authorship. (Your submission is of course supposed to reflect your own unaided work.) Each file should start with a comment identifying the author, describing how you transformed the original code into your code. and identifying the main factor limiting the speed of your code. This should be followed by your DLX code itself in a format appropriate for input to the simulator. This means that the C language constructs you may use are restricted to labels, calls to functions that simulate DLX instructions, and conditional gotos. (During development, you may of course insert any other code, such as debugging printf statements, into the code as well.) You may of course provide other comments, but these should not be necessary; your comments at the top should explain everything noteworthy about your code. The first criterion of assessment is correctness; incorrect code will get no marks. The second criterion is performance; the fewer cycles your programs takes to compute the right results, the more marks they will get. We will announce the level of performance required to get maximum marks later in the semester. This level will not require the loop body to be unrolled more than once. This project is worth 12% of the marks for the unit, divided up as 4 marks apiece for pipe_fp and pipe_mem, and 2 marks apiece for super_fp and super_mem. Late submissions will incur a penalty of 2 marks per working day unless you get permission for a late submission in advance. Note that machine breakdowns (within reason) are an accepted fact of life, and you must learn to deal with them. (One way you can do this is by starting work on your projects as soon as possible.) Therefore they will not be acceptable as an excuse for not having a project completed by the due date.