University of Melbourne Department of Computer Science and Software Engineering 433-313 Computer Design Semester 2, 2003 Project 1 Aim The aim of this project is to give you experience with the simulation of cache systems. Deadline The deadline for this project is 11pm on Thursday, 18 September. Specifications You must write the main part of a program that simulates the operation of a cache and prints out statistics gathered by the simulation. Such programs are often used by computer designers to decide which of several cache designs performs best under a given workload. This of course requires the program to be able to simulate all the cache designs being considered. The family of designs you should simulate all consist of a two-level cache system with a split primary cache and a unified secondary cache. The primary instruction cache is of course-read only (you need not consider self modifying code). The primary data cache and the secondary cache both use write allocate, with the primary data cache being write through and the secondary cache being write back. In all these caches, the replacement policy has two parts. First, if one or more of the cache blocks you can choose to evict is invalid, then you should choose one of those invalid blocks; if there is more than one, then choose the one in the lowest numbered bank. Second, if all of the cache blocks you can choose to evict are valid, then you should get the provided replacement function to choose the block to evict. The different members of the family are distinguished by their use of different values for each of the following parameters, which can be set independently for each of three caches in the system (L1 instruction, L1 data, L2 unified). o The block size of each of the three caches can be any power of 2 between 4 bytes and 64 bytes, subject to the constraint that the size of a block in one of the primary caches cannot be bigger than the size of a block in the secondary cache. o The number of blocks in each bank can be any power of 2 from 1 upwards. All the banks in a given cache contain the same number of blocks. o The number of banks in each cache (the cache's associativity) can be any number from 1 to 4. o The index function used by each bank of each cache can be chosen independently from a supplied set of index functions. Skewed associativity, the use of different index functions for different banks of a cache, is a relatively new cache design technique. It is based on the observation that in (e.g.) a conventional 2-way associative cache frequent access to 3 or more memory blocks whose addresses differ by a multiple of the cache size will cause many conflict misses, since they map to the same set. In some kinds of programs, including programs that use several data structures (e.g. arrays) whose size is a power 2, this happens relatively frequently. One part of the problem is that if two memory blocks map to the same cache block in one cache bank, with a conventional associative cache they also map to the same block in all other banks as well. A skewed associative cache avoids this by having different index functions for each cache bank, and choosing these functions so that two memory blocks that map to the same cache block in one bank have a low probability of mapping to the same cache block in other banks. This can make it possible for a skewed two-way associative cache to have a lower miss rate than a conventional three- or four-way associative cache. This can be especially important for off-chip caches since the connection between a CPU and an off-chip cache may have enough pins available if the off-chip cache is two-way associative (whether skewed or not) but not if its associativity is higher than that. It can also be important for an on chip cache, because lower associativity requires less multiplexing to select the cache block containing the right memory block, and thus reduces the hit time. This potential reduction in multiplexing time must be balanced against the increase in time required to compute the index function, which is also part of the hit time. The index function should therefore to have simple hardware implementations that introduce very few gates or delays. Suppose a cache block number has n bits, while memory block numbers have 2n or more bits. Then one good class of index functions is functions following the scheme cache block number in bank k = rotate(k, permute1(group1)) xor permute2(group2) where group1 and group2 are two distincts groups of n bits from the block number, permute1 and permute2 are two different n-bit randomizing permutation functions, and rotate is a function that rotates an n bit number by k bits (whether it rotates it left or right does not matter in this case). The permutations and the rotation require no gates and add no delays; the exclusive OR requires only n gates and only one level of gate delays. Software implementation of the permutations may be more expensive, but one can minimize the cost by precomputing each function's value for all possible inputs into a table, and then just doing table lookups. (This approach also allows the table to be printed, which can be useful in checking that the function is implemented correctly.) Software architecture The code in the simulator can be divided into three layers: top, middle and bottom. We supply the top and bottom layers; you must write the middle layer. The supplied top layer o processes the command line arguments, which specify the characteristics of the cache system to be simulated on this run, as well as information about the costs of certain operations for use in gathering statistics; o reads a trace, which is a sequence of the memory accesses performed by one particular run of one particular program; and o prints statistics about cache performance once the entire trace has been processed. The middle layer, which you will write, should contain the following two functions: o initialize is called by the top layer just once, at the start of the simulation. It tells the middle layer the characteristics of the cache system to be simulated on this run. o process_access is called by the top layer once for each memory access in the trace. It specifies whether the access is an instruction read, data read, or data write, it gives the address at which memory is being accessed, and it gives the number of bytes being accessed starting at that address. You may assume that the access size will always be either 1, 2 or 4, and that the access will always be aligned, i.e. the address will always be a multiple of the access size. In simulating the behavior of the cache, your process_access function should call the functions in the bottom layer supplied by us. These functions perform three tasks: o They collect statistics about the performance of the cache, to be printed out by the top layer at the end of the simulation. o They can be asked to print a record of the actions performed by the cache, which can help you in debugging your code. o This record of the actions performed by your code can be used to help us mark your code. The functions in the bottom layer that your code should call in the relevant circumstances are the following. o mem_to_l2 should be called when you perform a load from main memory into the L2 cache. Its parameters specify the memory address you start loading from, the number of bytes you are loading, and which block of which bank of the L2 cache you are loading into. o l2_to_l1 should be called when you perform a load from the L2 cache into one of the L1 caches. Its parameters specify which block of which bank of the L2 cache you are loading from and starting at what offset inside that L2 cache block, the number of bytes you are loading, and which block of which bank of which L1 cache you are loading to. o update should be called when you update the contents of cache blocks in the L1 data cache and in the L2 cache as part of the simulation of a memory access that is a data write. Its parameters specify, for each of the L1 data cache and the L2 cache, how many bytes you are modifying, starting at what offset, in which block of which bank of that cache. o evict should be called when you evict a cache block. Its parameters specify which block of which bank of which cache is being evicted, and whether the eviction requires a writeback to memory of the block being evicted. o replacement should be called when all the blocks in given cache into which you can put a memory block are full (i.e. they are valid), and you need to decide which block's contents you will evict. Its parameters specify which cache you are performing the replacement in and how many banks it has; its return value gives the number of the bank that the data being loaded should be put into. o The bottom layer also contains some index functions, which take as their arguments a memory block number and the number of blocks in a cache bank, and return the number of the one block in that cache bank into which the specified memory block can be placed. (Note that the block number of a given address in memory depends on the block size, which can be different in different caches.) Your code will not call these index functions by name. Instead, the data structures passed to the initialize function will give, for each bank of each cache, a pointer to one of these functions. Your code must therefore call these functions indirectly, through these pointers. Your code must always evict a cache block before loading some other value into it (using a mem_to_l2 or l2_to_l1 operation). Your code must also perform all the loads required by the write allocate policy before it performs the update operation required by a data write memory access. Note that if a data write access will overwrite the entire contents of a cache block, then the write allocate policy does not require you to load anything into a cache block first, since whether you loaded the relevant memory block into that cache block will not make any difference to the final result. For this project, you should make sure that write allocate loads data into cache blocks (in either the L1 or L2 cache) only when necessary. Files You will be able to get a copy of the supplied code of this project by executing the command "/home/subjects/313/local/p1/create", anytime after the availability of this code is announced on the newsgroup cs.313, which will be sometime before the end of wednesday, august 20. That command will create a directory called "p1" in the current directory and copy several files into it. One of these files will be a README file describing which files contain what, and direct you to the file you must modify. Another of these files will be a Makefile, which describes the relationships between the various files needed for the project and the commands needed to make one file from others. The program "make" uses this information to decide what files are out of date and what needs to be recompiled or relinked. After modifying one or more source files, you should run "make proj1" to recompile them and relink proj1. The new p1 directory will also contain a link to a directory containing test data. (You do not need your own copies of the test data files.) You should test your program using the smaller datasets first. Only when you are reasonably sure of the correctness of your program should you try running it on the larger datasets. The supplied files should work on all the CS department machines you have access to, including the Dell servers running Solaris. If you switch machine types and thus compilers, you may need to change the compiler-related definitions in the Makefile. You will also need to recompile everything; a "make clobber" followed by a "make" will do this. Simulations using large input files can take large amounts of CPU time. To minimize the impact on your fellow students, we strongly recommend that you do your work on the multimedia lab PCs, which are used by a lot fewer people at any one time than the CASE lab Suns. Questions If you have any questions about any aspect of the project, you should post them to the newsgroup cs.313 instead of sending email, so that other students can also read the answers. Submission and marking Your code, the middle layer of the simulator, must be contained in a single file called sim.c, which implements the interface described in the supplied file sim.h. You must not modify sim.h or any of the other supplied files. You must use RCS to record your progress during the development of sim.c, and the file you must submit is the RCS file sim.c,v. While we will normally look at only the last version of sim.c in sim.c,v, we reserve the right to look at the development history of the RCS file in cases of disputed authorship, and we will be actively looking for submissions that are too similar. (Your submission is of course supposed to reflect your own unaided work.) RCS is covered in the subject 433-252 Software Development Principles and Tools, which is a prerequisite for 433-313 Computer Design. If you need to refresh your memory, look at slide set #8 and lab class #9 on the 252 web page. You should submit sim.c,v by running the command submit 313 1. To get full marks, you must submit your work before the deadline. If you submit your work after the deadline, you will need to use the command submit 313 1.late. Late submissions will incur a penalty of 1 marks per day (or part thereof) for the first two days after the deadline and a penalty of 2 marks per day (or part thereof) after that. unless you get permission for a late submission in advance. Note that machine breakdowns (within reason) are an accepted fact of life, and you must learn to deal with them. (One way you can do this is by starting work on your projects as soon as possible.) Therefore they will not be acceptable as an excuse for not having a project completed by the due date. We will supply a list of the test cases and a script to run your program on all those test cases. The marking scheme for this project allocates most of the marks to how many test cases your code passes. A working program, while important, is not the only criterion for assessment. The marking scheme allocates the rest of the marks for this project to the design of your solution; your programming style with respect to clarity, logical structure, and modularity; the robustness of your program; and your standard of documentation. The last is especially important: it is your responsibility to make us understand your programs. This project is worth 13% of the marks for the unit.