Showing posts with label Assembly. Show all posts
Showing posts with label Assembly. Show all posts

Thursday, 14 May 2020

Homework on Textbook Sections 4.1 to 4.6

For Questions 1 to 11: Questions 1 to 11 deal with the processor datapath and control shown in Figure 4.17 of the textbook, executing the following instruction, located at address 0x0000000000204014 in the instruction memory:
sd x7, 0x18(x24)

Question 1: What is the value of the Branch control signal?

Solution:


0

Question 2:  What is the value of the MemRead control signal?

Solution:


0

Question 3: What is the value of the MemtoReg control signal?

Solution:


X

Question 4: What is the value of the ALUOp control signal, expressed in binary?

Solution:


00

Question 5: What is the value of the MemWrite control signal?

Solution:


1

Question 6: What is the value of the ALUSrc control signal?

Solution:


1

Question 7: What is the value of the RegWrite control signal?

Solution:


0

Question 8: What is the output value of the ALU Control block, expressed in binary?

Solution:


0010

Question 9: What is the output value of the ALU source multiplexer, expressed in hex?

Solution:


0X0000000000000018

Question 10: What is the value of the Zero output flag of the ALU?

Solution:

0

Question 11: What is the output value of the branch target adder, expressed in hex?

Solution:


0X0000000000204044

For Questions 12 and 13: Suppose the blocks in Figure 4.17 of the textbook have latencies shown in the following table.

 Block Latency
 PC read 20ps
 PC setup 15ps
 I-Mem 250ps
 Add 70ps
 Shift-left-1
 5ps
 Mux 20ps
 Regs read 150ps
 Regs setup 15ps
 Imm Gen 10ps
 ALU 100ps
 D-Mem 350ps
 Control 80ps
 ALU Ctrl 30ps

Question 12: What is the minimum clock period for this processor (in ps)?

Solution:


905

Question 13: Adding a multiplier to the ALU results in 100ps additional latency. However, it reduces the number of instructions needed in a program, since multiplications no longer need to be emulated in software. What fraction of the original instruction count must the reduced instruction count be to match the performance of the original processor?

Solution:


0.9005

For Questions 14 to 16: A 5-stage pipelined version of the RISC-V processor has the following latencies for the stages, including the overhead for pipeline registers between stages:

 IF ID EX MEM WB
 500ps 300ps 400ps 600ps 100ps

Question 14: What is the maximum clock frequency (in GHz) for this processor?

Solution:


1.67

Question 15: Suppose you can divide the IF stage into two stages, each with latency 300ps, but with a cost increase of 5%. Similarly, you can divide the MEM stage into two stages, each with 350ps latency, but with cost increase of 5%. If you are only allowed to chose one of these options, what would the resulting maximum clock frequency be (in GHz)?

Solution:


2

Question 16: If you are allowed to chose both options from Question 15 based on best cost/performance, decide whether to chose one or both options. What is the maximum clock frequency (in GHz) based on your choice?

Solution:


2.5

Question 20: In Section 4.5 of the textbook, on page 272, it is suggested that the hardware for branch processing (register comparison, target address calculation and PC update) all be included in the second stage of the pipeline (the ID stage). What is the speedup of this hardware change compared to the pipelined datapath using the ALU and adder in the EX stage and completing the branch in the MEM stage? Assume that branches account for 17% of instructions, as suggested on page 270, that no branch prediction is used, that the average CPI for all other instructions is 1.2, and that there is no effect on clock period.

Solution:


1.25

Wednesday, 22 April 2020

Quiz on Textbook Sections 4.1 to 4.4

Q2: Why are multiplexers required, as shown in Figure 4.2? 
Solution: We cannot wire data lines from multiple sources together. Instead, we require a logic circuit element to select between the sources.

Q3: Match the logic design terms to their corresponding definitions. 
Solution:
Combinational element: The outputs depend only on the current inputs
State element: The outputs depend on current inputs and internal stored values
Edge-triggered clocking: All state changes occur on a clock edge
Control signal: Directs operation of a functional unit or selection by a multiplexer
Data signal: Contains information that is operated on by a functional unit

Q4: The register file is a state element that consists of a set of registers that can be read and written by supplying a register number to be accessed. 
Solution: True

Q6: The single-cycle datapath conceptually described in Section 4.3 must have separate instruction and data memories, because 
 Solution: the processor operates in one cycle and cannot use a single-ported memory for two different accesses within that cycle.

Q8: How does the datapath in Figure 4.17 determine the outcome of a beq instruction? 
Solution: The ALU subtracts the operands and asserts the Zero output if the difference is zero. That, ANDed with the Branch decode signal, controls the multiplexer to select the next PC value.

Q10: The clock cycle time for the datapath described in Section 4.4 is determined by the longest chain of functional units used for any instruction. All instructions thus take that amount of time to execute. For this reason, single-cycle implementations are the main form of processor used today. 
Solution: False 

Homework on Textbook Sections 2.9 to 2.14, 2.17, 2.19, 3.1 to 3.7, 3.9

Q1: Write a C function equivalent to the following RISC-V assembly language code.

clear: addi  x5, x0, 0
       addi  x7, x0, '-'
       jal   x0, L2
L1:    sb    x7, 0(x6)
       addi  x5, x5, 1
L2:    add   x6, x10, x5
       lbu   x28, 0(x6)
       bne   x28, x0, L1
       jalr  x0, 0(x1)

Solution: void clear (char s[]){
int i;
i=0;
while (s[i] != '\0'){
s[i] = '-';
i++;
}
}

Q2: What doubleword, in hexadecimal, is placed in register x8 by the following instructions?

    lui  x8, 0xA9344
    ori  x8, x8, 0x01C

Solution: 0xFFFFFFFFA934401C

Q3: What word, in hexadecimal, encodes the branch instruction in the following code sequence?

    beq   x9, x24, L1
    slli  x8, x7, 2
    add   x8, x8, x15
    ld    x8, 0(x8)
    sd    x8, 0(x23)
L1: addi  x7, x7, 1


Solution: 0x01848A63

Q4: Suppose the following code sequence is executed on two processors in parallel without synchronization:

ld   x6, 0(x10)  // load x
add  x6, x6, x6  // double x
sd   x6, 0(x10)  // store x

If the variable x is initially 8, what are the possible final values for x? Assume that memory only does one load or store at a time, and that a pending load or store waits until the memory is not busy.


Solution:  16 or 32
 
Q5: The swap procedure on page 135 of the textbook is called by the sort procedure in Figure 2.25 on page 139. By how much would the dynamic instruction count change per iteration of the inner loop if the swap procedure were inlined? 

Solution: 4 fewer instructions
 
Q6: Measurements of programs running on a processor show the following relative frequencies of execution and CPI values:
  • Arithmetic instructions: 50%, 1 cycle
  • Load instructions: 15%, 3 cycles
  • Store instructions: 10%, 2 cycles
  • Branch instructions: 25%: 2 cycles
The clock frequency of the processor is 2GHz.
Suppose we augment a processor’s instruction set by adding a load indexed instruction that forms the effective address by adding two register values. The new instruction allows a sequence such as:
add  rtmp, rs1, rs2
ld   rd, 0(rtmp)
to be replaced by a single instruction
ldx  rd, (rs1+rs2)   // Load from address rs1+rs2 to rd
20% of the loads in the original processor are preceded by add instructions such that the pair can be replaced by a ldx instruction in the new processor. The CPI for the ldx instruction is 3 cycles, but its inclusion slows down the clock frequency of the processor.
What is the minimum clock frequency in GHz required for the new processor to ensure its performance is at least that of the original processor? 


Solution: 1.963

Q7: Consider a 32-bit multiplier organized in a similar way to Figure 3.7 in the textbook. Suppose the time required for each adder is 4ns. How long (in ns) does the multiplier take to multiply two 32-bit operands? 
Solution: 20

Q8: Use the RISC-V instructions described in the textbook (Sections 3.3 and 3.4 and the green card) to write assembly language code for the following C statement. Assume all variables are of type int, with a, b, c and d in x10, x11, x12 and x18, respectively.
d = (a * b) % c;


Solution: mul x18, x10, x11
rem x18, x18, x12

Q9: Write the hexadecimal word for the IEEE 754 single-precision representation of the decimal +81.75. 

Solution: 0x42A38000

Q10: What decimal number does the hexadecimal word 0xBFF00000 represent as an IEEE 754 single precision floating point value? 

Solution: -1.875

Q11: Write RISC-V instructions for the following C statement, assuming the variables y and a are of type double and are in floating-point registers, x is an array of double with the base address in x9, and i is of type int in x18.
y = y + a * x[i];


Solution: 
Assumption: a is in f9, y is in f8 and f0 is used as temporary storage
slli x5, x18, 3
add x5, x5, x9
fld f0, 0(x5)
fmul.d f0, f9, f0
fadd.d f8, f8, f0

Quiz on Textbook Sections 3.1 to 3.7, 3.9

Q1: For what combinations of operands can signed addition overflow? 
Solution: Both operands positive or both operands negative
 
Q2: What usually happens when an overflow occurs during addition to calculate an address in a program? 
Solution: Nothing - the overflow is ignored.
 
Q3: How many clock cycles would the sequential multiplier shown in Figure 3.5 of the textbook take to multiply 64-bit operands? 
Solution: 64

Q4: A parallel multiplier using tree-structured stack of adders is much faster than a sequential multiplier, but at the cost of significantly more hardware resources. 
Solution: True

Q5: Can we construct a fast parallel divider is a similar way to the way we make a fast parallel multiplier? 
Solution: No, because we need to use the sign of the difference calculated at each step in order to perform the next step.

Q6: How many bits are used for single precision floating-point values? 
Solution:  32 bits: 1 sign bit, 8 exponent bits, and 23 fraction bits

Q7: The revised IEEE 754-2008 standard added a 16-bit floating-point format with five exponent bits. What do you think is the likely range of numbers it could represent?
 Solution: ±1.0000 0000 00×2^−14 to ±1.1111 1111 11×2^15, ±0, ±∞, NaN

Q8: The hardware for floating-point operations is significantly more complex than that for integer operations.
Solution: True



Q9: From the statements below, select those that correctly describe floating point instructions in the RISC-V instruction set. 
Solution: Floating-point arithmetic instructions operate on different registers to integer instructions

Q10: The Intel SE2 instruction extensions provide subword parallelism by what means? 
Solution: By having 128-bit wide registers and a floating-point operations on short vectors of 4 single-precision or 2 double-precision elements.

Quiz on Textbook Sections 2.9 to 2.14, 2.17, 2.19

Q1: Which of the following statements about characters and strings in C and Java is true? 
Solution: A string in C takes about half the memory as the same string in Java.

Q2: If memory address 0x0000000000000020 contains the doubleword 0x8786858483828180, what value is loaded into register x9 by the following instruction?

lb  x9, 0x21(x0)

Recall that RISC-V uses little-endian addressing.

Solution: 0xffffffffffffff81
 
Q6: When do you use primitives like load reserved and store conditional? 
Solution: 
  1. When cooperating threads of a parallel program need to synchronize to get proper behavior for reading and writing shared data
  2. When cooperating processes on a uniprocessor need to synchronize for reading and writing shared data
Q8: Using pointers in C usually gives better performance than using arrays. 
Solution: False

Q9: How many bytes are used to encode x86 instructions? 
 Solution: From 1 to 15 bytes
 
Q10: It is generally best to write programs in a high level language, since they're more intelligible and maintainable, and a compiler can generate more efficient code than an assembly language programmer. 
Solution: True 

Homework on Textbook Sections 1.1 to 1.4, 1.6 to 1.9, 2.1 to 2.8

Q1: HTC’s VIVE VR headset, comprising two displays for stereo video, has the following specifications:

  •     Resolution: 1080 × 1200 per eye
  •     Refresh Rate: 90Hz
Assuming 24bit/pixel video, what data rate in Gbps (Gigabits/sec) is required on the HDMI cable for the device?

Solution: 5.5987

Q2: A processor, which has a clock frequency of 1.3GHz, take 10s to run a program of 6×10^9 instructions. What is the average number of cycles per instruction? 

Solution: 2.167

Q3: Suppose, for the program in Question 2, the instructions are composed as follows:


Instruction class
Instruction count
CPI
Arithmetic/logic
4×10^9
1
Load/store
1×10^9
6
Branch
1×10^9
3
We are trying to redesign the processor to increase performance by a factor of 1.25, but this would lead to an increase in the CPI for load/store instructions from 6 to 8. What clock frequency in GHz would be needed for the redesigned processor?
Solution: 1.875

Q4: Instead of trying to increase the performance of the single processor, we can consider using multiple processor cores in a computer. Suppose, for the program in Question 2, parallelizing the program to use p processor cores divides the number of arithmetic/logic instructions by 0.7p, the number of load/store instructions by 0.8p, and the number of branches by 0.9p. What is the minimum number of processor cores required to improve performance by a factor of 3? 
 Solution: 4

Q5: What is the actual speedup achieved with the number of processor cores you identified in Question 4? 
 Solution: 3.14

Q6: What RISC-V instruction is encoded by the hex word 0x0051E933? 
Solution: or x18, x3, x5

Q7: What hex word encodes the RISC-V instruction ld x9, -24(x10)? 
Solution: 0xFE853483

Q8: If x9 initially contains the value 0xC445028461001003, what value (in hex) is placed in x18 by the following instruction?

srai x18, x9, 6?

Solution: 0xFF11140A11840040




Q9: Which of the following RISC-V instruction sequences extracts the 6-bit field from bits 4 to 9 of x10 and places it in the least-significant 6 bits of x7? 
 Solution: andi x7, x10, 0x3F0
srli x7, x7, 4
 
Q10: Write RISC-V instructions for the following C statements, assuming a is in x9 and b is in x18:

if (a == b)
  a = a + 1;
else
  a = a - 1;
 

Solution: bne x9, x18, L1
addi x9, x9, 1
beq x0, x0, L2
L1: addi x9, x9, -1
L2:

Q11: What C statements are encoded by the following, assuming x10 contains the signed int variable m?

      addi x5, x0, 20
      bgeu x10, x5, skip
      jal  x1, my_func
skip:
 

Solution: if (m >= 0 && m < 20) my_func(m);

Q12: Write RISC-V instructions for the following leaf function:

int min(int a, b) {
  return a < b ? a : b;
}

Solution: min: bge x10, x11, L1
           add x10, x0, x10
           jalr x0, 0(x1)
L1: add x10, x0, x11
        jalr x0, 0(x1)

Q13: Write RISC-V instructions for the following recursive function, without eliminating the recursion (i.e., without replacing it by a loop):

int sum(int n) {
  if (n == 0)
    return 0;
  else
    return n + sum(n – 1);
}

Solution: sum: addi sp, sp, -16
            sd x1, 8(sp)
            sd x8, 0(sp)
            beq x10, x0, ret
            add x8, x0, x10
            addi x10, x10, -1
             jal x1, sum
             add x10, x8, x10
ret: ld x8, 0(sp)
         ld x1, 8(sp)
         addi sp, sp, 16
         jalr x0, 0(x1)




Quiz on Textbook Sections 2.1 to 2.8

Q1: What C statement is represented by the following RISC-V assembly code?

add  x5, x8, x9
add  x6, x20, x11
sub  x8, x5, x6
Solution:
a = (a + b) - (c + d);

Q2: Which of the following RISC-V instruction implements the C assignment

n = A[3];

assuming n is in x9, the base address of A is in x10, and A is an array of uint64_t?

Solution: ld  x9, 24(x10)

Q3: The statement that RISC-V is little-endian means that the least-significant byte of a doubleword in memory is at the lowest address and the most-significant byte the doubleword is at the highest address. 
Solution: True

Q4: What range of values can be used as the immediate operand in a RISC-V addi instruction? 
Solution:  −2048 to +2047
 
Q5: What RISC-V instruction is encode by the hexadecimal word 04533023? 
Solution:  sd  x5, 0x040(x6)

Q6: If the register x8 contains the value 0x804600B91200C00F, what value is placed in x5 by the instruction

srli x5, x8, 4

Solution: 0x0804600B91200C00


Q7: What C statement is implemented by the following RISC-V assembly code, assuming a is in x9 and b is in x18?

    blt  x18, x9, L1
    addi x9, x9, 1
L1:

Solution: if (a <= b) a = a + 1;


Q8: Which of the RISC-V code sequences correctly implements the C statements

i = 0;
while (A[i] != 0) i++;

assuming i is in x9 and the address of A is in x18

Solution:

    addi x9, x0, 0
L1: slli x5, x9, 3
    add  x5, x5, x18
    ld   x5, 0(x5)
    beq  x5, x0, L2
    addi x9, x9, 1
    j    L1
L2:

Q10: The RISC-V calling convention uses registers x10 to x17 for parameter passing. If a procedure has more than eight parameters, where are the additional parameters passed? 
 Solution:
They are pushed onto the stack by the caller.