TAGORE ENGINEERING COLLEGE RATHINAMANGALAM, CHENNAI - 600 127 .
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT TEST-II
CS6801 – MULTI-CORE ARCHITECTURES AND PROGRAMMING ANSWER KEY
PART-A 1. Lists the different challenges of parallel programming design
1. Synchronization challenge 2. Communication challenge 3. Load balancing challenge 4. Scalability challenge 2. What is Memory Fence?
Memory Fence is also called as memory barrier which is a processor dependent operation that ensures that one thread can see other threads memory operation during processing. 3. What is partitioning? Explain the ways of partitioning.
The Partitioning performs load balancing by dividing the computation and data into pieces. There are two ways of partitioning. i) Data centric partitioning (Domain decomposition): It is a parallel design method which divides the data of the serial program into small pieces and then determines how to associate the computations with data. ii) Computation centric partitioning (Functional decomposition): It is a process of dividing computation of the program into pieces and analyze how to associate data with the individual computations. 4. Give an ISO efficiency relation.
If a parallel system explicit the efficiency €(n,p) by definining C = €(n,p) €(n,p) 1-€(n,p) 1-€(n,p)
T0(n,p)=(p-1)ɕ(n)+pα(n,p) (n,p)=(p-1)ɕ(n)+pα(n,p) To improve the scalability of the parallel system, it should satisfy the following condition T(n,1)≥CT T(n,1)≥CT0(n,p) The ISO efficiency relation is used to determine the range of processors for maintaining the performance efficiency. 5. Define Dead lock and Live lock. Dead lock: The dead lock arises when one thread wait for another resource that is already
locked by another waiting thread. Live lock: It occur, when two threads continuously conflict with each other and back off.
6. Lists the steps to avoiding the data races.
i) Should confirm that only one thread can update the variable at a time. ii) Place the synchronization lock around all that variable access. iii) Ensure that the thread must acquire the lock before referencing the variable. 7. What is Mutex?
The simple method of proving synchronization is Mutex (mutually exclusive lock). Only one thread in the program can acquire a mutex lock at a time. The mutex is the simplest lock implementation that can be used in the program. 8. List out the different types of locks.
i) Mutex locks ii) Recursive locks iii) Reader Writer Locks iv) Spin Locks 9. What is Spin lock? Lists the advantages.
Spin lock is a condition that occurs, when one thread have locked a data and it is continuing its work, making all the thread to wait for a long time for some other thread to unlock the data. This situation is spin lock. Advantage:
The thread will acquire the lock to any data, once the data is immediately released by other thread. 10. What is Barriers?
In parallel programming, some restriction mechanisms allow synchronization among the multiple attributes. One of such mechanism is said to be barrier. By this technique, a single thread process has to wait for all other threads to complete its execution for the purpose of proceeding the next execution step.
PART-B 1. Explain the challenges in parallel programming design.
To perform the parallel programming, we need to improve the system performance by implementing threads. But threading process adds complexity to the programming and complexity of the parallel program increases when more than one functionality occurs in the program. There are four challenges that are faced in parallel programming. i) Synchronization challenge: It is the process in which two or more threads coordinate their functions and activities. For example one thread waits for another thread to complete its task before continuing its operation. Two synchronization operations. i) Mutual exclusion: one thread can block the critical section and may operate in the shared data other threads have to wait still the thread holding the critical section completes i ts task. ii) Condition synchronization: The thread is blocked until the system reaches some specific condition. Here the threads wait to enter in to the critical section till the defined condition is reached. The synchronization primitives are: Semaphores, Locks and Condition Variables ii) Communication challenge: The message is the method of communication to transfer the information from one node to another node. Three concerns of message communications are i) Multi granularity ii)Multi threading iii) Multi tasking Different Communications for message passing are 1) Inter process communication: Two threads that communicate that reside in two different processes. 2) Intra process communication: Two threads that communicate with the messages and reside in same process. 3) Process to process communication: Two different processes communicate through message. iii) Load balancing challenge: It is major need of parallel programming. The load balancing can be done in effective way by using appropriate loop scheduling and portioning. The Partitioning performs load balancing by dividing the computation and data into pieces. There are two ways of partitioning. a) Data centric partitioning (Domain decomposition): Determines how to associate the computations with data. b) Computation centric partitioning (Functional decomposition): Analyze how to associate data with the individual computations. The load balancing can be implemented in two ways. i) Static load balancing: This is done when we need to map the tasks to processors in order to minimize the communication overhead of the parallel program. ii) Dynamic load balancing: The dynamic load balancing algorithms analyze the task dynamically and create the current mapping of tasks to the processors.
iv) Scalability challenge:
The scalability is the ability of the parallel program to increase the performance as the number of processors is increasing. The scalability is limited by hardware interaction where the presence of multiple threads causes the hardware to become less effective. It is also limited by the software where the synchronization overhead becomes more issue. 2. Analyze the performance of the parallel program by deriving the Amdahl’s and Gustafson barsis law. (i)Amdahl’s Law:
Used to know the limit of increase in the number of processors and also used to determine the asymptotic speedup achievable as the number of processor increases. Definition: Let “g” be the fraction of operation in a computation that must be performed sequentially where 0≤f≤1. The maximum speedup ¥ achievable by a parallel computer with ‘p’ number of processors performing the computations is as follows ¥ (n,p) ≤ 1 g+(1-g)/p Derivation:
The speedup of parallel program execution is ¥ (n,p) ≤ ɕ(n) + Ø(n) ɕ(n) + (Ø(n) / p)+ α(n,p) WKT, α(n,p)>0, so we can write speedup is ¥ (n,p) ≤
ɕ(n) + Ø(n) ɕ(n) + (Ø(n) / p)
Let as assume that ‘g’ be the sequential computation g= ɕ(n) ɕ(n) + Ø(n) Then we can write, ¥ (n,p) ≤ 1 g+(1-g)/p (ii)Gustafson Barsis Law:
It starts with parallel computation and estimates how fast the parallel computation is performing than the same program while executed on a single processor. Definition: Let solving the ‘n’ size program with ‘p’ processors and ‘T’ denote the fraction of total execution time spent in serial code, then the maximum speedup speedup ¥ achievable by ¥ (n,p) = p+(1-p)T Derivation: WKT, The equation of for speedup with α(n,p)>0 is ¥ (n,p) ≤ ɕ(n) + Ø(n) ɕ(n) + (Ø(n) / p) -----------------(A)
Let ‘T’ denote the fraction of total execution time spent in serial code for performing the parallel computation and the parallel operations has 1-T. T = ɕ(n) + Ø(n) ɕ(n) + (Ø(n) / p) -------------------(1) 1- T =
(ɕ(n) /p) ɕ(n) + (Ø(n) / p)
From eqn (2) Ø(n) = (ɕ(n) + (Ø(n) / p) )(1- T)p -----------(3) From eqn (1) ɕ(n) = (ɕ(n) + (Ø(n) / p) )T ----------------(4) Substitute the equation (3) & (4) in equation (A) and we get the following, ¥ (n,p) = T+(1-T)p (or) ¥ (n,p) = p+(1-p)T 3. Derive the Karp-Flatt metric law to improve the high performance of parallel program.
Both Amdahl’s and Gustafson law ignore the parallel overhead but here α(n,p) is considered. It provides the high performance in parallel program design. Definition: With the given parallel computation speedup ¥ or ‘p’ number of processors where p>1 then experimentally determined serial fraction ‘e’ is e = (1/ ¥) - (1/p) 1- (1/p) Derivation: WKT, the execution time of parallel program is, T(n,p) = ɕ(n) + (Ø(n)/p)+α(n,p) --------------(1) The serial programs do not have any interprocessor communication or overhead. So execution time is T(n,1) = ɕ(n) + Ø(n)---------------------(2) The experiment determines the serial fraction ‘e’ is ɕ(n) + α(n,p)=T(n,1)e ------------------(3) Substitute equation (3) in (1) T(n,p)= T(n,1)e + (Ø(n)/p) -----------------(4) From equation (3), ɕ(n)= T(n,1)e- α(n,p) But, in serial program, parallel overhead is not possible. So α(n,p)=0. Therefore, ɕ(n)= T(n,1)e -------------------------(5) Substitute equation (5) in (2) and get the following Ø(n)=T(n,1) (1-e) --------------------------(6) Substitute equation (6) in (4) and get the parallel execution time as follows T(n,p)=T(n,1)e+(T(n,1)(1-e))/p-------------------(7) WKT, the Speedup is, ¥ =T(n,1) / T(n,p)
Let as assume, T(n,p)=1 From Equation (7) , Finally we get the following e (1-(1/p))=(1/¥)-(1/p) Where ‘e’ is experimentally determined serial fraction 4. Derive the ISO efficiency relation by scalability of the parallel program.
The scalability of the parallel system is the measure of the ability to increase and improve the performance as the number of processors increases. Here, an ISO efficiency relation is formalized to stabilize the performance and efficiency. Derivation: WKT, The Speed up is ¥ (n,p) ≤ ɕ(n) + Ø(n) ɕ(n) + (Ø(n) / p)+ α(n,p) ¥ (n,p) ≤ p(ɕ(n) + Ø(n)} pɕ(n) + Ø(n) + pα(n,p) ----------------(1) WKT, pɕ(n) as follows, pɕ(n)= ɕ(n)+(p-1) ɕ(n) --------------------(2) Substitute (2) in (1) ¥ (n,p) ≤
We already know that, T0(n,p) = (p-1) ɕ(n)+pα(n,p) --------------------- (4) Where T0(n,p) is the total time spent by all process for not performing any work in sequential program and (p-1) process spent executing sequential code. Substitute the eqn (4) in (3) ¥ (n,p) ≤ p(ɕ(n) + Ø(n)} ɕ(n) + Ø(n) + T0(n,p) WKT, the efficiency is equal to the speedup divided by equal number of processors, i.e, ε(n,p) = ɕ(n) + Ø(n) ɕ(n) + Ø(n) + T0(n,p) ------------------- (5) Divide the numerators and denominators of eqn (5) ε(n,p) ≤ 1 (Since T(n,1)= ɕ(n) + Ø(n) 1+( T0(n,p)/T(n,1)) T(n,1) ≥ ε(n,p) [T0(n,p)] 1- ε(n,p) Here, the constant level efficiency i.e, an ISO efficiency relation is C = ε(n,p) 1- ε(n,p) and T0(n,p)=(p-1)ɕ(n)+pα(n,p) To improve the scalability of the parallel system, it should satisfy the following condition T(n,1)≥ C T0(n,p) The ISO efficiency relation is used to determine the range of processors for maintaining the performance efficiency.