Lecture 7

Nikolaus Huber

Optimization

Outline

  • Recap on Complexity
  • Space-Time spectrum
  • Measuring & Profiling
  • Low-level optimization

Premature optimization is the root of all evil

Donald Knuth

Asymptotic complexity

  • How does our code scale for different inputs?
  • How do our algorithms scale?
  • Are we interested in worst, average, or best case?
  • Are we interested in processing time or memory requirement?

Recap: Landau Notation

\[ O(g(n)) = \{ f: \mathbb{N} \rightarrow \mathbb{N} \;|\; \exists M, n_0 : \forall n \geq n_0 \Rightarrow f(n) \leq M g(n) \} \]
  • $O(g(n))$ is the set of all functions that grow at most as quickly as g
  • For example: $2n^2 + n + 42 \in O(n^2)$
  • More commonly: $2n^2 + n + 42 = O(n^2)$

Example: Sorting algorithms

Algorithm WC Runtime
Bubble-sort $O(n^2)$
Insertion-sort $O(n^2)$
Quick-sort $O(n^2)$
Merge-sort $O(n\;log(n))$
Tim-sort $O(n\;log(n))$

O - Notation

  • Is idependent of the unit of measure of time
    • Absolute runtime (ms, µs, ...)
    • Number of instructions
    • Number of cycles
    • All just some constant factors aparts
  • Only captures scalability
    • Gives idea of performance when input becomes arbitrarily large
    • Does not tell anything about runtime for fixed input
    • Worst-case often extremely unrealistic (but important for real-time)

Average vs Worst case

  • Average runtime
    • Expected runtime for an algorithm given a randomly chosen input of size n
  • Tends to be more difficult to derive than worst-case
  • Very often determined through simulation and measurement

Example: Sorting algorithms

Algorithm WC Runtime AC Runtime
Bubble-sort $O(n^2)$ $O(n^2)$
Insertion-sort $O(n^2)$ $O(n^2)$
Quick-sort $O(n^2)$ $O(n\;log(n))$
Merge-sort $O(n\;log(n))$ $O(n\;log(n))$
Tim-sort $O(n\;log(n))$ $O(n\;log(n))$

Algorithm vs problem complexity

  • Complexity of a problem = complexity of an optimal algorithm solving it
  • E.g., sorting has complexity $O(n\;log(n))$
  • Currently we cannot do better than Merge-sort/Tim-Sort in the general case
    • For special inputs we can do better
    • Restricting the data domain can help (e.g. Dutch Flag Sorting Algorithm)

Data structures vs algorithms

  • DS complementary aspect to algorithm complexity
  • DS are often considered more important in algorithm design
    • Performance of DS operations has drastic impact on algorithm performance
    • Often many choices (e.g. double linked lists, arrays, trees, hash tables, ...)
    • Generally more difficult to change DS later in development

Data structures

  • Each DS comes with a set of operations
  • Complixity of these operations often depends on the size of the DS
Operation List of size n Array of size n
Read k-th element $O(k) \approx O(n)$ $O(1)$
Update k-th element $O(k) \approx O(n)$ $O(1)$
Insert k-th element $O(k) \approx O(n)$ -
Append element $O(1)$ -

Conclusion

  • There is no general best algorithm or best data structure
  • Have to make decisions based on
    • Concrete requirements
    • Expected data size
    • Expected data distribution
    • Expected update/insert rate
    • ...

Space - Time - Tradeoff

  • We can often choose whether an implementation
    • Uses more meory, but needs less computation
    • Uses more computation, but needs less memory
  • Choice has to be informed by context, no general solution
  • Most of the following aspects can be understood as space-time-tradeoffs

Look - Up Tables

  • Pre-computed and stored in memory
  • Data is accessed often
  • Examples
    • Trigonometric functions (sin, cos)
    • Hash-code tables
    • ...
  • More space, less computation

Images, Fonts, Bitmaps, ...

  • Can be stored in different ways
    • Raw format
    • Compressed format
  • Similar for other kinds of data (audio, text, ...)

Packed/Padded Representation

  • Individual bits can be stored in different ways
    • Packed: bits are grouped together
    • Padded: reserve whole byte/word for each bit
  • Bit banding is similar to padded representation

Bit fields

  • Feature of C to store bit-vectors of any size in a struct
  • Normally uses packed representation
  • E.g., the following only needs one 32-bit word:

		struct X {
			int x:10; // 10-bit int
			int y:20; // 20-bit int
		}; 
	

Related topic: Alignment

  • On some architectures, addresses of int/short/char variables must be aligned to word boundries
  • This is taken care of by the compiler
  • C - Standard forbidds the compiler from reordering struct fields though

		struct X {
			short	a; // 2 bytes 
			char 	b; // 1 byte 
			int 	c; // 4 bytes 
		}; 
	

Alignment

  • On fully aligned architectures, variables need to be aligned to word boundries
    • I.e., on 32-bit machines, all addresses are multiples of 4
    • Mostly outdated today
  • On self aligned architectures, addresses of n-byte types have to be multiples of n
    • int addresses => 4, short => 2, char => 1
    • To save space, fields in struct can be sorted (from biggest to smallest)
    • Again, compiler is not allowed to re-order the fields (must be done manually)

Alignment on ARM Cortex

  • Can also handle unaligned variables
  • Load / store to unaligned variables are slower than accesses to aligned variables

Size of machine code

  • How many bytes to save one machine instruction?
  • Depends, different ARM instruction sets:
    • Original ARM 32-bit ISA
    • Reduced Thumb 16-bit ISA
    • Thumb-2, mixed size (16 + 32 bit)

Memory and caches

  • Some architectures use multiple levels of caches to speed up memory access
  • As said before, not that relevant for us
    • Smaller µCs usually don't have complicated cache structure
    • Caches make WCET prediction even harder
    • Sometimes: Scratchpad memories

Optimizing Code

  • Efficiency normally considered late in development
  • Better to focus on readability and maintainability first
  • Once a program is functionally complete, we can focus on optimization
  • Not always possible in that way, some decisions have to be made up front

What to optimize?

  • Optimization should always be informed
  • Can use a profiler
    • Tool to measure time spent in individual functions, blocks, ...
  • Can instrument the code
    • Sometimes some support from the (RT)OS
  • Conduct experiments
    • Remove or duplicate code sections

Profilers (1)

  • gprof, Tracealyzer, ..., sometimes inbuilt into IDE
  • Instrumentation based
    • Profiler adds statements to capture time at various code locations
    • Affects measured times (maybe drastically)
  • Sampling based
    • Profiler stops program periodically and records current program counter
    • Less precise

Profilers (2)

  • Simulation based
    • Usually quite slow
    • Needs a cycle-accurate simulator for the platform
    • Often works with simplified assumptions about the HW
  • In-circuit / tracing
    • Needs hardware support
    • Often implemented internally by sampling

How to eliminate bottlenecks?

  • Choose a better suited algorithm or data structure
  • Perform low-level optimizations
  • Optimization is related to refactoring
    • We want to improve the design
    • Modifications should not impact functionality of the systems

Low-level optimizations

  • Use faster instructions (i.e. bit-shifting vs multiplications)
  • Use right arithmetic data-types
    • Floating point, Fixed point, int
    • Can have a significant impact (factor 10)
  • Loop optimizations
  • Sub-expression elimination
  • Inlining
  • Algebraic simplificitions

Low-level optimizations

  • A lot is handled by the compiler
  • Sometimes we need to give hints to the compiler
  • E.g., C keywords like inline and register
  • ISO TR 18037:2008: Embedded-C standard
Schkufza, Eric, Rahul Sharma, and Alex Aiken. "Stochastic superoptimization." ACM SIGARCH Computer Architecture News 41.1 (2013): 305-316.

Preparation for lab 3

Zephyr Sensor Subsystem

  • Zephyr tries to describe applications on a high level
  • By developing against Zephyr API => high portability
  • Provides unified API for accessing sensors
  • Works on channels
    • Measurable quantity
    • Examples: SENSOR_CHAN_ACCEL_X, SENSOR_CHAN_VOLTAGE, SENSOR_CHAN_AMBIENT_TEMP
    • Every channel has a defined unit

Thanks for today!

Thanks for today!