Lecture 7

Nikolaus Huber

Optimization

Outline

Recap on Complexity
Space-Time spectrum
Measuring & Profiling
Low-level optimization

Premature optimization is the root of all evil

Donald Knuth

Asymptotic complexity

How does our code scale for different inputs?
How do our algorithms scale?
Are we interested in worst, average, or best case?
Are we interested in processing time or memory requirement?

Recap: Landau Notation

\[ O(g(n)) = \{ f: \mathbb{N} \rightarrow \mathbb{N} \;|\; \exists M, n_0 : \forall n \geq n_0 \Rightarrow f(n) \leq M g(n) \} \]

$O(g(n))$ is the set of all functions that grow at most as quickly as g
For example: $2n^2 + n + 42 \in O(n^2)$
More commonly: $2n^2 + n + 42 = O(n^2)$

Example: Sorting algorithms

Algorithm	WC Runtime
Bubble-sort	$O(n^2)$
Insertion-sort	$O(n^2)$
Quick-sort	$O(n^2)$
Merge-sort	$O(n\;log(n))$
Tim-sort	$O(n\;log(n))$

O - Notation

Is idependent of the unit of measure of time

Absolute runtime (ms, µs, ...)
Number of instructions
Number of cycles
All just some constant factors aparts

Only captures scalability

Gives idea of performance when input becomes arbitrarily large
Does not tell anything about runtime for fixed input
Worst-case often extremely unrealistic (but important for real-time)

Average vs Worst case

Average runtime

Expected runtime for an algorithm given a randomly chosen input of size n

Tends to be more difficult to derive than worst-case
Very often determined through simulation and measurement

Example: Sorting algorithms

Algorithm	WC Runtime	AC Runtime
Bubble-sort	$O(n^2)$	$O(n^2)$
Insertion-sort	$O(n^2)$	$O(n^2)$
Quick-sort	$O(n^2)$	$O(n\;log(n))$
Merge-sort	$O(n\;log(n))$	$O(n\;log(n))$
Tim-sort	$O(n\;log(n))$	$O(n\;log(n))$

Algorithm vs problem complexity

Complexity of a problem = complexity of an optimal algorithm solving it
E.g., sorting has complexity $O(n\;log(n))$
Currently we cannot do better than Merge-sort/Tim-Sort in the general case

For special inputs we can do better
Restricting the data domain can help (e.g. Dutch Flag Sorting Algorithm)

Data structures vs algorithms

DS complementary aspect to algorithm complexity
DS are often considered more important in algorithm design

Performance of DS operations has drastic impact on algorithm performance
Often many choices (e.g. double linked lists, arrays, trees, hash tables, ...)
Generally more difficult to change DS later in development

Data structures

Each DS comes with a set of operations
Complixity of these operations often depends on the size of the DS

Operation	List of size n	Array of size n
Read k-th element	$O(k) \approx O(n)$	$O(1)$
Update k-th element	$O(k) \approx O(n)$	$O(1)$
Insert k-th element	$O(k) \approx O(n)$	-
Append element	$O(1)$	-

Conclusion

There is no general best algorithm or best data structure
Have to make decisions based on

Concrete requirements
Expected data size
Expected data distribution
Expected update/insert rate
...

Space - Time - Tradeoff

We can often choose whether an implementation

Uses more meory, but needs less computation
Uses more computation, but needs less memory

Choice has to be informed by context, no general solution
Most of the following aspects can be understood as space-time-tradeoffs

Look - Up Tables

Pre-computed and stored in memory
Data is accessed often
Examples

Trigonometric functions (sin, cos)
Hash-code tables
...

More space, less computation

Images, Fonts, Bitmaps, ...

Can be stored in different ways

Raw format
Compressed format

Similar for other kinds of data (audio, text, ...)

Packed/Padded Representation

Individual bits can be stored in different ways

Packed: bits are grouped together
Padded: reserve whole byte/word for each bit

Bit banding is similar to padded representation

Bit fields

Feature of C to store bit-vectors of any size in a struct
Normally uses packed representation
E.g., the following only needs one 32-bit word:


		struct X {
			int x:10; // 10-bit int
			int y:20; // 20-bit int
		};

Alignment

On fully aligned architectures, variables need to be aligned to word boundries

I.e., on 32-bit machines, all addresses are multiples of 4
Mostly outdated today

On self aligned architectures, addresses of n-byte types have to be multiples of n

int addresses => 4, short => 2, char => 1
To save space, fields in struct can be sorted (from biggest to smallest)
Again, compiler is not allowed to re-order the fields (must be done manually)

Alignment on ARM Cortex

Can also handle unaligned variables
Load / store to unaligned variables are slower than accesses to aligned variables

Size of machine code

How many bytes to save one machine instruction?
Depends, different ARM instruction sets:

Original ARM 32-bit ISA
Reduced Thumb 16-bit ISA
Thumb-2, mixed size (16 + 32 bit)

Memory and caches

Some architectures use multiple levels of caches to speed up memory access
As said before, not that relevant for us

Smaller µCs usually don't have complicated cache structure
Caches make WCET prediction even harder
Sometimes: Scratchpad memories

Optimizing Code

Efficiency normally considered late in development
Better to focus on readability and maintainability first
Once a program is functionally complete, we can focus on optimization
Not always possible in that way, some decisions have to be made up front

What to optimize?

Optimization should always be informed
Can use a profiler

Tool to measure time spent in individual functions, blocks, ...

Can instrument the code

Sometimes some support from the (RT)OS

Conduct experiments

Remove or duplicate code sections

Profilers (1)

gprof, Tracealyzer, ..., sometimes inbuilt into IDE
Instrumentation based

Profiler adds statements to capture time at various code locations
Affects measured times (maybe drastically)

Sampling based

Profiler stops program periodically and records current program counter
Less precise

Profilers (2)

Simulation based

Usually quite slow
Needs a cycle-accurate simulator for the platform
Often works with simplified assumptions about the HW

In-circuit / tracing

Needs hardware support
Often implemented internally by sampling

How to eliminate bottlenecks?

Choose a better suited algorithm or data structure
Perform low-level optimizations
Optimization is related to refactoring

We want to improve the design
Modifications should not impact functionality of the systems

Low-level optimizations

Use faster instructions (i.e. bit-shifting vs multiplications)
Use right arithmetic data-types

Floating point, Fixed point, int
Can have a significant impact (factor 10)

Loop optimizations
Sub-expression elimination
Inlining
Algebraic simplificitions

Low-level optimizations

A lot is handled by the compiler
Sometimes we need to give hints to the compiler
E.g., C keywords like inline and register
ISO TR 18037:2008: Embedded-C standard

Schkufza, Eric, Rahul Sharma, and Alex Aiken. "Stochastic superoptimization." ACM SIGARCH Computer Architecture News 41.1 (2013): 305-316.

Preparation for lab 3

Zephyr Sensor Subsystem

Zephyr tries to describe applications on a high level
By developing against Zephyr API => high portability
Provides unified API for accessing sensors
Works on channels
- Measurable quantity
- Examples: SENSOR_CHAN_ACCEL_X, SENSOR_CHAN_VOLTAGE, SENSOR_CHAN_AMBIENT_TEMP
- Every channel has a defined unit

Lecture 7

Optimization

Outline

Premature optimization is the root of all evil

Donald Knuth

Asymptotic complexity

Recap: Landau Notation

Example: Sorting algorithms

O - Notation

Average vs Worst case

Example: Sorting algorithms

Algorithm vs problem complexity

Data structures vs algorithms

Data structures

Conclusion

Space - Time - Tradeoff

Look - Up Tables

Images, Fonts, Bitmaps, ...

Packed/Padded Representation

Bit fields

Related topic: Alignment

Alignment

Alignment on ARM Cortex

Size of machine code

Memory and caches

Optimizing Code

What to optimize?

Profilers (1)

Profilers (2)

How to eliminate bottlenecks?

Low-level optimizations

Low-level optimizations

Preparation for lab 3

Zephyr Sensor Subsystem

Thanks for today!

Thanks for today!