Architecture Materielle - S6
Annee: 2022-2023 (Semestre 6)
Credits: 3 ECTS
Type: Architecture et Systemes Embarques
PART A: PRESENTATION GENERALE
Objectifs du cours
Ce cours approfondit l'architecture des processeurs modernes en se concentrant sur les architectures ARM et x86, la programmation assembleur, les mecanismes de pipeline et les aspects de securite materielle. L'accent est mis sur la comprehension bas niveau du fonctionnement des processeurs et les vulnerabilites materielles.
Competences visees
- Comprendre les architectures ARM et x86/x64
- Programmer en assembleur ARM et x86
- Maitriser les mecanismes de pipeline et parallelisme
- Analyser les performances des processeurs
- Comprendre la hierarchie memoire et les caches
- Identifier les vulnerabilites materielles
- Analyser les attaques par canaux caches
- Optimiser le code pour l'architecture cible
- Comprendre les compromis energie/performance
Organisation
- Volume horaire: Cours magistraux et TPs pratiques
- Evaluation: Examen ecrit + compte-rendu de TP
- Semestre: 6 (2022-2023)
- Prerequis: Architecture de base, programmation C, systemes d'exploitation
PART B: EXPERIENCE, CONTEXTE ET FONCTION
Contenu pedagogique
Le cours couvre les architectures modernes et leurs implications securitaires.
1. Programmation Assembleur
Introduction a l'assembleur:
L'assembleur est le langage de plus bas niveau (avant le binaire).
Avantages:
- Controle total du processeur
- Performances optimales
- Comprehension du fonctionnement materiel
- Debogage bas niveau
- Reverse engineering
Utilisations:
- Noyaux de systemes d'exploitation
- Drivers de peripheriques
- Code critique en performance
- Systemes embarques
- Securite et cryptographie
Structure d'un programme assembleur:
Sections typiques:
- Section .data: donnees initialisees
- Section .bss: donnees non initialisees
- Section .text: code executable
Registres:
Memoires ultra-rapides dans le processeur.
Types:
- Registres generaux (calculs, donnees)
- Pointeur de pile (SP - Stack Pointer)
- Compteur ordinal (PC - Program Counter)
- Registre d'etat (flags)
2. Architecture ARM
Caracteristiques ARM:
ARM (Advanced RISC Machine) est l'architecture dominante dans l'embarque et mobile.
Principes RISC (Reduced Instruction Set Computer):
- Instructions simples et uniformes
- Execution rapide (1 cycle par instruction)
- Nombreux registres
- Architecture load/store
Registres ARM:
| Registre | Nom | Fonction |
|---|---|---|
| R0-R12 | Registres generaux | Calculs et donnees |
| R13 (SP) | Stack Pointer | Pointeur de pile |
| R14 (LR) | Link Register | Adresse de retour |
| R15 (PC) | Program Counter | Adresse instruction courante |
| CPSR | Current Program Status | Flags d'etat |
Instructions de base ARM:
; Mouvement de donnees
MOV R0, #5 ; R0 = 5
LDR R1, [R2] ; R1 = memoire[R2]
STR R0, [R1] ; memoire[R1] = R0
; Arithmetique
ADD R0, R1, R2 ; R0 = R1 + R2
SUB R0, R1, #10 ; R0 = R1 - 10
MUL R0, R1, R2 ; R0 = R1 x R2
; Logique
AND R0, R1, R2 ; R0 = R1 ET R2
ORR R0, R1, R2 ; R0 = R1 OU R2
EOR R0, R1, R2 ; R0 = R1 XOR R2
; Comparaison et branchements
CMP R0, R1 ; Compare R0 et R1
BEQ label ; Branche si egal
BNE label ; Branche si different
BL fonction ; Appel de fonction
Convention d'appel ARM:
- R0-R3: passage des 4 premiers arguments
- R0: valeur de retour
- R4-R11: sauvegardes par la fonction appelee
- LR (R14): adresse de retour
Exemple de fonction ARM:
; Fonction addition: int add(int a, int b)
add:
PUSH {LR} ; Sauvegarder LR
ADD R0, R0, R1 ; R0 = R0 + R1 (resultat)
POP {PC} ; Retour (restaure PC)
; Appel de la fonction
MOV R0, #5
MOV R1, #3
BL add ; Appel add(5, 3)
; R0 contient 8
3. Architecture x86/x64
Caracteristiques x86:
x86 (Intel/AMD) est l'architecture dominante sur PC et serveurs.
Principes CISC (Complex Instruction Set Computer):
- Instructions complexes et variees
- Duree d'execution variable
- Moins de registres
- Memoire directement accessible
Registres x86-64:
| Registre | 64 bits | 32 bits | 16 bits | 8 bits | Usage |
|---|---|---|---|---|---|
| RAX | RAX | EAX | AX | AL | Accumulateur, retour |
| RBX | RBX | EBX | BX | BL | Base |
| RCX | RCX | ECX | CX | CL | Compteur |
| RDX | RDX | EDX | DX | DL | Donnees |
| RSI | RSI | ESI | SI | - | Source index |
| RDI | RDI | EDI | DI | - | Destination index |
| RBP | RBP | EBP | BP | - | Base pointer (pile) |
| RSP | RSP | ESP | SP | - | Stack pointer |
| R8-R15 | - | - | - | - | Registres additionnels |
Instructions de base x86:
; Mouvement de donnees
mov rax, 5 ; rax = 5
mov rbx, [rax] ; rbx = memoire[rax]
lea rax, [rbx+8] ; rax = adresse rbx+8
; Arithmetique
add rax, rbx ; rax = rax + rbx
sub rax, 10 ; rax = rax - 10
imul rax, rbx ; rax = rax x rbx
idiv rcx ; rax = rax / rcx, rdx = reste
; Logique
and rax, rbx ; rax = rax ET rbx
or rax, rbx ; rax = rax OU rbx
xor rax, rax ; rax = 0 (idiome courant)
not rax ; rax = NON rax
; Pile
push rax ; Empiler rax
pop rbx ; Depiler dans rbx
; Branchements
cmp rax, rbx ; Comparer
je label ; Jump if equal
jne label ; Jump if not equal
jmp label ; Jump inconditionnel
call fonction ; Appel fonction
ret ; Retour
Convention d'appel x64 (System V):
Arguments dans l'ordre:
- RDI
- RSI
- RDX
- RCX
- R8
- R9
- Puis sur la pile
Valeur de retour: RAX
Exemple de fonction x64:
; Fonction: int add(int a, int b)
add:
push rbp ; Prologue
mov rbp, rsp
mov eax, edi ; a dans eax
add eax, esi ; eax += b
pop rbp ; Epilogue
ret
Comparaison ARM vs x86:
| Aspect | ARM | x86 |
|---|---|---|
| Philosophie | RISC | CISC |
| Instructions | Simples, regulieres | Complexes, variees |
| Longueur instruction | Fixe (32 bits) | Variable (1-15 octets) |
| Registres | 16 (ARM32) | 16 (x64) |
| Consommation | Faible | Elevee |
| Performance/Watt | Excellente | Moyenne |
| Usage | Mobile, embarque | PC, serveurs |
4. Pipeline et parallelisme
Pipeline d'instructions:
Technique permettant d'executer plusieurs instructions en parallele.
Etapes classiques (5 stages):
- IF (Instruction Fetch): lecture instruction
- ID (Instruction Decode): decodage
- EX (Execute): execution
- MEM (Memory): acces memoire
- WB (Write Back): ecriture resultat
Sans pipeline: 5 cycles par instruction.
Avec pipeline: 1 instruction/cycle en regime permanent.
Aleas de pipeline (hazards):
Aleas structurels:
Conflit sur une ressource materielle.
Solution: dupliquer ressources.
Aleas de donnees:
Instruction depend d'un resultat pas encore disponible.
Exemple:
ADD R1, R2, R3 ; R1 = R2 + R3
SUB R4, R1, R5 ; R4 = R1 - R5 (depend de R1!)
Solutions:
- Stall (bulles): attendre
- Forwarding: transmettre resultat directement
- Reorganisation: ordonnanceur rearrange les instructions
Aleas de controle:
Branchements conditionnels perturbent le pipeline.
Le processeur ne sait pas quelle instruction charger apres un branchement.
Solutions:
- Prediction de branchement: deviner la direction
- Execution speculative: executer les deux chemins
- Branch delay slot: instruction apres branchement toujours executee
Architectures superscalaires:
Plusieurs pipelines en parallele → plusieurs instructions/cycle.
Exemples: processeurs modernes (4-6 instructions/cycle).
Execution dans le desordre (out-of-order):
Le processeur reordonne les instructions pour maximiser l'utilisation des unites fonctionnelles.
Masque les latences et ameliore IPC (Instructions Per Cycle).
Parallelisme au niveau instruction (ILP):
Exploitation automatique du parallelisme dans le code sequentiel.
Techniques:
- Pipeline
- Superscalaire
- Out-of-order execution
- Prediction de branchement
- Renommage de registres
5. Hierarchie memoire
Pyramide memoire:
| Niveau | Taille | Latence | Cout |
|---|---|---|---|
| Registres | Quelques octets | < 1 ns | Tres eleve |
| Cache L1 | 32-64 KB | 1-2 ns | Eleve |
| Cache L2 | 256 KB - 1 MB | 5-10 ns | Moyen |
| Cache L3 | 4-32 MB | 20-40 ns | Bas |
| RAM | 4-64 GB | 50-100 ns | Tres bas |
| SSD | 256 GB - 2 TB | 0.1 ms | Minimal |
| HDD | 1-10 TB | 10 ms | Minimal |
Principe de localite:
- Temporelle: donnee recemment utilisee sera probablement reutilisee
- Spatiale: donnees proches seront probablement utilisees ensemble
Memoire cache:
Memoire rapide entre processeur et RAM.
Organisation:
Direct-mapped: chaque adresse RAM a une seule place possible dans le cache.
- Simple mais conflits frequents
Set-associative: chaque adresse a N places possibles.
- Bon compromis (2-way, 4-way, 8-way courant)
Fully-associative: adresse peut aller n'importe ou.
- Flexible mais complexe et couteux
Politiques de remplacement:
Quand le cache est plein, quelle ligne evincer?
- LRU (Least Recently Used): la moins recemment utilisee
- FIFO: la plus ancienne
- Random: au hasard
- LFU (Least Frequently Used): la moins frequemment utilisee
Politiques d'ecriture:
Write-through:
- Ecriture simultanee cache + RAM
- Coherence garantie
- Lent
Write-back:
- Ecriture uniquement dans cache
- RAM mise a jour lors de l'eviction
- Rapide mais complexe
- Bit "dirty" pour tracer modifications
Coherence de cache:
Dans un systeme multiprocesseur, comment garantir que tous les caches voient les memes donnees?
Protocoles MESI, MOESI: etats des lignes de cache (Modified, Exclusive, Shared, Invalid).
6. Securite materielle et attaques
Vulnerabilites materielles:
Le materiel n'est pas infaillible et peut etre exploite.
Attaques par canaux caches (side-channel attacks):
Exploitation d'informations indirectes (temps, consommation, emissions electromagnetiques).
Attaque temporelle (timing attack):
Mesure du temps d'execution pour deduire des informations secretes.
Exemple: cache timing.
Si une donnee est en cache, l'acces est rapide.
Si elle n'y est pas, l'acces est lent.
En mesurant le temps, un attaquant peut deduire quelles donnees ont ete accedees.
Attaque Spectre:
Exploite l'execution speculative et la prediction de branchement.
Principe:
- Entrainer le predicteur de branchement
- Faire executer speculativement du code qui accede a des donnees sensibles
- Observer effet de bord via cache timing
- Recuperer donnees secretes
Impact: fuite d'informations entre processus, contournement d'isolation memoire.
Attaque Meltdown:
Exploite le delai entre verification des permissions et annulation d'une instruction speculative.
Permet a un processus utilisateur de lire la memoire noyau.
Impact majeur: tous les processeurs Intel recents vulnerables.
Mitigation: KPTI (Kernel Page Table Isolation) avec cout en performance.
Attaque par analyse de consommation:
Mesure de la consommation electrique pour extraire des cles cryptographiques.
Types:
- SPA (Simple Power Analysis): observation directe
- DPA (Differential Power Analysis): analyse statistique
Cibles privilegiees: cartes a puce, systemes embarques.
Contre-mesures:
- Masquage (randomisation)
- Equilibrage de la consommation
- Operations a temps constant
Attaques electromagnetiques:
Ecoute des emissions EM du processeur.
Similaire aux attaques par consommation mais sans contact.
Contre-mesures generales:
- Conception securisee du materiel
- Patches microcode
- Modifications d'OS (KPTI)
- Code a temps constant
- Masquage et randomisation
- Enclaves securisees (Intel SGX, ARM TrustZone)
7. Processeurs multi-coeurs
Evolution vers le parallelisme:
Fin de la loi de Moore: frequences plafonnees (≈ 4-5 GHz).
Solution: multiplier les coeurs.
Architectures multi-coeurs:
| Nombre de coeurs | Usage typique |
|---|---|
| 2-4 | Laptop, mobile |
| 4-8 | Desktop |
| 8-64 | Serveur |
| 64+ | Calcul haute performance |
Symetrie (SMP):
Tous les coeurs identiques, partagent la memoire.
Asymetrie (AMP):
Coeurs de types differents (ex: ARM big.LITTLE).
Coeurs rapides (big) pour taches lourdes.
Coeurs economes (LITTLE) pour taches legeres.
Optimise rapport performance/consommation.
Hyperthreading (SMT):
Un coeur physique apparait comme plusieurs coeurs logiques.
Partage des unites fonctionnelles entre threads.
Gain: 20-30% de performances.
Affinite processeur:
Lier un processus a un coeur specifique.
Avantages:
- Meilleure utilisation du cache
- Predictibilite temps reel
- Isolation pour securite
PART C: ASPECTS TECHNIQUES
Travaux Pratiques
TP: Comparaison ARM et x86/x64
Objectif: comprendre les differences entre architectures par la pratique.
Exercices typiques:
1. Programme simple en ARM et x86:
Ecrire la meme fonction dans les deux assembleurs.
Exemple: calcul de factorielle.
ARM:
factorial:
PUSH {R4, LR}
MOV R4, R0 ; Sauver n
CMP R0, #1
BLE end_fact
SUB R0, R0, #1
BL factorial ; Appel recursif
MUL R0, R4, R0 ; n * fact(n-1)
end_fact:
POP {R4, PC}
x86-64:
factorial:
push rbp
mov rbp, rsp
cmp rdi, 1
jle end_fact
push rdi
dec rdi
call factorial
pop rdi
imul rax, rdi
end_fact:
pop rbp
ret
2. Analyse de performances:
Mesurer le temps d'execution de differentes implementations.
Comparer:
- Code C optimise
- Assembleur manuel
- Differentes optimisations
3. Utilisation du cache:
Ecrire du code exploitant bien le cache vs mal.
Bon usage: parcours sequentiel d'un tableau.
Mauvais usage: acces aleatoires.
4. Etude de vulnerabilites:
Implementer une attaque simple de type cache timing.
Observer la difference de temps entre:
- Donnee en cache
- Donnee pas en cache
Outils et environnement
Assembleurs:
# ARM
arm-none-eabi-as programme.s -o programme.o
arm-none-eabi-ld programme.o -o programme
# x86-64
nasm -f elf64 programme.asm
ld programme.o -o programme
# Ou via GCC
gcc -S programme.c # Generer assembleur
gcc -c programme.s # Assembler
Desassemblage:
objdump -d programme # Desassembler
objdump -S programme # Avec code source entrelace
gdb programme # Debogueur
Analyse de performances:
perf stat ./programme # Statistiques performances
perf record ./programme # Enregistrer profil
perf report # Analyser profil
# Compteurs materiels
perf stat -e cache-misses,cache-references ./programme
Simulation:
qemu-arm programme # Emuler ARM
qemu-x86_64 programme # Emuler x86-64
Optimisations assembleur
Techniques courantes:
Deroulage de boucles (loop unrolling):
// Original
for(i=0; i<100; i++)
a[i] = b[i] + c[i];
// Deroule
for(i=0; i<100; i+=4) {
a[i] = b[i] + c[i];
a[i+1] = b[i+1] + c[i+1];
a[i+2] = b[i+2] + c[i+2];
a[i+3] = b[i+3] + c[i+3];
}
Avantages: moins de tests, meilleure utilisation pipeline.
Vectorisation (SIMD):
Traiter plusieurs donnees simultanement.
ARM NEON, x86 SSE/AVX: operations sur 128-512 bits.
Reorganisation pour le cache:
Acceder aux donnees dans l'ordre de leur disposition en memoire.
Elimination de branchements:
Remplacer if par calculs arithmetiques/logiques.
Exemple:
// Avec branchement
if(x > 0) y = a; else y = b;
// Sans branchement (x86)
mov eax, a
mov ebx, b
cmp x, 0
cmovg eax, ebx ; Conditional move
PART D: ANALYSE ET REFLEXION
Competences acquises
Programmation bas niveau:
- Maitrise de l'assembleur ARM et x86
- Comprehension du lien entre C et assembleur
- Optimisation de code critique
- Debogage au niveau materiel
Architecture:
- Comprehension des pipelines modernes
- Connaissance des hierarchies memoire
- Fonctionnement des caches
- Parallelisme materiel et multi-coeurs
Securite:
- Identification de vulnerabilites materielles
- Comprehension des attaques par canaux caches
- Conscience des compromis securite/performance
- Analyse de risques au niveau materiel
Applications pratiques
L'architecture materielle impacte tous les domaines de l'informatique:
Systemes embarques:
- Programmation ARM pour microcontroleurs
- Optimisation pour contraintes (memoire, energie)
- Systemes temps reel critiques
- IoT et objets connectes
Securite:
- Analyse de malwares (reverse engineering)
- Cryptographie resistante aux attaques physiques
- Systemes securises (cartes a puce, TPM)
- Detection d'attaques materielles
Performances:
- Optimisation de code critique (jeux, calcul scientifique)
- Exploitation efficace du materiel (caches, SIMD)
- Parallelisation sur multi-coeurs
- Reduction de la consommation energetique
Developpement systeme:
- Noyaux de systemes d'exploitation
- Drivers de peripheriques
- Bootloaders et firmware
- Hyperviseurs et virtualisation
Liens avec autres cours
| Cours | Lien |
|---|---|
| Architecture Informatique (S5) | Bases de l'architecture |
| Systemes d'Exploitation (S5) | Lien avec le logiciel systeme |
| Langage C (S5) | Compilation vers assembleur |
| Microcontroleur (S6) | Programmation ARM pratique |
| Securite Materielle (S7) | Approfondissement securite |
| Temps Reel (S8) | Optimisation et predictibilite |
Evolution des architectures
Tendances actuelles:
Efficacite energetique:
Performance par Watt devient critique.
ARM domine mobile et commence a s'imposer en datacenter (AWS Graviton, Apple M1/M2).
Architectures heterogenes:
Combinaison de processeurs differents:
- CPU generalistes
- GPU pour calcul parallele
- NPU (Neural Processing Unit) pour IA
- Accelerateurs specialises (crypto, codecs)
RISC-V:
Architecture ouverte alternative a ARM et x86.
Adoption croissante dans l'embarque et la recherche.
Calcul quantique:
Architectures radicalement differentes.
Encore experimental mais prometteur pour certains problemes.
Memoire non-volatile:
Technologies emergentes (MRAM, ReRAM, 3D XPoint).
Floutent la distinction RAM/stockage.
Securite: un defi permanent
Lecons des vulnerabilites recentes:
Spectre/Meltdown ont revele:
- Optimisations de performance creent failles securite
- Corrections logicielles couteuses en performance
- Necessite de repenser conception materielle
Design securise:
Principes emergents:
- Securite des la conception (security by design)
- Isolation materielle renforcee
- Enclaves securisees
- Verification formelle
Compromis inevitables:
Performance vs Securite:
- Desactiver fonctionnalites (hyperthreading)
- Isolation couteuse (KPTI)
- Operations a temps constant plus lentes
Mon opinion
Ce cours est essentiel pour comprendre le fonctionnement reel des ordinateurs.
Points forts:
- Vision concrete du materiel
- Comprehension des optimisations compilateur
- Conscience des enjeux de securite
- Programmation assembleur formatrice
Importance professionnelle:
Ces connaissances sont critiques pour:
- Systemes embarques a contraintes fortes
- Optimisation de code haute performance
- Securite informatique (analyse, conception)
- Comprehension des architectures emergentes
Assembleur aujourd'hui:
Bien que rarement ecrit directement, comprendre l'assembleur permet:
- Lire le code genere par le compilateur
- Optimiser les sections critiques
- Deboguer les problemes bas niveau
- Analyser des binaires (reverse engineering)
ARM vs x86: bataille interessante:
ARM:
- Domine mobile/embarque
- Perce dans les datacenters
- Efficacite energetique superieure
- Apple M1/M2 impressionnants
x86:
- Toujours dominant sur PC/serveurs
- Puissance brute elevee
- Ecosysteme mature
- Retrocompatibilite precieuse
Securite materielle: priorite croissante:
Les attaques materielles sont de plus en plus sophistiquees.
Necessite:
- Formation des developpeurs
- Outils d'analyse adaptes
- Conception consciente des risques
- Veille technologique constante
Futur des architectures:
Vers plus de:
- Specialisation (accelerateurs IA, crypto)
- Heterogeneite (big.LITTLE generalise)
- Efficacite energetique
- Securite integree
Bilan personnel: Ce cours a fourni une comprehension approfondie du fonctionnement bas niveau des processeurs modernes. La programmation assembleur ARM et x86 a permis de voir concretement comment le code s'execute sur le materiel. La prise de conscience des vulnerabilites materielles (Spectre, Meltdown) et des attaques par canaux caches est particulierement importante pour concevoir des systemes securises. Ces connaissances sont directement applicables en developpement embarque, optimisation de performances, et analyse de securite. La comparaison ARM/x86 eclaire sur les compromis architecturaux et l'evolution future de l'informatique.
Documents de Cours
Rapports et Projets
Compte Rendu TP - Architecture Materielle
Rapport de travaux pratiques sur la comparaison des architectures ARM et x86/x64, la programmation assembleur et l'analyse de performances.
Introduction a l'Assembleur
Cours complet sur les langages assembleurs, leur role et leur utilisation dans l'architecture des processeurs.
Comparaison ARM vs x86
Etude comparative des architectures ARM et x86/x64 : instructions, registres, conventions d'appel et performances.
Introduction aux Attaques Materielles
Presentation des vulnerabilites materielles et des attaques par canaux caches (Spectre, Meltdown, timing attacks).
Attaques par Consommation Energetique
Analyse detaillee des attaques SPA et DPA sur circuits cryptographiques via l'analyse de consommation electrique.
Hardware Architecture - S6
Year: 2022-2023 (Semester 6)
Credits: 3 ECTS
Type: Architecture and Embedded Systems
PART A: GENERAL OVERVIEW
Course Objectives
This course deepens the study of modern processor architectures, focusing on ARM and x86 architectures, assembly language programming, pipeline mechanisms, and hardware security aspects. The emphasis is on low-level understanding of processor operation and hardware vulnerabilities.
Target Skills
- Understand ARM and x86/x64 architectures
- Program in ARM and x86 assembly
- Master pipeline and parallelism mechanisms
- Analyze processor performance
- Understand memory hierarchy and caches
- Identify hardware vulnerabilities
- Analyze side-channel attacks
- Optimize code for the target architecture
- Understand energy/performance trade-offs
Organization
- Hours: Lectures and practical lab sessions
- Assessment: Written exam + lab report
- Semester: 6 (2022-2023)
- Prerequisites: Basic architecture, C programming, operating systems
PART B: EXPERIENCE, CONTEXT AND FUNCTION
Pedagogical Content
The course covers modern architectures and their security implications.
1. Assembly Language Programming
Introduction to assembly:
Assembly is the lowest-level language (before binary).
Advantages:
- Total control of the processor
- Optimal performance
- Understanding of hardware operation
- Low-level debugging
- Reverse engineering
Uses:
- Operating system kernels
- Device drivers
- Performance-critical code
- Embedded systems
- Security and cryptography
Structure of an assembly program:
Typical sections:
- .data section: initialized data
- .bss section: uninitialized data
- .text section: executable code
Registers:
Ultra-fast memory inside the processor.
Types:
- General-purpose registers (computations, data)
- Stack Pointer (SP)
- Program Counter (PC)
- Status register (flags)
2. ARM Architecture
ARM characteristics:
ARM (Advanced RISC Machine) is the dominant architecture in embedded and mobile.
RISC principles (Reduced Instruction Set Computer):
- Simple and uniform instructions
- Fast execution (1 cycle per instruction)
- Many registers
- Load/store architecture
ARM registers:
| Register | Name | Function |
|---|---|---|
| R0-R12 | General-purpose registers | Computations and data |
| R13 (SP) | Stack Pointer | Stack pointer |
| R14 (LR) | Link Register | Return address |
| R15 (PC) | Program Counter | Current instruction address |
| CPSR | Current Program Status | Status flags |
Basic ARM instructions:
; Data movement
MOV R0, #5 ; R0 = 5
LDR R1, [R2] ; R1 = memory[R2]
STR R0, [R1] ; memory[R1] = R0
; Arithmetic
ADD R0, R1, R2 ; R0 = R1 + R2
SUB R0, R1, #10 ; R0 = R1 - 10
MUL R0, R1, R2 ; R0 = R1 x R2
; Logic
AND R0, R1, R2 ; R0 = R1 AND R2
ORR R0, R1, R2 ; R0 = R1 OR R2
EOR R0, R1, R2 ; R0 = R1 XOR R2
; Comparison and branching
CMP R0, R1 ; Compare R0 and R1
BEQ label ; Branch if equal
BNE label ; Branch if not equal
BL function ; Function call
ARM calling convention:
- R0-R3: first 4 arguments
- R0: return value
- R4-R11: saved by the called function
- LR (R14): return address
ARM function example:
; Addition function: int add(int a, int b)
add:
PUSH {LR} ; Save LR
ADD R0, R0, R1 ; R0 = R0 + R1 (result)
POP {PC} ; Return (restore PC)
; Function call
MOV R0, #5
MOV R1, #3
BL add ; Call add(5, 3)
; R0 contains 8
3. x86/x64 Architecture
x86 characteristics:
x86 (Intel/AMD) is the dominant architecture on PCs and servers.
CISC principles (Complex Instruction Set Computer):
- Complex and varied instructions
- Variable execution time
- Fewer registers
- Directly accessible memory
x86-64 registers:
| Register | 64 bits | 32 bits | 16 bits | 8 bits | Usage |
|---|---|---|---|---|---|
| RAX | RAX | EAX | AX | AL | Accumulator, return |
| RBX | RBX | EBX | BX | BL | Base |
| RCX | RCX | ECX | CX | CL | Counter |
| RDX | RDX | EDX | DX | DL | Data |
| RSI | RSI | ESI | SI | - | Source index |
| RDI | RDI | EDI | DI | - | Destination index |
| RBP | RBP | EBP | BP | - | Base pointer (stack) |
| RSP | RSP | ESP | SP | - | Stack pointer |
| R8-R15 | - | - | - | - | Additional registers |
Basic x86 instructions:
; Data movement
mov rax, 5 ; rax = 5
mov rbx, [rax] ; rbx = memory[rax]
lea rax, [rbx+8] ; rax = address rbx+8
; Arithmetic
add rax, rbx ; rax = rax + rbx
sub rax, 10 ; rax = rax - 10
imul rax, rbx ; rax = rax x rbx
idiv rcx ; rax = rax / rcx, rdx = remainder
; Logic
and rax, rbx ; rax = rax AND rbx
or rax, rbx ; rax = rax OR rbx
xor rax, rax ; rax = 0 (common idiom)
not rax ; rax = NOT rax
; Stack
push rax ; Push rax
pop rbx ; Pop into rbx
; Branching
cmp rax, rbx ; Compare
je label ; Jump if equal
jne label ; Jump if not equal
jmp label ; Unconditional jump
call function ; Function call
ret ; Return
x64 calling convention (System V):
Arguments in order:
- RDI
- RSI
- RDX
- RCX
- R8
- R9
- Then on the stack
Return value: RAX
x64 function example:
; Function: int add(int a, int b)
add:
push rbp ; Prologue
mov rbp, rsp
mov eax, edi ; a in eax
add eax, esi ; eax += b
pop rbp ; Epilogue
ret
ARM vs x86 comparison:
| Aspect | ARM | x86 |
|---|---|---|
| Philosophy | RISC | CISC |
| Instructions | Simple, regular | Complex, varied |
| Instruction length | Fixed (32 bits) | Variable (1-15 bytes) |
| Registers | 16 (ARM32) | 16 (x64) |
| Power consumption | Low | High |
| Performance/Watt | Excellent | Average |
| Usage | Mobile, embedded | PC, servers |
4. Pipeline and Parallelism
Instruction pipeline:
Technique allowing multiple instructions to be executed in parallel.
Classic stages (5 stages):
- IF (Instruction Fetch): read instruction
- ID (Instruction Decode): decoding
- EX (Execute): execution
- MEM (Memory): memory access
- WB (Write Back): write result
Without pipeline: 5 cycles per instruction.
With pipeline: 1 instruction/cycle at steady state.
Pipeline hazards:
Structural hazards:
Conflict on a hardware resource.
Solution: duplicate resources.
Data hazards:
An instruction depends on a result that is not yet available.
Example:
ADD R1, R2, R3 ; R1 = R2 + R3
SUB R4, R1, R5 ; R4 = R1 - R5 (depends on R1!)
Solutions:
- Stall (bubbles): wait
- Forwarding: transmit result directly
- Reordering: scheduler rearranges instructions
Control hazards:
Conditional branches disrupt the pipeline.
The processor does not know which instruction to fetch after a branch.
Solutions:
- Branch prediction: guess the direction
- Speculative execution: execute both paths
- Branch delay slot: instruction after branch always executed
Superscalar architectures:
Multiple pipelines in parallel → multiple instructions/cycle.
Examples: modern processors (4-6 instructions/cycle).
Out-of-order execution:
The processor reorders instructions to maximize functional unit utilization.
Hides latencies and improves IPC (Instructions Per Cycle).
Instruction-level parallelism (ILP):
Automatic exploitation of parallelism in sequential code.
Techniques:
- Pipeline
- Superscalar
- Out-of-order execution
- Branch prediction
- Register renaming
5. Memory Hierarchy
Memory pyramid:
| Level | Size | Latency | Cost |
|---|---|---|---|
| Registers | A few bytes | < 1 ns | Very high |
| L1 Cache | 32-64 KB | 1-2 ns | High |
| L2 Cache | 256 KB - 1 MB | 5-10 ns | Medium |
| L3 Cache | 4-32 MB | 20-40 ns | Low |
| RAM | 4-64 GB | 50-100 ns | Very low |
| SSD | 256 GB - 2 TB | 0.1 ms | Minimal |
| HDD | 1-10 TB | 10 ms | Minimal |
Locality principle:
- Temporal: recently accessed data will likely be accessed again
- Spatial: nearby data will likely be accessed together
Cache memory:
Fast memory between processor and RAM.
Organization:
Direct-mapped: each RAM address has only one possible location in cache.
- Simple but frequent conflicts
Set-associative: each address has N possible locations.
- Good compromise (2-way, 4-way, 8-way common)
Fully-associative: address can go anywhere.
- Flexible but complex and expensive
Replacement policies:
When the cache is full, which line to evict?
- LRU (Least Recently Used): least recently used
- FIFO: oldest
- Random: random
- LFU (Least Frequently Used): least frequently used
Write policies:
Write-through:
- Simultaneous write to cache + RAM
- Guaranteed coherence
- Slow
Write-back:
- Write only to cache
- RAM updated upon eviction
- Fast but complex
- "Dirty" bit to track modifications
Cache coherence:
In a multiprocessor system, how to ensure all caches see the same data?
MESI, MOESI protocols: cache line states (Modified, Exclusive, Shared, Invalid).
6. Hardware Security and Attacks
Hardware vulnerabilities:
Hardware is not infallible and can be exploited.
Side-channel attacks:
Exploitation of indirect information (timing, power consumption, electromagnetic emissions).
Timing attack:
Measuring execution time to deduce secret information.
Example: cache timing.
If data is in cache, access is fast.
If not, access is slow.
By measuring time, an attacker can deduce which data has been accessed.
Spectre attack:
Exploits speculative execution and branch prediction.
Principle:
- Train the branch predictor
- Cause speculative execution of code that accesses sensitive data
- Observe side effects via cache timing
- Recover secret data
Impact: information leakage between processes, bypassing memory isolation.
Meltdown attack:
Exploits the delay between permission checking and cancellation of a speculative instruction.
Allows a user process to read kernel memory.
Major impact: all recent Intel processors were vulnerable.
Mitigation: KPTI (Kernel Page Table Isolation) with performance cost.
Power analysis attack:
Measuring electrical power consumption to extract cryptographic keys.
Types:
- SPA (Simple Power Analysis): direct observation
- DPA (Differential Power Analysis): statistical analysis
Prime targets: smart cards, embedded systems.
Countermeasures:
- Masking (randomization)
- Power balancing
- Constant-time operations
Electromagnetic attacks:
Eavesdropping on processor EM emissions.
Similar to power analysis attacks but without contact.
General countermeasures:
- Secure hardware design
- Microcode patches
- OS modifications (KPTI)
- Constant-time code
- Masking and randomization
- Secure enclaves (Intel SGX, ARM TrustZone)
7. Multi-core Processors
Evolution towards parallelism:
End of Moore's Law: frequencies plateaued (≈ 4-5 GHz).
Solution: multiply cores.
Multi-core architectures:
| Number of cores | Typical usage |
|---|---|
| 2-4 | Laptop, mobile |
| 4-8 | Desktop |
| 8-64 | Server |
| 64+ | High-performance computing |
Symmetric (SMP):
All cores identical, sharing memory.
Asymmetric (AMP):
Cores of different types (e.g., ARM big.LITTLE).
Fast cores (big) for heavy tasks.
Efficient cores (LITTLE) for light tasks.
Optimizes performance/power ratio.
Hyperthreading (SMT):
One physical core appears as multiple logical cores.
Sharing functional units between threads.
Gain: 20-30% performance improvement.
Processor affinity:
Binding a process to a specific core.
Advantages:
- Better cache utilization
- Real-time predictability
- Isolation for security
PART C: TECHNICAL ASPECTS
Lab Sessions
Lab: ARM vs x86/x64 comparison
Objective: understand architectural differences through practice.
Typical exercises:
1. Simple program in ARM and x86:
Write the same function in both assembly languages.
Example: factorial computation.
ARM:
factorial:
PUSH {R4, LR}
MOV R4, R0 ; Save n
CMP R0, #1
BLE end_fact
SUB R0, R0, #1
BL factorial ; Recursive call
MUL R0, R4, R0 ; n * fact(n-1)
end_fact:
POP {R4, PC}
x86-64:
factorial:
push rbp
mov rbp, rsp
cmp rdi, 1
jle end_fact
push rdi
dec rdi
call factorial
pop rdi
imul rax, rdi
end_fact:
pop rbp
ret
2. Performance analysis:
Measure execution time of different implementations.
Compare:
- Optimized C code
- Hand-written assembly
- Different optimizations
3. Cache usage:
Write code that uses the cache well vs poorly.
Good usage: sequential array traversal.
Bad usage: random access.
4. Vulnerability study:
Implement a simple cache timing attack.
Observe the time difference between:
- Data in cache
- Data not in cache
Tools and Environment
Assemblers:
# ARM
arm-none-eabi-as program.s -o program.o
arm-none-eabi-ld program.o -o program
# x86-64
nasm -f elf64 program.asm
ld program.o -o program
# Or via GCC
gcc -S program.c # Generate assembly
gcc -c program.s # Assemble
Disassembly:
objdump -d program # Disassemble
objdump -S program # With interleaved source code
gdb program # Debugger
Performance analysis:
perf stat ./program # Performance statistics
perf record ./program # Record profile
perf report # Analyze profile
# Hardware counters
perf stat -e cache-misses,cache-references ./program
Simulation:
qemu-arm program # Emulate ARM
qemu-x86_64 program # Emulate x86-64
Assembly Optimizations
Common techniques:
Loop unrolling:
// Original
for(i=0; i<100; i++)
a[i] = b[i] + c[i];
// Unrolled
for(i=0; i<100; i+=4) {
a[i] = b[i] + c[i];
a[i+1] = b[i+1] + c[i+1];
a[i+2] = b[i+2] + c[i+2];
a[i+3] = b[i+3] + c[i+3];
}
Advantages: fewer tests, better pipeline utilization.
Vectorization (SIMD):
Process multiple data simultaneously.
ARM NEON, x86 SSE/AVX: operations on 128-512 bits.
Cache-friendly reordering:
Access data in the order of their layout in memory.
Branch elimination:
Replace if with arithmetic/logical computations.
Example:
// With branch
if(x > 0) y = a; else y = b;
// Without branch (x86)
mov eax, a
mov ebx, b
cmp x, 0
cmovg eax, ebx ; Conditional move
PART D: ANALYSIS AND REFLECTION
Acquired Skills
Low-level programming:
- Proficiency in ARM and x86 assembly
- Understanding the link between C and assembly
- Critical code optimization
- Hardware-level debugging
Architecture:
- Understanding modern pipelines
- Knowledge of memory hierarchies
- Cache operation
- Hardware parallelism and multi-core
Security:
- Identifying hardware vulnerabilities
- Understanding side-channel attacks
- Awareness of security/performance trade-offs
- Hardware-level risk analysis
Practical Applications
Hardware architecture impacts all areas of computing:
Embedded systems:
- ARM programming for microcontrollers
- Optimization under constraints (memory, energy)
- Critical real-time systems
- IoT and connected objects
Security:
- Malware analysis (reverse engineering)
- Cryptography resistant to physical attacks
- Secure systems (smart cards, TPM)
- Hardware attack detection
Performance:
- Critical code optimization (gaming, scientific computing)
- Efficient hardware utilization (caches, SIMD)
- Multi-core parallelization
- Energy consumption reduction
System development:
- Operating system kernels
- Device drivers
- Bootloaders and firmware
- Hypervisors and virtualization
Links with Other Courses
| Course | Link |
|---|---|
| Computer Architecture (S5) | Architecture fundamentals |
| Operating Systems (S5) | Link with system software |
| C Language (S5) | Compilation to assembly |
| Microcontroller (S6) | Practical ARM programming |
| Hardware Security (S7) | Security deep dive |
| Real-Time Systems (S8) | Optimization and predictability |
Architecture Evolution
Current trends:
Energy efficiency:
Performance per Watt is becoming critical.
ARM dominates mobile and is gaining ground in datacenters (AWS Graviton, Apple M1/M2).
Heterogeneous architectures:
Combination of different processors:
- General-purpose CPUs
- GPUs for parallel computation
- NPU (Neural Processing Unit) for AI
- Specialized accelerators (crypto, codecs)
RISC-V:
Open-source architecture alternative to ARM and x86.
Growing adoption in embedded and research.
Quantum computing:
Radically different architectures.
Still experimental but promising for certain problems.
Non-volatile memory:
Emerging technologies (MRAM, ReRAM, 3D XPoint).
Blurring the distinction between RAM and storage.
Security: An Ongoing Challenge
Lessons from recent vulnerabilities:
Spectre/Meltdown revealed:
- Performance optimizations create security flaws
- Software fixes are costly in terms of performance
- Need to rethink hardware design
Secure design:
Emerging principles:
- Security by design
- Reinforced hardware isolation
- Secure enclaves
- Formal verification
Inevitable trade-offs:
Performance vs Security:
- Disabling features (hyperthreading)
- Costly isolation (KPTI)
- Constant-time operations are slower
My Opinion
This course is essential for understanding how computers actually work.
Strengths:
- Concrete view of hardware
- Understanding of compiler optimizations
- Awareness of security challenges
- Educational assembly programming
Professional importance:
This knowledge is critical for:
- Highly constrained embedded systems
- High-performance code optimization
- Computer security (analysis, design)
- Understanding emerging architectures
Assembly today:
Although rarely written directly, understanding assembly allows:
- Reading compiler-generated code
- Optimizing critical sections
- Debugging low-level issues
- Analyzing binaries (reverse engineering)
ARM vs x86: an interesting battle:
ARM:
- Dominates mobile/embedded
- Breaking into datacenters
- Superior energy efficiency
- Impressive Apple M1/M2
x86:
- Still dominant on PC/servers
- High raw power
- Mature ecosystem
- Valuable backward compatibility
Hardware security: growing priority:
Hardware attacks are becoming increasingly sophisticated.
Requires:
- Developer training
- Adapted analysis tools
- Risk-aware design
- Constant technology watch
Future of architectures:
Towards more:
- Specialization (AI accelerators, crypto)
- Heterogeneity (generalized big.LITTLE)
- Energy efficiency
- Integrated security
Personal assessment: This course provided an in-depth understanding of low-level operation of modern processors. ARM and x86 assembly programming allowed seeing concretely how code executes on hardware. Awareness of hardware vulnerabilities (Spectre, Meltdown) and side-channel attacks is particularly important for designing secure systems. This knowledge is directly applicable in embedded development, performance optimization, and security analysis. The ARM/x86 comparison sheds light on architectural trade-offs and the future evolution of computing.
Course Documents
Reports and Projects
Lab Report - Hardware Architecture
Lab report on the comparison of ARM and x86/x64 architectures, assembly programming and performance analysis.
Introduction to Assembly
Complete course on assembly languages, their role and their use in processor architecture.
ARM vs x86 Comparison
Comparative study of ARM and x86/x64 architectures: instructions, registers, calling conventions and performance.
Introduction to Hardware Attacks
Overview of hardware vulnerabilities and side-channel attacks (Spectre, Meltdown, timing attacks).
Power Consumption Attacks
Detailed analysis of SPA and DPA attacks on cryptographic circuits via electrical power consumption analysis.