Architecture Materielle - S6

Annee: 2022-2023 (Semestre 6)
Credits: 3 ECTS
Type: Architecture et Systemes Embarques


PART A: PRESENTATION GENERALE

Objectifs du cours

Ce cours approfondit l'architecture des processeurs modernes en se concentrant sur les architectures ARM et x86, la programmation assembleur, les mecanismes de pipeline et les aspects de securite materielle. L'accent est mis sur la comprehension bas niveau du fonctionnement des processeurs et les vulnerabilites materielles.

Competences visees

  • Comprendre les architectures ARM et x86/x64
  • Programmer en assembleur ARM et x86
  • Maitriser les mecanismes de pipeline et parallelisme
  • Analyser les performances des processeurs
  • Comprendre la hierarchie memoire et les caches
  • Identifier les vulnerabilites materielles
  • Analyser les attaques par canaux caches
  • Optimiser le code pour l'architecture cible
  • Comprendre les compromis energie/performance

Organisation

  • Volume horaire: Cours magistraux et TPs pratiques
  • Evaluation: Examen ecrit + compte-rendu de TP
  • Semestre: 6 (2022-2023)
  • Prerequis: Architecture de base, programmation C, systemes d'exploitation

PART B: EXPERIENCE, CONTEXTE ET FONCTION

Contenu pedagogique

Le cours couvre les architectures modernes et leurs implications securitaires.

1. Programmation Assembleur

Introduction a l'assembleur:

L'assembleur est le langage de plus bas niveau (avant le binaire).

Avantages:

  • Controle total du processeur
  • Performances optimales
  • Comprehension du fonctionnement materiel
  • Debogage bas niveau
  • Reverse engineering

Utilisations:

  • Noyaux de systemes d'exploitation
  • Drivers de peripheriques
  • Code critique en performance
  • Systemes embarques
  • Securite et cryptographie

Structure d'un programme assembleur:

Sections typiques:

  • Section .data: donnees initialisees
  • Section .bss: donnees non initialisees
  • Section .text: code executable

Registres:

Memoires ultra-rapides dans le processeur.

Types:

  • Registres generaux (calculs, donnees)
  • Pointeur de pile (SP - Stack Pointer)
  • Compteur ordinal (PC - Program Counter)
  • Registre d'etat (flags)

2. Architecture ARM

Caracteristiques ARM:

ARM (Advanced RISC Machine) est l'architecture dominante dans l'embarque et mobile.

Principes RISC (Reduced Instruction Set Computer):

  • Instructions simples et uniformes
  • Execution rapide (1 cycle par instruction)
  • Nombreux registres
  • Architecture load/store

Registres ARM:

RegistreNomFonction
R0-R12Registres generauxCalculs et donnees
R13 (SP)Stack PointerPointeur de pile
R14 (LR)Link RegisterAdresse de retour
R15 (PC)Program CounterAdresse instruction courante
CPSRCurrent Program StatusFlags d'etat

Instructions de base ARM:

; Mouvement de donnees
MOV R0, #5          ; R0 = 5
LDR R1, [R2]        ; R1 = memoire[R2]
STR R0, [R1]        ; memoire[R1] = R0

; Arithmetique
ADD R0, R1, R2      ; R0 = R1 + R2
SUB R0, R1, #10     ; R0 = R1 - 10
MUL R0, R1, R2      ; R0 = R1 x R2

; Logique
AND R0, R1, R2      ; R0 = R1 ET R2
ORR R0, R1, R2      ; R0 = R1 OU R2
EOR R0, R1, R2      ; R0 = R1 XOR R2

; Comparaison et branchements
CMP R0, R1          ; Compare R0 et R1
BEQ label           ; Branche si egal
BNE label           ; Branche si different
BL fonction         ; Appel de fonction

Convention d'appel ARM:

  • R0-R3: passage des 4 premiers arguments
  • R0: valeur de retour
  • R4-R11: sauvegardes par la fonction appelee
  • LR (R14): adresse de retour

Exemple de fonction ARM:

; Fonction addition: int add(int a, int b)
add:
    PUSH {LR}           ; Sauvegarder LR
    ADD R0, R0, R1      ; R0 = R0 + R1 (resultat)
    POP {PC}            ; Retour (restaure PC)

; Appel de la fonction
MOV R0, #5
MOV R1, #3
BL add                  ; Appel add(5, 3)
; R0 contient 8

3. Architecture x86/x64

Caracteristiques x86:

x86 (Intel/AMD) est l'architecture dominante sur PC et serveurs.

Principes CISC (Complex Instruction Set Computer):

  • Instructions complexes et variees
  • Duree d'execution variable
  • Moins de registres
  • Memoire directement accessible

Registres x86-64:

Registre64 bits32 bits16 bits8 bitsUsage
RAXRAXEAXAXALAccumulateur, retour
RBXRBXEBXBXBLBase
RCXRCXECXCXCLCompteur
RDXRDXEDXDXDLDonnees
RSIRSIESISI-Source index
RDIRDIEDIDI-Destination index
RBPRBPEBPBP-Base pointer (pile)
RSPRSPESPSP-Stack pointer
R8-R15----Registres additionnels

Instructions de base x86:

; Mouvement de donnees
mov rax, 5          ; rax = 5
mov rbx, [rax]      ; rbx = memoire[rax]
lea rax, [rbx+8]    ; rax = adresse rbx+8

; Arithmetique
add rax, rbx        ; rax = rax + rbx
sub rax, 10         ; rax = rax - 10
imul rax, rbx       ; rax = rax x rbx
idiv rcx            ; rax = rax / rcx, rdx = reste

; Logique
and rax, rbx        ; rax = rax ET rbx
or rax, rbx         ; rax = rax OU rbx
xor rax, rax        ; rax = 0 (idiome courant)
not rax             ; rax = NON rax

; Pile
push rax            ; Empiler rax
pop rbx             ; Depiler dans rbx

; Branchements
cmp rax, rbx        ; Comparer
je label            ; Jump if equal
jne label           ; Jump if not equal
jmp label           ; Jump inconditionnel
call fonction       ; Appel fonction
ret                 ; Retour

Convention d'appel x64 (System V):

Arguments dans l'ordre:

  1. RDI
  2. RSI
  3. RDX
  4. RCX
  5. R8
  6. R9
  7. Puis sur la pile

Valeur de retour: RAX

Exemple de fonction x64:

; Fonction: int add(int a, int b)
add:
    push rbp            ; Prologue
    mov rbp, rsp

    mov eax, edi        ; a dans eax
    add eax, esi        ; eax += b

    pop rbp             ; Epilogue
    ret

Comparaison ARM vs x86:

AspectARMx86
PhilosophieRISCCISC
InstructionsSimples, regulieresComplexes, variees
Longueur instructionFixe (32 bits)Variable (1-15 octets)
Registres16 (ARM32)16 (x64)
ConsommationFaibleElevee
Performance/WattExcellenteMoyenne
UsageMobile, embarquePC, serveurs

4. Pipeline et parallelisme

Pipeline d'instructions:

Technique permettant d'executer plusieurs instructions en parallele.

Etapes classiques (5 stages):

  1. IF (Instruction Fetch): lecture instruction
  2. ID (Instruction Decode): decodage
  3. EX (Execute): execution
  4. MEM (Memory): acces memoire
  5. WB (Write Back): ecriture resultat

Sans pipeline: 5 cycles par instruction.
Avec pipeline: 1 instruction/cycle en regime permanent.

Aleas de pipeline (hazards):

Aleas structurels:

Conflit sur une ressource materielle.
Solution: dupliquer ressources.

Aleas de donnees:

Instruction depend d'un resultat pas encore disponible.

Exemple:

ADD R1, R2, R3    ; R1 = R2 + R3
SUB R4, R1, R5    ; R4 = R1 - R5 (depend de R1!)

Solutions:

  • Stall (bulles): attendre
  • Forwarding: transmettre resultat directement
  • Reorganisation: ordonnanceur rearrange les instructions

Aleas de controle:

Branchements conditionnels perturbent le pipeline.

Le processeur ne sait pas quelle instruction charger apres un branchement.

Solutions:

  • Prediction de branchement: deviner la direction
  • Execution speculative: executer les deux chemins
  • Branch delay slot: instruction apres branchement toujours executee

Architectures superscalaires:

Plusieurs pipelines en parallele → plusieurs instructions/cycle.

Exemples: processeurs modernes (4-6 instructions/cycle).

Execution dans le desordre (out-of-order):

Le processeur reordonne les instructions pour maximiser l'utilisation des unites fonctionnelles.

Masque les latences et ameliore IPC (Instructions Per Cycle).

Parallelisme au niveau instruction (ILP):

Exploitation automatique du parallelisme dans le code sequentiel.

Techniques:

  • Pipeline
  • Superscalaire
  • Out-of-order execution
  • Prediction de branchement
  • Renommage de registres

5. Hierarchie memoire

Pyramide memoire:

NiveauTailleLatenceCout
RegistresQuelques octets< 1 nsTres eleve
Cache L132-64 KB1-2 nsEleve
Cache L2256 KB - 1 MB5-10 nsMoyen
Cache L34-32 MB20-40 nsBas
RAM4-64 GB50-100 nsTres bas
SSD256 GB - 2 TB0.1 msMinimal
HDD1-10 TB10 msMinimal

Principe de localite:

  • Temporelle: donnee recemment utilisee sera probablement reutilisee
  • Spatiale: donnees proches seront probablement utilisees ensemble

Memoire cache:

Memoire rapide entre processeur et RAM.

Organisation:

Direct-mapped: chaque adresse RAM a une seule place possible dans le cache.

  • Simple mais conflits frequents

Set-associative: chaque adresse a N places possibles.

  • Bon compromis (2-way, 4-way, 8-way courant)

Fully-associative: adresse peut aller n'importe ou.

  • Flexible mais complexe et couteux

Politiques de remplacement:

Quand le cache est plein, quelle ligne evincer?

  • LRU (Least Recently Used): la moins recemment utilisee
  • FIFO: la plus ancienne
  • Random: au hasard
  • LFU (Least Frequently Used): la moins frequemment utilisee

Politiques d'ecriture:

Write-through:

  • Ecriture simultanee cache + RAM
  • Coherence garantie
  • Lent

Write-back:

  • Ecriture uniquement dans cache
  • RAM mise a jour lors de l'eviction
  • Rapide mais complexe
  • Bit "dirty" pour tracer modifications

Coherence de cache:

Dans un systeme multiprocesseur, comment garantir que tous les caches voient les memes donnees?

Protocoles MESI, MOESI: etats des lignes de cache (Modified, Exclusive, Shared, Invalid).

6. Securite materielle et attaques

Vulnerabilites materielles:

Le materiel n'est pas infaillible et peut etre exploite.

Attaques par canaux caches (side-channel attacks):

Exploitation d'informations indirectes (temps, consommation, emissions electromagnetiques).

Attaque temporelle (timing attack):

Mesure du temps d'execution pour deduire des informations secretes.

Exemple: cache timing.

Si une donnee est en cache, l'acces est rapide.
Si elle n'y est pas, l'acces est lent.

En mesurant le temps, un attaquant peut deduire quelles donnees ont ete accedees.

Attaque Spectre:

Exploite l'execution speculative et la prediction de branchement.

Principe:

  1. Entrainer le predicteur de branchement
  2. Faire executer speculativement du code qui accede a des donnees sensibles
  3. Observer effet de bord via cache timing
  4. Recuperer donnees secretes

Impact: fuite d'informations entre processus, contournement d'isolation memoire.

Attaque Meltdown:

Exploite le delai entre verification des permissions et annulation d'une instruction speculative.

Permet a un processus utilisateur de lire la memoire noyau.

Impact majeur: tous les processeurs Intel recents vulnerables.

Mitigation: KPTI (Kernel Page Table Isolation) avec cout en performance.

Attaque par analyse de consommation:

Mesure de la consommation electrique pour extraire des cles cryptographiques.

Types:

  • SPA (Simple Power Analysis): observation directe
  • DPA (Differential Power Analysis): analyse statistique

Cibles privilegiees: cartes a puce, systemes embarques.

Contre-mesures:

  • Masquage (randomisation)
  • Equilibrage de la consommation
  • Operations a temps constant

Attaques electromagnetiques:

Ecoute des emissions EM du processeur.

Similaire aux attaques par consommation mais sans contact.

Contre-mesures generales:

  • Conception securisee du materiel
  • Patches microcode
  • Modifications d'OS (KPTI)
  • Code a temps constant
  • Masquage et randomisation
  • Enclaves securisees (Intel SGX, ARM TrustZone)

7. Processeurs multi-coeurs

Evolution vers le parallelisme:

Fin de la loi de Moore: frequences plafonnees (≈ 4-5 GHz).

Solution: multiplier les coeurs.

Architectures multi-coeurs:

Nombre de coeursUsage typique
2-4Laptop, mobile
4-8Desktop
8-64Serveur
64+Calcul haute performance

Symetrie (SMP):

Tous les coeurs identiques, partagent la memoire.

Asymetrie (AMP):

Coeurs de types differents (ex: ARM big.LITTLE).

Coeurs rapides (big) pour taches lourdes.
Coeurs economes (LITTLE) pour taches legeres.

Optimise rapport performance/consommation.

Hyperthreading (SMT):

Un coeur physique apparait comme plusieurs coeurs logiques.

Partage des unites fonctionnelles entre threads.

Gain: 20-30% de performances.

Affinite processeur:

Lier un processus a un coeur specifique.

Avantages:

  • Meilleure utilisation du cache
  • Predictibilite temps reel
  • Isolation pour securite

PART C: ASPECTS TECHNIQUES

Travaux Pratiques

TP: Comparaison ARM et x86/x64

Objectif: comprendre les differences entre architectures par la pratique.

Exercices typiques:

1. Programme simple en ARM et x86:

Ecrire la meme fonction dans les deux assembleurs.

Exemple: calcul de factorielle.

ARM:

factorial:
    PUSH {R4, LR}
    MOV R4, R0          ; Sauver n
    CMP R0, #1
    BLE end_fact
    SUB R0, R0, #1
    BL factorial        ; Appel recursif
    MUL R0, R4, R0      ; n * fact(n-1)
end_fact:
    POP {R4, PC}

x86-64:

factorial:
    push rbp
    mov rbp, rsp
    cmp rdi, 1
    jle end_fact
    push rdi
    dec rdi
    call factorial
    pop rdi
    imul rax, rdi
end_fact:
    pop rbp
    ret

2. Analyse de performances:

Mesurer le temps d'execution de differentes implementations.

Comparer:

  • Code C optimise
  • Assembleur manuel
  • Differentes optimisations

3. Utilisation du cache:

Ecrire du code exploitant bien le cache vs mal.

Bon usage: parcours sequentiel d'un tableau.
Mauvais usage: acces aleatoires.

4. Etude de vulnerabilites:

Implementer une attaque simple de type cache timing.

Observer la difference de temps entre:

  • Donnee en cache
  • Donnee pas en cache

Outils et environnement

Assembleurs:

# ARM
arm-none-eabi-as programme.s -o programme.o
arm-none-eabi-ld programme.o -o programme

# x86-64
nasm -f elf64 programme.asm
ld programme.o -o programme

# Ou via GCC
gcc -S programme.c          # Generer assembleur
gcc -c programme.s          # Assembler

Desassemblage:

objdump -d programme        # Desassembler
objdump -S programme        # Avec code source entrelace
gdb programme               # Debogueur

Analyse de performances:

perf stat ./programme       # Statistiques performances
perf record ./programme     # Enregistrer profil
perf report                 # Analyser profil

# Compteurs materiels
perf stat -e cache-misses,cache-references ./programme

Simulation:

qemu-arm programme          # Emuler ARM
qemu-x86_64 programme       # Emuler x86-64

Optimisations assembleur

Techniques courantes:

Deroulage de boucles (loop unrolling):

// Original
for(i=0; i<100; i++)
    a[i] = b[i] + c[i];

// Deroule
for(i=0; i<100; i+=4) {
    a[i] = b[i] + c[i];
    a[i+1] = b[i+1] + c[i+1];
    a[i+2] = b[i+2] + c[i+2];
    a[i+3] = b[i+3] + c[i+3];
}

Avantages: moins de tests, meilleure utilisation pipeline.

Vectorisation (SIMD):

Traiter plusieurs donnees simultanement.

ARM NEON, x86 SSE/AVX: operations sur 128-512 bits.

Reorganisation pour le cache:

Acceder aux donnees dans l'ordre de leur disposition en memoire.

Elimination de branchements:

Remplacer if par calculs arithmetiques/logiques.

Exemple:

// Avec branchement
if(x > 0) y = a; else y = b;

// Sans branchement (x86)
mov eax, a
mov ebx, b
cmp x, 0
cmovg eax, ebx  ; Conditional move

PART D: ANALYSE ET REFLEXION

Competences acquises

Programmation bas niveau:

  • Maitrise de l'assembleur ARM et x86
  • Comprehension du lien entre C et assembleur
  • Optimisation de code critique
  • Debogage au niveau materiel

Architecture:

  • Comprehension des pipelines modernes
  • Connaissance des hierarchies memoire
  • Fonctionnement des caches
  • Parallelisme materiel et multi-coeurs

Securite:

  • Identification de vulnerabilites materielles
  • Comprehension des attaques par canaux caches
  • Conscience des compromis securite/performance
  • Analyse de risques au niveau materiel

Applications pratiques

L'architecture materielle impacte tous les domaines de l'informatique:

Systemes embarques:

  • Programmation ARM pour microcontroleurs
  • Optimisation pour contraintes (memoire, energie)
  • Systemes temps reel critiques
  • IoT et objets connectes

Securite:

  • Analyse de malwares (reverse engineering)
  • Cryptographie resistante aux attaques physiques
  • Systemes securises (cartes a puce, TPM)
  • Detection d'attaques materielles

Performances:

  • Optimisation de code critique (jeux, calcul scientifique)
  • Exploitation efficace du materiel (caches, SIMD)
  • Parallelisation sur multi-coeurs
  • Reduction de la consommation energetique

Developpement systeme:

  • Noyaux de systemes d'exploitation
  • Drivers de peripheriques
  • Bootloaders et firmware
  • Hyperviseurs et virtualisation

Liens avec autres cours

CoursLien
Architecture Informatique (S5)Bases de l'architecture
Systemes d'Exploitation (S5)Lien avec le logiciel systeme
Langage C (S5)Compilation vers assembleur
Microcontroleur (S6)Programmation ARM pratique
Securite Materielle (S7)Approfondissement securite
Temps Reel (S8)Optimisation et predictibilite

Evolution des architectures

Tendances actuelles:

Efficacite energetique:

Performance par Watt devient critique.

ARM domine mobile et commence a s'imposer en datacenter (AWS Graviton, Apple M1/M2).

Architectures heterogenes:

Combinaison de processeurs differents:

  • CPU generalistes
  • GPU pour calcul parallele
  • NPU (Neural Processing Unit) pour IA
  • Accelerateurs specialises (crypto, codecs)

RISC-V:

Architecture ouverte alternative a ARM et x86.

Adoption croissante dans l'embarque et la recherche.

Calcul quantique:

Architectures radicalement differentes.

Encore experimental mais prometteur pour certains problemes.

Memoire non-volatile:

Technologies emergentes (MRAM, ReRAM, 3D XPoint).

Floutent la distinction RAM/stockage.

Securite: un defi permanent

Lecons des vulnerabilites recentes:

Spectre/Meltdown ont revele:

  • Optimisations de performance creent failles securite
  • Corrections logicielles couteuses en performance
  • Necessite de repenser conception materielle

Design securise:

Principes emergents:

  • Securite des la conception (security by design)
  • Isolation materielle renforcee
  • Enclaves securisees
  • Verification formelle

Compromis inevitables:

Performance vs Securite:

  • Desactiver fonctionnalites (hyperthreading)
  • Isolation couteuse (KPTI)
  • Operations a temps constant plus lentes

Mon opinion

Ce cours est essentiel pour comprendre le fonctionnement reel des ordinateurs.

Points forts:

  • Vision concrete du materiel
  • Comprehension des optimisations compilateur
  • Conscience des enjeux de securite
  • Programmation assembleur formatrice

Importance professionnelle:

Ces connaissances sont critiques pour:

  • Systemes embarques a contraintes fortes
  • Optimisation de code haute performance
  • Securite informatique (analyse, conception)
  • Comprehension des architectures emergentes

Assembleur aujourd'hui:

Bien que rarement ecrit directement, comprendre l'assembleur permet:

  • Lire le code genere par le compilateur
  • Optimiser les sections critiques
  • Deboguer les problemes bas niveau
  • Analyser des binaires (reverse engineering)

ARM vs x86: bataille interessante:

ARM:

  • Domine mobile/embarque
  • Perce dans les datacenters
  • Efficacite energetique superieure
  • Apple M1/M2 impressionnants

x86:

  • Toujours dominant sur PC/serveurs
  • Puissance brute elevee
  • Ecosysteme mature
  • Retrocompatibilite precieuse

Securite materielle: priorite croissante:

Les attaques materielles sont de plus en plus sophistiquees.

Necessite:

  • Formation des developpeurs
  • Outils d'analyse adaptes
  • Conception consciente des risques
  • Veille technologique constante

Futur des architectures:

Vers plus de:

  • Specialisation (accelerateurs IA, crypto)
  • Heterogeneite (big.LITTLE generalise)
  • Efficacite energetique
  • Securite integree

Bilan personnel: Ce cours a fourni une comprehension approfondie du fonctionnement bas niveau des processeurs modernes. La programmation assembleur ARM et x86 a permis de voir concretement comment le code s'execute sur le materiel. La prise de conscience des vulnerabilites materielles (Spectre, Meltdown) et des attaques par canaux caches est particulierement importante pour concevoir des systemes securises. Ces connaissances sont directement applicables en developpement embarque, optimisation de performances, et analyse de securite. La comparaison ARM/x86 eclaire sur les compromis architecturaux et l'evolution future de l'informatique.


Documents de Cours

Rapports et Projets

Compte Rendu TP - Architecture Materielle

Rapport de travaux pratiques sur la comparaison des architectures ARM et x86/x64, la programmation assembleur et l'analyse de performances.

Telecharger le rapport PDF

Introduction a l'Assembleur

Cours complet sur les langages assembleurs, leur role et leur utilisation dans l'architecture des processeurs.

Telecharger

Comparaison ARM vs x86

Etude comparative des architectures ARM et x86/x64 : instructions, registres, conventions d'appel et performances.

Telecharger

Introduction aux Attaques Materielles

Presentation des vulnerabilites materielles et des attaques par canaux caches (Spectre, Meltdown, timing attacks).

Telecharger

Attaques par Consommation Energetique

Analyse detaillee des attaques SPA et DPA sur circuits cryptographiques via l'analyse de consommation electrique.

Telecharger

Hardware Architecture - S6

Year: 2022-2023 (Semester 6)
Credits: 3 ECTS
Type: Architecture and Embedded Systems


PART A: GENERAL OVERVIEW

Course Objectives

This course deepens the study of modern processor architectures, focusing on ARM and x86 architectures, assembly language programming, pipeline mechanisms, and hardware security aspects. The emphasis is on low-level understanding of processor operation and hardware vulnerabilities.

Target Skills

  • Understand ARM and x86/x64 architectures
  • Program in ARM and x86 assembly
  • Master pipeline and parallelism mechanisms
  • Analyze processor performance
  • Understand memory hierarchy and caches
  • Identify hardware vulnerabilities
  • Analyze side-channel attacks
  • Optimize code for the target architecture
  • Understand energy/performance trade-offs

Organization

  • Hours: Lectures and practical lab sessions
  • Assessment: Written exam + lab report
  • Semester: 6 (2022-2023)
  • Prerequisites: Basic architecture, C programming, operating systems

PART B: EXPERIENCE, CONTEXT AND FUNCTION

Pedagogical Content

The course covers modern architectures and their security implications.

1. Assembly Language Programming

Introduction to assembly:

Assembly is the lowest-level language (before binary).

Advantages:

  • Total control of the processor
  • Optimal performance
  • Understanding of hardware operation
  • Low-level debugging
  • Reverse engineering

Uses:

  • Operating system kernels
  • Device drivers
  • Performance-critical code
  • Embedded systems
  • Security and cryptography

Structure of an assembly program:

Typical sections:

  • .data section: initialized data
  • .bss section: uninitialized data
  • .text section: executable code

Registers:

Ultra-fast memory inside the processor.

Types:

  • General-purpose registers (computations, data)
  • Stack Pointer (SP)
  • Program Counter (PC)
  • Status register (flags)

2. ARM Architecture

ARM characteristics:

ARM (Advanced RISC Machine) is the dominant architecture in embedded and mobile.

RISC principles (Reduced Instruction Set Computer):

  • Simple and uniform instructions
  • Fast execution (1 cycle per instruction)
  • Many registers
  • Load/store architecture

ARM registers:

RegisterNameFunction
R0-R12General-purpose registersComputations and data
R13 (SP)Stack PointerStack pointer
R14 (LR)Link RegisterReturn address
R15 (PC)Program CounterCurrent instruction address
CPSRCurrent Program StatusStatus flags

Basic ARM instructions:

; Data movement
MOV R0, #5          ; R0 = 5
LDR R1, [R2]        ; R1 = memory[R2]
STR R0, [R1]        ; memory[R1] = R0

; Arithmetic
ADD R0, R1, R2      ; R0 = R1 + R2
SUB R0, R1, #10     ; R0 = R1 - 10
MUL R0, R1, R2      ; R0 = R1 x R2

; Logic
AND R0, R1, R2      ; R0 = R1 AND R2
ORR R0, R1, R2      ; R0 = R1 OR R2
EOR R0, R1, R2      ; R0 = R1 XOR R2

; Comparison and branching
CMP R0, R1          ; Compare R0 and R1
BEQ label           ; Branch if equal
BNE label           ; Branch if not equal
BL function         ; Function call

ARM calling convention:

  • R0-R3: first 4 arguments
  • R0: return value
  • R4-R11: saved by the called function
  • LR (R14): return address

ARM function example:

; Addition function: int add(int a, int b)
add:
    PUSH {LR}           ; Save LR
    ADD R0, R0, R1      ; R0 = R0 + R1 (result)
    POP {PC}            ; Return (restore PC)

; Function call
MOV R0, #5
MOV R1, #3
BL add                  ; Call add(5, 3)
; R0 contains 8

3. x86/x64 Architecture

x86 characteristics:

x86 (Intel/AMD) is the dominant architecture on PCs and servers.

CISC principles (Complex Instruction Set Computer):

  • Complex and varied instructions
  • Variable execution time
  • Fewer registers
  • Directly accessible memory

x86-64 registers:

Register64 bits32 bits16 bits8 bitsUsage
RAXRAXEAXAXALAccumulator, return
RBXRBXEBXBXBLBase
RCXRCXECXCXCLCounter
RDXRDXEDXDXDLData
RSIRSIESISI-Source index
RDIRDIEDIDI-Destination index
RBPRBPEBPBP-Base pointer (stack)
RSPRSPESPSP-Stack pointer
R8-R15----Additional registers

Basic x86 instructions:

; Data movement
mov rax, 5          ; rax = 5
mov rbx, [rax]      ; rbx = memory[rax]
lea rax, [rbx+8]    ; rax = address rbx+8

; Arithmetic
add rax, rbx        ; rax = rax + rbx
sub rax, 10         ; rax = rax - 10
imul rax, rbx       ; rax = rax x rbx
idiv rcx            ; rax = rax / rcx, rdx = remainder

; Logic
and rax, rbx        ; rax = rax AND rbx
or rax, rbx         ; rax = rax OR rbx
xor rax, rax        ; rax = 0 (common idiom)
not rax             ; rax = NOT rax

; Stack
push rax            ; Push rax
pop rbx             ; Pop into rbx

; Branching
cmp rax, rbx        ; Compare
je label            ; Jump if equal
jne label           ; Jump if not equal
jmp label           ; Unconditional jump
call function       ; Function call
ret                 ; Return

x64 calling convention (System V):

Arguments in order:

  1. RDI
  2. RSI
  3. RDX
  4. RCX
  5. R8
  6. R9
  7. Then on the stack

Return value: RAX

x64 function example:

; Function: int add(int a, int b)
add:
    push rbp            ; Prologue
    mov rbp, rsp

    mov eax, edi        ; a in eax
    add eax, esi        ; eax += b

    pop rbp             ; Epilogue
    ret

ARM vs x86 comparison:

AspectARMx86
PhilosophyRISCCISC
InstructionsSimple, regularComplex, varied
Instruction lengthFixed (32 bits)Variable (1-15 bytes)
Registers16 (ARM32)16 (x64)
Power consumptionLowHigh
Performance/WattExcellentAverage
UsageMobile, embeddedPC, servers

4. Pipeline and Parallelism

Instruction pipeline:

Technique allowing multiple instructions to be executed in parallel.

Classic stages (5 stages):

  1. IF (Instruction Fetch): read instruction
  2. ID (Instruction Decode): decoding
  3. EX (Execute): execution
  4. MEM (Memory): memory access
  5. WB (Write Back): write result

Without pipeline: 5 cycles per instruction.
With pipeline: 1 instruction/cycle at steady state.

Pipeline hazards:

Structural hazards:

Conflict on a hardware resource.
Solution: duplicate resources.

Data hazards:

An instruction depends on a result that is not yet available.

Example:

ADD R1, R2, R3    ; R1 = R2 + R3
SUB R4, R1, R5    ; R4 = R1 - R5 (depends on R1!)

Solutions:

  • Stall (bubbles): wait
  • Forwarding: transmit result directly
  • Reordering: scheduler rearranges instructions

Control hazards:

Conditional branches disrupt the pipeline.

The processor does not know which instruction to fetch after a branch.

Solutions:

  • Branch prediction: guess the direction
  • Speculative execution: execute both paths
  • Branch delay slot: instruction after branch always executed

Superscalar architectures:

Multiple pipelines in parallel → multiple instructions/cycle.

Examples: modern processors (4-6 instructions/cycle).

Out-of-order execution:

The processor reorders instructions to maximize functional unit utilization.

Hides latencies and improves IPC (Instructions Per Cycle).

Instruction-level parallelism (ILP):

Automatic exploitation of parallelism in sequential code.

Techniques:

  • Pipeline
  • Superscalar
  • Out-of-order execution
  • Branch prediction
  • Register renaming

5. Memory Hierarchy

Memory pyramid:

LevelSizeLatencyCost
RegistersA few bytes< 1 nsVery high
L1 Cache32-64 KB1-2 nsHigh
L2 Cache256 KB - 1 MB5-10 nsMedium
L3 Cache4-32 MB20-40 nsLow
RAM4-64 GB50-100 nsVery low
SSD256 GB - 2 TB0.1 msMinimal
HDD1-10 TB10 msMinimal

Locality principle:

  • Temporal: recently accessed data will likely be accessed again
  • Spatial: nearby data will likely be accessed together

Cache memory:

Fast memory between processor and RAM.

Organization:

Direct-mapped: each RAM address has only one possible location in cache.

  • Simple but frequent conflicts

Set-associative: each address has N possible locations.

  • Good compromise (2-way, 4-way, 8-way common)

Fully-associative: address can go anywhere.

  • Flexible but complex and expensive

Replacement policies:

When the cache is full, which line to evict?

  • LRU (Least Recently Used): least recently used
  • FIFO: oldest
  • Random: random
  • LFU (Least Frequently Used): least frequently used

Write policies:

Write-through:

  • Simultaneous write to cache + RAM
  • Guaranteed coherence
  • Slow

Write-back:

  • Write only to cache
  • RAM updated upon eviction
  • Fast but complex
  • "Dirty" bit to track modifications

Cache coherence:

In a multiprocessor system, how to ensure all caches see the same data?

MESI, MOESI protocols: cache line states (Modified, Exclusive, Shared, Invalid).

6. Hardware Security and Attacks

Hardware vulnerabilities:

Hardware is not infallible and can be exploited.

Side-channel attacks:

Exploitation of indirect information (timing, power consumption, electromagnetic emissions).

Timing attack:

Measuring execution time to deduce secret information.

Example: cache timing.

If data is in cache, access is fast.
If not, access is slow.

By measuring time, an attacker can deduce which data has been accessed.

Spectre attack:

Exploits speculative execution and branch prediction.

Principle:

  1. Train the branch predictor
  2. Cause speculative execution of code that accesses sensitive data
  3. Observe side effects via cache timing
  4. Recover secret data

Impact: information leakage between processes, bypassing memory isolation.

Meltdown attack:

Exploits the delay between permission checking and cancellation of a speculative instruction.

Allows a user process to read kernel memory.

Major impact: all recent Intel processors were vulnerable.

Mitigation: KPTI (Kernel Page Table Isolation) with performance cost.

Power analysis attack:

Measuring electrical power consumption to extract cryptographic keys.

Types:

  • SPA (Simple Power Analysis): direct observation
  • DPA (Differential Power Analysis): statistical analysis

Prime targets: smart cards, embedded systems.

Countermeasures:

  • Masking (randomization)
  • Power balancing
  • Constant-time operations

Electromagnetic attacks:

Eavesdropping on processor EM emissions.

Similar to power analysis attacks but without contact.

General countermeasures:

  • Secure hardware design
  • Microcode patches
  • OS modifications (KPTI)
  • Constant-time code
  • Masking and randomization
  • Secure enclaves (Intel SGX, ARM TrustZone)

7. Multi-core Processors

Evolution towards parallelism:

End of Moore's Law: frequencies plateaued (≈ 4-5 GHz).

Solution: multiply cores.

Multi-core architectures:

Number of coresTypical usage
2-4Laptop, mobile
4-8Desktop
8-64Server
64+High-performance computing

Symmetric (SMP):

All cores identical, sharing memory.

Asymmetric (AMP):

Cores of different types (e.g., ARM big.LITTLE).

Fast cores (big) for heavy tasks.
Efficient cores (LITTLE) for light tasks.

Optimizes performance/power ratio.

Hyperthreading (SMT):

One physical core appears as multiple logical cores.

Sharing functional units between threads.

Gain: 20-30% performance improvement.

Processor affinity:

Binding a process to a specific core.

Advantages:

  • Better cache utilization
  • Real-time predictability
  • Isolation for security

PART C: TECHNICAL ASPECTS

Lab Sessions

Lab: ARM vs x86/x64 comparison

Objective: understand architectural differences through practice.

Typical exercises:

1. Simple program in ARM and x86:

Write the same function in both assembly languages.

Example: factorial computation.

ARM:

factorial:
    PUSH {R4, LR}
    MOV R4, R0          ; Save n
    CMP R0, #1
    BLE end_fact
    SUB R0, R0, #1
    BL factorial        ; Recursive call
    MUL R0, R4, R0      ; n * fact(n-1)
end_fact:
    POP {R4, PC}

x86-64:

factorial:
    push rbp
    mov rbp, rsp
    cmp rdi, 1
    jle end_fact
    push rdi
    dec rdi
    call factorial
    pop rdi
    imul rax, rdi
end_fact:
    pop rbp
    ret

2. Performance analysis:

Measure execution time of different implementations.

Compare:

  • Optimized C code
  • Hand-written assembly
  • Different optimizations

3. Cache usage:

Write code that uses the cache well vs poorly.

Good usage: sequential array traversal.
Bad usage: random access.

4. Vulnerability study:

Implement a simple cache timing attack.

Observe the time difference between:

  • Data in cache
  • Data not in cache

Tools and Environment

Assemblers:

# ARM
arm-none-eabi-as program.s -o program.o
arm-none-eabi-ld program.o -o program

# x86-64
nasm -f elf64 program.asm
ld program.o -o program

# Or via GCC
gcc -S program.c          # Generate assembly
gcc -c program.s          # Assemble

Disassembly:

objdump -d program        # Disassemble
objdump -S program        # With interleaved source code
gdb program               # Debugger

Performance analysis:

perf stat ./program       # Performance statistics
perf record ./program     # Record profile
perf report               # Analyze profile

# Hardware counters
perf stat -e cache-misses,cache-references ./program

Simulation:

qemu-arm program          # Emulate ARM
qemu-x86_64 program       # Emulate x86-64

Assembly Optimizations

Common techniques:

Loop unrolling:

// Original
for(i=0; i<100; i++)
    a[i] = b[i] + c[i];

// Unrolled
for(i=0; i<100; i+=4) {
    a[i] = b[i] + c[i];
    a[i+1] = b[i+1] + c[i+1];
    a[i+2] = b[i+2] + c[i+2];
    a[i+3] = b[i+3] + c[i+3];
}

Advantages: fewer tests, better pipeline utilization.

Vectorization (SIMD):

Process multiple data simultaneously.

ARM NEON, x86 SSE/AVX: operations on 128-512 bits.

Cache-friendly reordering:

Access data in the order of their layout in memory.

Branch elimination:

Replace if with arithmetic/logical computations.

Example:

// With branch
if(x > 0) y = a; else y = b;

// Without branch (x86)
mov eax, a
mov ebx, b
cmp x, 0
cmovg eax, ebx  ; Conditional move

PART D: ANALYSIS AND REFLECTION

Acquired Skills

Low-level programming:

  • Proficiency in ARM and x86 assembly
  • Understanding the link between C and assembly
  • Critical code optimization
  • Hardware-level debugging

Architecture:

  • Understanding modern pipelines
  • Knowledge of memory hierarchies
  • Cache operation
  • Hardware parallelism and multi-core

Security:

  • Identifying hardware vulnerabilities
  • Understanding side-channel attacks
  • Awareness of security/performance trade-offs
  • Hardware-level risk analysis

Practical Applications

Hardware architecture impacts all areas of computing:

Embedded systems:

  • ARM programming for microcontrollers
  • Optimization under constraints (memory, energy)
  • Critical real-time systems
  • IoT and connected objects

Security:

  • Malware analysis (reverse engineering)
  • Cryptography resistant to physical attacks
  • Secure systems (smart cards, TPM)
  • Hardware attack detection

Performance:

  • Critical code optimization (gaming, scientific computing)
  • Efficient hardware utilization (caches, SIMD)
  • Multi-core parallelization
  • Energy consumption reduction

System development:

  • Operating system kernels
  • Device drivers
  • Bootloaders and firmware
  • Hypervisors and virtualization

Links with Other Courses

CourseLink
Computer Architecture (S5)Architecture fundamentals
Operating Systems (S5)Link with system software
C Language (S5)Compilation to assembly
Microcontroller (S6)Practical ARM programming
Hardware Security (S7)Security deep dive
Real-Time Systems (S8)Optimization and predictability

Architecture Evolution

Current trends:

Energy efficiency:

Performance per Watt is becoming critical.

ARM dominates mobile and is gaining ground in datacenters (AWS Graviton, Apple M1/M2).

Heterogeneous architectures:

Combination of different processors:

  • General-purpose CPUs
  • GPUs for parallel computation
  • NPU (Neural Processing Unit) for AI
  • Specialized accelerators (crypto, codecs)

RISC-V:

Open-source architecture alternative to ARM and x86.

Growing adoption in embedded and research.

Quantum computing:

Radically different architectures.

Still experimental but promising for certain problems.

Non-volatile memory:

Emerging technologies (MRAM, ReRAM, 3D XPoint).

Blurring the distinction between RAM and storage.

Security: An Ongoing Challenge

Lessons from recent vulnerabilities:

Spectre/Meltdown revealed:

  • Performance optimizations create security flaws
  • Software fixes are costly in terms of performance
  • Need to rethink hardware design

Secure design:

Emerging principles:

  • Security by design
  • Reinforced hardware isolation
  • Secure enclaves
  • Formal verification

Inevitable trade-offs:

Performance vs Security:

  • Disabling features (hyperthreading)
  • Costly isolation (KPTI)
  • Constant-time operations are slower

My Opinion

This course is essential for understanding how computers actually work.

Strengths:

  • Concrete view of hardware
  • Understanding of compiler optimizations
  • Awareness of security challenges
  • Educational assembly programming

Professional importance:

This knowledge is critical for:

  • Highly constrained embedded systems
  • High-performance code optimization
  • Computer security (analysis, design)
  • Understanding emerging architectures

Assembly today:

Although rarely written directly, understanding assembly allows:

  • Reading compiler-generated code
  • Optimizing critical sections
  • Debugging low-level issues
  • Analyzing binaries (reverse engineering)

ARM vs x86: an interesting battle:

ARM:

  • Dominates mobile/embedded
  • Breaking into datacenters
  • Superior energy efficiency
  • Impressive Apple M1/M2

x86:

  • Still dominant on PC/servers
  • High raw power
  • Mature ecosystem
  • Valuable backward compatibility

Hardware security: growing priority:

Hardware attacks are becoming increasingly sophisticated.

Requires:

  • Developer training
  • Adapted analysis tools
  • Risk-aware design
  • Constant technology watch

Future of architectures:

Towards more:

  • Specialization (AI accelerators, crypto)
  • Heterogeneity (generalized big.LITTLE)
  • Energy efficiency
  • Integrated security

Personal assessment: This course provided an in-depth understanding of low-level operation of modern processors. ARM and x86 assembly programming allowed seeing concretely how code executes on hardware. Awareness of hardware vulnerabilities (Spectre, Meltdown) and side-channel attacks is particularly important for designing secure systems. This knowledge is directly applicable in embedded development, performance optimization, and security analysis. The ARM/x86 comparison sheds light on architectural trade-offs and the future evolution of computing.


Course Documents

Reports and Projects

Lab Report - Hardware Architecture

Lab report on the comparison of ARM and x86/x64 architectures, assembly programming and performance analysis.

Download PDF report

Introduction to Assembly

Complete course on assembly languages, their role and their use in processor architecture.

Download

ARM vs x86 Comparison

Comparative study of ARM and x86/x64 architectures: instructions, registers, calling conventions and performance.

Download

Introduction to Hardware Attacks

Overview of hardware vulnerabilities and side-channel attacks (Spectre, Meltdown, timing attacks).

Download

Power Consumption Attacks

Detailed analysis of SPA and DPA attacks on cryptographic circuits via electrical power consumption analysis.

Download