Architecture Informatique Materielle - S5

Annee: 2022-2023 (Semestre 5)
Credits: 3 ECTS
Type: Informatique / Architecture des Systemes


PART A: PRESENTATION GENERALE

Objectifs du cours

Le cours "Architecture Informatique Materielle" fournit une comprehension approfondie de l'organisation et du fonctionnement des systemes informatiques au niveau materiel. Il couvre l'ensemble de la hierarchie memoire, de l'architecture des processeurs, et des mecanismes d'optimisation des performances. Ce cours est fondamental pour comprendre comment le materiel influence les performances logicielles et pour concevoir des systemes embarques efficaces.


Competences visees

  • Maitriser les concepts d'architecture des processeurs (RISC, pipeline, unites fonctionnelles)
  • Comprendre la hierarchie memoire (memoire physique, virtuelle, caches)
  • Analyser les performances des systemes informatiques
  • Programmer en assembleur MIPS pour le bas niveau
  • Optimiser le code en fonction de l'architecture materielle
  • Comprendre les mecanismes de pagination et de gestion memoire

Organisation

  • Volume horaire: 30h (CM: 18h, TD: 12h)
  • Evaluation: Examen ecrit + TDs notes + TP assembleur
  • Semestre: 5 (2022-2023)
  • Prerequis: Logique sequentielle, systemes numeriques

PART B: EXPERIENCE, CONTEXTE ET FONCTION

Contenu pedagogique

1. Introduction Generale aux Architectures

Concepts fondamentaux:

  • Architecture de Von Neumann vs Harvard
  • Modele de Von Neumann: memoire unique pour donnees et instructions
  • Architecture Harvard: separation physique instructions/donnees
  • Bus systeme: adresses, donnees, controle
  • Cycle d'execution: Fetch-Decode-Execute

Evolution historique:

  • Des premiers ordinateurs aux architectures modernes
  • Loi de Moore et ses limites actuelles
  • Passage du monoprocesseur au multiprocesseur
  • Architectures RISC (Reduced Instruction Set Computer) vs CISC (Complex)

Mesure de performances:

  • CPI (Cycles Per Instruction)
  • MIPS (Millions d'Instructions Per Second)
  • Temps d'execution CPU
  • Benchmark et profiling

Supports de cours: Introduction generale aux architectures (PDF)

2. Memoire Physique

Architecture Von Neumann

Figure : Architecture Von Neumann - Modele classique avec bus partages

Organisation memoire:

  • Hierarchie memoire: registres → cache → RAM → disque
  • Technologies memoire: SRAM, DRAM, ROM, Flash
  • Temps d'acces et bande passante
  • Principe de localite (temporelle et spatiale)

Adressage memoire:

  • Espace d'adressage lineaire
  • Adressage par mot vs par octet (byte-addressable)
  • Alignement memoire et padding
  • Endianness (big-endian vs little-endian)

Organisation des donnees:

Exemple d'organisation memoire 32 bits:
Adresse    |  Contenu (hex)
-----------|-----------------
0x00000000 |  0xAABBCCDD
0x00000004 |  0x11223344
0x00000008 |  0xFFFF0000

Supports de cours: Memoire physique (PDF)

3. Memoire Virtuelle

Concepts de virtualisation:

  • Separation adresse virtuelle / adresse physique
  • Espace d'adressage par processus
  • Protection memoire et isolation
  • Partage de memoire entre processus

Mecanisme de pagination:

  • Pages virtuelles et cadres physiques (frames)
  • Taille de page typique: 4 KB, 2 MB (large pages)
  • Table des pages (page table)
  • TLB (Translation Lookaside Buffer): cache pour traductions

Formules de traduction:

Adresse virtuelle = Numero de page virtuelle + Offset
Adresse physique = Numero de cadre physique + Offset

Exemple avec pages de 4 KB (2^12 octets):
Adresse virtuelle: 32 bits
  - 20 bits: numero de page (2^20 pages)
  - 12 bits: offset dans la page (4096 octets)

Gestion des defauts de page:

  • Page fault (defaut de page)
  • Algorithmes de remplacement: LRU, FIFO, Clock
  • Swapping et pagination a la demande
  • Working set et thrashing

Table des pages multi-niveaux:

  • Table des pages sur 2 niveaux (x86)
  • Table des pages sur 3/4 niveaux (x86-64)
  • Reduction de l'espace memoire pour les tables
  • Pagination inverse

Supports de cours: Memoire virtuelle (PDF)

4. Memoires Caches

Principe du cache:

  • Cache situe entre CPU et RAM
  • Exploite la localite spatiale et temporelle
  • Reduction du temps d'acces moyen
  • Hierarchie: L1 (le plus rapide), L2, L3

Organisation du cache:

Cache a correspondance directe (direct-mapped):

Adresse memoire decomposee en:
  Tag | Index | Offset

Exemple: cache 16 KB, lignes de 64 octets
  Offset: 6 bits (64 = 2^6)
  Index: 8 bits (256 lignes)
  Tag: 18 bits (pour adresse 32 bits)

Cache associatif par ensemble (set-associative):

  • N-way set-associative (2-way, 4-way, 8-way)
  • Compromis entre direct-mapped et fully associative
  • Politique de remplacement: LRU, Random, FIFO

Formules de performances:

Temps d'acces moyen = Hit_time + Miss_rate x Miss_penalty

Taux de hit (hit rate) = Nombre de hits / Nombre total d'acces
Taux de miss (miss rate) = 1 - Hit rate

Exemple:
  Hit time = 1 cycle
  Miss penalty = 100 cycles
  Miss rate = 2%

  Temps moyen = 1 + 0.02 x 100 = 3 cycles

Types de miss:

  • Compulsory miss (cold miss): premier acces
  • Capacity miss: cache trop petit
  • Conflict miss: collision dans direct-mapped

Coherence de cache:

  • Probleme en multiprocesseur
  • Protocoles MESI, MOESI
  • Write-through vs write-back
  • Invalidation vs mise a jour

Supports de cours: Les caches (PDF)

5. Architecture du Processeur

Processeur RISC (MIPS):

  • Jeu d'instructions reduit et regulier
  • Instructions de taille fixe (32 bits)
  • Load/Store architecture
  • Pipeline efficace

Registres MIPS:

$zero ($0): toujours 0
$at ($1): reserve assembleur
$v0-$v1 ($2-$3): valeurs de retour
$a0-$a3 ($4-$7): arguments de fonction
$t0-$t9 ($8-$15, $24-$25): temporaires
$s0-$s7 ($16-$23): sauvegardes
$k0-$k1 ($26-$27): reserves OS
$gp ($28): pointeur global
$sp ($29): pointeur de pile
$fp ($30): pointeur de cadre
$ra ($31): adresse de retour

Formats d'instructions:

Format R (Register): operations arithmetiques/logiques

| op (6) | rs (5) | rt (5) | rd (5) | shamt (5) | funct (6) |
Exemple: add $13, $11, $12

Format I (Immediate): load/store, branches, constantes

| op (6) | rs (5) | rt (5) | immediate (16) |
Exemple: addi $10, $zero, 53

Format J (Jump): sauts inconditionnels

| op (6) | address (26) |
Exemple: j add_32bits

Pipeline du processeur:

  • IF (Instruction Fetch): chargement instruction
  • ID (Instruction Decode): decodage et lecture registres
  • EX (Execute): execution ALU
  • MEM (Memory): acces memoire (load/store)
  • WB (Write Back): ecriture resultat dans registre

Aleas de pipeline (hazards):

  • Aleas de donnees: RAW (Read After Write), WAR, WAW
  • Aleas de controle: branches et sauts
  • Aleas structurels: conflits de ressources

Solutions aux aleas:

  • Forwarding (court-circuit)
  • Stall (bulles dans le pipeline)
  • Branch prediction (prediction de branchement)
  • Delayed branch

Supports de cours: Architecture du processeur (PDF)


PART C: ASPECTS TECHNIQUES

Cette section presente les elements techniques appris a travers les TPs et exercices pratiques.

Programmation Assembleur MIPS

TP1: Addition 32 bits et Conversion de Temps

Code assembleur realise:

# Initialisation des variables
addi $10, $zero, 53      # Secondes = 53
addi $1, $zero, 27       # Minutes = 27
addi $2, $zero, 3        # Heures = 3
j add_32bits             # Saut vers fonction addition

# Conversion temps en secondes totales
conv_secondes:
    addi $4, $zero, 3600     # $4 = 3600 (secondes/heure)
    addi $5, $zero, 60       # $5 = 60 (secondes/minute)
    mul $6, $2, $4           # $6 = heures x 3600
    mul $7, $1, $5           # $7 = minutes x 60
    add $8, $7, $6           # $8 = (heures x 3600) + (minutes x 60)
    add $9, $8, $10          # $9 = total + secondes

# Addition 32 bits avec detection de depassement
add_32bits:
    lw $11, var              # Charger variable a
    lw $12, var              # Charger variable b

    addu $13, $12, $11       # c = a + b (addition non signee)
    and $14, $11, $12        # e = a AND b (retenue entrante)
    xor $15, $11, $12        # x = a XOR b (somme sans retenue)

    not $18, $13             # Inversion de c
    and $16, $18, $15        # x AND (NOT c)
    or $17, $16, $14         # Calcul du bit de retenue sortante
    srl $21, $17, 31         # Decalage pour extraire bit de poids fort
                             # $21 contient le flag de depassement (overflow)

var: .word 0xFFFF0000        # Variable de test 32 bits

Concepts appliques:

  1. Instructions arithmetiques:
    • addi: addition immediate (avec constante)
    • add / addu: addition signee / non signee
    • mul: multiplication
  2. Instructions logiques:
    • and: ET bit a bit
    • or: OU bit a bit
    • xor: OU exclusif bit a bit
    • not: inversion (complement a 1)
  3. Instructions memoire:
    • lw (load word): chargement 32 bits depuis memoire
    • Directive .word: declaration de donnee 32 bits
  4. Instructions de controle:
    • j (jump): saut inconditionnel
    • Etiquettes pour les adresses
  5. Gestion du depassement:
    • Detection de l'overflow sur addition
    • Utilisation de la logique booleenne pour calculer la retenue
    • Formule: Overflow = Cout XOR Cin (retenue sortante XOR entrante)

Analyse de l'algorithme de detection de depassement

Principe mathematique:

Pour detecter un depassement (overflow) sur addition:
  c = a + b
  e = a AND b  (positions ou a=1 et b=1, retenue garantie)
  x = a XOR b  (somme sans retenue)

  Retenue sortante = (x AND NOT(c)) OR e

  Si le bit de poids fort de la retenue = 1 → overflow

Exemple numerique:

a = 0xFFFF0000 (grand nombre negatif en complement a 2)
b = 0xFFFF0000
c = 0xFFFE0000 (resultat)

Analyse bit par bit du MSB (bit 31):
  a[31] = 1, b[31] = 1
  c[31] = 1

  Overflow detecte si signe change incorrectement

Exercices de Calcul de Performances

Calcul du temps d'execution

Formule fondamentale:

Temps_CPU = Nombre_instructions x CPI x Periode_horloge

Ou:
  CPI = Cycles Per Instruction
  Periode = 1 / Frequence

Exemple d'exercice:

Programme: 1 million d'instructions
CPI moyen: 2.5
Frequence CPU: 2 GHz

Temps_CPU = 10^6 x 2.5 x (1 / 2x10^9)
          = 2.5x10^6 / 2x10^9
          = 1.25 ms

Performances du cache

Exercice type:

Cache L1: 32 KB, 4-way associative, ligne 64 octets
  Hit time: 1 cycle
  Hit rate: 95%

RAM:
  Latence: 100 cycles

Calcul temps d'acces moyen:
  T_moy = 0.95 x 1 + 0.05 x 100
        = 0.95 + 5
        = 5.95 cycles

Amelioration par rapport a sans cache:
  Speedup = 100 / 5.95 = 16.8x

Calcul de pagination

Exercice de traduction d'adresse:

Configuration:
  - Pages de 4 KB (4096 octets = 2^12)
  - Adresses virtuelles 32 bits
  - Adresse physique 30 bits

Adresse virtuelle: 0x12345678

Decomposition:
  Bits 31-12: numero page virtuelle = 0x12345
  Bits 11-0:  offset = 0x678

Si table des pages indique: VPN 0x12345 → PFN 0x00ABC

Adresse physique:
  = (0x00ABC << 12) | 0x678
  = 0x00ABC678

Optimisation du Code

Exploitation du cache

Mauvaise utilisation (cache miss frequents):

// Parcours colonne par colonne (mauvaise localite spatiale)
for (int j = 0; j < N; j++)
    for (int i = 0; i < N; i++)
        sum += matrix[i][j];  // Acces non contigus

Bonne utilisation (cache friendly):

// Parcours ligne par ligne (bonne localite spatiale)
for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++)
        sum += matrix[i][j];  // Acces contigus

Explication:

  • Les matrices en C sont stockees ligne par ligne (row-major order)
  • Lignes de cache: 64 octets = 16 entiers (4 octets chacun)
  • Parcours ligne → tous les elements d'une ligne charges ensemble
  • Parcours colonne → chaque acces charge une nouvelle ligne de cache

Blocking/Tiling pour matrices

Technique de blocking:

// Multiplication matricielle avec blocking
#define BLOCK_SIZE 32

for (int ii = 0; ii < N; ii += BLOCK_SIZE)
    for (int jj = 0; jj < N; jj += BLOCK_SIZE)
        for (int kk = 0; kk < N; kk += BLOCK_SIZE)
            // Bloc de BLOCK_SIZE x BLOCK_SIZE
            for (int i = ii; i < min(ii+BLOCK_SIZE, N); i++)
                for (int j = jj; j < min(jj+BLOCK_SIZE, N); j++)
                    for (int k = kk; k < min(kk+BLOCK_SIZE, N); k++)
                        C[i][j] += A[i][k] * B[k][j];

Avantage: reutilisation des donnees en cache L1 avant eviction


PART D: ANALYSE ET REFLEXION

Connaissances et competences mobilisees

  • Architecture des systemes: comprehension profonde du fonctionnement materiel
  • Programmation bas niveau: maitrise de l'assembleur MIPS
  • Optimisation: capacite a ecrire du code efficace tenant compte du materiel
  • Analyse de performances: calcul et optimisation des performances systeme
  • Gestion memoire: comprehension des mecanismes de pagination et cache

Auto-evaluation

Ce cours a ete fondamental pour ma comprehension des systemes informatiques. J'ai particulierement apprecie:

Points forts:

  • Lien theorie/pratique: les TPs en assembleur MIPS ont rendu concrets les concepts abstraits
  • Comprehension du materiel: vision claire de ce qui se passe "sous le capot"
  • Optimisation: capacite a comprendre pourquoi certains codes sont plus rapides
  • Debogage: meilleure comprehension des erreurs (segfault, cache miss, etc.)

Difficultes rencontrees:

  • Assembleur MIPS: syntaxe et conventions differentes du C
  • Pipeline et hazards: concepts abstraits necessitant de la visualisation
  • Calculs de cache: nombreuses formules et cas particuliers

Applications pratiques:

  • Optimisation de code pour systemes embarques
  • Comprehension des contraintes temps reel
  • Choix d'algorithmes adaptes a l'architecture

Mon opinion

Ce cours est indispensable pour tout ingenieur en informatique ou systemes embarques. Meme a l'ere des langages de haut niveau, comprendre l'architecture materielle permet:

  1. Meilleure performance: code optimise pour le cache et le pipeline
  2. Debugging efficace: comprendre les erreurs memoire, les ralentissements
  3. Choix techniques: selectionner le bon processeur pour une application
  4. Embarque: conception de systemes avec contraintes strictes

Connexions avec autres cours:

  • Systemes d'exploitation (S5): gestion memoire virtuelle, ordonnancement
  • Microcontroleur (S6): application pratique sur STM32
  • Temps Reel (S8): contraintes de timing liees au materiel
  • Systemes embarques (S7): optimisation pour ressources limitees

Evolution des architectures:

Aujourd'hui, les defis ont evolue:

  • Multiprocesseur: parallelisme, coherence cache complexe
  • Heterogene: CPU + GPU + accelerateurs (TPU, NPU)
  • Memoire non-volatile: nouvelles technologies (3D XPoint, ReRAM)
  • Securite: Spectre, Meltdown → compromis performance/securite

La loi de Moore ralentit, mais l'innovation continue:

  • Architecture 3D (empilement)
  • Memoires proches du calcul (in-memory computing)
  • Architectures neuromorphiques
  • Calcul quantique (futur)

Recommandations pour bien reussir:

  1. Pratiquer l'assembleur: ecrire du code MIPS, analyser le desassemblage C
  2. Visualiser: dessiner les pipelines, les caches, les tables de pages
  3. Calculer: faire et refaire les exercices de performances
  4. Lire les specifications: datasheets de processeurs (ARM, x86)
  5. Profiler: utiliser des outils (perf, valgrind) pour observer le cache

Applications professionnelles:

Ces connaissances sont utilisees dans:

  • Developpement de compilateurs: optimisations machine-specifiques
  • Systemes d'exploitation: schedulers, gestionnaires memoire
  • Jeux video: optimisations poussees pour FPS eleves
  • HPC (High Performance Computing): supercalculateurs
  • IoT: microcontroleurs a ressources limitees
  • Automobile: ECU temps reel critiques
  • Securite: analyse de vulnerabilites hardware

En conclusion, ce cours fournit les bases essentielles pour comprendre comment le logiciel interagit avec le materiel. C'est un investissement qui porte ses fruits tout au long de la carriere, car les principes fondamentaux (hierarchie memoire, pipeline, cache) restent valables meme si les technologies evoluent.


Documents de Cours

Voici les supports de cours en PDF pour approfondir l'architecture informatique materielle :

Introduction aux Architectures

Vue d'ensemble des architectures informatiques, evolution historique et concepts fondamentaux.

Telecharger le PDF

Memoire Physique

Organisation de la memoire physique, types de memoires (RAM, ROM, Flash) et hierarchie memoire.

Telecharger le PDF

Memoire Virtuelle

Gestion de la memoire virtuelle, pagination, segmentation et traduction d'adresses.

Telecharger le PDF

Memoires Caches

Fonctionnement des caches, politiques de remplacement, coherence des caches et optimisation des performances.

Telecharger le PDF

Processeur

Architecture du processeur, pipeline, parallelisme d'instructions et optimisations materielles.

Telecharger le PDF

Computer Hardware Architecture - S5

Year: 2022-2023 (Semester 5)
Credits: 3 ECTS
Type: Computer Science / System Architecture


PART A: GENERAL OVERVIEW

Course Objectives

The "Computer Hardware Architecture" course provides an in-depth understanding of the organization and operation of computer systems at the hardware level. It covers the entire memory hierarchy, processor architecture, and performance optimization mechanisms. This course is fundamental for understanding how hardware influences software performance and for designing efficient embedded systems.


Targeted Skills

  • Master processor architecture concepts (RISC, pipeline, functional units)
  • Understand the memory hierarchy (physical memory, virtual memory, caches)
  • Analyze computer system performance
  • Program in MIPS assembly for low-level operations
  • Optimize code based on the hardware architecture
  • Understand pagination and memory management mechanisms

Organization

  • Course hours: 30h (Lectures: 18h, Tutorials: 12h)
  • Assessment: Written exam + Graded tutorials + Assembly lab
  • Semester: 5 (2022-2023)
  • Prerequisites: Sequential logic, digital systems

PART B: EXPERIENCE, CONTEXT AND PURPOSE

Course Content

1. General Introduction to Architectures

Fundamental concepts:

  • Von Neumann vs Harvard architecture
  • Von Neumann model: single memory for data and instructions
  • Harvard architecture: physical separation of instructions/data
  • System bus: addresses, data, control
  • Execution cycle: Fetch-Decode-Execute

Historical evolution:

  • From early computers to modern architectures
  • Moore's Law and its current limitations
  • Transition from single-processor to multi-processor
  • RISC (Reduced Instruction Set Computer) vs CISC (Complex) architectures

Performance measurement:

  • CPI (Cycles Per Instruction)
  • MIPS (Millions of Instructions Per Second)
  • CPU execution time
  • Benchmarking and profiling

Course materials: General introduction to architectures (PDF)

2. Physical Memory

Von Neumann Architecture

Figure: Von Neumann Architecture - Classic model with shared buses

Memory organization:

  • Memory hierarchy: registers → cache → RAM → disk
  • Memory technologies: SRAM, DRAM, ROM, Flash
  • Access time and bandwidth
  • Locality principle (temporal and spatial)

Memory addressing:

  • Linear address space
  • Word-addressable vs byte-addressable
  • Memory alignment and padding
  • Endianness (big-endian vs little-endian)

Data organization:

32-bit memory organization example:
Address    |  Content (hex)
-----------|-----------------
0x00000000 |  0xAABBCCDD
0x00000004 |  0x11223344
0x00000008 |  0xFFFF0000

Course materials: Physical memory (PDF)

3. Virtual Memory

Virtualization concepts:

  • Separation of virtual address / physical address
  • Per-process address space
  • Memory protection and isolation
  • Memory sharing between processes

Paging mechanism:

  • Virtual pages and physical frames
  • Typical page size: 4 KB, 2 MB (large pages)
  • Page table
  • TLB (Translation Lookaside Buffer): cache for address translations

Translation formulas:

Virtual address = Virtual page number + Offset
Physical address = Physical frame number + Offset

Example with 4 KB pages (2^12 bytes):
Virtual address: 32 bits
  - 20 bits: page number (2^20 pages)
  - 12 bits: offset within the page (4096 bytes)

Page fault handling:

  • Page fault
  • Replacement algorithms: LRU, FIFO, Clock
  • Swapping and demand paging
  • Working set and thrashing

Multi-level page tables:

  • 2-level page table (x86)
  • 3/4-level page table (x86-64)
  • Reducing memory space for tables
  • Inverted page table

Course materials: Virtual memory (PDF)

4. Cache Memory

Cache principle:

  • Cache located between CPU and RAM
  • Exploits spatial and temporal locality
  • Reduces average access time
  • Hierarchy: L1 (fastest), L2, L3

Cache organization:

Direct-mapped cache:

Memory address decomposed into:
  Tag | Index | Offset

Example: 16 KB cache, 64-byte lines
  Offset: 6 bits (64 = 2^6)
  Index: 8 bits (256 lines)
  Tag: 18 bits (for 32-bit address)

Set-associative cache:

  • N-way set-associative (2-way, 4-way, 8-way)
  • Compromise between direct-mapped and fully associative
  • Replacement policy: LRU, Random, FIFO

Performance formulas:

Average access time = Hit_time + Miss_rate x Miss_penalty

Hit rate = Number of hits / Total number of accesses
Miss rate = 1 - Hit rate

Example:
  Hit time = 1 cycle
  Miss penalty = 100 cycles
  Miss rate = 2%

  Average time = 1 + 0.02 x 100 = 3 cycles

Types of misses:

  • Compulsory miss (cold miss): first access
  • Capacity miss: cache too small
  • Conflict miss: collision in direct-mapped

Cache coherence:

  • Multiprocessor challenge
  • MESI, MOESI protocols
  • Write-through vs write-back
  • Invalidation vs update

Course materials: Caches (PDF)

5. Processor Architecture

RISC Processor (MIPS):

  • Reduced and regular instruction set
  • Fixed-size instructions (32 bits)
  • Load/Store architecture
  • Efficient pipeline

MIPS Registers:

$zero ($0): always 0
$at ($1): reserved for assembler
$v0-$v1 ($2-$3): return values
$a0-$a3 ($4-$7): function arguments
$t0-$t9 ($8-$15, $24-$25): temporaries
$s0-$s7 ($16-$23): saved registers
$k0-$k1 ($26-$27): reserved for OS
$gp ($28): global pointer
$sp ($29): stack pointer
$fp ($30): frame pointer
$ra ($31): return address

Instruction formats:

R-format (Register): arithmetic/logic operations

| op (6) | rs (5) | rt (5) | rd (5) | shamt (5) | funct (6) |
Example: add $13, $11, $12

I-format (Immediate): load/store, branches, constants

| op (6) | rs (5) | rt (5) | immediate (16) |
Example: addi $10, $zero, 53

J-format (Jump): unconditional jumps

| op (6) | address (26) |
Example: j add_32bits

Processor pipeline:

  • IF (Instruction Fetch): load instruction
  • ID (Instruction Decode): decode and read registers
  • EX (Execute): ALU execution
  • MEM (Memory): memory access (load/store)
  • WB (Write Back): write result to register

Pipeline hazards:

  • Data hazards: RAW (Read After Write), WAR, WAW
  • Control hazards: branches and jumps
  • Structural hazards: resource conflicts

Hazard solutions:

  • Forwarding (bypassing)
  • Stalling (pipeline bubbles)
  • Branch prediction
  • Delayed branch

Course materials: Processor architecture (PDF)


PART C: TECHNICAL ASPECTS

This section presents the technical elements learned through lab sessions and practical exercises.

MIPS Assembly Programming

Lab 1: 32-bit Addition and Time Conversion

Assembly code written:

# Variable initialization
addi $10, $zero, 53      # Seconds = 53
addi $1, $zero, 27       # Minutes = 27
addi $2, $zero, 3        # Hours = 3
j add_32bits             # Jump to addition function

# Time conversion to total seconds
conv_secondes:
    addi $4, $zero, 3600     # $4 = 3600 (seconds/hour)
    addi $5, $zero, 60       # $5 = 60 (seconds/minute)
    mul $6, $2, $4           # $6 = hours x 3600
    mul $7, $1, $5           # $7 = minutes x 60
    add $8, $7, $6           # $8 = (hours x 3600) + (minutes x 60)
    add $9, $8, $10          # $9 = total + seconds

# 32-bit addition with overflow detection
add_32bits:
    lw $11, var              # Load variable a
    lw $12, var              # Load variable b

    addu $13, $12, $11       # c = a + b (unsigned addition)
    and $14, $11, $12        # e = a AND b (incoming carry)
    xor $15, $11, $12        # x = a XOR b (sum without carry)

    not $18, $13             # Invert c
    and $16, $18, $15        # x AND (NOT c)
    or $17, $16, $14         # Compute outgoing carry bit
    srl $21, $17, 31         # Shift to extract most significant bit
                             # $21 contains the overflow flag

var: .word 0xFFFF0000        # 32-bit test variable

Applied concepts:

  1. Arithmetic instructions:
    • addi: immediate addition (with constant)
    • add / addu: signed / unsigned addition
    • mul: multiplication
  2. Logical instructions:
    • and: bitwise AND
    • or: bitwise OR
    • xor: bitwise exclusive OR
    • not: inversion (one's complement)
  3. Memory instructions:
    • lw (load word): 32-bit load from memory
    • .word directive: 32-bit data declaration
  4. Control instructions:
    • j (jump): unconditional jump
    • Labels for addresses
  5. Overflow handling:
    • Overflow detection on addition
    • Using Boolean logic to compute the carry
    • Formula: Overflow = Cout XOR Cin (outgoing carry XOR incoming carry)

Analysis of the Overflow Detection Algorithm

Mathematical principle:

To detect overflow on addition:
  c = a + b
  e = a AND b  (positions where a=1 and b=1, guaranteed carry)
  x = a XOR b  (sum without carry)

  Outgoing carry = (x AND NOT(c)) OR e

  If the most significant bit of the carry = 1 → overflow

Numerical example:

a = 0xFFFF0000 (large negative number in two's complement)
b = 0xFFFF0000
c = 0xFFFE0000 (result)

Bit-by-bit analysis of MSB (bit 31):
  a[31] = 1, b[31] = 1
  c[31] = 1

  Overflow detected if sign changes incorrectly

Performance Calculation Exercises

Execution Time Calculation

Fundamental formula:

CPU_Time = Number_of_instructions x CPI x Clock_period

Where:
  CPI = Cycles Per Instruction
  Period = 1 / Frequency

Exercise example:

Program: 1 million instructions
Average CPI: 2.5
CPU frequency: 2 GHz

CPU_Time = 10^6 x 2.5 x (1 / 2x10^9)
         = 2.5x10^6 / 2x10^9
         = 1.25 ms

Cache Performance

Typical exercise:

L1 Cache: 32 KB, 4-way associative, 64-byte line
  Hit time: 1 cycle
  Hit rate: 95%

RAM:
  Latency: 100 cycles

Average access time calculation:
  T_avg = 0.95 x 1 + 0.05 x 100
        = 0.95 + 5
        = 5.95 cycles

Improvement compared to no cache:
  Speedup = 100 / 5.95 = 16.8x

Paging Calculation

Address translation exercise:

Configuration:
  - 4 KB pages (4096 bytes = 2^12)
  - 32-bit virtual addresses
  - 30-bit physical address

Virtual address: 0x12345678

Decomposition:
  Bits 31-12: virtual page number = 0x12345
  Bits 11-0:  offset = 0x678

If page table indicates: VPN 0x12345 → PFN 0x00ABC

Physical address:
  = (0x00ABC << 12) | 0x678
  = 0x00ABC678

Code Optimization

Cache Exploitation

Poor usage (frequent cache misses):

// Column-by-column traversal (poor spatial locality)
for (int j = 0; j < N; j++)
    for (int i = 0; i < N; i++)
        sum += matrix[i][j];  // Non-contiguous accesses

Good usage (cache friendly):

// Row-by-row traversal (good spatial locality)
for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++)
        sum += matrix[i][j];  // Contiguous accesses

Explanation:

  • Matrices in C are stored row by row (row-major order)
  • Cache lines: 64 bytes = 16 integers (4 bytes each)
  • Row traversal → all elements of a row loaded together
  • Column traversal → each access loads a new cache line

Blocking/Tiling for Matrices

Blocking technique:

// Matrix multiplication with blocking
#define BLOCK_SIZE 32

for (int ii = 0; ii < N; ii += BLOCK_SIZE)
    for (int jj = 0; jj < N; jj += BLOCK_SIZE)
        for (int kk = 0; kk < N; kk += BLOCK_SIZE)
            // Block of BLOCK_SIZE x BLOCK_SIZE
            for (int i = ii; i < min(ii+BLOCK_SIZE, N); i++)
                for (int j = jj; j < min(jj+BLOCK_SIZE, N); j++)
                    for (int k = kk; k < min(kk+BLOCK_SIZE, N); k++)
                        C[i][j] += A[i][k] * B[k][j];

Advantage: reuse of data in L1 cache before eviction


PART D: ANALYSIS AND REFLECTION

Knowledge and Skills Mobilized

  • System architecture: deep understanding of hardware operation
  • Low-level programming: mastery of MIPS assembly
  • Optimization: ability to write efficient code accounting for hardware
  • Performance analysis: computing and optimizing system performance
  • Memory management: understanding of paging and caching mechanisms

Self-Assessment

This course was fundamental for my understanding of computer systems. I particularly appreciated:

Strengths:

  • Theory/practice connection: MIPS assembly labs made abstract concepts tangible
  • Hardware understanding: clear vision of what happens "under the hood"
  • Optimization: ability to understand why certain code is faster
  • Debugging: better understanding of errors (segfault, cache miss, etc.)

Difficulties encountered:

  • MIPS assembly: syntax and conventions different from C
  • Pipeline and hazards: abstract concepts requiring visualization
  • Cache calculations: numerous formulas and special cases

Practical applications:

  • Code optimization for embedded systems
  • Understanding real-time constraints
  • Choosing algorithms suited to the architecture

My Opinion

This course is essential for any engineer in computer science or embedded systems. Even in the era of high-level languages, understanding hardware architecture enables:

  1. Better performance: code optimized for cache and pipeline
  2. Effective debugging: understanding memory errors and slowdowns
  3. Technical decisions: selecting the right processor for an application
  4. Embedded systems: designing systems with strict constraints

Connections with other courses:

  • Operating Systems (S5): virtual memory management, scheduling
  • Microcontroller (S6): practical application on STM32
  • Real-Time Systems (S8): timing constraints related to hardware
  • Embedded Systems (S7): optimization for limited resources

Evolution of architectures:

Today, challenges have evolved:

  • Multiprocessor: parallelism, complex cache coherence
  • Heterogeneous: CPU + GPU + accelerators (TPU, NPU)
  • Non-volatile memory: new technologies (3D XPoint, ReRAM)
  • Security: Spectre, Meltdown → performance/security trade-offs

Moore's Law is slowing down, but innovation continues:

  • 3D architecture (stacking)
  • Near-memory computing (in-memory computing)
  • Neuromorphic architectures
  • Quantum computing (future)

Recommendations for success:

  1. Practice assembly: write MIPS code, analyze C disassembly
  2. Visualize: draw pipelines, caches, page tables
  3. Calculate: do and redo performance exercises
  4. Read specifications: processor datasheets (ARM, x86)
  5. Profile: use tools (perf, valgrind) to observe cache behavior

Professional applications:

This knowledge is used in:

  • Compiler development: machine-specific optimizations
  • Operating systems: schedulers, memory managers
  • Video games: advanced optimizations for high FPS
  • HPC (High Performance Computing): supercomputers
  • IoT: resource-constrained microcontrollers
  • Automotive: safety-critical real-time ECUs
  • Security: hardware vulnerability analysis

In conclusion, this course provides the essential foundations for understanding how software interacts with hardware. It is an investment that pays off throughout one's career, as the fundamental principles (memory hierarchy, pipeline, cache) remain valid even as technologies evolve.


Course Documents

Here are the course materials in PDF format to deepen your understanding of computer hardware architecture:

Introduction to Architectures

Overview of computer architectures, historical evolution and fundamental concepts.

Download PDF

Physical Memory

Physical memory organization, memory types (RAM, ROM, Flash) and memory hierarchy.

Download PDF

Virtual Memory

Virtual memory management, paging, segmentation and address translation.

Download PDF

Cache Memory

How caches work, replacement policies, cache coherence and performance optimization.

Download PDF

Processor

Processor architecture, pipeline, instruction-level parallelism and hardware optimizations.

Download PDF