Appmajix in Launchpad

Appmajix

Registered 2017-01-09 by kiran gutha

Analysis of Linux Memory Management Mechanism.

This article makes a simple analysis of the Linux memory management mechanism, trying to make you quickly understand some of the Linux memory management concepts and effective use of some management methods.

Linux 2.6 started to support the NUMA (Non-Uniform Memory Access) memory management mode. In a multi-CPU system, memory is divided into different Nodes by CPU. Each CPU has a Node. The speed of accessing a local Node is much faster than accessing Nodes on other CPUs.

View NUMA hardware information by numactl -H, you can see the size of the two nodes and the corresponding CPU core, and the distances of the CPU access node. As shown below, the distances from the CPU to the remote node are more than twice that of the local node.

Learn Linux from industry expert trainers at Mindmajix Technologies - https://mindmajix.com/linux-training

[root@localhost ~]# numactl -H

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 15870 MB
node 0 free: 13780 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node 1 size: 16384 MB
node 1 free: 15542 MB
node distances:
node 0 1
0: 10 21
1: 21 10

View the NUMA statistics through numastat, including the number of memory allocation hits, misses, the number of local distribution, and the number of remote distributions.

[root@localhost ~]# numastat

node0 node1
numa_hit 2351854045 3021228076
numa_miss 22736854 2976885
numa_foreign 2976885 22736854
interleave_hit 14144 14100
local_node 2351844760 3021220020
other_node 22746139 2984941

Zone

The following node is divided into one or more zones, why there are zones, two reasons: 1. DMA devices can access a limited range of memory (ISA device can only access 16MB); 2.x86-32bit system address space is limited (32 The bit can only be 4GB at most), in order to use more memory, need to use HIGHMEM mechanism.

ZONE_DMA

The lowest memory area in the address segment is used for DMA (Industry Standard Architecture) device DMA access. In the x86 architecture, the Zone size is limited to 16MB.

ZONE_DMA32

The Zone is used for DMA devices that support the 32-bit address bus and is only available in 64-bit systems.

ZONE_NORMAL

The Zone's memory is directly mapped to a linear address by the kernel and can be used directly. In the X86-32 architecture, the address range of the zone is 16MB~896MB. In the X86-64 architecture, all the memory except the DMA and DMA32 is managed in the NORMAL zone.

ZONE_HIGHMEM

The Zone is only available on 32-bit systems and maps over 896MB of memory space by creating temporary page tables. That is, when the access is needed, the mapping relationship between the address space and the memory is established. After the access ends, the mapping relationship is released and the address space can be used for memory mapping of other HIGHMEMs.

Zone related information can be viewed through /proc/zoneinfo. As shown below, there are two Nodes on the X86-64 system. Node0 has three zones: DMA, DMA32, and Normal. There is only one Normal Zone on Node1.

[root@localhost ~]# cat /proc/zoneinfo |grep -E "zone| free|managed"

Node 0, zone DMA
  pages free 3700
        managed 3975
Node 0, zone DMA32
  pages free 291250
        managed 326897
Node 0, zone Normal
  pages free 3232166
        managed 3604347
Node 1, zone Normal
  pages free 3980110
        managed 4128056

Page

Page is the basic unit of Linux underlying memory management and its size is 4KB. A Page is mapped to a contiguous piece of physical memory, and memory allocation and release must be done in Page units. The mapping of the process virtual address to the physical address is also performed through the Page Table page table. Each entry in the page table records the physical address corresponding to the virtual address of a Page.

TLB

Memory access needs to search for the Page structure corresponding to the address. This data is recorded in the page table. All accesses to the memory address must first query the page table, so the page table has the highest number of accesses.

In order to increase the speed of accessing the page table, the TLB (Translation Lookaside Buffer) mechanism was introduced to cache more pages in the CPU cache. Therefore, an important item in the performance statistics of the CPU is the TLB miss statistic item of the L1/L2 cache. In a large-memory system, for example, 256 GB of memory has a total of 256 GB/4 KB=67,108,864 page table entries.

If each entry occupies 16 bytes, 1 GB is required. It is apparent that the CPU cache cannot be fully cached. At this time, if the accessed memory is wide, it is easy for the TLB miss to increase the access delay.

Hugepages

In order to reduce the probability of TLB miss, Linux introduced the Hugepages mechanism, which can set the Page size to 2MB or 1GB. Under the 2MB Hugepages mechanism, the same 256GB memory page entry is reduced to 256GB/2MB=131072, which only requires 2MB. So Hugepages page table can be cached in the CPU cache.

By sysctl -w vm.nr_hugepages=1024 you can set the number of hugepages to 1024 and the total size to 4GB. It should be noted that setting the hugepages will request 2MB of memory blocks from the system and keep them (cannot be used for normal memory requests). If the system runs for a period of time and the memory fragmentation is high, then applying hugepages will fail.

The settings and mount methods for hugepages are shown below. After mount, the application needs to use mmap for file mapping in the mount path to use these hugepages.

sysctl -w vm.nr_hugepages=1024
mkdir -p /mnt/hugepages
mount -t hugetlbfs hugetlbfs /mnt/hugepages

Buddy System

The Linux Buddy System is designed to solve the memory fragmentation problem caused by the memory allocation in the unit of Page: that is, the system lacks consecutive Page Pages and memory requests that require continuous Page Pages cannot be satisfied.

The principle is very simple, different numbers of continuous Pages are combined into Block to allocate, Block is divided into 11 Block lists according to the power of two Pages, corresponding to 1,2,4,8,16,32,64,128 , 256, 512, and 1024 consecutive Pages. When calling the Buddy System for memory allocation, find the most suitable block based on the requested size.

The following shows the Buddy System basic information on each Zone. The last 11 columns are the number of available Blocks in the 11 Block List.

[root@localhost ~]# cat /proc/buddyinfo

Node 0, zone DMA 0 0 1 0 1 1 1 0 0 1 3
Node 0, zone DMA32 102 79 179 229 230 166 251 168 107 78 169
Node 0, zone Normal 1328 900 1985 1920 2261 1388 798 972 539 324 2578
Node 1, zone Normal 466 1476 2133 7715 6026 4737 2883 1532 778 490 2760

Slab

Buddy System memory is a large application, but most applications require very little memory, such as the common hundreds of Bytes data structure, if you also apply for a Page, it will be very wasteful. In order to meet the small and irregular memory allocation requirements, Linux designed Slab distributor.

The principle is simply to create a memcache for a specific data structure, apply Pages from the Buddy System, and divide each Page into multiple objects according to the size of the data structure. The user allocates an Object when requesting a data structure from memcache.

The following shows how to view slab information in Linux:

[root@localhost ~]# cat /proc/slabinfo

slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
fat_inode_cache 90 90 720 45 8 : tunables 0 0 0 : slabdata 2 2 0
fat_cache 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0
kvm_vcpu 0 0 16576 1 8 : tunables 0 0 0 : slabdata 0 0 0
kvm_mmu_page_header 0 0 168 48 2 : tunables 0 0 0 : slabdata 0 0 0
ext4_groupinfo_4k 4440 4440 136 30 1 : tunables 0 0 0 : slabdata 148 148 0
ext4_inode_cache 63816 65100 1032 31 8 : tunables 0 0 0 : slabdata 2100 2100 0
ext4_xattr 1012 1012 88 46 1 : tunables 0 0 0 : slabdata 22 22 0
ext4_free_data 16896 17600 64 64 1 : tunables 0 0 0 : slabdata 275 275 0

Usually we use the slabtop command to view the sorted slab information:

OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
352014 352014 100% 0.10K 9026 39 36104K buffer_head
93492 93435 99% 0.19K 2226 42 17808K dentry
65100 63816 98% 1.01K 2100 31 67200K ext4_inode_cache
48128 47638 98% 0.06K 752 64 3008K kmalloc-64
47090 43684 92% 0.05K 554 85 2216K shared_policy_node
44892 44892 100% 0.11K 1247 36 4988K sysfs_dir_cache
43624 43177 98% 0.07K 779 56 3116K Acpi-ParseExt
43146 42842 99% 0.04K 423 102 1692K ext4_extent_status

kmalloc

Like glibc's malloc(), the kernel also provides kmalloc() for allocating memory of any size. Similarly, if you let an application randomly apply any size of memory from a Page, it will also cause memory fragmentation in Page.

In order to solve the internal fragmentation problem, Linux uses the Slab mechanism to achieve kmalloc memory allocation. The principle is similar to that of the Buddy System, which is to create a power-of-two Slab pool for kmalloc allocation based on the best-sized Slab.

The following are the Slabs for kmalloc allocation:

[root@localhost ~]# cat /proc/slabinfo

slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
kmalloc-8192 196 200 8192 4 8 : tunables 0 0 0 : slabdata 50 50 0
kmalloc-4096 1214 1288 4096 8 8 : tunables 0 0 0 : slabdata 161 161 0
kmalloc-2048 2861 2928 2048 16 8 : tunables 0 0 0 : slabdata 183 183 0
kmalloc-1024 7993 8320 1024 32 8 : tunables 0 0 0 : slabdata 260 260 0
kmalloc-512 6030 6144 512 32 4 : tunables 0 0 0 : slabdata 192 192 0
kmalloc-256 7813 8576 256 32 2 : tunables 0 0 0 : slabdata 268 268 0
kmalloc-192 15542 15750 192 42 2 : tunables 0 0 0 : slabdata 375 375 0
kmalloc-128 16814 16896 128 32 1 : tunables 0 0 0 : slabdata 528 528 0
kmalloc-96 17507 17934 96 42 1 : tunables 0 0 0 : slabdata 427 427 0
kmalloc-64 48590 48704 64 64 1 : tunables 0 0 0 : slabdata 761 761 0
kmalloc-32 7296 7296 32 128 1 : tunables 0 0 0 : slabdata 57 57 0
kmalloc-16 14336 14336 16 256 1 : tunables 0 0 0 : slabdata 56 56 0
kmalloc-8 21504 21504 8 512 1 : tunables 0 0 0 : slabdata 42 42 0

Kernel parameters

Linux provides some memory management related kernel parameters, which can be viewed in the /proc/sys/vm directory or viewed via sysctl -a |grep vm:

[root@localhost vm]# sysctl -a |grep vm

vm.admin_reserve_kbytes = 8192
vm.block_dump = 0
vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 10
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 20
vm.dirty_writeback_centisecs = 500
vm.drop_caches = 1
vm.extfrag_threshold = 500
vm.hugepages_treat_as_movable = 0
vm.hugetlb_shm_group = 0
vm.laptop_mode = 0
vm.legacy_va_layout = 0
vm.lowmem_reserve_ratio = 256 256 32
vm.max_map_count = 65530
vm.memory_failure_early_kill = 0
vm.memory_failure_recovery = 1
vm.min_free_kbytes = 1024000
vm.min_slab_ratio = 1
vm.min_unmapped_ratio = 1
vm.mmap_min_addr = 4096
vm.nr_hugepages = 0
vm.nr_hugepages_mempolicy = 0
vm.nr_overcommit_hugepages = 0
vm.nr_pdflush_threads = 0
vm.numa_zonelist_order = default
vm.oom_dump_tasks = 1
vm.oom_kill_allocating_task = 0
vm.overcommit_kbytes = 0
vm.overcommit_memory = 0
vm.overcommit_ratio = 50
vm.page-cluster = 3
vm.panic_on_oom = 0
vm.percpu_pagelist_fraction = 0
vm.stat_interval = 1
vm.swappiness = 60
vm.user_reserve_kbytes = 131072
vm.vfs_cache_pressure = 100
vm.zone_reclaim_mode = 0

vm.drop_caches

Vm.drop_caches is the most commonly used parameter because Linux's Page cache mechanism causes a large amount of memory to be used for file system caching, including data caching and metadata (dentry, inode) caching. When the memory is insufficient, we can quickly release the file system cache with this parameter:

To free pagecache:

echo 1 > /proc/sys/vm/drop_caches

To free reclaimable slab objects (includes dentries and inodes):

echo 2 > /proc/sys/vm/drop_caches

To free slab objects and pagecache:

echo 3 > /proc/sys/vm/drop_caches

vm.min_free_kbytes

Vm.min_free_kbytes is used to determine when the memory is less than the start of the memory recovery mechanism (including the file system cache mentioned above and the recyclable Slab mentioned below), the value of the default is smaller, in the system settings more memory A large value (such as 1GB) can automatically trigger memory reclamation when memory is not too low. However, it cannot be set too large, resulting in frequent applications being often killed by OOM.

sysctl -w vm.min_free_kbytes=1024000

vm.min_slab_ratio

Vm.min_slab_ratio is used to determine the amount of Slab space that can be recycled in the Slab pool when the percentage of the area is reached. The default is 5%. However, after the author's experiment, Slab recovery will not be triggered when there is sufficient memory, and Slab recovery will only be triggered when the memory water level reaches min_free_kbytes above. The minimum value can be set to 1%:

sysctl -w vm.min_slab_ratio=1

Conclusion: The above article briefly describes about the Linux memory management mechanism and several commonly used memory management kernel parameters. We hope that you understood the concept clearly. If you have any questions please drop us your comment in the below comment box. We will get back to you as soon as possible. Happy learning.

Home page

Project information

Maintainer:: kiran gutha

Driver:: kiran gutha

Licence:: Creative Commons - No Rights Reserved

RDF metadata

View full history Series and milestones

trunk series is the current focus of development.

All packages Packages in Distributions

numactl source package in Oracular
Version 2.0.18-1build1 uploaded on 2024-04-08
numactl source package in Noble
Version 2.0.18-1build1 uploaded on 2024-04-08
numactl source package in Mantic
Version 2.0.16-1 uploaded on 2022-12-27
numactl source package in Lunar
Version 2.0.16-1 uploaded on 2022-12-27
numactl source package in Jammy
Version 2.0.14-3ubuntu2 uploaded on 2022-03-24

Get Involved

warning

Report a bug
warning

Ask a question
warning

Help translate

Downloads

Appmajix does not have any download files registered with Launchpad.