mm patches #1

Rasenkai · 2023-09-19T16:03:11Z

No description provided.

What watermark boosting does is preemptively fire up kswapd to free memory when there hasn't been an allocation failure. It does this by increasing kswapd's high watermark goal and then firing up kswapd. The reason why this causes freezes is because, with the increased high watermark goal, kswapd will steal memory from processes that need it in order to make forward progress. These processes will, in turn, try to allocate memory again, which will cause kswapd to steal necessary pages from those processes again, in a positive feedback loop known as page thrashing. When page thrashing occurs, your system is essentially livelocked until the necessary forward progress can be made to stop processes from trying to continuously allocate memory and trigger kswapd to steal it back. This problem already occurs with kswapd *without* watermark boosting, but it's usually only encountered on machines with a small amount of memory and/or a slow CPU. Watermark boosting just makes the existing problem worse enough to notice on higher spec'd machines. Disable watermark boosting by default since it's a total dumpster fire. I can't imagine why anyone would want to explicitly enable it, but the option is there in case someone does. Signed-off-by: Sultan Alsawaf <[email protected]>

Keeping kswapd running when all the failed allocations that invoked it are satisfied incurs a high overhead due to unnecessary page eviction and writeback, as well as spurious VM pressure events to various registered shrinkers. When kswapd doesn't need to work to make an allocation succeed anymore, stop it prematurely to save resources. Signed-off-by: Sultan Alsawaf <[email protected]>

The page allocator wakes all kswapds in an allocation context's allowed nodemask in the slow path, so it doesn't make sense to have the kswapd- waiter count per each NUMA node. Instead, it should be a global counter to stop all kswapds when there are no failed allocation requests. Signed-off-by: Sultan Alsawaf <[email protected]>

Throttled direct reclaimers will wake up kswapd and wait for kswapd to satisfy their page allocation request, even when the failed allocation lacks the __GFP_KSWAPD_RECLAIM flag in its gfp mask. As a result, kswapd may think that there are no waiters and thus exit prematurely, causing throttled direct reclaimers lacking __GFP_KSWAPD_RECLAIM to stall on waiting for kswapd to wake them up. Incrementing the kswapd_waiters counter when such direct reclaimers become throttled fixes the problem. Signed-off-by: Sultan Alsawaf <[email protected]>

On-demand compaction works fine assuming that you don't have a need to spam the page allocator nonstop for large order page allocations. Signed-off-by: Sultan Alsawaf <[email protected]>

There is noticeable scheduling latency and heavy zone lock contention stemming from rmqueue_bulk's single hold of the zone lock while doing its work, as seen with the preemptoff tracer. There's no actual need for rmqueue_bulk() to hold the zone lock the entire time; it only does so for supposed efficiency. As such, we can relax the zone lock and even reschedule when IRQs are enabled in order to keep the scheduling delays and zone lock contention at bay. Forward progress is still guaranteed, as the zone lock can only be relaxed after page removal. With this change, rmqueue_bulk() no longer appears as a serious offender in the preemptoff tracer, and system latency is noticeably improved. Signed-off-by: Sultan Alsawaf <[email protected]>

Allocating pages with __get_free_page is slower than going through the slab allocator to grab free pages out from a pool. These are the results from running the code at the bottom of this message: [ 1.278602] speedtest: __get_free_page: 9 us [ 1.278606] speedtest: kmalloc: 4 us [ 1.278609] speedtest: kmem_cache_alloc: 4 us [ 1.278611] speedtest: vmalloc: 13 us kmalloc and kmem_cache_alloc (which is what kmalloc uses for common sizes behind the scenes) are the fastest choices. Use kmalloc to speed up sg list allocation. This is the code used to produce the above measurements: static int speedtest(void *data) { static const struct sched_param sched_max_rt_prio = { .sched_priority = MAX_RT_PRIO - 1 }; volatile s64 ctotal = 0, gtotal = 0, ktotal = 0, vtotal = 0; struct kmem_cache *page_pool; int i, j, trials = 1000; volatile ktime_t start; void *ptr[100]; sched_setscheduler_nocheck(current, SCHED_FIFO, &sched_max_rt_prio); page_pool = kmem_cache_create("pages", PAGE_SIZE, PAGE_SIZE, SLAB_PANIC, NULL); for (i = 0; i < trials; i ) { start = ktime_get(); for (j = 0; j < ARRAY_SIZE(ptr); j ) while (!(ptr[j] = kmem_cache_alloc(page_pool, GFP_KERNEL))); ctotal = ktime_us_delta(ktime_get(), start); for (j = 0; j < ARRAY_SIZE(ptr); j ) kmem_cache_free(page_pool, ptr[j]); start = ktime_get(); for (j = 0; j < ARRAY_SIZE(ptr); j ) while (!(ptr[j] = (void *)__get_free_page(GFP_KERNEL))); gtotal = ktime_us_delta(ktime_get(), start); for (j = 0; j < ARRAY_SIZE(ptr); j ) free_page((unsigned long)ptr[j]); start = ktime_get(); for (j = 0; j < ARRAY_SIZE(ptr); j ) while (!(ptr[j] = __kmalloc(PAGE_SIZE, GFP_KERNEL))); ktotal = ktime_us_delta(ktime_get(), start); for (j = 0; j < ARRAY_SIZE(ptr); j ) kfree(ptr[j]); start = ktime_get(); *ptr = vmalloc(ARRAY_SIZE(ptr) * PAGE_SIZE); vtotal = ktime_us_delta(ktime_get(), start); vfree(*ptr); } kmem_cache_destroy(page_pool); printk("%s: __get_free_page: %lld us\n", __func__, gtotal / trials); printk("%s: __kmalloc: %lld us\n", __func__, ktotal / trials); printk("%s: kmem_cache_alloc: %lld us\n", __func__, ctotal / trials); printk("%s: vmalloc: %lld us\n", __func__, vtotal / trials); complete(data); return 0; } static int __init start_test(void) { DECLARE_COMPLETION_ONSTACK(done); BUG_ON(IS_ERR(kthread_run(speedtest, &done, "malloc_test"))); wait_for_completion(&done); return 0; } late_initcall(start_test); Signed-off-by: Sultan Alsawaf <[email protected]>

The RCU read lock isn't necessary in list_lru_count_one() when the condition that requires RCU (CONFIG_MEMCG && !CONFIG_SLOB) isn't met. The highly-frequent RCU lock and unlock adds measurable overhead to the shrink_slab() path when it isn't needed. As such, we can simply omit the RCU read lock in this case to improve performance. Signed-off-by: Sultan Alsawaf <[email protected]> Signed-off-by: Kazuki Hashimoto <[email protected]>

[ Upstream commit 7962ef1 ] In 3cb4d5e ("perf trace: Free syscall tp fields in evsel->priv") it only was freeing if strcmp(evsel->tp_format->system, "syscalls") returned zero, while the corresponding initialization of evsel->priv was being performed if it was _not_ zero, i.e. if the tp system wasn't 'syscalls'. Just stop looking for that and free it if evsel->priv was set, which should be equivalent. Also use the pre-existing evsel_trace__delete() function. This resolves these leaks, detected with: $ make EXTRA_CFLAGS="-fsanitize=address" BUILD_BPF_SKEL=1 CORESIGHT=1 O=/tmp/build/perf-tools-next -C tools/perf install-bin ================================================================= ==481565==ERROR: LeakSanitizer: detected memory leaks Direct leak of 40 byte(s) in 1 object(s) allocated from: #0 0x7f7343cba097 in calloc (/lib64/libasan.so.8 0xba097) #1 0x987966 in zalloc (/home/acme/bin/perf 0x987966) #2 0x52f9b9 in evsel_trace__new /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:307 #3 0x52f9b9 in evsel__syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:333 #4 0x52f9b9 in evsel__init_raw_syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:458 #5 0x52f9b9 in perf_evsel__raw_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:480 torvalds#6 0x540e8b in trace__add_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3212 torvalds#7 0x540e8b in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3891 torvalds#8 0x540e8b in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5156 torvalds#9 0x5ef262 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323 torvalds#10 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377 torvalds#11 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421 torvalds#12 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537 torvalds#13 0x7f7342c4a50f in __libc_start_call_main (/lib64/libc.so.6 0x2750f) Direct leak of 40 byte(s) in 1 object(s) allocated from: #0 0x7f7343cba097 in calloc (/lib64/libasan.so.8 0xba097) #1 0x987966 in zalloc (/home/acme/bin/perf 0x987966) #2 0x52f9b9 in evsel_trace__new /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:307 #3 0x52f9b9 in evsel__syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:333 #4 0x52f9b9 in evsel__init_raw_syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:458 #5 0x52f9b9 in perf_evsel__raw_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:480 torvalds#6 0x540dd1 in trace__add_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3205 torvalds#7 0x540dd1 in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3891 torvalds#8 0x540dd1 in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5156 torvalds#9 0x5ef262 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323 torvalds#10 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377 torvalds#11 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421 torvalds#12 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537 torvalds#13 0x7f7342c4a50f in __libc_start_call_main (/lib64/libc.so.6 0x2750f) SUMMARY: AddressSanitizer: 80 byte(s) leaked in 2 allocation(s). [root@quaco ~]# With this we plug all leaks with "perf trace sleep 1". Fixes: 3cb4d5e ("perf trace: Free syscall tp fields in evsel->priv") Acked-by: Ian Rogers <[email protected]> Cc: Adrian Hunter <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Riccardo Mancini <[email protected]> Link: https://lore.kernel.org/lkml/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit ef23cb5 ] While debugging a segfault on 'perf lock contention' without an available perf.data file I noticed that it was basically calling: perf_session__delete(ERR_PTR(-1)) Resulting in: (gdb) run lock contention Starting program: /root/bin/perf lock contention [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". failed to open perf.data: No such file or directory (try 'perf record' first) Initializing perf session failed Program received signal SIGSEGV, Segmentation fault. 0x00000000005e7515 in auxtrace__free (session=0xffffffffffffffff) at util/auxtrace.c:2858 2858 if (!session->auxtrace) (gdb) p session $1 = (struct perf_session *) 0xffffffffffffffff (gdb) bt #0 0x00000000005e7515 in auxtrace__free (session=0xffffffffffffffff) at util/auxtrace.c:2858 #1 0x000000000057bb4d in perf_session__delete (session=0xffffffffffffffff) at util/session.c:300 #2 0x000000000047c421 in __cmd_contention (argc=0, argv=0x7fffffffe200) at builtin-lock.c:2161 #3 0x000000000047dc95 in cmd_lock (argc=0, argv=0x7fffffffe200) at builtin-lock.c:2604 #4 0x0000000000501466 in run_builtin (p=0xe597a8 <commands 552>, argc=2, argv=0x7fffffffe200) at perf.c:322 #5 0x00000000005016d5 in handle_internal_command (argc=2, argv=0x7fffffffe200) at perf.c:375 torvalds#6 0x0000000000501824 in run_argv (argcp=0x7fffffffe02c, argv=0x7fffffffe020) at perf.c:419 torvalds#7 0x0000000000501b11 in main (argc=2, argv=0x7fffffffe200) at perf.c:535 (gdb) So just set it to NULL after using PTR_ERR(session) to decode the error as perf_session__delete(NULL) is supported. The same problem was found in 'perf top' after an audit of all perf_session__new() failure handling. Fixes: 6ef81c5 ("perf session: Return error code for perf_session__new() function on failure") Cc: Adrian Hunter <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Alexey Budankov <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Jeremie Galarneau <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kate Stewart <[email protected]> Cc: Mamatha Inamdar <[email protected]> Cc: Mukesh Ojha <[email protected]> Cc: Nageswara R Sastry <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ravi Bangoria <[email protected]> Cc: Shawn Landden <[email protected]> Cc: Song Liu <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tzvetomir Stoyanov <[email protected]> Link: https://lore.kernel.org/lkml/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit abaf1e0 ] While debugging a segfault on 'perf lock contention' without an available perf.data file I noticed that it was basically calling: perf_session__delete(ERR_PTR(-1)) Resulting in: (gdb) run lock contention Starting program: /root/bin/perf lock contention [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". failed to open perf.data: No such file or directory (try 'perf record' first) Initializing perf session failed Program received signal SIGSEGV, Segmentation fault. 0x00000000005e7515 in auxtrace__free (session=0xffffffffffffffff) at util/auxtrace.c:2858 2858 if (!session->auxtrace) (gdb) p session $1 = (struct perf_session *) 0xffffffffffffffff (gdb) bt #0 0x00000000005e7515 in auxtrace__free (session=0xffffffffffffffff) at util/auxtrace.c:2858 #1 0x000000000057bb4d in perf_session__delete (session=0xffffffffffffffff) at util/session.c:300 #2 0x000000000047c421 in __cmd_contention (argc=0, argv=0x7fffffffe200) at builtin-lock.c:2161 #3 0x000000000047dc95 in cmd_lock (argc=0, argv=0x7fffffffe200) at builtin-lock.c:2604 #4 0x0000000000501466 in run_builtin (p=0xe597a8 <commands 552>, argc=2, argv=0x7fffffffe200) at perf.c:322 #5 0x00000000005016d5 in handle_internal_command (argc=2, argv=0x7fffffffe200) at perf.c:375 torvalds#6 0x0000000000501824 in run_argv (argcp=0x7fffffffe02c, argv=0x7fffffffe020) at perf.c:419 torvalds#7 0x0000000000501b11 in main (argc=2, argv=0x7fffffffe200) at perf.c:535 (gdb) So just set it to NULL after using PTR_ERR(session) to decode the error as perf_session__delete(NULL) is supported. Fixes: eef4fee ("perf lock: Dynamically allocate lockhash_table") Cc: Adrian Hunter <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: K Prateek Nayak <[email protected]> Cc: Kan Liang <[email protected]> Cc: Leo Yan <[email protected]> Cc: Mamatha Inamdar <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ravi Bangoria <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Steven Rostedt (VMware) <[email protected]> Cc: Tiezhu Yang <[email protected]> Cc: Yang Jihong <[email protected]> Link: https://lore.kernel.org/lkml/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 82ba0ff ] We should not call trace_handshake_cmd_done_err() if socket lookup has failed. Also we should call trace_handshake_cmd_done_err() before releasing the file, otherwise dereferencing sock->sk can return garbage. This also reverts 7afc6d0 ("net/handshake: Fix uninitialized local variable") Unable to handle kernel paging request at virtual address dfff800000000003 KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f] Mem abort info: ESR = 0x0000000096000005 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x05: level 1 translation fault Data abort info: ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [dfff800000000003] address between user and kernel address ranges Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP Modules linked in: CPU: 1 PID: 5986 Comm: syz-executor292 Not tainted 6.5.0-rc7-syzkaller-gfe4469582053 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023 pstate: 80400005 (Nzcv daif PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : handshake_nl_done_doit 0x198/0x9c8 net/handshake/netlink.c:193 lr : handshake_nl_done_doit 0x180/0x9c8 sp : ffff800096e37180 x29: ffff800096e37200 x28: 1ffff00012dc6e34 x27: dfff800000000000 x26: ffff800096e373d0 x25: 0000000000000000 x24: 00000000ffffffa8 x23: ffff800096e373f0 x22: 1ffff00012dc6e38 x21: 0000000000000000 x20: ffff800096e371c0 x19: 0000000000000018 x18: 0000000000000000 x17: 0000000000000000 x16: ffff800080516cc4 x15: 0000000000000001 x14: 1fffe0001b14aa3b x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000003 x8 : 0000000000000003 x7 : ffff800080afe47c x6 : 0000000000000000 x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff800080a88078 x2 : 0000000000000001 x1 : 00000000ffffffa8 x0 : 0000000000000000 Call trace: handshake_nl_done_doit 0x198/0x9c8 net/handshake/netlink.c:193 genl_family_rcv_msg_doit net/netlink/genetlink.c:970 [inline] genl_family_rcv_msg net/netlink/genetlink.c:1050 [inline] genl_rcv_msg 0x96c/0xc50 net/netlink/genetlink.c:1067 netlink_rcv_skb 0x214/0x3c4 net/netlink/af_netlink.c:2549 genl_rcv 0x38/0x50 net/netlink/genetlink.c:1078 netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline] netlink_unicast 0x660/0x8d4 net/netlink/af_netlink.c:1365 netlink_sendmsg 0x834/0xb18 net/netlink/af_netlink.c:1914 sock_sendmsg_nosec net/socket.c:725 [inline] sock_sendmsg net/socket.c:748 [inline] ____sys_sendmsg 0x56c/0x840 net/socket.c:2494 ___sys_sendmsg net/socket.c:2548 [inline] __sys_sendmsg 0x26c/0x33c net/socket.c:2577 __do_sys_sendmsg net/socket.c:2586 [inline] __se_sys_sendmsg net/socket.c:2584 [inline] __arm64_sys_sendmsg 0x80/0x94 net/socket.c:2584 __invoke_syscall arch/arm64/kernel/syscall.c:37 [inline] invoke_syscall 0x98/0x2b8 arch/arm64/kernel/syscall.c:51 el0_svc_common 0x130/0x23c arch/arm64/kernel/syscall.c:136 do_el0_svc 0x48/0x58 arch/arm64/kernel/syscall.c:155 el0_svc 0x58/0x16c arch/arm64/kernel/entry-common.c:678 el0t_64_sync_handler 0x84/0xfc arch/arm64/kernel/entry-common.c:696 el0t_64_sync 0x190/0x194 arch/arm64/kernel/entry.S:591 Code: 12800108 b90043e8 910062b3 d343fe68 (387b6908) Fixes: 3b3009e ("net/handshake: Create a NETLINK service for handling handshake requests") Reported-by: syzbot <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Cc: Chuck Lever <[email protected]> Reviewed-by: Michal Kubiak <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit a96a44a ] './test_progs -t test_local_storage' reported a splat: [ 27.137569] ============================= [ 27.138122] [ BUG: Invalid wait context ] [ 27.138650] 6.5.0-03980-gd11ae1b16b0a torvalds#247 Tainted: G O [ 27.139542] ----------------------------- [ 27.140106] test_progs/1729 is trying to lock: [ 27.140713] ffff8883ef047b88 (stock_lock){-.-.}-{3:3}, at: local_lock_acquire 0x9/0x130 [ 27.141834] other info that might help us debug this: [ 27.142437] context-{5:5} [ 27.142856] 2 locks held by test_progs/1729: [ 27.143352] #0: ffffffff84bcd9c0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire 0x4/0x40 [ 27.144492] #1: ffff888107deb2c0 (&storage->lock){..-.}-{2:2}, at: bpf_local_storage_update 0x39e/0x8e0 [ 27.145855] stack backtrace: [ 27.146274] CPU: 0 PID: 1729 Comm: test_progs Tainted: G O 6.5.0-03980-gd11ae1b16b0a torvalds#247 [ 27.147550] Hardware name: QEMU Standard PC (i440FX PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 27.149127] Call Trace: [ 27.149490] <TASK> [ 27.149867] dump_stack_lvl 0x130/0x1d0 [ 27.152609] dump_stack 0x14/0x20 [ 27.153131] __lock_acquire 0x1657/0x2220 [ 27.153677] lock_acquire 0x1b8/0x510 [ 27.157908] local_lock_acquire 0x29/0x130 [ 27.159048] obj_cgroup_charge 0xf4/0x3c0 [ 27.160794] slab_pre_alloc_hook 0x28e/0x2b0 [ 27.161931] __kmem_cache_alloc_node 0x51/0x210 [ 27.163557] __kmalloc 0xaa/0x210 [ 27.164593] bpf_map_kzalloc 0xbc/0x170 [ 27.165147] bpf_selem_alloc 0x130/0x510 [ 27.166295] bpf_local_storage_update 0x5aa/0x8e0 [ 27.167042] bpf_fd_sk_storage_update_elem 0xdb/0x1a0 [ 27.169199] bpf_map_update_value 0x415/0x4f0 [ 27.169871] map_update_elem 0x413/0x550 [ 27.170330] __sys_bpf 0x5e9/0x640 [ 27.174065] __x64_sys_bpf 0x80/0x90 [ 27.174568] do_syscall_64 0x48/0xa0 [ 27.175201] entry_SYSCALL_64_after_hwframe 0x6e/0xd8 [ 27.175932] RIP: 0033:0x7effb40e41ad [ 27.176357] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d8 [ 27.179028] RSP: 002b:00007ffe64c21fc8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141 [ 27.180088] RAX: ffffffffffffffda RBX: 00007ffe64c22768 RCX: 00007effb40e41ad [ 27.181082] RDX: 0000000000000020 RSI: 00007ffe64c22008 RDI: 0000000000000002 [ 27.182030] RBP: 00007ffe64c21ff0 R08: 0000000000000000 R09: 00007ffe64c22788 [ 27.183038] R10: 0000000000000064 R11: 0000000000000202 R12: 0000000000000000 [ 27.184006] R13: 00007ffe64c22788 R14: 00007effb42a1000 R15: 0000000000000000 [ 27.184958] </TASK> It complains about acquiring a local_lock while holding a raw_spin_lock. It means it should not allocate memory while holding a raw_spin_lock since it is not safe for RT. raw_spin_lock is needed because bpf_local_storage supports tracing context. In particular for task local storage, it is easy to get a "current" task PTR_TO_BTF_ID in tracing bpf prog. However, task (and cgroup) local storage has already been moved to bpf mem allocator which can be used after raw_spin_lock. The splat is for the sk storage. For sk (and inode) storage, it has not been moved to bpf mem allocator. Using raw_spin_lock or not, kzalloc(GFP_ATOMIC) could theoretically be unsafe in tracing context. However, the local storage helper requires a verifier accepted sk pointer (PTR_TO_BTF_ID), it is hypothetical if that (mean running a bpf prog in a kzalloc unsafe context and also able to hold a verifier accepted sk pointer) could happen. This patch avoids kzalloc after raw_spin_lock to silent the splat. There is an existing kzalloc before the raw_spin_lock. At that point, a kzalloc is very likely required because a lookup has just been done before. Thus, this patch always does the kzalloc before acquiring the raw_spin_lock and remove the later kzalloc usage after the raw_spin_lock. After this change, it will have a charge and then uncharge during the syscall bpf_map_update_elem() code path. This patch opts for simplicity and not continue the old optimization to save one charge and uncharge. This issue is dated back to the very first commit of bpf_sk_storage which had been refactored multiple times to create task, inode, and cgroup storage. This patch uses a Fixes tag with a more recent commit that should be easier to do backport. Fixes: b00fa38 ("bpf: Enable non-atomic allocations in local storage") Signed-off-by: Martin KaFai Lau <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected] Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit f4f8a78 ] The opt_num field is controlled by user mode and is not currently validated inside the kernel. An attacker can take advantage of this to trigger an OOB read and potentially leak information. BUG: KASAN: slab-out-of-bounds in nf_osf_match_one 0xbed/0xd10 net/netfilter/nfnetlink_osf.c:88 Read of size 2 at addr ffff88804bc64272 by task poc/6431 CPU: 1 PID: 6431 Comm: poc Not tainted 6.0.0-rc4 #1 Call Trace: nf_osf_match_one 0xbed/0xd10 net/netfilter/nfnetlink_osf.c:88 nf_osf_find 0x186/0x2f0 net/netfilter/nfnetlink_osf.c:281 nft_osf_eval 0x37f/0x590 net/netfilter/nft_osf.c:47 expr_call_ops_eval net/netfilter/nf_tables_core.c:214 nft_do_chain 0x2b0/0x1490 net/netfilter/nf_tables_core.c:264 nft_do_chain_ipv4 0x17c/0x1f0 net/netfilter/nft_chain_filter.c:23 [..] Also add validation to genre, subtype and version fields. Fixes: 11eeef4 ("netfilter: passive OS fingerprint xtables match") Reported-by: Lucas Leong <[email protected]> Signed-off-by: Wander Lairson Costa <[email protected]> Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 9b5ba5c ] Deliver audit log from __nf_tables_dump_rules(), table dereference at the end of the table list loop might point to the list head, leading to this crash. [ 4137.407349] BUG: unable to handle page fault for address: 00000000001f3c50 [ 4137.407357] #PF: supervisor read access in kernel mode [ 4137.407359] #PF: error_code(0x0000) - not-present page [ 4137.407360] PGD 0 P4D 0 [ 4137.407363] Oops: 0000 [#1] PREEMPT SMP PTI [ 4137.407365] CPU: 4 PID: 500177 Comm: nft Not tainted 6.5.0 torvalds#277 [ 4137.407369] RIP: 0010:string 0x49/0xd0 [ 4137.407374] Code: ff 77 36 45 89 d1 31 f6 49 01 f9 66 45 85 d2 75 19 eb 1e 49 39 f8 76 02 88 07 48 83 c7 01 83 c6 01 48 83 c2 01 4c 39 cf 74 07 <0f> b6 02 84 c0 75 e2 4c 89 c2 e9 58 e5 ff ff 48 c7 c0 0e b2 ff 81 [ 4137.407377] RSP: 0018:ffff8881179737f0 EFLAGS: 00010286 [ 4137.407379] RAX: 00000000001f2c50 RBX: ffff888117973848 RCX: ffff0a00ffffff04 [ 4137.407380] RDX: 00000000001f3c50 RSI: 0000000000000000 RDI: 0000000000000000 [ 4137.407381] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000ffffffff [ 4137.407383] R10: ffffffffffffffff R11: ffff88813584d200 R12: 0000000000000000 [ 4137.407384] R13: ffffffffa15cf709 R14: 0000000000000000 R15: ffffffffa15cf709 [ 4137.407385] FS: 00007fcfc18bb580(0000) GS:ffff88840e700000(0000) knlGS:0000000000000000 [ 4137.407387] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4137.407388] CR2: 00000000001f3c50 CR3: 00000001055b2001 CR4: 00000000001706e0 [ 4137.407390] Call Trace: [ 4137.407392] <TASK> [ 4137.407393] ? __die 0x1b/0x60 [ 4137.407397] ? page_fault_oops 0x6b/0xa0 [ 4137.407399] ? exc_page_fault 0x60/0x120 [ 4137.407403] ? asm_exc_page_fault 0x22/0x30 [ 4137.407408] ? string 0x49/0xd0 [ 4137.407410] vsnprintf 0x257/0x4f0 [ 4137.407414] kvasprintf 0x3e/0xb0 [ 4137.407417] kasprintf 0x3e/0x50 [ 4137.407419] nf_tables_dump_rules 0x1c0/0x360 [nf_tables] [ 4137.407439] ? __alloc_skb 0xc3/0x170 [ 4137.407442] netlink_dump 0x170/0x330 [ 4137.407447] __netlink_dump_start 0x227/0x300 [ 4137.407449] nf_tables_getrule 0x205/0x390 [nf_tables] Deliver audit log only once at the end of the rule dump reset for consistency with the set dump reset. Ensure audit reset access to table under rcu read side lock. The table list iteration holds rcu read lock side, but recent audit code dereferences table object out of the rcu read lock side. Fixes: ea078ae ("netfilter: nf_tables: Audit log rule reset") Fixes: 7e9be11 ("netfilter: nf_tables: Audit log setelem reset") Signed-off-by: Pablo Neira Ayuso <[email protected]> Acked-by: Phil Sutter <[email protected]> Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

commit 768d612 upstream. Yikebaer reported an issue: ================================================================== BUG: KASAN: slab-use-after-free in ext4_es_insert_extent 0xc68/0xcb0 fs/ext4/extents_status.c:894 Read of size 4 at addr ffff888112ecc1a4 by task syz-executor/8438 CPU: 1 PID: 8438 Comm: syz-executor Not tainted 6.5.0-rc5 #1 Call Trace: [...] kasan_report 0xba/0xf0 mm/kasan/report.c:588 ext4_es_insert_extent 0xc68/0xcb0 fs/ext4/extents_status.c:894 ext4_map_blocks 0x92a/0x16f0 fs/ext4/inode.c:680 ext4_alloc_file_blocks.isra.0 0x2df/0xb70 fs/ext4/extents.c:4462 ext4_zero_range fs/ext4/extents.c:4622 [inline] ext4_fallocate 0x251c/0x3ce0 fs/ext4/extents.c:4721 [...] Allocated by task 8438: [...] kmem_cache_zalloc include/linux/slab.h:693 [inline] __es_alloc_extent fs/ext4/extents_status.c:469 [inline] ext4_es_insert_extent 0x672/0xcb0 fs/ext4/extents_status.c:873 ext4_map_blocks 0x92a/0x16f0 fs/ext4/inode.c:680 ext4_alloc_file_blocks.isra.0 0x2df/0xb70 fs/ext4/extents.c:4462 ext4_zero_range fs/ext4/extents.c:4622 [inline] ext4_fallocate 0x251c/0x3ce0 fs/ext4/extents.c:4721 [...] Freed by task 8438: [...] kmem_cache_free 0xec/0x490 mm/slub.c:3823 ext4_es_try_to_merge_right fs/ext4/extents_status.c:593 [inline] __es_insert_extent 0x9f4/0x1440 fs/ext4/extents_status.c:802 ext4_es_insert_extent 0x2ca/0xcb0 fs/ext4/extents_status.c:882 ext4_map_blocks 0x92a/0x16f0 fs/ext4/inode.c:680 ext4_alloc_file_blocks.isra.0 0x2df/0xb70 fs/ext4/extents.c:4462 ext4_zero_range fs/ext4/extents.c:4622 [inline] ext4_fallocate 0x251c/0x3ce0 fs/ext4/extents.c:4721 [...] ================================================================== The flow of issue triggering is as follows: 1. remove es raw es es removed es1 |-------------------| -> |----|.......|------| 2. insert es es insert es1 merge with es es1 merge with es and free es1 |----|.......|------| -> |------------|------| -> |-------------------| es merges with newes, then merges with es1, frees es1, then determines if es1->es_len is 0 and triggers a UAF. The code flow is as follows: ext4_es_insert_extent es1 = __es_alloc_extent(true); es2 = __es_alloc_extent(true); __es_remove_extent(inode, lblk, end, NULL, es1) __es_insert_extent(inode, &newes, es1) ---> insert es1 to es tree __es_insert_extent(inode, &newes, es2) ext4_es_try_to_merge_right ext4_es_free_extent(inode, es1) ---> es1 is freed if (es1 && !es1->es_len) // Trigger UAF by determining if es1 is used. We determine whether es1 or es2 is used immediately after calling __es_remove_extent() or __es_insert_extent() to avoid triggering a UAF if es1 or es2 is freed. Reported-by: Yikebaer Aizezi <[email protected]> Closes: https://lore.kernel.org/lkml/CALcu4raD4h9coiyEBL4Bm0zjDwxC2CyPiTwsP3zFuhot6y9Beg@mail.gmail.com Fixes: 2a69c45 ("ext4: using nofail preallocation in ext4_es_insert_extent()") Cc: [email protected] Signed-off-by: Baokun Li <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit a3ab557 upstream. Let's flush the inode being aborted atomic operation to avoid stale dirty inode during eviction in this call stack: f2fs_mark_inode_dirty_sync 0x22/0x40 [f2fs] f2fs_abort_atomic_write 0xc4/0xf0 [f2fs] f2fs_evict_inode 0x3f/0x690 [f2fs] ? sugov_start 0x140/0x140 evict 0xc3/0x1c0 evict_inodes 0x17b/0x210 generic_shutdown_super 0x32/0x120 kill_block_super 0x21/0x50 deactivate_locked_super 0x31/0x90 cleanup_mnt 0x100/0x160 task_work_run 0x59/0x90 do_exit 0x33b/0xa50 do_group_exit 0x2d/0x80 __x64_sys_exit_group 0x14/0x20 do_syscall_64 0x3b/0x90 entry_SYSCALL_64_after_hwframe 0x63/0xcd This triggers f2fs_bug_on() in f2fs_evict_inode: f2fs_bug_on(sbi, is_inode_flag_set(inode, FI_DIRTY_INODE)); This fixes the syzbot report: loop0: detected capacity change from 0 to 131072 F2FS-fs (loop0): invalid crc value F2FS-fs (loop0): Found nat_bits in checkpoint F2FS-fs (loop0): Mounted with checkpoint version = 48b305e4 ------------[ cut here ]------------ kernel BUG at fs/f2fs/inode.c:869! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 5014 Comm: syz-executor220 Not tainted 6.4.0-syzkaller-11479-g6cd06ab12d1a #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023 RIP: 0010:f2fs_evict_inode 0x172d/0x1e00 fs/f2fs/inode.c:869 Code: ff df 48 c1 ea 03 80 3c 02 00 0f 85 6a 06 00 00 8b 75 40 ba 01 00 00 00 4c 89 e7 e8 6d ce 06 00 e9 aa fc ff ff e8 63 22 e2 fd <0f> 0b e8 5c 22 e2 fd 48 c7 c0 a8 3a 18 8d 48 ba 00 00 00 00 00 fc RSP: 0018:ffffc95003a6fa00 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000 RDX: ffff8880273b8000 RSI: ffffffff83a2bd0d RDI: 0000000000000007 RBP: ffff888077db91b0 R08: 0000000000000007 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000001 R12: ffff888029a3c000 R13: ffff888077db9660 R14: ffff888029a3c0b8 R15: ffff888077db9c50 FS: 0000000000000000(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1909bb9500 CR3: 00000000276a9500 CR4: 0000000000350ef0 Call Trace: <TASK> evict 0x2ed/0x6b0 fs/inode.c:665 dispose_list 0x117/0x1e0 fs/inode.c:698 evict_inodes 0x345/0x440 fs/inode.c:748 generic_shutdown_super 0xaf/0x480 fs/super.c:478 kill_block_super 0x64/0xb0 fs/super.c:1417 kill_f2fs_super 0x2af/0x3c0 fs/f2fs/super.c:4704 deactivate_locked_super 0x98/0x160 fs/super.c:330 deactivate_super 0xb1/0xd0 fs/super.c:361 cleanup_mnt 0x2ae/0x3d0 fs/namespace.c:1254 task_work_run 0x16f/0x270 kernel/task_work.c:179 exit_task_work include/linux/task_work.h:38 [inline] do_exit 0xa9a/0x29a0 kernel/exit.c:874 do_group_exit 0xd4/0x2a0 kernel/exit.c:1024 __do_sys_exit_group kernel/exit.c:1035 [inline] __se_sys_exit_group kernel/exit.c:1033 [inline] __x64_sys_exit_group 0x3e/0x50 kernel/exit.c:1033 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64 0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe 0x63/0xcd RIP: 0033:0x7f309be71a09 Code: Unable to access opcode bytes at 0x7f309be719df. RSP: 002b:00007fff171df518 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 00007f309bef7330 RCX: 00007f309be71a09 RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000001 RBP: 0000000000000001 R08: ffffffffffffffc0 R09: 00007f309bef1e40 R10: 0000000000010600 R11: 0000000000000246 R12: 00007f309bef7330 R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- RIP: 0010:f2fs_evict_inode 0x172d/0x1e00 fs/f2fs/inode.c:869 Code: ff df 48 c1 ea 03 80 3c 02 00 0f 85 6a 06 00 00 8b 75 40 ba 01 00 00 00 4c 89 e7 e8 6d ce 06 00 e9 aa fc ff ff e8 63 22 e2 fd <0f> 0b e8 5c 22 e2 fd 48 c7 c0 a8 3a 18 8d 48 ba 00 00 00 00 00 fc RSP: 0018:ffffc95003a6fa00 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000 RDX: ffff8880273b8000 RSI: ffffffff83a2bd0d RDI: 0000000000000007 RBP: ffff888077db91b0 R08: 0000000000000007 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000001 R12: ffff888029a3c000 R13: ffff888077db9660 R14: ffff888029a3c0b8 R15: ffff888077db9c50 FS: 0000000000000000(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1909bb9500 CR3: 00000000276a9500 CR4: 0000000000350ef0 Cc: <[email protected]> Reported-and-tested-by: syzbot [email protected] Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 5c13e23 upstream. ====================================================== WARNING: possible circular locking dependency detected 6.5.0-rc5-syzkaller-00353-gae545c3283dc #0 Not tainted ------------------------------------------------------ syz-executor273/5027 is trying to acquire lock: ffff888077fe1fb0 (&fi->i_sem){ . .}-{3:3}, at: f2fs_down_write fs/f2fs/f2fs.h:2133 [inline] ffff888077fe1fb0 (&fi->i_sem){ . .}-{3:3}, at: f2fs_add_inline_entry 0x300/0x6f0 fs/f2fs/inline.c:644 but task is already holding lock: ffff888077fe07c8 (&fi->i_xattr_sem){. . }-{3:3}, at: f2fs_down_read fs/f2fs/f2fs.h:2108 [inline] ffff888077fe07c8 (&fi->i_xattr_sem){. . }-{3:3}, at: f2fs_add_dentry 0x92/0x230 fs/f2fs/dir.c:783 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&fi->i_xattr_sem){. . }-{3:3}: down_read 0x9c/0x470 kernel/locking/rwsem.c:1520 f2fs_down_read fs/f2fs/f2fs.h:2108 [inline] f2fs_getxattr 0xb1e/0x12c0 fs/f2fs/xattr.c:532 __f2fs_get_acl 0x5a/0x900 fs/f2fs/acl.c:179 f2fs_acl_create fs/f2fs/acl.c:377 [inline] f2fs_init_acl 0x15c/0xb30 fs/f2fs/acl.c:420 f2fs_init_inode_metadata 0x159/0x1290 fs/f2fs/dir.c:558 f2fs_add_regular_entry 0x79e/0xb90 fs/f2fs/dir.c:740 f2fs_add_dentry 0x1de/0x230 fs/f2fs/dir.c:788 f2fs_do_add_link 0x190/0x280 fs/f2fs/dir.c:827 f2fs_add_link fs/f2fs/f2fs.h:3554 [inline] f2fs_mkdir 0x377/0x620 fs/f2fs/namei.c:781 vfs_mkdir 0x532/0x7e0 fs/namei.c:4117 do_mkdirat 0x2a9/0x330 fs/namei.c:4140 __do_sys_mkdir fs/namei.c:4160 [inline] __se_sys_mkdir fs/namei.c:4158 [inline] __x64_sys_mkdir 0xf2/0x140 fs/namei.c:4158 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64 0x38/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe 0x63/0xcd -> #0 (&fi->i_sem){ . .}-{3:3}: check_prev_add kernel/locking/lockdep.c:3142 [inline] check_prevs_add kernel/locking/lockdep.c:3261 [inline] validate_chain kernel/locking/lockdep.c:3876 [inline] __lock_acquire 0x2e3d/0x5de0 kernel/locking/lockdep.c:5144 lock_acquire kernel/locking/lockdep.c:5761 [inline] lock_acquire 0x1ae/0x510 kernel/locking/lockdep.c:5726 down_write 0x93/0x200 kernel/locking/rwsem.c:1573 f2fs_down_write fs/f2fs/f2fs.h:2133 [inline] f2fs_add_inline_entry 0x300/0x6f0 fs/f2fs/inline.c:644 f2fs_add_dentry 0xa6/0x230 fs/f2fs/dir.c:784 f2fs_do_add_link 0x190/0x280 fs/f2fs/dir.c:827 f2fs_add_link fs/f2fs/f2fs.h:3554 [inline] f2fs_mkdir 0x377/0x620 fs/f2fs/namei.c:781 vfs_mkdir 0x532/0x7e0 fs/namei.c:4117 ovl_do_mkdir fs/overlayfs/overlayfs.h:196 [inline] ovl_mkdir_real 0xb5/0x370 fs/overlayfs/dir.c:146 ovl_workdir_create 0x3de/0x820 fs/overlayfs/super.c:309 ovl_make_workdir fs/overlayfs/super.c:711 [inline] ovl_get_workdir fs/overlayfs/super.c:864 [inline] ovl_fill_super 0xdab/0x6180 fs/overlayfs/super.c:1400 vfs_get_super 0xf9/0x290 fs/super.c:1152 vfs_get_tree 0x88/0x350 fs/super.c:1519 do_new_mount fs/namespace.c:3335 [inline] path_mount 0x1492/0x1ed0 fs/namespace.c:3662 do_mount fs/namespace.c:3675 [inline] __do_sys_mount fs/namespace.c:3884 [inline] __se_sys_mount fs/namespace.c:3861 [inline] __x64_sys_mount 0x293/0x310 fs/namespace.c:3861 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64 0x38/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe 0x63/0xcd other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- rlock(&fi->i_xattr_sem); lock(&fi->i_sem); lock(&fi->i_xattr_sem); lock(&fi->i_sem); Cc: <[email protected]> Reported-and-tested-by: syzbot [email protected] Fixes: 5eda1ad "f2fs: fix deadlock in i_xattr_sem and inode page lock" Tested-by: Guenter Roeck <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 6f0df8e upstream. In the eviction recency check, we attempt to retrieve the memcg to which the folio belonged when it was evicted, by the memcg id stored in the shadow entry. However, there is a chance that the retrieved memcg is not the original memcg that has been killed, but a new one which happens to have the same id. This is a somewhat unfortunate, but acceptable and rare inaccuracy in the heuristics. However, if we retrieve this new memcg between its allocation and when it is properly attached to the memcg hierarchy, we could run into the following NULL pointer exception during the memcg hierarchy traversal done in mem_cgroup_get_nr_swap_pages(): [ 155757.793456] BUG: kernel NULL pointer dereference, address: 00000000000000c0 [ 155757.807568] #PF: supervisor read access in kernel mode [ 155757.818024] #PF: error_code(0x0000) - not-present page [ 155757.828482] PGD 401f77067 P4D 401f77067 PUD 401f76067 PMD 0 [ 155757.839985] Oops: 0000 [#1] SMP [ 155757.887870] RIP: 0010:mem_cgroup_get_nr_swap_pages 0x3d/0xb0 [ 155757.899377] Code: 29 19 4a 02 48 39 f9 74 63 48 8b 97 c0 00 00 00 48 8b b7 58 02 00 00 48 2b b7 c0 01 00 00 48 39 f0 48 0f 4d c6 48 39 d1 74 42 <48> 8b b2 c0 00 00 00 48 8b ba 58 02 00 00 48 2b ba c0 01 00 00 48 [ 155757.937125] RSP: 0018:ffffc9002ecdfbc8 EFLAGS: 00010286 [ 155757.947755] RAX: 00000000003a3b1c RBX: 000007ffffffffff RCX: ffff888280183000 [ 155757.962202] RDX: 0000000000000000 RSI: 0007ffffffffffff RDI: ffff888bbc2d1000 [ 155757.976648] RBP: 0000000000000001 R08: 000000000000000b R09: ffff888ad9cedba0 [ 155757.991094] R10: ffffea0039c07900 R11: 0000000000000010 R12: ffff888b23a7b000 [ 155758.005540] R13: 0000000000000000 R14: ffff888bbc2d1000 R15: 000007ffffc71354 [ 155758.019991] FS: 00007f6234c68640(0000) GS:ffff88903f9c0000(0000) knlGS:0000000000000000 [ 155758.036356] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 155758.048023] CR2: 00000000000000c0 CR3: 0000000a83eb8004 CR4: 00000000007706e0 [ 155758.062473] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 155758.076924] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 155758.091376] PKRU: 55555554 [ 155758.096957] Call Trace: [ 155758.102016] <TASK> [ 155758.106502] ? __die 0x78/0xc0 [ 155758.112793] ? page_fault_oops 0x286/0x380 [ 155758.121175] ? exc_page_fault 0x5d/0x110 [ 155758.129209] ? asm_exc_page_fault 0x22/0x30 [ 155758.137763] ? mem_cgroup_get_nr_swap_pages 0x3d/0xb0 [ 155758.148060] workingset_test_recent 0xda/0x1b0 [ 155758.157133] workingset_refault 0xca/0x1e0 [ 155758.165508] filemap_add_folio 0x4d/0x70 [ 155758.173538] page_cache_ra_unbounded 0xed/0x190 [ 155758.182919] page_cache_sync_ra 0xd6/0x1e0 [ 155758.191738] filemap_read 0x68d/0xdf0 [ 155758.199495] ? mlx5e_napi_poll 0x123/0x940 [ 155758.207981] ? __napi_schedule 0x55/0x90 [ 155758.216095] __x64_sys_pread64 0x1d6/0x2c0 [ 155758.224601] do_syscall_64 0x3d/0x80 [ 155758.232058] entry_SYSCALL_64_after_hwframe 0x46/0xb0 [ 155758.242473] RIP: 0033:0x7f62c29153b5 [ 155758.249938] Code: e8 48 89 75 f0 89 7d f8 48 89 4d e0 e8 b4 e6 f7 ff 41 89 c0 4c 8b 55 e0 48 8b 55 e8 48 8b 75 f0 8b 7d f8 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 33 44 89 c7 48 89 45 f8 e8 e7 e6 f7 ff 48 8b [ 155758.288005] RSP: 002b:00007f6234c5ffd0 EFLAGS: 00000293 ORIG_RAX: 0000000000000011 [ 155758.303474] RAX: ffffffffffffffda RBX: 00007f628c4e70c0 RCX: 00007f62c29153b5 [ 155758.318075] RDX: 000000000003c041 RSI: 00007f61d2986000 RDI: 0000000000000076 [ 155758.332678] RBP: 00007f6234c5fff0 R08: 0000000000000000 R09: 0000000064d5230c [ 155758.347452] R10: 000000000027d450 R11: 0000000000000293 R12: 000000000003c041 [ 155758.362044] R13: 00007f61d2986000 R14: 00007f629e11b060 R15: 000000000027d450 [ 155758.376661] </TASK> This patch fixes the issue by moving the memcg's id publication from the alloc stage to online stage, ensuring that any memcg acquired via id must be connected to the memcg tree. Link: https://lkml.kernel.org/r/[email protected] Fixes: f78dfc7 ("workingset: fix confusion around eviction vs refault container") Signed-off-by: Johannes Weiner <[email protected]> Co-developed-by: Nhat Pham <[email protected]> Signed-off-by: Nhat Pham <[email protected]> Acked-by: Shakeel Butt <[email protected]> Cc: Yosry Ahmed <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Muchun Song <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit e7f1326 upstream. One of the CI runs triggered the following panic assertion failed: PagePrivate(page) && page->private, in fs/btrfs/subpage.c:229 ------------[ cut here ]------------ kernel BUG at fs/btrfs/subpage.c:229! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP CPU: 0 PID: 923660 Comm: btrfs Not tainted 6.5.0-rc3 #1 pstate: 61400005 (nZCv daif PAN -UAO -TCO DIT -SSBS BTYPE=--) pc : btrfs_subpage_assert 0xbc/0xf0 lr : btrfs_subpage_assert 0xbc/0xf0 sp : ffff800093213720 x29: ffff800093213720 x28: ffff8000932138b4 x27: 000000000c280000 x26: 00000001b5d00000 x25: 000000000c281000 x24: 000000000c281fff x23: 0000000000001000 x22: 0000000000000000 x21: ffffff42b95bf880 x20: ffff42b9528e0000 x19: 0000000000001000 x18: ffffffffffffffff x17: 667274622f736620 x16: 6e69202c65746176 x15: 0000000000000028 x14: 0000000000000003 x13: 00000000002672d7 x12: 0000000000000000 x11: ffffcd3f0ccd9204 x10: ffffcd3f0554ae50 x9 : ffffcd3f0379528c x8 : ffff800093213428 x7 : 0000000000000000 x6 : ffffcd3f091771e8 x5 : ffff42b97f333948 x4 : 0000000000000000 x3 : 0000000000000000 x2 : 0000000000000000 x1 : ffff42b9556cde80 x0 : 000000000000004f Call trace: btrfs_subpage_assert 0xbc/0xf0 btrfs_subpage_set_dirty 0x38/0xa0 btrfs_page_set_dirty 0x58/0x88 relocate_one_page 0x204/0x5f0 relocate_file_extent_cluster 0x11c/0x180 relocate_data_extent 0xd0/0xf8 relocate_block_group 0x3d0/0x4e8 btrfs_relocate_block_group 0x2d8/0x490 btrfs_relocate_chunk 0x54/0x1a8 btrfs_balance 0x7f4/0x1150 btrfs_ioctl 0x10f0/0x20b8 __arm64_sys_ioctl 0x120/0x11d8 invoke_syscall.constprop.0 0x80/0xd8 do_el0_svc 0x6c/0x158 el0_svc 0x50/0x1b0 el0t_64_sync_handler 0x120/0x130 el0t_64_sync 0x194/0x198 Code: 91098021 b0007fa0 91346000 97e9c6d2 (d4210000) This is the same problem outlined in 17b17fc ("btrfs: set_page_extent_mapped after read_folio in btrfs_cont_expand") , and the fix is the same. I originally looked for the same pattern elsewhere in our code, but mistakenly skipped over this code because I saw the page cache readahead before we set_page_extent_mapped, not realizing that this was only in the !page case, that we can still end up with a !uptodate page and then do the btrfs_read_folio further down. The fix here is the same as the above mentioned patch, move the set_page_extent_mapped call to after the btrfs_read_folio() block to make sure that we have the subpage blocksize stuff setup properly before using the page. CC: [email protected] # 6.1 Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit f1187ef upstream. Fix a goof where KVM tries to grab source vCPUs from the destination VM when doing intrahost migration. Grabbing the wrong vCPU not only hoses the guest, it also crashes the host due to the VMSA pointer being left NULL. BUG: unable to handle page fault for address: ffffe38687000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP NOPTI CPU: 39 PID: 17143 Comm: sev_migrate_tes Tainted: GO 6.5.0-smp--fff2e47e6c3b-next torvalds#151 Hardware name: Google, Inc. Arcadia_IT_80/Arcadia_IT_80, BIOS 34.28.0 07/10/2023 RIP: 0010:__free_pages 0x15/0xd0 RSP: 0018:ffff923fcf6e3c78 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffe38687000000 RCX: 0000000000000100 RDX: 0000000000000100 RSI: 0000000000000000 RDI: ffffe38687000000 RBP: ffff923fcf6e3c88 R08: ffff923fcafb0000 R09: 0000000000000000 R10: 0000000000000000 R11: ffffffff83619b90 R12: ffff923fa9540000 R13: 0000000000080007 R14: ffff923f6d35d000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff929d0d7c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffe38687000000 CR3: 0000005224c34005 CR4: 0000000000770ee0 PKRU: 55555554 Call Trace: <TASK> sev_free_vcpu 0xcb/0x110 [kvm_amd] svm_vcpu_free 0x75/0xf0 [kvm_amd] kvm_arch_vcpu_destroy 0x36/0x140 [kvm] kvm_destroy_vcpus 0x67/0x100 [kvm] kvm_arch_destroy_vm 0x161/0x1d0 [kvm] kvm_put_kvm 0x276/0x560 [kvm] kvm_vm_release 0x25/0x30 [kvm] __fput 0x106/0x280 ____fput 0x12/0x20 task_work_run 0x86/0xb0 do_exit 0x2e3/0x9c0 do_group_exit 0xb1/0xc0 __x64_sys_exit_group 0x1b/0x20 do_syscall_64 0x41/0x90 entry_SYSCALL_64_after_hwframe 0x63/0xcd </TASK> CR2: ffffe38687000000 Fixes: 6defa24 ("KVM: SEV: Init target VMCBs in sev_migrate_from") Cc: [email protected] Cc: Peter Gonda <[email protected]> Reviewed-by: Peter Gonda <[email protected]> Reviewed-by: Pankaj Gupta <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>

[ Upstream commit 2810c1e ] Inject fault while probing kunit-example-test.ko, if kstrdup() fails in mod_sysfs_setup() in load_module(), the mod->state will switch from MODULE_STATE_COMING to MODULE_STATE_GOING instead of from MODULE_STATE_LIVE to MODULE_STATE_GOING, so only kunit_module_exit() will be called without kunit_module_init(), and the mod->kunit_suites is no set correctly and the free in kunit_free_suite_set() will cause below wild-memory-access bug. The mod->state state machine when load_module() succeeds: MODULE_STATE_UNFORMED ---> MODULE_STATE_COMING ---> MODULE_STATE_LIVE ^ | | | delete_module ---------------- MODULE_STATE_GOING <--------- The mod->state state machine when load_module() fails at mod_sysfs_setup(): MODULE_STATE_UNFORMED ---> MODULE_STATE_COMING ---> MODULE_STATE_GOING ^ | | | ----------------------------------------------- Call kunit_module_init() at MODULE_STATE_COMING state to fix the issue because MODULE_STATE_LIVE is transformed from it. Unable to handle kernel paging request at virtual address ffffff341e942a88 KASAN: maybe wild-memory-access in range [0x0003f9a0f4a15440-0x0003f9a0f4a15447] Mem abort info: ESR = 0x0000000096000004 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x04: level 0 translation fault Data abort info: ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000441ea000 [ffffff341e942a88] pgd=0000000000000000, p4d=0000000000000000 Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP Modules linked in: kunit_example_test(-) cfg80211 rfkill 8021q garp mrp stp llc ipv6 [last unloaded: kunit_example_test] CPU: 3 PID: 2035 Comm: modprobe Tainted: G W N 6.5.0-next-20230828 torvalds#136 Hardware name: linux,dummy-virt (DT) pstate: a0000005 (NzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : kfree 0x2c/0x70 lr : kunit_free_suite_set 0xcc/0x13c sp : ffff8000829b75b0 x29: ffff8000829b75b0 x28: ffff8000829b7b90 x27: 0000000000000000 x26: dfff800000000000 x25: ffffcd07c82a7280 x24: ffffcd07a50ab300 x23: ffffcd07a50ab2e8 x22: 1ffff00010536ec0 x21: dfff800000000000 x20: ffffcd07a50ab2f0 x19: ffffcd07a50ab2f0 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: ffffcd07c24b6764 x14: ffffcd07c24b63c0 x13: ffffcd07c4cebb94 x12: ffff700010536ec7 x11: 1ffff00010536ec6 x10: ffff700010536ec6 x9 : dfff800000000000 x8 : 00008fffefac913a x7 : 0000000041b58ab3 x6 : 0000000000000000 x5 : 1ffff00010536ec5 x4 : ffff8000829b7628 x3 : dfff800000000000 x2 : ffffff341e942a80 x1 : ffffcd07a50aa000 x0 : fffffc0000000000 Call trace: kfree 0x2c/0x70 kunit_free_suite_set 0xcc/0x13c kunit_module_notify 0xd8/0x360 blocking_notifier_call_chain 0xc4/0x128 load_module 0x382c/0x44a4 init_module_from_file 0xd4/0x128 idempotent_init_module 0x2c8/0x524 __arm64_sys_finit_module 0xac/0x100 invoke_syscall 0x6c/0x258 el0_svc_common.constprop.0 0x160/0x22c do_el0_svc 0x44/0x5c el0_svc 0x38/0x78 el0t_64_sync_handler 0x13c/0x158 el0t_64_sync 0x190/0x194 Code: aa0003e1 b25657e0 d34cfc42 8b021802 (f9400440) ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Oops: Fatal exception SMP: stopping secondary CPUs Kernel Offset: 0x4d0742200000 from 0xffff800080000000 PHYS_OFFSET: 0xffffee43c0000000 CPU features: 0x88000203,3c020000,1000421b Memory Limit: none Rebooting in 1 seconds.. Fixes: 3d6e446 ("kunit: unify module and builtin suite definitions") Signed-off-by: Jinjie Ruan <[email protected]> Reviewed-by: Rae Moar <[email protected]> Reviewed-by: David Gow <[email protected]> Signed-off-by: Shuah Khan <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

…n smcr_port_add [ Upstream commit f5146e3 ] While doing smcr_port_add, there maybe linkgroup add into or delete from smc_lgr_list.list at the same time, which may result kernel crash. So, use smc_lgr_list.lock to protect smc_lgr_list.list iterate in smcr_port_add. The crash calltrace show below: BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD 0 P4D 0 Oops: 0000 [#1] SMP NOPTI CPU: 0 PID: 559726 Comm: kworker/0:92 Kdump: loaded Tainted: G Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014 Workqueue: events smc_ib_port_event_work [smc] RIP: 0010:smcr_port_add 0xa6/0xf0 [smc] RSP: 0000:ffffa5a2c8f67de0 EFLAGS: 00010297 RAX: 0000000000000001 RBX: ffff9935e0650000 RCX: 0000000000000000 RDX: 0000000000000010 RSI: ffff9935e0654290 RDI: ffff9935c8560000 RBP: 0000000000000000 R08: 0000000000000000 R09: ffff9934c0401918 R10: 0000000000000000 R11: ffffffffb4a5c278 R12: ffff99364029aae4 R13: ffff99364029aa00 R14: 00000000ffffffed R15: ffff99364029ab08 FS: 0000000000000000(0000) GS:ffff994380600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000f06a10003 CR4: 0000000002770ef0 PKRU: 55555554 Call Trace: smc_ib_port_event_work 0x18f/0x380 [smc] process_one_work 0x19b/0x340 worker_thread 0x30/0x370 ? process_one_work 0x340/0x340 kthread 0x114/0x130 ? __kthread_cancel_work 0x50/0x50 ret_from_fork 0x1f/0x30 Fixes: 1f90a05 ("net/smc: add smcr_port_add() and smcr_link_up() processing") Signed-off-by: Guangguan Wang <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 403f0e7 ] macb_set_tx_clk() is called under a spinlock but itself calls clk_set_rate() which can sleep. This results in: | BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580 | pps pps1: new PPS source ptp1 | in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 40, name: kworker/u4:3 | preempt_count: 1, expected: 0 | RCU nest depth: 0, expected: 0 | 4 locks held by kworker/u4:3/40: | #0: ffff000003409148 | macb ff0c0000.ethernet: gem-ptp-timer ptp clock registered. | ((wq_completion)events_power_efficient){ . .}-{0:0}, at: process_one_work 0x14c/0x51c | #1: ffff8000833cbdd8 ((work_completion)(&pl->resolve)){ . .}-{0:0}, at: process_one_work 0x14c/0x51c | #2: ffff000004f01578 (&pl->state_mutex){ . .}-{4:4}, at: phylink_resolve 0x44/0x4e8 | #3: ffff000004f06f50 (&bp->lock){....}-{3:3}, at: macb_mac_link_up 0x40/0x2ac | irq event stamp: 113998 | hardirqs last enabled at (113997): [<ffff800080e8503c>] _raw_spin_unlock_irq 0x30/0x64 | hardirqs last disabled at (113998): [<ffff800080e84478>] _raw_spin_lock_irqsave 0xac/0xc8 | softirqs last enabled at (113608): [<ffff800080010630>] __do_softirq 0x430/0x4e4 | softirqs last disabled at (113597): [<ffff80008001614c>] ____do_softirq 0x10/0x1c | CPU: 0 PID: 40 Comm: kworker/u4:3 Not tainted 6.5.0-11717-g9355ce8b2f50-dirty torvalds#368 | Hardware name: ... ZynqMP ... (DT) | Workqueue: events_power_efficient phylink_resolve | Call trace: | dump_backtrace 0x98/0xf0 | show_stack 0x18/0x24 | dump_stack_lvl 0x60/0xac | dump_stack 0x18/0x24 | __might_resched 0x144/0x24c | __might_sleep 0x48/0x98 | __mutex_lock 0x58/0x7b0 | mutex_lock_nested 0x24/0x30 | clk_prepare_lock 0x4c/0xa8 | clk_set_rate 0x24/0x8c | macb_mac_link_up 0x25c/0x2ac | phylink_resolve 0x178/0x4e8 | process_one_work 0x1ec/0x51c | worker_thread 0x1ec/0x3e4 | kthread 0x120/0x124 | ret_from_fork 0x10/0x20 The obvious fix is to move the call to macb_set_tx_clk() out of the protected area. This seems safe as rx and tx are both disabled anyway at this point. It is however not entirely clear what the spinlock shall protect. It could be the read-modify-write access to the NCFGR register, but this is accessed in macb_set_rx_mode() and macb_set_rxcsum_feature() as well without holding the spinlock. It could also be the register accesses done in mog_init_rings() or macb_init_buffers(), but again these functions are called without holding the spinlock in macb_hresp_error_task(). The locking seems fishy in this driver and it might deserve another look before this patch is applied. Fixes: 633e98a ("net: macb: use resolved link config in mac_link_up()") Signed-off-by: Sascha Hauer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 910e230 ] Macro symbol_put() is defined as __symbol_put(__stringify(x)) ksym_name = "jiffies" symbol_put(ksym_name) will be resolved as __symbol_put("ksym_name") which is clearly wrong. So symbol_put must be replaced with __symbol_put. When we uninstall hw_breakpoint.ko (rmmod), a kernel bug occurs with the following error: [11381.854152] kernel BUG at kernel/module/main.c:779! [11381.854159] invalid opcode: 0000 [#2] PREEMPT SMP PTI [11381.854163] CPU: 8 PID: 59623 Comm: rmmod Tainted: G D OE 6.2.9-200.fc37.x86_64 #1 [11381.854167] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B360M-HDV, BIOS P3.20 10/23/2018 [11381.854169] RIP: 0010:__symbol_put 0xa2/0xb0 [11381.854175] Code: 00 e8 92 d2 f7 ff 65 8b 05 c3 2f e6 78 85 c0 74 1b 48 8b 44 24 30 65 48 2b 04 25 28 00 00 00 75 12 48 83 c4 38 c3 cc cc cc cc <0f> 0b 0f 1f 44 00 00 eb de e8 c0 df d8 00 90 90 90 90 90 90 90 90 [11381.854178] RSP: 0018:ffffad8ec6ae7dd0 EFLAGS: 00010246 [11381.854181] RAX: 0000000000000000 RBX: ffffffffc1fd1240 RCX: 000000000000000c [11381.854184] RDX: 000000000000006b RSI: ffffffffc02bf7c7 RDI: ffffffffc1fd001c [11381.854186] RBP: 000055a38b76e7c8 R08: ffffffff871ccfe0 R09: 0000000000000000 [11381.854188] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [11381.854190] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [11381.854192] FS: 00007fbf7c62c740(0000) GS:ffff8c5badc00000(0000) knlGS:0000000000000000 [11381.854195] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [11381.854197] CR2: 000055a38b7793f8 CR3: 0000000363e1e001 CR4: 00000000003726e0 [11381.854200] DR0: ffffffffb3407980 DR1: 0000000000000000 DR2: 0000000000000000 [11381.854202] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [11381.854204] Call Trace: [11381.854207] <TASK> [11381.854212] s_module_exit 0xc/0xff0 [symbol_getput] [11381.854219] __do_sys_delete_module.constprop.0 0x198/0x2f0 [11381.854225] do_syscall_64 0x58/0x80 [11381.854231] ? exit_to_user_mode_prepare 0x180/0x1f0 [11381.854237] ? syscall_exit_to_user_mode 0x17/0x40 [11381.854241] ? do_syscall_64 0x67/0x80 [11381.854245] ? syscall_exit_to_user_mode 0x17/0x40 [11381.854248] ? do_syscall_64 0x67/0x80 [11381.854252] ? exc_page_fault 0x70/0x170 [11381.854256] entry_SYSCALL_64_after_hwframe 0x72/0xdc Signed-off-by: Rong Tao <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Signed-off-by: Luis Chamberlain <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit b97f96e ] When compiling the kernel with clang and having HARDENED_USERCOPY enabled, the liburing openat2.t test case fails during request setup: usercopy: Kernel memory overwrite attempt detected to SLUB object 'io_kiocb' (offset 24, size 24)! ------------[ cut here ]------------ kernel BUG at mm/usercopy.c:102! invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC CPU: 3 PID: 413 Comm: openat2.t Tainted: G N 6.4.3-g6995e2de6891-dirty torvalds#19 Hardware name: QEMU Standard PC (i440FX PIIX, 1996), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014 RIP: 0010:usercopy_abort 0x84/0x90 Code: ce 49 89 ce 48 c7 c3 68 48 98 82 48 0f 44 de 48 c7 c7 56 c6 94 82 4c 89 de 48 89 c1 41 52 41 56 53 e8 e0 51 c5 00 48 83 c4 18 <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 41 57 41 56 RSP: 0018:ffffc950016b3da0 EFLAGS: 00010296 RAX: 0000000000000062 RBX: ffffffff82984868 RCX: 4e9b661ac6275b00 RDX: ffff8881b90ec580 RSI: ffffffff82949a64 RDI: 00000000ffffffff RBP: 0000000000000018 R08: 0000000000000000 R09: 0000000000000000 R10: ffffc950016b3c88 R11: ffffc950016b3c30 R12: 00007ffe549659e0 R13: ffff888119014000 R14: 0000000000000018 R15: 0000000000000018 FS: 00007f862e3ca680(0000) GS:ffff8881b90c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005571483542a8 CR3: 0000000118c11000 CR4: 00000000003506e0 Call Trace: <TASK> ? __die_body 0x63/0xb0 ? die 0x9d/0xc0 ? do_trap 0xa7/0x180 ? usercopy_abort 0x84/0x90 ? do_error_trap 0xc6/0x110 ? usercopy_abort 0x84/0x90 ? handle_invalid_op 0x2c/0x40 ? usercopy_abort 0x84/0x90 ? exc_invalid_op 0x2f/0x40 ? asm_exc_invalid_op 0x16/0x20 ? usercopy_abort 0x84/0x90 __check_heap_object 0xe2/0x110 __check_object_size 0x142/0x3d0 io_openat2_prep 0x68/0x140 io_submit_sqes 0x28a/0x680 __se_sys_io_uring_enter 0x120/0x580 do_syscall_64 0x3d/0x80 entry_SYSCALL_64_after_hwframe 0x46/0xb0 RIP: 0033:0x55714834de26 Code: ca 01 0f b6 82 d0 00 00 00 8b ba cc 00 00 00 45 31 c0 31 d2 41 b9 08 00 00 00 83 e0 01 c1 e0 04 41 09 c2 b8 aa 01 00 00 0f 05 <c3> 66 0f 1f 84 00 00 00 00 00 89 30 eb 89 0f 1f 40 00 8b 00 a8 06 RSP: 002b:00007ffe549659c8 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa RAX: ffffffffffffffda RBX: 00007ffe54965a50 RCX: 000055714834de26 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000003 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000008 R10: 0000000000000000 R11: 0000000000000246 R12: 000055714834f057 R13: 00007ffe54965a50 R14: 0000000000000001 R15: 0000557148351dd8 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- when it tries to copy struct open_how from userspace into the per-command space in the io_kiocb. There's nothing wrong with the copy, but we're missing the appropriate annotations for allowing user copies to/from the io_kiocb slab. Allow copies in the per-command area, which is from the 'file' pointer to when 'opcode' starts. We do have existing user copies there, but they are not all annotated like the one that openat2_prep() uses, copy_struct_from_user(). But in practice opcodes should be allowed to copy data into their per-command area in the io_kiocb. Reported-by: Breno Leitao <[email protected]> Signed-off-by: Jens Axboe <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 2319b9c ] The device may be scheduled during the resume process, so this cannot appear in atomic operations. Since pm_runtime_set_active will resume suppliers, put set active outside the spin lock, which is only used to protect the struct cdns data structure, otherwise the kernel will report the following warning: BUG: sleeping function called from invalid context at drivers/base/power/runtime.c:1163 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 651, name: sh preempt_count: 1, expected: 0 RCU nest depth: 0, expected: 0 CPU: 0 PID: 651 Comm: sh Tainted: G WC 6.1.20 #1 Hardware name: Freescale i.MX8QM MEK (DT) Call trace: dump_backtrace.part.0 0xe0/0xf0 show_stack 0x18/0x30 dump_stack_lvl 0x64/0x80 dump_stack 0x1c/0x38 __might_resched 0x1fc/0x240 __might_sleep 0x68/0xc0 __pm_runtime_resume 0x9c/0xe0 rpm_get_suppliers 0x68/0x1b0 __pm_runtime_set_status 0x298/0x560 cdns_resume 0xb0/0x1c0 cdns3_controller_resume.isra.0 0x1e0/0x250 cdns3_plat_resume 0x28/0x40 Signed-off-by: Xiaolei Wang <[email protected]> Acked-by: Peter Chen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit af42269 ] For cases where icc_bw_set() can be called in callbaths that could deadlock against shrinker/reclaim, such as runpm resume, we need to decouple the icc locking. Introduce a new icc_bw_lock for cases where we need to serialize bw aggregation and update to decouple that from paths that require memory allocation such as node/link creation/ destruction. Fixes this lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 6.2.0-rc8-debug torvalds#554 Not tainted ------------------------------------------------------ ring0/132 is trying to acquire lock: ffffff80871916d0 (&gmu->lock){ . .}-{3:3}, at: a6xx_pm_resume 0xf0/0x234 but task is already holding lock: ffffffdb5aee57e8 (dma_fence_map){ }-{0:0}, at: msm_job_run 0x68/0x150 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (dma_fence_map){ }-{0:0}: __dma_fence_might_wait 0x74/0xc0 dma_resv_lockdep 0x1f4/0x2f4 do_one_initcall 0x104/0x2bc kernel_init_freeable 0x344/0x34c kernel_init 0x30/0x134 ret_from_fork 0x10/0x20 -> #3 (mmu_notifier_invalidate_range_start){ . .}-{0:0}: fs_reclaim_acquire 0x80/0xa8 slab_pre_alloc_hook.constprop.0 0x40/0x25c __kmem_cache_alloc_node 0x60/0x1cc __kmalloc 0xd8/0x100 topology_parse_cpu_capacity 0x8c/0x178 get_cpu_for_node 0x88/0xc4 parse_cluster 0x1b0/0x28c parse_cluster 0x8c/0x28c init_cpu_topology 0x168/0x188 smp_prepare_cpus 0x24/0xf8 kernel_init_freeable 0x18c/0x34c kernel_init 0x30/0x134 ret_from_fork 0x10/0x20 -> #2 (fs_reclaim){ . .}-{0:0}: __fs_reclaim_acquire 0x3c/0x48 fs_reclaim_acquire 0x54/0xa8 slab_pre_alloc_hook.constprop.0 0x40/0x25c __kmem_cache_alloc_node 0x60/0x1cc __kmalloc 0xd8/0x100 kzalloc.constprop.0 0x14/0x20 icc_node_create_nolock 0x4c/0xc4 icc_node_create 0x38/0x58 qcom_icc_rpmh_probe 0x1b8/0x248 platform_probe 0x70/0xc4 really_probe 0x158/0x290 __driver_probe_device 0xc8/0xe0 driver_probe_device 0x44/0x100 __driver_attach 0xf8/0x108 bus_for_each_dev 0x78/0xc4 driver_attach 0x2c/0x38 bus_add_driver 0xd0/0x1d8 driver_register 0xbc/0xf8 __platform_driver_register 0x30/0x3c qnoc_driver_init 0x24/0x30 do_one_initcall 0x104/0x2bc kernel_init_freeable 0x344/0x34c kernel_init 0x30/0x134 ret_from_fork 0x10/0x20 -> #1 (icc_lock){ . .}-{3:3}: __mutex_lock 0xcc/0x3c8 mutex_lock_nested 0x30/0x44 icc_set_bw 0x88/0x2b4 _set_opp_bw 0x8c/0xd8 _set_opp 0x19c/0x300 dev_pm_opp_set_opp 0x84/0x94 a6xx_gmu_resume 0x18c/0x804 a6xx_pm_resume 0xf8/0x234 adreno_runtime_resume 0x2c/0x38 pm_generic_runtime_resume 0x30/0x44 __rpm_callback 0x15c/0x174 rpm_callback 0x78/0x7c rpm_resume 0x318/0x524 __pm_runtime_resume 0x78/0xbc adreno_load_gpu 0xc4/0x17c msm_open 0x50/0x120 drm_file_alloc 0x17c/0x228 drm_open_helper 0x74/0x118 drm_open 0xa0/0x144 drm_stub_open 0xd4/0xe4 chrdev_open 0x1b8/0x1e4 do_dentry_open 0x2f8/0x38c vfs_open 0x34/0x40 path_openat 0x64c/0x7b4 do_filp_open 0x54/0xc4 do_sys_openat2 0x9c/0x100 do_sys_open 0x50/0x7c __arm64_sys_openat 0x28/0x34 invoke_syscall 0x8c/0x128 el0_svc_common.constprop.0 0xa0/0x11c do_el0_svc 0xac/0xbc el0_svc 0x48/0xa0 el0t_64_sync_handler 0xac/0x13c el0t_64_sync 0x190/0x194 -> #0 (&gmu->lock){ . .}-{3:3}: __lock_acquire 0xe00/0x1060 lock_acquire 0x1e0/0x2f8 __mutex_lock 0xcc/0x3c8 mutex_lock_nested 0x30/0x44 a6xx_pm_resume 0xf0/0x234 adreno_runtime_resume 0x2c/0x38 pm_generic_runtime_resume 0x30/0x44 __rpm_callback 0x15c/0x174 rpm_callback 0x78/0x7c rpm_resume 0x318/0x524 __pm_runtime_resume 0x78/0xbc pm_runtime_get_sync.isra.0 0x14/0x20 msm_gpu_submit 0x58/0x178 msm_job_run 0x78/0x150 drm_sched_main 0x290/0x370 kthread 0xf0/0x100 ret_from_fork 0x10/0x20 other info that might help us debug this: Chain exists of: &gmu->lock --> mmu_notifier_invalidate_range_start --> dma_fence_map Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(dma_fence_map); lock(mmu_notifier_invalidate_range_start); lock(dma_fence_map); lock(&gmu->lock); *** DEADLOCK *** 2 locks held by ring0/132: #0: ffffff8087191170 (&gpu->lock){ . .}-{3:3}, at: msm_job_run 0x64/0x150 #1: ffffffdb5aee57e8 (dma_fence_map){ }-{0:0}, at: msm_job_run 0x68/0x150 stack backtrace: CPU: 7 PID: 132 Comm: ring0 Not tainted 6.2.0-rc8-debug torvalds#554 Hardware name: Google Lazor (rev1 - 2) with LTE (DT) Call trace: dump_backtrace.part.0 0xb4/0xf8 show_stack 0x20/0x38 dump_stack_lvl 0x9c/0xd0 dump_stack 0x18/0x34 print_circular_bug 0x1b4/0x1f0 check_noncircular 0x78/0xac __lock_acquire 0xe00/0x1060 lock_acquire 0x1e0/0x2f8 __mutex_lock 0xcc/0x3c8 mutex_lock_nested 0x30/0x44 a6xx_pm_resume 0xf0/0x234 adreno_runtime_resume 0x2c/0x38 pm_generic_runtime_resume 0x30/0x44 __rpm_callback 0x15c/0x174 rpm_callback 0x78/0x7c rpm_resume 0x318/0x524 __pm_runtime_resume 0x78/0xbc pm_runtime_get_sync.isra.0 0x14/0x20 msm_gpu_submit 0x58/0x178 msm_job_run 0x78/0x150 drm_sched_main 0x290/0x370 kthread 0xf0/0x100 ret_from_fork 0x10/0x20 Signed-off-by: Rob Clark <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Georgi Djakov <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit bc056e7 ] When we calculate the end position of ext4_free_extent, this position may be exactly where ext4_lblk_t (i.e. uint) overflows. For example, if ac_g_ex.fe_logical is 4294965248 and ac_orig_goal_len is 2048, then the computed end is 0x100000000, which is 0. If ac->ac_o_ex.fe_logical is not the first case of adjusting the best extent, that is, new_bex_end > 0, the following BUG_ON will be triggered: ========================================================= kernel BUG at fs/ext4/mballoc.c:5116! invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 3 PID: 673 Comm: xfs_io Tainted: G E 6.5.0-rc1 torvalds#279 RIP: 0010:ext4_mb_new_inode_pa 0xc5/0x430 Call Trace: <TASK> ext4_mb_use_best_found 0x203/0x2f0 ext4_mb_try_best_found 0x163/0x240 ext4_mb_regular_allocator 0x158/0x1550 ext4_mb_new_blocks 0x86a/0xe10 ext4_ext_map_blocks 0xb0c/0x13a0 ext4_map_blocks 0x2cd/0x8f0 ext4_iomap_begin 0x27b/0x400 iomap_iter 0x222/0x3d0 __iomap_dio_rw 0x243/0xcb0 iomap_dio_rw 0x16/0x80 ========================================================= A simple reproducer demonstrating the problem: mkfs.ext4 -F /dev/sda -b 4096 100M mount /dev/sda /tmp/test fallocate -l1M /tmp/test/tmp fallocate -l10M /tmp/test/file fallocate -i -o 1M -l16777203M /tmp/test/file fsstress -d /tmp/test -l 0 -n 100000 -p 8 & sleep 10 && killall -9 fsstress rm -f /tmp/test/tmp xfs_io -c "open -ad /tmp/test/file" -c "pwrite -S 0xff 0 8192" We simply refactor the logic for adjusting the best extent by adding a temporary ext4_free_extent ex and use extent_logical_end() to avoid overflow, which also simplifies the code. Cc: [email protected] # 6.4 Fixes: 93cdf49 ("ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()") Signed-off-by: Baokun Li <[email protected]> Reviewed-by: Ritesh Harjani (IBM) <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

… tree" commit 3c70de9 upstream. This reverts commit 06f4543. John Ogness reports the case that the allocation is in atomic context under acquired spin-lock. [ 12.555784] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:306 [ 12.555808] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 70, name: kworker/1:2 [ 12.555814] preempt_count: 1, expected: 0 [ 12.555820] INFO: lockdep is turned off. [ 12.555824] irq event stamp: 208 [ 12.555828] hardirqs last enabled at (207): [<c00000000111e414>] ._raw_spin_unlock_irq 0x44/0x80 [ 12.555850] hardirqs last disabled at (208): [<c00000000110ff94>] .__schedule 0x854/0xfe0 [ 12.555859] softirqs last enabled at (188): [<c000000000f73504>] .addrconf_verify_rtnl 0x2c4/0xb70 [ 12.555872] softirqs last disabled at (182): [<c000000000f732b0>] .addrconf_verify_rtnl 0x70/0xb70 [ 12.555884] CPU: 1 PID: 70 Comm: kworker/1:2 Tainted: G S 6.6.0-rc1 #1 [ 12.555893] Hardware name: PowerMac7,2 PPC970 0x390202 PowerMac [ 12.555898] Workqueue: firewire_ohci .bus_reset_work [firewire_ohci] [ 12.555939] Call Trace: [ 12.558634] [c000000009677830] [c0000000010d83c0] .dump_stack_lvl 0x8c/0xd0 (unreliable) [ 12.555963] [c0000000096778b0] [c000000000140270] .__might_resched 0x320/0x340 [ 12.555978] [c000000009677940] [c000000000497600] .__kmem_cache_alloc_node 0x390/0x460 [ 12.555993] [c000000009677a10] [c0000000003fe620] .__kmalloc 0x70/0x310 [ 12.556007] [c000000009677ac0] [c0003d00004e2268] .fw_core_handle_bus_reset 0x2c8/0xba0 [firewire_core] [ 12.556060] [c000000009677c20] [c0003d0000491190] .bus_reset_work 0x330/0x9b0 [firewire_ohci] [ 12.556079] [c000000009677d10] [c00000000011d0d0] .process_one_work 0x280/0x6f0 [ 12.556094] [c000000009677e10] [c00000000011d8a0] .worker_thread 0x360/0x500 [ 12.556107] [c000000009677ef0] [c00000000012e3b4] .kthread 0x154/0x160 [ 12.556120] [c000000009677f90] [c00000000000bfa8] .start_kernel_thread 0x10/0x14 Cc: [email protected] Reported-by: John Ogness <[email protected]> Link: https://lore.kernel.org/lkml/[email protected]/raw Signed-off-by: Takashi Sakamoto <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>

[ Upstream commit e8c526f ] Martin KaFai Lau reported use-after-free [0] in reqsk_timer_handler(). """ We are seeing a use-after-free from a bpf prog attached to trace_tcp_retransmit_synack. The program passes the req->sk to the bpf_sk_storage_get_tracing kernel helper which does check for null before using it. """ The commit 83fccfc ("inet: fix potential deadlock in reqsk_queue_unlink()") added timer_pending() in reqsk_queue_unlink() not to call del_timer_sync() from reqsk_timer_handler(), but it introduced a small race window. Before the timer is called, expire_timers() calls detach_timer(timer, true) to clear timer->entry.pprev and marks it as not pending. If reqsk_queue_unlink() checks timer_pending() just after expire_timers() calls detach_timer(), TCP will miss del_timer_sync(); the reqsk timer will continue running and send multiple SYN ACKs until it expires. The reported UAF could happen if req->sk is close()d earlier than the timer expiration, which is 63s by default. The scenario would be 1. inet_csk_complete_hashdance() calls inet_csk_reqsk_queue_drop(), but del_timer_sync() is missed 2. reqsk timer is executed and scheduled again 3. req->sk is accept()ed and reqsk_put() decrements rsk_refcnt, but reqsk timer still has another one, and inet_csk_accept() does not clear req->sk for non-TFO sockets 4. sk is close()d 5. reqsk timer is executed again, and BPF touches req->sk Let's not use timer_pending() by passing the caller context to __inet_csk_reqsk_queue_drop(). Note that reqsk timer is pinned, so the issue does not happen in most use cases. [1] [0] BUG: KFENCE: use-after-free read in bpf_sk_storage_get_tracing 0x2e/0x1b0 Use-after-free read at 0x00000000a891fb3a (in kfence-#1): bpf_sk_storage_get_tracing 0x2e/0x1b0 bpf_prog_5ea3e95db6da0438_tcp_retransmit_synack 0x1d20/0x1dda bpf_trace_run2 0x4c/0xc0 tcp_rtx_synack 0xf9/0x100 reqsk_timer_handler 0xda/0x3d0 run_timer_softirq 0x292/0x8a0 irq_exit_rcu 0xf5/0x320 sysvec_apic_timer_interrupt 0x6d/0x80 asm_sysvec_apic_timer_interrupt 0x16/0x20 intel_idle_irq 0x5a/0xa0 cpuidle_enter_state 0x94/0x273 cpu_startup_entry 0x15e/0x260 start_secondary 0x8a/0x90 secondary_startup_64_no_verify 0xfa/0xfb kfence-#1: 0x00000000a72cc7b6-0x00000000d97616d9, size=2376, cache=TCPv6 allocated by task 0 on cpu 9 at 260507.901592s: sk_prot_alloc 0x35/0x140 sk_clone_lock 0x1f/0x3f0 inet_csk_clone_lock 0x15/0x160 tcp_create_openreq_child 0x1f/0x410 tcp_v6_syn_recv_sock 0x1da/0x700 tcp_check_req 0x1fb/0x510 tcp_v6_rcv 0x98b/0x1420 ipv6_list_rcv 0x2258/0x26e0 napi_complete_done 0x5b1/0x2990 mlx5e_napi_poll 0x2ae/0x8d0 net_rx_action 0x13e/0x590 irq_exit_rcu 0xf5/0x320 common_interrupt 0x80/0x90 asm_common_interrupt 0x22/0x40 cpuidle_enter_state 0xfb/0x273 cpu_startup_entry 0x15e/0x260 start_secondary 0x8a/0x90 secondary_startup_64_no_verify 0xfa/0xfb freed by task 0 on cpu 9 at 260507.927527s: rcu_core_si 0x4ff/0xf10 irq_exit_rcu 0xf5/0x320 sysvec_apic_timer_interrupt 0x6d/0x80 asm_sysvec_apic_timer_interrupt 0x16/0x20 cpuidle_enter_state 0xfb/0x273 cpu_startup_entry 0x15e/0x260 start_secondary 0x8a/0x90 secondary_startup_64_no_verify 0xfa/0xfb Fixes: 83fccfc ("inet: fix potential deadlock in reqsk_queue_unlink()") Reported-by: Martin KaFai Lau <[email protected]> Closes: https://lore.kernel.org/netdev/[email protected]/ Link: https://lore.kernel.org/netdev/[email protected]/ [1] Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: Martin KaFai Lau <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

@flags

[ Upstream commit 56440d7 ] While running net selftests with CONFIG_PROVE_RCU_LIST=y I saw one lockdep splat [1]. genlmsg_mcast() uses for_each_net_rcu(), and must therefore hold RCU. Instead of letting all callers guard genlmsg_multicast_allns() with a rcu_read_lock()/rcu_read_unlock() pair, do it in genlmsg_mcast(). This also means the @flags parameter is useless, we need to always use GFP_ATOMIC. [1] [10882.424136] ============================= [10882.424166] WARNING: suspicious RCU usage [10882.424309] 6.12.0-rc2-virtme #1156 Not tainted [10882.424400] ----------------------------- [10882.424423] net/netlink/genetlink.c:1940 RCU-list traversed in non-reader section!! [10882.424469] other info that might help us debug this: [10882.424500] rcu_scheduler_active = 2, debug_locks = 1 [10882.424744] 2 locks held by ip/15677: [10882.424791] #0: ffffffffb6b491b0 (cb_lock){ }-{3:3}, at: genl_rcv (net/netlink/genetlink.c:1219) [10882.426334] #1: ffffffffb6b49248 (genl_mutex){ . .}-{3:3}, at: genl_rcv_msg (net/netlink/genetlink.c:61 net/netlink/genetlink.c:57 net/netlink/genetlink.c:1209) [10882.426465] stack backtrace: [10882.426805] CPU: 14 UID: 0 PID: 15677 Comm: ip Not tainted 6.12.0-rc2-virtme #1156 [10882.426919] Hardware name: QEMU Standard PC (i440FX PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [10882.427046] Call Trace: [10882.427131] <TASK> [10882.427244] dump_stack_lvl (lib/dump_stack.c:123) [10882.427335] lockdep_rcu_suspicious (kernel/locking/lockdep.c:6822) [10882.427387] genlmsg_multicast_allns (net/netlink/genetlink.c:1940 (discriminator 7) net/netlink/genetlink.c:1977 (discriminator 7)) [10882.427436] l2tp_tunnel_notify.constprop.0 (net/l2tp/l2tp_netlink.c:119) l2tp_netlink [10882.427683] l2tp_nl_cmd_tunnel_create (net/l2tp/l2tp_netlink.c:253) l2tp_netlink [10882.427748] genl_family_rcv_msg_doit (net/netlink/genetlink.c:1115) [10882.427834] genl_rcv_msg (net/netlink/genetlink.c:1195 net/netlink/genetlink.c:1210) [10882.427877] ? __pfx_l2tp_nl_cmd_tunnel_create (net/l2tp/l2tp_netlink.c:186) l2tp_netlink [10882.427927] ? __pfx_genl_rcv_msg (net/netlink/genetlink.c:1201) [10882.427959] netlink_rcv_skb (net/netlink/af_netlink.c:2551) [10882.428069] genl_rcv (net/netlink/genetlink.c:1220) [10882.428095] netlink_unicast (net/netlink/af_netlink.c:1332 net/netlink/af_netlink.c:1357) [10882.428140] netlink_sendmsg (net/netlink/af_netlink.c:1901) [10882.428210] ____sys_sendmsg (net/socket.c:729 (discriminator 1) net/socket.c:744 (discriminator 1) net/socket.c:2607 (discriminator 1)) Fixes: 33f72e6 ("l2tp : multicast notification to the registered listeners") Signed-off-by: Eric Dumazet <[email protected]> Cc: James Chapman <[email protected]> Cc: Tom Parkin <[email protected]> Cc: Johannes Berg <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 1ab6032 ] When using encryption, either enforced by the server or when using 'seal' mount option, the client will squash all compound request buffers down for encryption into a single iov in smb2_set_next_command(). SMB2_ioctl_init() allocates a small buffer (448 bytes) to hold the SMB2_IOCTL request in the first iov, and if the user passes an input buffer that is greater than 328 bytes, smb2_set_next_command() will end up writing off the end of @rqst->iov[0].iov_base as shown below: mount.cifs //srv/share /mnt -o ...,seal ln -s $(perl -e "print('a')for 1..1024") /mnt/link BUG: KASAN: slab-out-of-bounds in smb2_set_next_command.cold 0x1d6/0x24c [cifs] Write of size 4116 at addr ffff8881148fcab8 by task ln/859 CPU: 1 UID: 0 PID: 859 Comm: ln Not tainted 6.12.0-rc3 #1 Hardware name: QEMU Standard PC (Q35 ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014 Call Trace: <TASK> dump_stack_lvl 0x5d/0x80 ? smb2_set_next_command.cold 0x1d6/0x24c [cifs] print_report 0x156/0x4d9 ? smb2_set_next_command.cold 0x1d6/0x24c [cifs] ? __virt_addr_valid 0x145/0x310 ? __phys_addr 0x46/0x90 ? smb2_set_next_command.cold 0x1d6/0x24c [cifs] kasan_report 0xda/0x110 ? smb2_set_next_command.cold 0x1d6/0x24c [cifs] kasan_check_range 0x10f/0x1f0 __asan_memcpy 0x3c/0x60 smb2_set_next_command.cold 0x1d6/0x24c [cifs] smb2_compound_op 0x238c/0x3840 [cifs] ? kasan_save_track 0x14/0x30 ? kasan_save_free_info 0x3b/0x70 ? vfs_symlink 0x1a1/0x2c0 ? do_symlinkat 0x108/0x1c0 ? __pfx_smb2_compound_op 0x10/0x10 [cifs] ? kmem_cache_free 0x118/0x3e0 ? cifs_get_writable_path 0xeb/0x1a0 [cifs] smb2_get_reparse_inode 0x423/0x540 [cifs] ? __pfx_smb2_get_reparse_inode 0x10/0x10 [cifs] ? rcu_is_watching 0x20/0x50 ? __kmalloc_noprof 0x37c/0x480 ? smb2_create_reparse_symlink 0x257/0x490 [cifs] ? smb2_create_reparse_symlink 0x38f/0x490 [cifs] smb2_create_reparse_symlink 0x38f/0x490 [cifs] ? __pfx_smb2_create_reparse_symlink 0x10/0x10 [cifs] ? find_held_lock 0x8a/0xa0 ? hlock_class 0x32/0xb0 ? __build_path_from_dentry_optional_prefix 0x19d/0x2e0 [cifs] cifs_symlink 0x24f/0x960 [cifs] ? __pfx_make_vfsuid 0x10/0x10 ? __pfx_cifs_symlink 0x10/0x10 [cifs] ? make_vfsgid 0x6b/0xc0 ? generic_permission 0x96/0x2d0 vfs_symlink 0x1a1/0x2c0 do_symlinkat 0x108/0x1c0 ? __pfx_do_symlinkat 0x10/0x10 ? strncpy_from_user 0xaa/0x160 __x64_sys_symlinkat 0xb9/0xf0 do_syscall_64 0xbb/0x1d0 entry_SYSCALL_64_after_hwframe 0x77/0x7f RIP: 0033:0x7f08d75c13bb Reported-by: David Howells <[email protected]> Fixes: e77fe73 ("cifs: we can not use small padding iovs together with encryption") Signed-off-by: Paulo Alcantara (Red Hat) <[email protected]> Signed-off-by: Steve French <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit d62b140 ] Command bitmask have a dedicated bit for MANAGE_PAGES command, this bit isn't Initialize during command bitmask Initialization, only during MANAGE_PAGES. In addition, mlx5_cmd_trigger_completions() is trying to trigger completion for MANAGE_PAGES command as well. Hence, in case health error occurred before any MANAGE_PAGES command have been invoke (for example, during mlx5_enable_hca()), mlx5_cmd_trigger_completions() will try to trigger completion for MANAGE_PAGES command, which will result in null-ptr-deref error.[1] Fix it by Initialize command bitmask correctly. While at it, re-write the code for better understanding. [1] BUG: KASAN: null-ptr-deref in mlx5_cmd_trigger_completions 0x1db/0x600 [mlx5_core] Write of size 4 at addr 0000000000000214 by task kworker/u96:2/12078 CPU: 10 PID: 12078 Comm: kworker/u96:2 Not tainted 6.9.0-rc2_for_upstream_debug_2024_04_07_19_01 #1 Hardware name: QEMU Standard PC (Q35 ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Workqueue: mlx5_health0000:08:00.0 mlx5_fw_fatal_reporter_err_work [mlx5_core] Call Trace: <TASK> dump_stack_lvl 0x7e/0xc0 kasan_report 0xb9/0xf0 kasan_check_range 0xec/0x190 mlx5_cmd_trigger_completions 0x1db/0x600 [mlx5_core] mlx5_cmd_flush 0x94/0x240 [mlx5_core] enter_error_state 0x6c/0xd0 [mlx5_core] mlx5_fw_fatal_reporter_err_work 0xf3/0x480 [mlx5_core] process_one_work 0x787/0x1490 ? lockdep_hardirqs_on_prepare 0x400/0x400 ? pwq_dec_nr_in_flight 0xda0/0xda0 ? assign_work 0x168/0x240 worker_thread 0x586/0xd30 ? rescuer_thread 0xae0/0xae0 kthread 0x2df/0x3b0 ? kthread_complete_and_exit 0x20/0x20 ret_from_fork 0x2d/0x70 ? kthread_complete_and_exit 0x20/0x20 ret_from_fork_asm 0x11/0x20 </TASK> Fixes: 9b98d39 ("net/mlx5: Start health poll at earlier stage of driver load") Signed-off-by: Shay Drory <[email protected]> Reviewed-by: Moshe Shemesh <[email protected]> Reviewed-by: Saeed Mahameed <[email protected]> Signed-off-by: Tariq Toukan <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 73f3508 ] When creating a trace_probe we would set nr_args prior to truncating the arguments to MAX_TRACE_ARGS. However, we would only initialize arguments up to the limit. This caused invalid memory access when attempting to set up probes with more than 128 fetchargs. BUG: kernel NULL pointer dereference, address: 0000000000000020 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] PREEMPT SMP PTI CPU: 0 UID: 0 PID: 1769 Comm: cat Not tainted 6.11.0-rc7 torvalds#8 Hardware name: QEMU Standard PC (i440FX PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014 RIP: 0010:__set_print_fmt 0x134/0x330 Resolve the issue by applying the MAX_TRACE_ARGS limit earlier. Return an error when there are too many arguments instead of silently truncating. Link: https://lore.kernel.org/all/[email protected]/ Fixes: 035ba76 ("tracing/probes: cleanup: Set trace_probe::nr_args at trace_probe_init") Signed-off-by: Mikel Rychliski <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 47dd544 ] The variable wwan_rtnl_link_ops assign a *bigger* maxtype which leads to a global out-of-bounds read when parsing the netlink attributes. Exactly same bug cause as the oob fixed in commit b33fb5b ("net: qualcomm: rmnet: fix global oob in rmnet_policy"). ================================================================== BUG: KASAN: global-out-of-bounds in validate_nla lib/nlattr.c:388 [inline] BUG: KASAN: global-out-of-bounds in __nla_validate_parse 0x19d7/0x29a0 lib/nlattr.c:603 Read of size 1 at addr ffffffff8b09cb60 by task syz.1.66276/323862 CPU: 0 PID: 323862 Comm: syz.1.66276 Not tainted 6.1.70 #1 Hardware name: QEMU Standard PC (i440FX PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl 0x177/0x231 lib/dump_stack.c:106 print_address_description mm/kasan/report.c:284 [inline] print_report 0x14f/0x750 mm/kasan/report.c:395 kasan_report 0x139/0x170 mm/kasan/report.c:495 validate_nla lib/nlattr.c:388 [inline] __nla_validate_parse 0x19d7/0x29a0 lib/nlattr.c:603 __nla_parse 0x3c/0x50 lib/nlattr.c:700 nla_parse_nested_deprecated include/net/netlink.h:1269 [inline] __rtnl_newlink net/core/rtnetlink.c:3514 [inline] rtnl_newlink 0x7bc/0x1fd0 net/core/rtnetlink.c:3623 rtnetlink_rcv_msg 0x794/0xef0 net/core/rtnetlink.c:6122 netlink_rcv_skb 0x1de/0x420 net/netlink/af_netlink.c:2508 netlink_unicast_kernel net/netlink/af_netlink.c:1326 [inline] netlink_unicast 0x74b/0x8c0 net/netlink/af_netlink.c:1352 netlink_sendmsg 0x882/0xb90 net/netlink/af_netlink.c:1874 sock_sendmsg_nosec net/socket.c:716 [inline] __sock_sendmsg net/socket.c:728 [inline] ____sys_sendmsg 0x5cc/0x8f0 net/socket.c:2499 ___sys_sendmsg 0x21c/0x290 net/socket.c:2553 __sys_sendmsg net/socket.c:2582 [inline] __do_sys_sendmsg net/socket.c:2591 [inline] __se_sys_sendmsg 0x19e/0x270 net/socket.c:2589 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64 0x45/0x90 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe 0x63/0xcd RIP: 0033:0x7f67b19a24ad RSP: 002b:00007f67b17febb8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f67b1b45f80 RCX: 00007f67b19a24ad RDX: 0000000000000000 RSI: 0000000020005e40 RDI: 0000000000000004 RBP: 00007f67b1a1e01d R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffd2513764f R14: 00007ffd251376e0 R15: 00007f67b17fed40 </TASK> The buggy address belongs to the variable: wwan_rtnl_policy 0x20/0x40 The buggy address belongs to the physical page: page:ffffea00002c2700 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xb09c flags: 0xfff00000001000(reserved|node=0|zone=1|lastcpupid=0x7ff) raw: 00fff00000001000 ffffea00002c2708 ffffea00002c2708 0000000000000000 raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected page_owner info is not present (never set?) Memory state around the buggy address: ffffffff8b09ca00: 05 f9 f9 f9 05 f9 f9 f9 00 01 f9 f9 00 01 f9 f9 ffffffff8b09ca80: 00 00 00 05 f9 f9 f9 f9 00 00 03 f9 f9 f9 f9 f9 >ffffffff8b09cb00: 00 00 00 00 05 f9 f9 f9 00 00 00 00 f9 f9 f9 f9 ^ ffffffff8b09cb80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ================================================================== According to the comment of `nla_parse_nested_deprecated`, use correct size `IFLA_WWAN_MAX` here to fix this issue. Fixes: 88b7105 ("wwan: add interface creation support") Signed-off-by: Lin Ma <[email protected]> Reviewed-by: Loic Poulain <[email protected]> Reviewed-by: Simon Horman <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

…n_net [ Upstream commit d5ff2fb ] In the normal case, when we excute `echo 0 > /proc/fs/nfsd/threads`, the function `nfs4_state_destroy_net` in `nfs4_state_shutdown_net` will release all resources related to the hashed `nfs4_client`. If the `nfsd_client_shrinker` is running concurrently, the `expire_client` function will first unhash this client and then destroy it. This can lead to the following warning. Additionally, numerous use-after-free errors may occur as well. nfsd_client_shrinker echo 0 > /proc/fs/nfsd/threads expire_client nfsd_shutdown_net unhash_client ... nfs4_state_shutdown_net /* won't wait shrinker exit */ /* cancel_work(&nn->nfsd_shrinker_work) * nfsd_file for this /* won't destroy unhashed client1 */ * client1 still alive nfs4_state_destroy_net */ nfsd_file_cache_shutdown /* trigger warning */ kmem_cache_destroy(nfsd_file_slab) kmem_cache_destroy(nfsd_file_mark_slab) /* release nfsd_file and mark */ __destroy_client ==================================================================== BUG nfsd_file (Not tainted): Objects remaining in nfsd_file on __kmem_cache_shutdown() -------------------------------------------------------------------- CPU: 4 UID: 0 PID: 764 Comm: sh Not tainted 6.12.0-rc3 #1 dump_stack_lvl 0x53/0x70 slab_err 0xb0/0xf0 __kmem_cache_shutdown 0x15c/0x310 kmem_cache_destroy 0x66/0x160 nfsd_file_cache_shutdown 0xac/0x210 [nfsd] nfsd_destroy_serv 0x251/0x2a0 [nfsd] nfsd_svc 0x125/0x1e0 [nfsd] write_threads 0x16a/0x2a0 [nfsd] nfsctl_transaction_write 0x74/0xa0 [nfsd] vfs_write 0x1a5/0x6d0 ksys_write 0xc1/0x160 do_syscall_64 0x5f/0x170 entry_SYSCALL_64_after_hwframe 0x76/0x7e ==================================================================== BUG nfsd_file_mark (Tainted: G B W ): Objects remaining nfsd_file_mark on __kmem_cache_shutdown() -------------------------------------------------------------------- dump_stack_lvl 0x53/0x70 slab_err 0xb0/0xf0 __kmem_cache_shutdown 0x15c/0x310 kmem_cache_destroy 0x66/0x160 nfsd_file_cache_shutdown 0xc8/0x210 [nfsd] nfsd_destroy_serv 0x251/0x2a0 [nfsd] nfsd_svc 0x125/0x1e0 [nfsd] write_threads 0x16a/0x2a0 [nfsd] nfsctl_transaction_write 0x74/0xa0 [nfsd] vfs_write 0x1a5/0x6d0 ksys_write 0xc1/0x160 do_syscall_64 0x5f/0x170 entry_SYSCALL_64_after_hwframe 0x76/0x7e To resolve this issue, cancel `nfsd_shrinker_work` using synchronous mode in nfs4_state_shutdown_net. Fixes: 7c24fa2 ("NFSD: replace delayed_work with work_struct for nfsd_client_shrinker") Signed-off-by: Yang Erkun <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

commit 6889cd2 upstream. During fuzz testing, the following issue was discovered: BUG: KMSAN: kernel-infoleak in _copy_to_iter 0x598/0x2a30 _copy_to_iter 0x598/0x2a30 __skb_datagram_iter 0x168/0x1060 skb_copy_datagram_iter 0x5b/0x220 netlink_recvmsg 0x362/0x1700 sock_recvmsg 0x2dc/0x390 __sys_recvfrom 0x381/0x6d0 __x64_sys_recvfrom 0x130/0x200 x64_sys_call 0x32c8/0x3cc0 do_syscall_64 0xd8/0x1c0 entry_SYSCALL_64_after_hwframe 0x79/0x81 Uninit was stored to memory at: copy_to_user_state_extra 0xcc1/0x1e00 dump_one_state 0x28c/0x5f0 xfrm_state_walk 0x548/0x11e0 xfrm_dump_sa 0x1e0/0x840 netlink_dump 0x943/0x1c40 __netlink_dump_start 0x746/0xdb0 xfrm_user_rcv_msg 0x429/0xc00 netlink_rcv_skb 0x613/0x780 xfrm_netlink_rcv 0x77/0xc0 netlink_unicast 0xe90/0x1280 netlink_sendmsg 0x126d/0x1490 __sock_sendmsg 0x332/0x3d0 ____sys_sendmsg 0x863/0xc30 ___sys_sendmsg 0x285/0x3e0 __x64_sys_sendmsg 0x2d6/0x560 x64_sys_call 0x1316/0x3cc0 do_syscall_64 0xd8/0x1c0 entry_SYSCALL_64_after_hwframe 0x79/0x81 Uninit was created at: __kmalloc 0x571/0xd30 attach_auth 0x106/0x3e0 xfrm_add_sa 0x2aa0/0x4230 xfrm_user_rcv_msg 0x832/0xc00 netlink_rcv_skb 0x613/0x780 xfrm_netlink_rcv 0x77/0xc0 netlink_unicast 0xe90/0x1280 netlink_sendmsg 0x126d/0x1490 __sock_sendmsg 0x332/0x3d0 ____sys_sendmsg 0x863/0xc30 ___sys_sendmsg 0x285/0x3e0 __x64_sys_sendmsg 0x2d6/0x560 x64_sys_call 0x1316/0x3cc0 do_syscall_64 0xd8/0x1c0 entry_SYSCALL_64_after_hwframe 0x79/0x81 Bytes 328-379 of 732 are uninitialized Memory access of size 732 starts at ffff88800e18e000 Data copied to user address 00007ff30f48aff0 CPU: 2 PID: 18167 Comm: syz-executor.0 Not tainted 6.8.11 #1 Hardware name: QEMU Standard PC (i440FX PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Fixes copying of xfrm algorithms where some random data of the structure fields can end up in userspace. Padding in structures may be filled with random (possibly sensitve) data and should never be given directly to user-space. A similar issue was resolved in the commit 8222d59 ("xfrm: Zero padding when dumping algos and encap") Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Fixes: c7a5899 ("xfrm: redact SA secret with lockdown confidentiality") Cc: [email protected] Co-developed-by: Boris Tonofa <[email protected]> Signed-off-by: Boris Tonofa <[email protected]> Signed-off-by: Petr Vaganov <[email protected]> Signed-off-by: Steffen Klassert <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>

Currently, when configuring TMU (Time Management Unit) mode of a given router, we take into account only its own TMU requirements ignoring other routers in the domain. This is problematic if the router we are configuring has lower TMU requirements than what is already configured in the domain. In the scenario below, we have a host router with two USB4 ports: A and B. Port A connected to device router #1 (which supports CL states) and existing DisplayPort tunnel, thus, the TMU mode is HiFi uni-directional. 1. Initial topology [Host] A/ / [Device #1] / Monitor 2. Plug in device #2 (that supports CL states) to downstream port B of the host router [Host] A/ B\ / \ [Device #1] [Device #2] / Monitor The TMU mode on port B and port A will be configured to LowRes which is not what we want and will cause monitor to start flickering. To address this we first scan the domain and search for any router configured to HiFi uni-directional mode, and if found, configure TMU mode of the given router to HiFi uni-directional as well. Cc: [email protected] Signed-off-by: Gil Fine <[email protected]> Signed-off-by: Mika Westerberg <[email protected]>

…ing to satisfy some BPF verifiers In a RHEL8 kernel (4.18.0-513.11.1.el8_9.x86_64), that, as enterprise kernels go, have backports from modern kernels, the verifier complains about lack of bounds check for the index into the array of syscall arguments, on a BPF bytecode generated by clang 17, with: ; } else if (size < 0 && size >= -6) { /* buffer */ 116: (b7) r1 = -6 117: (2d) if r1 > r6 goto pc-30 R0=map_value(id=0,off=0,ks=4,vs=24688,imm=0) R1_w=inv-6 R2=map_value(id=0,off=16,ks=4,vs=8272,imm=0) R3=inv(id=0) R5=inv40 R6=inv(id=0,umin_value=18446744073709551610,var_off=(0xffffffff00000000; 0xffffffff)) R7=map_value(id=0,off=56,ks=4,vs=8272,imm=0) R8=invP6 R9=map_value(id=0,off=20,ks=4,vs=24,imm=0) R10=fp0 fp-8=mmmmmmmm fp-16=map_value fp-24=map_value fp-32=inv40 fp-40=ctx fp-48=map_value fp-56=inv1 fp-64=map_value fp-72=map_value fp-80=map_value ; index = -(size 1); 118: (a7) r6 ^= -1 119: (67) r6 <<= 32 120: (77) r6 >>= 32 ; aug_size = args->args[index]; 121: (67) r6 <<= 3 122: (79) r1 = *(u64 *)(r10 -24) 123: (0f) r1 = r6 last_idx 123 first_idx 116 regs=40 stack=0 before 122: (79) r1 = *(u64 *)(r10 -24) regs=40 stack=0 before 121: (67) r6 <<= 3 regs=40 stack=0 before 120: (77) r6 >>= 32 regs=40 stack=0 before 119: (67) r6 <<= 32 regs=40 stack=0 before 118: (a7) r6 ^= -1 regs=40 stack=0 before 117: (2d) if r1 > r6 goto pc-30 regs=42 stack=0 before 116: (b7) r1 = -6 R0_w=map_value(id=0,off=0,ks=4,vs=24688,imm=0) R1_w=inv1 R2_w=map_value(id=0,off=16,ks=4,vs=8272,imm=0) R3_w=inv(id=0) R5_w=inv40 R6_rw=invP(id=0,smin_value=-2147483648,smax_value=0) R7_w=map_value(id=0,off=56,ks=4,vs=8272,imm=0) R8_w=invP6 R9_w=map_value(id=0,off=20,ks=4,vs=24,imm=0) R10=fp0 fp-8=mmmmmmmm fp-16_w=map_value fp-24_r=map_value fp-32_w=inv40 fp-40=ctx fp-48=map_value fp-56_w=inv1 fp-64_w=map_value fp-72=map_value fp-80=map_value parent didn't have regs=40 stack=0 marks last_idx 110 first_idx 98 regs=40 stack=0 before 110: (6d) if r1 s> r6 goto pc 5 regs=42 stack=0 before 109: (b7) r1 = 1 regs=40 stack=0 before 108: (65) if r6 s> 0x1000 goto pc 7 regs=40 stack=0 before 98: (55) if r6 != 0x1 goto pc 9 R0_w=map_value(id=0,off=0,ks=4,vs=24688,imm=0) R1_w=invP12 R2_w=map_value(id=0,off=16,ks=4,vs=8272,imm=0) R3_rw=inv(id=0) R5_w=inv24 R6_rw=invP(id=0,smin_value=-2147483648,smax_value=2147483647) R7_w=map_value(id=0,off=40,ks=4,vs=8272,imm=0) R8_rw=invP4 R9_w=map_value(id=0,off=12,ks=4,vs=24,imm=0) R10=fp0 fp-8=mmmmmmmm fp-16_rw=map_value fp-24_r=map_value fp-32_rw=invP24 fp-40_r=ctx fp-48_r=map_value fp-56_w=invP1 fp-64_rw=map_value fp-72_r=map_value fp-80_r=map_value parent already had regs=40 stack=0 marks 124: (79) r6 = *(u64 *)(r1 16) R0=map_value(id=0,off=0,ks=4,vs=24688,imm=0) R1_w=map_value(id=0,off=0,ks=4,vs=8272,umax_value=34359738360,var_off=(0x0; 0x7fffffff8),s32_max_value=2147483640,u32_max_value=-8) R2=map_value(id=0,off=16,ks=4,vs=8272,imm=0) R3=inv(id=0) R5=inv40 R6_w=invP(id=0,umax_value=34359738360,var_off=(0x0; 0x7fffffff8),s32_max_value=2147483640,u32_max_value=-8) R7=map_value(id=0,off=56,ks=4,vs=8272,imm=0) R8=invP6 R9=map_value(id=0,off=20,ks=4,vs=24,imm=0) R10=fp0 fp-8=mmmmmmmm fp-16=map_value fp-24=map_value fp-32=inv40 fp-40=ctx fp-48=map_value fp-56=inv1 fp-64=map_value fp-72=map_value fp-80=map_value R1 unbounded memory access, make sure to bounds check any such access processed 466 insns (limit 1000000) max_states_per_insn 2 total_states 20 peak_states 20 mark_read 3 If we add this line, as used in other BPF programs, to cap that index: index &= 7; The generated BPF program is considered safe by that version of the BPF verifier, allowing perf to collect the syscall args in one more kernel using the BPF based pointer contents collector. With the above one-liner it works with that kernel: [root@dell-per740-01 ~]# uname -a Linux dell-per740-01.khw.eng.rdu2.dc.redhat.com 4.18.0-513.11.1.el8_9.x86_64 #1 SMP Thu Dec 7 03:06:13 EST 2023 x86_64 x86_64 x86_64 GNU/Linux [root@dell-per740-01 ~]# ~acme/bin/perf trace -e *sleep* sleep 1.234567890 0.000 (1234.704 ms): sleep/3863610 nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 234567890 }) = 0 [root@dell-per740-01 ~]# As well as with the one in Fedora 40: root@number:~# uname -a Linux number 6.11.3-200.fc40.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Oct 10 22:31:19 UTC 2024 x86_64 GNU/Linux root@number:~# perf trace -e *sleep* sleep 1.234567890 0.000 (1234.722 ms): sleep/14873 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 234567890 }, rmtp: 0x7ffe87311a40) = 0 root@number:~# Song Liu reported that this one-liner was being optimized out by clang 18, so I suggested and he tested that adding a compiler barrier before it made clang v18 to keep it and the verifier in the kernel in Song's case (Meta's 5.12 based kernel) also was happy with the resulting bytecode. I'll investigate using virtme-ng[1] to have all the perf BPF based functionality thoroughly tested over multiple kernels and clang versions. [1] https://kernel-recipes.org/en/2024/virtme-ng/ Cc: Adrian Hunter <[email protected]> Cc: Alan Maguire <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Andrea Righi <[email protected]> Cc: Howard Chu <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: James Clark <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kan Liang <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Song Liu <[email protected]> Link: https://lore.kernel.org/lkml/Zw7JgJc0LOwSpuvx@x1 Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

The purpose of btrfs_bbio_propagate_error() shall be propagating an error of split bio to its original btrfs_bio, and tell the error to the upper layer. However, it's not working well on some cases. * Case 1. Immediate (or quick) end_bio with an error When btrfs sends btrfs_bio to mirrored devices, btrfs calls btrfs_bio_end_io() when all the mirroring bios are completed. If that btrfs_bio was split, it is from btrfs_clone_bioset and its end_io function is btrfs_orig_write_end_io. For this case, btrfs_bbio_propagate_error() accesses the orig_bbio's bio context to increase the error count. That works well in most cases. However, if the end_io is called enough fast, orig_bbio's (remaining part after split) bio context may not be properly set at that time. Since the bio context is set when the orig_bbio (the last btrfs_bio) is sent to devices, that might be too late for earlier split btrfs_bio's completion. That will result in NULL pointer dereference. That bug is easily reproducible by running btrfs/146 on zoned devices [1] and it shows the following trace. [1] You need raid-stripe-tree feature as it create "-d raid0 -m raid1" FS. BUG: kernel NULL pointer dereference, address: 0000000000000020 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] PREEMPT SMP PTI CPU: 1 UID: 0 PID: 13 Comm: kworker/u32:1 Not tainted 6.11.0-rc7-BTRFS-ZNS torvalds#474 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Workqueue: writeback wb_workfn (flush-btrfs-5) RIP: 0010:btrfs_bio_end_io 0xae/0xc0 [btrfs] BTRFS error (device dm-0): bdev /dev/mapper/error-test errs: wr 2, rd 0, flush 0, corrupt 0, gen 0 RSP: 0018:ffffc9500006f248 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888005a7f080 RCX: ffffc9500006f1dc RDX: 0000000000000000 RSI: 000000000000000a RDI: ffff888005a7f080 RBP: ffff888011dfc540 R08: 0000000000000000 R09: 0000000000000001 R10: ffffffff82e508e0 R11: 0000000000000005 R12: ffff88800ddfbe58 R13: ffff888005a7f080 R14: ffff888005a7f158 R15: ffff888005a7f158 FS: 0000000000000000(0000) GS:ffff88803ea80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000020 CR3: 0000000002e22006 CR4: 0000000000370ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __die_body.cold 0x19/0x26 ? page_fault_oops 0x13e/0x2b0 ? _printk 0x58/0x73 ? do_user_addr_fault 0x5f/0x750 ? exc_page_fault 0x76/0x240 ? asm_exc_page_fault 0x22/0x30 ? btrfs_bio_end_io 0xae/0xc0 [btrfs] ? btrfs_log_dev_io_error 0x7f/0x90 [btrfs] btrfs_orig_write_end_io 0x51/0x90 [btrfs] dm_submit_bio 0x5c2/0xa50 [dm_mod] ? find_held_lock 0x2b/0x80 ? blk_try_enter_queue 0x90/0x1e0 __submit_bio 0xe0/0x130 ? ktime_get 0x10a/0x160 ? lockdep_hardirqs_on 0x74/0x100 submit_bio_noacct_nocheck 0x199/0x410 btrfs_submit_bio 0x7d/0x150 [btrfs] btrfs_submit_chunk 0x1a1/0x6d0 [btrfs] ? lockdep_hardirqs_on 0x74/0x100 ? __folio_start_writeback 0x10/0x2c0 btrfs_submit_bbio 0x1c/0x40 [btrfs] submit_one_bio 0x44/0x60 [btrfs] submit_extent_folio 0x13f/0x330 [btrfs] ? btrfs_set_range_writeback 0xa3/0xd0 [btrfs] extent_writepage_io 0x18b/0x360 [btrfs] extent_write_locked_range 0x17c/0x340 [btrfs] ? __pfx_end_bbio_data_write 0x10/0x10 [btrfs] run_delalloc_cow 0x71/0xd0 [btrfs] btrfs_run_delalloc_range 0x176/0x500 [btrfs] ? find_lock_delalloc_range 0x119/0x260 [btrfs] writepage_delalloc 0x2ab/0x480 [btrfs] extent_write_cache_pages 0x236/0x7d0 [btrfs] btrfs_writepages 0x72/0x130 [btrfs] do_writepages 0xd4/0x240 ? find_held_lock 0x2b/0x80 ? wbc_attach_and_unlock_inode 0x12c/0x290 ? wbc_attach_and_unlock_inode 0x12c/0x290 __writeback_single_inode 0x5c/0x4c0 ? do_raw_spin_unlock 0x49/0xb0 writeback_sb_inodes 0x22c/0x560 __writeback_inodes_wb 0x4c/0xe0 wb_writeback 0x1d6/0x3f0 wb_workfn 0x334/0x520 process_one_work 0x1ee/0x570 ? lock_is_held_type 0xc6/0x130 worker_thread 0x1d1/0x3b0 ? __pfx_worker_thread 0x10/0x10 kthread 0xee/0x120 ? __pfx_kthread 0x10/0x10 ret_from_fork 0x30/0x50 ? __pfx_kthread 0x10/0x10 ret_from_fork_asm 0x1a/0x30 </TASK> Modules linked in: dm_mod btrfs blake2b_generic xor raid6_pq rapl CR2: 0000000000000020 * Case 2. Earlier completion of orig_bbio for mirrored btrfs_bios btrfs_bbio_propagate_error() assumes the end_io function for orig_bbio is called last among split bios. In that case, btrfs_orig_write_end_io() sets the bio->bi_status to BLK_STS_IOERR by seeing the bioc->error [2]. Otherwise, the increased orig_bio's bioc->error is not checked by anyone and return BLK_STS_OK to the upper layer. [2] Actually, this is not true. Because we only increases orig_bioc->errors by max_errors, the condition "atomic_read(&bioc->error) > bioc->max_errors" is still not met if only one split btrfs_bio fails. * Case 3. Later completion of orig_bbio for un-mirrored btrfs_bios In contrast to the above case, btrfs_bbio_propagate_error() is not working well if un-mirrored orig_bbio is completed last. It sets orig_bbio->bio.bi_status to the btrfs_bio's error. But, that is easily over-written by orig_bbio's completion status. If the status is BLK_STS_OK, the upper layer would not know the failure. * Solution Considering the above cases, we can only save the error status in the orig_bbio (remaining part after split) itself as it is always available. Also, the saved error status should be propagated when all the split btrfs_bios are finished (i.e, bbio->pending_ios == 0). This commit introduces "status" to btrfs_bbio and saves the first error of split bios to original btrfs_bio's "status" variable. When all the split bios are finished, the saved status is loaded into original btrfs_bio's status. With this commit, btrfs/146 on zoned devices does not hit the NULL pointer dereference anymore. Fixes: 852eee6 ("btrfs: allow btrfs_submit_bio to split bios") CC: [email protected] # 6.6 Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Naohiro Aota <[email protected]> Signed-off-by: David Sterba <[email protected]>

Running rcutorture scenario TREE05, the below warning is triggered. [ 32.604863] WARNING: suspicious RCU usage [ 32.605928] 6.11.0-rc5-00040-g4ba4f1afb6a9 #55238 Not tainted [ 32.607812] ----------------------------- [ 32.609140] kernel/events/core.c:13946 RCU-list traversed in non-reader section!! [ 32.611595] other info that might help us debug this: [ 32.614247] rcu_scheduler_active = 2, debug_locks = 1 [ 32.616392] 3 locks held by cpuhp/4/35: [ 32.617687] #0: ffffffffb666a650 (cpu_hotplug_lock){ }-{0:0}, at: cpuhp_thread_fun 0x4e/0x200 [ 32.620563] #1: ffffffffb666cd20 (cpuhp_state-down){ . .}-{0:0}, at: cpuhp_thread_fun 0x4e/0x200 [ 32.623412] #2: ffffffffb677c288 (pmus_lock){ . .}-{3:3}, at: perf_event_exit_cpu_context 0x32/0x2f0 In perf_event_clear_cpumask(), uses list_for_each_entry_rcu() without an obvious RCU read-side critical section. Either pmus_srcu or pmus_lock is good enough to protect the pmus list. In the current context, pmus_lock is already held. The list_for_each_entry_rcu() is not required. Fixes: 4ba4f1a ("perf: Generic hotplug support for a PMU with a scope") Closes: https://lore.kernel.org/lkml/2b66dff8-b827-494b-b151-1ad8d56f13e6@paulmck-laptop/ Closes: https://lore.kernel.org/oe-lkp/[email protected] Reported-by: "Paul E. McKenney" <[email protected]> Reported-by: kernel test robot <[email protected]> Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: "Paul E. McKenney" <[email protected]> Link: https://lore.kernel.org/r/[email protected]

Add check for the return value of spi_get_csgpiod() to avoid passing a NULL pointer to gpiod_direction_output(), preventing a crash when GPIO chip select is not used. Fix below crash: [ 4.251960] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 [ 4.260762] Mem abort info: [ 4.263556] ESR = 0x0000000096000004 [ 4.267308] EC = 0x25: DABT (current EL), IL = 32 bits [ 4.272624] SET = 0, FnV = 0 [ 4.275681] EA = 0, S1PTW = 0 [ 4.278822] FSC = 0x04: level 0 translation fault [ 4.283704] Data abort info: [ 4.286583] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 [ 4.292074] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 4.297130] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 4.302445] [0000000000000000] user address but active_mm is swapper [ 4.308805] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP [ 4.315072] Modules linked in: [ 4.318124] CPU: 2 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.12.0-rc4-next-20241023-00008-ga20ec42c5fc1 torvalds#359 [ 4.328130] Hardware name: LS1046A QDS Board (DT) [ 4.332832] pstate: 40000005 (nZcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 4.339794] pc : gpiod_direction_output 0x34/0x5c [ 4.344505] lr : gpiod_direction_output 0x18/0x5c [ 4.349208] sp : ffff80008003b8f0 [ 4.352517] x29: ffff80008003b8f0 x28: 0000000000000000 x27: ffffc96bcc7e9068 [ 4.359659] x26: ffffc96bcc6e00b0 x25: ffffc96bcc598398 x24: ffff447400132810 [ 4.366800] x23: 0000000000000000 x22: 0000000011e1a300 x21: 0000000000020002 [ 4.373940] x20: 0000000000000000 x19: 0000000000000000 x18: ffffffffffffffff [ 4.381081] x17: ffff44740016e600 x16: 0000000500000003 x15: 0000000000000007 [ 4.388221] x14: 0000000000989680 x13: 0000000000020000 x12: 000000000000001e [ 4.395362] x11: 0044b82fa09b5a53 x10: 0000000000000019 x9 : 0000000000000008 [ 4.402502] x8 : 0000000000000002 x7 : 0000000000000007 x6 : 0000000000000000 [ 4.409641] x5 : 0000000000000200 x4 : 0000000002000000 x3 : 0000000000000000 [ 4.416781] x2 : 0000000000022202 x1 : 0000000000000000 x0 : 0000000000000000 [ 4.423921] Call trace: [ 4.426362] gpiod_direction_output 0x34/0x5c (P) [ 4.431067] gpiod_direction_output 0x18/0x5c (L) [ 4.435771] dspi_setup 0x220/0x334 Fixes: 9e264f3 ("spi: Replace all spi->chip_select and spi->cs_gpiod references with function call") Cc: [email protected] Signed-off-by: Frank Li <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Mark Brown <[email protected]>

… non-PCI device The function cxl_endpoint_gather_bandwidth() invokes pci_bus_read/write_XXX(), however, not all CXL devices are presently implemented via PCI. It is recognized that the cxl_test has realized a CXL device using a platform device. Calling pci_bus_read/write_XXX() in cxl_test will cause kernel panic: platform cxl_host_bridge.3: host supports CXL (restricted) Oops: general protection fault, probably for non-canonical address 0x3ef17856fcae4fbd: 0000 [#1] PREEMPT SMP PTI Call Trace: <TASK> ? __die_body.cold 0x19/0x27 ? die_addr 0x38/0x60 ? exc_general_protection 0x1f5/0x4b0 ? asm_exc_general_protection 0x22/0x30 ? pci_bus_read_config_word 0x1c/0x60 pcie_capability_read_word 0x93/0xb0 pcie_link_speed_mbps 0x18/0x50 cxl_pci_get_bandwidth 0x18/0x60 [cxl_core] cxl_endpoint_gather_bandwidth.constprop.0 0xf4/0x230 [cxl_core] ? xas_store 0x54/0x660 ? preempt_count_add 0x69/0xa0 ? _raw_spin_lock 0x13/0x40 ? __kmalloc_cache_noprof 0xe7/0x270 cxl_region_shared_upstream_bandwidth_update 0x9c/0x790 [cxl_core] cxl_region_attach 0x520/0x7e0 [cxl_core] store_targetN 0xf2/0x120 [cxl_core] kernfs_fop_write_iter 0x13a/0x1f0 vfs_write 0x23b/0x410 ksys_write 0x53/0xd0 do_syscall_64 0x62/0x180 entry_SYSCALL_64_after_hwframe 0x76/0x7e And Ying also reported a KASAN error with similar calltrace. Reported-by: Huang, Ying <[email protected]> Closes: http://lore.kernel.org/[email protected] Fixes: a5ab0de ("cxl: Calculate region bandwidth of targets with shared upstream link") Signed-off-by: Li Zhijian <[email protected]> Reviewed-by: Dan Williams <[email protected]> Tested-by: Huang, Ying <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Ira Weiny <[email protected]>

In support of investigating an initialization failure report [1], cxl_test was updated to register mock memory-devices after the mock root-port/bus device had been registered. That led to cxl_test crashing with a use-after-free bug with the following signature: cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem0:decoder7.0 @ 0 next: cxl_switch_uport.0 nr_eps: 1 nr_targets: 1 cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem4:decoder14.0 @ 1 next: cxl_switch_uport.0 nr_eps: 2 nr_targets: 1 cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[0] = cxl_switch_dport.0 for mem0:decoder7.0 @ 0 1) cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[1] = cxl_switch_dport.4 for mem4:decoder14.0 @ 1 [..] cxld_unregister: cxl decoder14.0: cxl_region_decode_reset: cxl_region region3: mock_decoder_reset: cxl_port port3: decoder3.0 reset 2) mock_decoder_reset: cxl_port port3: decoder3.0: out of order reset, expected decoder3.1 cxl_endpoint_decoder_release: cxl decoder14.0: [..] cxld_unregister: cxl decoder7.0: 3) cxl_region_decode_reset: cxl_region region3: Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bc3: 0000 [#1] PREEMPT SMP PTI [..] RIP: 0010:to_cxl_port 0x8/0x60 [cxl_core] [..] Call Trace: <TASK> cxl_region_decode_reset 0x69/0x190 [cxl_core] cxl_region_detach 0xe8/0x210 [cxl_core] cxl_decoder_kill_region 0x27/0x40 [cxl_core] cxld_unregister 0x5d/0x60 [cxl_core] At 1) a region has been established with 2 endpoint decoders (7.0 and 14.0). Those endpoints share a common switch-decoder in the topology (3.0). At teardown, 2), decoder14.0 is the first to be removed and hits the "out of order reset case" in the switch decoder. The effect though is that region3 cleanup is aborted leaving it in-tact and referencing decoder14.0. At 3) the second attempt to teardown region3 trips over the stale decoder14.0 object which has long since been deleted. The fix here is to recognize that the CXL specification places no mandate on in-order shutdown of switch-decoders, the driver enforces in-order allocation, and hardware enforces in-order commit. So, rather than fail and leave objects dangling, always remove them. In support of making cxl_region_decode_reset() always succeed, cxl_region_invalidate_memregion() failures are turned into warnings. Crashing the kernel is ok there since system integrity is at risk if caches cannot be managed around physical address mutation events like CXL region destruction. A new device_for_each_child_reverse_from() is added to cleanup port->commit_end after all dependent decoders have been disabled. In other words if decoders are allocated 0->1->2 and disabled 1->2->0 then port->commit_end only decrements from 2 after 2 has been disabled, and it decrements all the way to zero since 1 was disabled previously. Link: http://lore.kernel.org/[email protected] [1] Cc: [email protected] Fixes: 176baef ("cxl/hdm: Commit decoder state to hardware") Reviewed-by: Jonathan Cameron <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Alison Schofield <[email protected]> Cc: Ira Weiny <[email protected]> Cc: Zijun Hu <[email protected]> Signed-off-by: Dan Williams <[email protected]> Reviewed-by: Ira Weiny <[email protected]> Link: https://patch.msgid.link/172964782781.81806.17902885593105284330.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Ira Weiny <[email protected]>

Under memory pressure it's possible for GFP_ATOMIC order-0 allocations to fail even though free pages are available in the highatomic reserves. GFP_ATOMIC allocations cannot trigger unreserve_highatomic_pageblock() since it's only run from reclaim. Given that such allocations will pass the watermarks in __zone_watermark_unusable_free(), it makes sense to fallback to highatomic reserves the same way that ALLOC_OOM can. This fixes order-0 page allocation failures observed on Cloudflare's fleet when handling network packets: kswapd1: page allocation failure: order:0, mode:0x820(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0-7 CPU: 10 PID: 696 Comm: kswapd1 Kdump: loaded Tainted: G O 6.6.43-CUSTOM #1 Hardware name: MACHINE Call Trace: <IRQ> dump_stack_lvl 0x3c/0x50 warn_alloc 0x13a/0x1c0 __alloc_pages_slowpath.constprop.0 0xc9d/0xd10 __alloc_pages 0x327/0x340 __napi_alloc_skb 0x16d/0x1f0 bnxt_rx_page_skb 0x96/0x1b0 [bnxt_en] bnxt_rx_pkt 0x201/0x15e0 [bnxt_en] __bnxt_poll_work 0x156/0x2b0 [bnxt_en] bnxt_poll 0xd9/0x1c0 [bnxt_en] __napi_poll 0x2b/0x1b0 bpf_trampoline_6442524138 0x7d/0x1000 __napi_poll 0x5/0x1b0 net_rx_action 0x342/0x740 handle_softirqs 0xcf/0x2b0 irq_exit_rcu 0x6c/0x90 sysvec_apic_timer_interrupt 0x72/0x90 </IRQ> [[email protected]: update comment] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/all/CAGis_TWzSu=P7QJmjD58WWiu3zjMTVKSzdOwWE8ORaGytzWJwQ@mail.gmail.com/ Fixes: 1d91df8 ("mm/page_alloc: handle a missing case for memalloc_nocma_{save/restore} APIs") Signed-off-by: Matt Fleming <[email protected]> Suggested-by: Vlastimil Babka <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>

walk_system_ram_res_rev() erroneously discards resource flags when passing the information to the callback. This causes systems with IORESOURCE_SYSRAM_DRIVER_MANAGED memory to have these resources selected during kexec to store kexec buffers if that memory happens to be at placed above normal system ram. This leads to undefined behavior after reboot. If the kexec buffer is never touched, nothing happens. If the kexec buffer is touched, it could lead to a crash (like below) or undefined behavior. Tested on a system with CXL memory expanders with driver managed memory, TPM enabled, and CONFIG_IMA_KEXEC=y. Adding printk's showed the flags were being discarded and as a result the check for IORESOURCE_SYSRAM_DRIVER_MANAGED passes. find_next_iomem_res: name(System RAM (kmem)) start(10000000000) end(1034fffffff) flags(83000200) locate_mem_hole_top_down: start(10000000000) end(1034fffffff) flags(0) [.] BUG: unable to handle page fault for address: ffff89834ffff000 [.] #PF: supervisor read access in kernel mode [.] #PF: error_code(0x0000) - not-present page [.] PGD c04c8bf067 P4D c04c8bf067 PUD c04c8be067 PMD 0 [.] Oops: 0000 [#1] SMP [.] RIP: 0010:ima_restore_measurement_list 0x95/0x4b0 [.] RSP: 0018:ffffc950000d3a80 EFLAGS: 00010286 [.] RAX: 0000000000001000 RBX: 0000000000000000 RCX: ffff89834ffff000 [.] RDX: 0000000000000018 RSI: ffff89834ffff000 RDI: ffff89834ffff018 [.] RBP: ffffc950000d3ba0 R08: 0000000000000020 R09: ffff888132b8a900 [.] R10: 4000000000000000 R11: 000000003a616d69 R12: 0000000000000000 [.] R13: ffffffff8404ac28 R14: 0000000000000000 R15: ffff89834ffff000 [.] FS: 0000000000000000(0000) GS:ffff893d44640000(0000) knlGS:0000000000000000 [.] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [.] ata5: SATA link down (SStatus 0 SControl 300) [.] CR2: ffff89834ffff000 CR3: 000001034d00f001 CR4: 0000000000770ef0 [.] PKRU: 55555554 [.] Call Trace: [.] <TASK> [.] ? __die 0x78/0xc0 [.] ? page_fault_oops 0x2a8/0x3a0 [.] ? exc_page_fault 0x84/0x130 [.] ? asm_exc_page_fault 0x22/0x30 [.] ? ima_restore_measurement_list 0x95/0x4b0 [.] ? template_desc_init_fields 0x317/0x410 [.] ? crypto_alloc_tfm_node 0x9c/0xc0 [.] ? init_ima_lsm 0x30/0x30 [.] ima_load_kexec_buffer 0x72/0xa0 [.] ima_init 0x44/0xa0 [.] __initstub__kmod_ima__373_1201_init_ima7 0x1e/0xb0 [.] ? init_ima_lsm 0x30/0x30 [.] do_one_initcall 0xad/0x200 [.] ? idr_alloc_cyclic 0xaa/0x110 [.] ? new_slab 0x12c/0x420 [.] ? new_slab 0x12c/0x420 [.] ? number 0x12a/0x430 [.] ? sysvec_apic_timer_interrupt 0xa/0x80 [.] ? asm_sysvec_apic_timer_interrupt 0x16/0x20 [.] ? parse_args 0xd4/0x380 [.] ? parse_args 0x14b/0x380 [.] kernel_init_freeable 0x1c1/0x2b0 [.] ? rest_init 0xb0/0xb0 [.] kernel_init 0x16/0x1a0 [.] ret_from_fork 0x2f/0x40 [.] ? rest_init 0xb0/0xb0 [.] ret_from_fork_asm 0x11/0x20 [.] </TASK> Link: https://lore.kernel.org/all/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Fixes: 7acf164 ("resource: add walk_system_ram_res_rev()") Signed-off-by: Gregory Price <[email protected]> Reviewed-by: Dan Williams <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: AKASHI Takahiro <[email protected]> Cc: Andy Shevchenko <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Ilpo Järvinen <[email protected]> Cc: Mika Westerberg <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>

The following BUG was triggered: ============================= [ BUG: Invalid wait context ] 6.12.0-rc2-XXX torvalds#406 Not tainted ----------------------------- kworker/1:1/62 is trying to lock: ffffff8801593030 (&cpc_ptr->rmw_lock){ . .}-{3:3}, at: cpc_write 0xcc/0x370 other info that might help us debug this: context-{5:5} 2 locks held by kworker/1:1/62: #0: ffffff897ef5ec98 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested 0x2c/0x50 #1: ffffff880154e238 (&sg_policy->update_lock){....}-{2:2}, at: sugov_update_shared 0x3c/0x280 stack backtrace: CPU: 1 UID: 0 PID: 62 Comm: kworker/1:1 Not tainted 6.12.0-rc2-g9654bd3e8806 torvalds#406 Workqueue: 0x0 (events) Call trace: dump_backtrace 0xa4/0x130 show_stack 0x20/0x38 dump_stack_lvl 0x90/0xd0 dump_stack 0x18/0x28 __lock_acquire 0x480/0x1ad8 lock_acquire 0x114/0x310 _raw_spin_lock 0x50/0x70 cpc_write 0xcc/0x370 cppc_set_perf 0xa0/0x3a8 cppc_cpufreq_fast_switch 0x40/0xc0 cpufreq_driver_fast_switch 0x4c/0x218 sugov_update_shared 0x234/0x280 update_load_avg 0x6ec/0x7b8 dequeue_entities 0x108/0x830 dequeue_task_fair 0x58/0x408 __schedule 0x4f0/0x1070 schedule 0x54/0x130 worker_thread 0xc0/0x2e8 kthread 0x130/0x148 ret_from_fork 0x10/0x20 sugov_update_shared() locks a raw_spinlock while cpc_write() locks a spinlock. To have a correct wait-type order, update rmw_lock to a raw spinlock and ensure that interrupts will be disabled on the CPU holding it. Fixes: 60949b7 ("ACPI: CPPC: Fix MASK_VAL() usage") Signed-off-by: Pierre Gondois <[email protected]> Link: https://patch.msgid.link/[email protected] [ rjw: Changelog edits ] Signed-off-by: Rafael J. Wysocki <[email protected]>

I got a syzbot report without a repro [1] crashing in nf_send_reset6() I think the issue is that dev->hard_header_len is zero, and we attempt later to push an Ethernet header. Use LL_MAX_HEADER, as other functions in net/ipv6/netfilter/nf_reject_ipv6.c. [1] skbuff: skb_under_panic: text:ffffffff89b1d008 len:74 put:14 head:ffff88803123aa00 data:ffff88803123a9f2 tail:0x3c end:0x140 dev:syz_tun kernel BUG at net/core/skbuff.c:206 ! Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI CPU: 0 UID: 0 PID: 7373 Comm: syz.1.568 Not tainted 6.12.0-rc2-syzkaller-00631-g6d858708d465 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 RIP: 0010:skb_panic net/core/skbuff.c:206 [inline] RIP: 0010:skb_under_panic 0x14b/0x150 net/core/skbuff.c:216 Code: 0d 8d 48 c7 c6 60 a6 29 8e 48 8b 54 24 08 8b 0c 24 44 8b 44 24 04 4d 89 e9 50 41 54 41 57 41 56 e8 ba 30 38 02 48 83 c4 20 90 <0f> 0b 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 RSP: 0018:ffffc950045269b0 EFLAGS: 00010282 RAX: 0000000000000088 RBX: dffffc0000000000 RCX: cd66dacdc5d8e800 RDX: 0000000000000000 RSI: 0000000000000200 RDI: 0000000000000000 RBP: ffff88802d39a3d0 R08: ffffffff8174afec R09: 1ffff920008a4ccc R10: dffffc0000000000 R11: fffff520008a4ccd R12: 0000000000000140 R13: ffff88803123aa00 R14: ffff88803123a9f2 R15: 000000000000003c FS: 00007fdbee5ff6c0(0000) GS:ffff8880b8600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000005d322000 CR4: 00000000003526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> skb_push 0xe5/0x100 net/core/skbuff.c:2636 eth_header 0x38/0x1f0 net/ethernet/eth.c:83 dev_hard_header include/linux/netdevice.h:3208 [inline] nf_send_reset6 0xce6/0x1270 net/ipv6/netfilter/nf_reject_ipv6.c:358 nft_reject_inet_eval 0x3b9/0x690 net/netfilter/nft_reject_inet.c:48 expr_call_ops_eval net/netfilter/nf_tables_core.c:240 [inline] nft_do_chain 0x4ad/0x1da0 net/netfilter/nf_tables_core.c:288 nft_do_chain_inet 0x418/0x6b0 net/netfilter/nft_chain_filter.c:161 nf_hook_entry_hookfn include/linux/netfilter.h:154 [inline] nf_hook_slow 0xc3/0x220 net/netfilter/core.c:626 nf_hook include/linux/netfilter.h:269 [inline] NF_HOOK include/linux/netfilter.h:312 [inline] br_nf_pre_routing_ipv6 0x63e/0x770 net/bridge/br_netfilter_ipv6.c:184 nf_hook_entry_hookfn include/linux/netfilter.h:154 [inline] nf_hook_bridge_pre net/bridge/br_input.c:277 [inline] br_handle_frame 0x9fd/0x1530 net/bridge/br_input.c:424 __netif_receive_skb_core 0x13e8/0x4570 net/core/dev.c:5562 __netif_receive_skb_one_core net/core/dev.c:5666 [inline] __netif_receive_skb 0x12f/0x650 net/core/dev.c:5781 netif_receive_skb_internal net/core/dev.c:5867 [inline] netif_receive_skb 0x1e8/0x890 net/core/dev.c:5926 tun_rx_batched 0x1b7/0x8f0 drivers/net/tun.c:1550 tun_get_user 0x3056/0x47e0 drivers/net/tun.c:2007 tun_chr_write_iter 0x10d/0x1f0 drivers/net/tun.c:2053 new_sync_write fs/read_write.c:590 [inline] vfs_write 0xa6d/0xc90 fs/read_write.c:683 ksys_write 0x183/0x2b0 fs/read_write.c:736 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64 0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe 0x77/0x7f RIP: 0033:0x7fdbeeb7d1ff Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 c9 8d 02 00 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 44 24 08 e8 1c 8e 02 00 48 RSP: 002b:00007fdbee5ff000 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 00007fdbeed36058 RCX: 00007fdbeeb7d1ff RDX: 000000000000008e RSI: 0000000020000040 RDI: 00000000000000c8 RBP: 00007fdbeebf12be R08: 0000000000000000 R09: 0000000000000000 R10: 000000000000008e R11: 0000000000000293 R12: 0000000000000000 R13: 0000000000000000 R14: 00007fdbeed36058 R15: 00007ffc38de06e8 </TASK> Fixes: c8d7b98 ("netfilter: move nf_send_resetX() code to nf_reject_ipvX modules") Reported-by: syzbot <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>

Hou Tao says: ==================== The patch set fixes several issues in bits iterator. Patch #1 fixes the kmemleak problem of bits iterator. Patch #2~#3 fix the overflow problem of nr_bits. Patch #4 fixes the potential stack corruption when bits iterator is used on 32-bit host. Patch #5 adds more test cases for bits iterator. Please see the individual patches for more details. And comments are always welcome. --- v4: * patch #1: add ack from Yafang * patch #3: revert code-churn like changes: (1) compute nr_bytes and nr_bits before the check of nr_words. (2) use nr_bits == 64 to check for single u64, preventing build warning on 32-bit hosts. * patch #4: use "BITS_PER_LONG == 32" instead of "!defined(CONFIG_64BIT)" v3: https://lore.kernel.org/bpf/[email protected]/T/#t * split the bits-iterator related patches from "Misc fixes for bpf" patch set * patch #1: use "!nr_bits || bits >= nr_bits" to stop the iteration * patch #2: add a new helper for the overflow problem * patch #3: decrease the limitation from 512 to 511 and check whether nr_bytes is too large for bpf memory allocator explicitly * patch #5: add two more test cases for bit iterator v2: http://lore.kernel.org/bpf/[email protected] ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>

Petr Machata says: ==================== mlxsw: Fixes In this patchset: - Tx header should be pushed for each packet which is transmitted via Spectrum ASICs. Patch #1 adds a missing call to skb_cow_head() to make sure that there is both enough room to push the Tx header and that the SKB header is not cloned and can be modified. - Commit b5b60bb ("mlxsw: pci: Use page pool for Rx buffers allocation") converted mlxsw to use page pool for Rx buffers allocation. Sync for CPU and for device should be done for Rx pages. In patches #2 and #3, add the missing calls to sync pages for, respectively, CPU and the device. - Patch #4 then fixes a bug to IPv6 GRE forwarding offload. Patch #5 adds a generic forwarding test that fails with mlxsw ports prior to the fix. ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>

When we compile and load lib/slub_kunit.c,it will cause a panic. The root cause is that __kmalloc_cache_noprof was directly called instead of kmem_cache_alloc,which resulted in no alloc_tag being allocated.This caused current->alloc_tag to be null,leading to a null pointer dereference in alloc_tag_ref_set. Despite the fact that my colleague Pei Xiao will later fix the code in slub_kunit.c,we still need fix null pointer check logic for ref and tag to avoid panic caused by a null pointer dereference. Here is the log for the panic: [ 74.779373][ T2158] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020 [ 74.780130][ T2158] Mem abort info: [ 74.780406][ T2158] ESR = 0x0000000096000004 [ 74.780756][ T2158] EC = 0x25: DABT (current EL), IL = 32 bits [ 74.781225][ T2158] SET = 0, FnV = 0 [ 74.781529][ T2158] EA = 0, S1PTW = 0 [ 74.781836][ T2158] FSC = 0x04: level 0 translation fault [ 74.782288][ T2158] Data abort info: [ 74.782577][ T2158] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 [ 74.783068][ T2158] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 74.783533][ T2158] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 74.784010][ T2158] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000105f34000 [ 74.784586][ T2158] [0000000000000020] pgd=0000000000000000, p4d=0000000000000000 [ 74.785293][ T2158] Internal error: Oops: 0000000096000004 [#1] SMP [ 74.785805][ T2158] Modules linked in: slub_kunit kunit ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute ip6table_nat ip6table_mangle 4 [ 74.790661][ T2158] CPU: 0 UID: 0 PID: 2158 Comm: kunit_try_catch Kdump: loaded Tainted: G W N 6.12.0-rc3 #2 [ 74.791535][ T2158] Tainted: [W]=WARN, [N]=TEST [ 74.791889][ T2158] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [ 74.792479][ T2158] pstate: 40400005 (nZcv daif PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 74.793101][ T2158] pc : alloc_tagging_slab_alloc_hook 0x120/0x270 [ 74.793607][ T2158] lr : alloc_tagging_slab_alloc_hook 0x120/0x270 [ 74.794095][ T2158] sp : ffff800084d33cd0 [ 74.794418][ T2158] x29: ffff800084d33cd0 x28: 0000000000000000 x27: 0000000000000000 [ 74.795095][ T2158] x26: 0000000000000000 x25: 0000000000000012 x24: ffff80007b30e314 [ 74.795822][ T2158] x23: ffff000390ff6f10 x22: 0000000000000000 x21: 0000000000000088 [ 74.796555][ T2158] x20: ffff000390285840 x19: fffffd7fc3ef7830 x18: ffffffffffffffff [ 74.797283][ T2158] x17: ffff8000800e63b4 x16: ffff80007b33afc4 x15: ffff800081654c00 [ 74.798011][ T2158] x14: 0000000000000000 x13: 205d383531325420 x12: 5b5d383734363537 [ 74.798744][ T2158] x11: ffff800084d337e0 x10: 000000000000005d x9 : 00000000ffffffd0 [ 74.799476][ T2158] x8 : 7f7f7f7f7f7f7f7f x7 : ffff80008219d188 x6 : c0000000ffff7fff [ 74.800206][ T2158] x5 : ffff0003fdbc9208 x4 : ffff800081edd188 x3 : 0000000000000001 [ 74.800932][ T2158] x2 : 0beaa6dee1ac5a00 x1 : 0beaa6dee1ac5a00 x0 : ffff80037c2cb000 [ 74.801656][ T2158] Call trace: [ 74.801954][ T2158] alloc_tagging_slab_alloc_hook 0x120/0x270 [ 74.802494][ T2158] __kmalloc_cache_noprof 0x148/0x33c [ 74.802976][ T2158] test_kmalloc_redzone_access 0x4c/0x104 [slub_kunit] [ 74.803607][ T2158] kunit_try_run_case 0x70/0x17c [kunit] [ 74.804124][ T2158] kunit_generic_run_threadfn_adapter 0x2c/0x4c [kunit] [ 74.804768][ T2158] kthread 0x10c/0x118 [ 74.805141][ T2158] ret_from_fork 0x10/0x20 [ 74.805540][ T2158] Code: b9400a80 11000400 b9500a80 97ffd858 (f94012d3) [ 74.806176][ T2158] SMP: stopping secondary CPUs [ 74.808130][ T2158] Starting crashdump kernel... Link: https://lkml.kernel.org/r/[email protected] Fixes: e0a955b ("mm/codetag: add pgalloc_tag_copy()") Signed-off-by: Hao Ge <[email protected]> Acked-by: Suren Baghdasaryan <[email protected]> Suggested-by: Suren Baghdasaryan <[email protected]> Acked-by: Yu Zhao <[email protected]> Cc: Kent Overstreet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>

[ Upstream commit 4ed234f ] I got a syzbot report without a repro [1] crashing in nf_send_reset6() I think the issue is that dev->hard_header_len is zero, and we attempt later to push an Ethernet header. Use LL_MAX_HEADER, as other functions in net/ipv6/netfilter/nf_reject_ipv6.c. [1] skbuff: skb_under_panic: text:ffffffff89b1d008 len:74 put:14 head:ffff88803123aa00 data:ffff88803123a9f2 tail:0x3c end:0x140 dev:syz_tun kernel BUG at net/core/skbuff.c:206 ! Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI CPU: 0 UID: 0 PID: 7373 Comm: syz.1.568 Not tainted 6.12.0-rc2-syzkaller-00631-g6d858708d465 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 RIP: 0010:skb_panic net/core/skbuff.c:206 [inline] RIP: 0010:skb_under_panic 0x14b/0x150 net/core/skbuff.c:216 Code: 0d 8d 48 c7 c6 60 a6 29 8e 48 8b 54 24 08 8b 0c 24 44 8b 44 24 04 4d 89 e9 50 41 54 41 57 41 56 e8 ba 30 38 02 48 83 c4 20 90 <0f> 0b 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 RSP: 0018:ffffc950045269b0 EFLAGS: 00010282 RAX: 0000000000000088 RBX: dffffc0000000000 RCX: cd66dacdc5d8e800 RDX: 0000000000000000 RSI: 0000000000000200 RDI: 0000000000000000 RBP: ffff88802d39a3d0 R08: ffffffff8174afec R09: 1ffff920008a4ccc R10: dffffc0000000000 R11: fffff520008a4ccd R12: 0000000000000140 R13: ffff88803123aa00 R14: ffff88803123a9f2 R15: 000000000000003c FS: 00007fdbee5ff6c0(0000) GS:ffff8880b8600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000005d322000 CR4: 00000000003526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> skb_push 0xe5/0x100 net/core/skbuff.c:2636 eth_header 0x38/0x1f0 net/ethernet/eth.c:83 dev_hard_header include/linux/netdevice.h:3208 [inline] nf_send_reset6 0xce6/0x1270 net/ipv6/netfilter/nf_reject_ipv6.c:358 nft_reject_inet_eval 0x3b9/0x690 net/netfilter/nft_reject_inet.c:48 expr_call_ops_eval net/netfilter/nf_tables_core.c:240 [inline] nft_do_chain 0x4ad/0x1da0 net/netfilter/nf_tables_core.c:288 nft_do_chain_inet 0x418/0x6b0 net/netfilter/nft_chain_filter.c:161 nf_hook_entry_hookfn include/linux/netfilter.h:154 [inline] nf_hook_slow 0xc3/0x220 net/netfilter/core.c:626 nf_hook include/linux/netfilter.h:269 [inline] NF_HOOK include/linux/netfilter.h:312 [inline] br_nf_pre_routing_ipv6 0x63e/0x770 net/bridge/br_netfilter_ipv6.c:184 nf_hook_entry_hookfn include/linux/netfilter.h:154 [inline] nf_hook_bridge_pre net/bridge/br_input.c:277 [inline] br_handle_frame 0x9fd/0x1530 net/bridge/br_input.c:424 __netif_receive_skb_core 0x13e8/0x4570 net/core/dev.c:5562 __netif_receive_skb_one_core net/core/dev.c:5666 [inline] __netif_receive_skb 0x12f/0x650 net/core/dev.c:5781 netif_receive_skb_internal net/core/dev.c:5867 [inline] netif_receive_skb 0x1e8/0x890 net/core/dev.c:5926 tun_rx_batched 0x1b7/0x8f0 drivers/net/tun.c:1550 tun_get_user 0x3056/0x47e0 drivers/net/tun.c:2007 tun_chr_write_iter 0x10d/0x1f0 drivers/net/tun.c:2053 new_sync_write fs/read_write.c:590 [inline] vfs_write 0xa6d/0xc90 fs/read_write.c:683 ksys_write 0x183/0x2b0 fs/read_write.c:736 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64 0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe 0x77/0x7f RIP: 0033:0x7fdbeeb7d1ff Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 c9 8d 02 00 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 44 24 08 e8 1c 8e 02 00 48 RSP: 002b:00007fdbee5ff000 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 00007fdbeed36058 RCX: 00007fdbeeb7d1ff RDX: 000000000000008e RSI: 0000000020000040 RDI: 00000000000000c8 RBP: 00007fdbeebf12be R08: 0000000000000000 R09: 0000000000000000 R10: 000000000000008e R11: 0000000000000293 R12: 0000000000000000 R13: 0000000000000000 R14: 00007fdbeed36058 R15: 00007ffc38de06e8 </TASK> Fixes: c8d7b98 ("netfilter: move nf_send_resetX() code to nf_reject_ipvX modules") Reported-by: syzbot <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 1c10941 ] The following BUG was triggered: ============================= [ BUG: Invalid wait context ] 6.12.0-rc2-XXX torvalds#406 Not tainted ----------------------------- kworker/1:1/62 is trying to lock: ffffff8801593030 (&cpc_ptr->rmw_lock){ . .}-{3:3}, at: cpc_write 0xcc/0x370 other info that might help us debug this: context-{5:5} 2 locks held by kworker/1:1/62: #0: ffffff897ef5ec98 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested 0x2c/0x50 #1: ffffff880154e238 (&sg_policy->update_lock){....}-{2:2}, at: sugov_update_shared 0x3c/0x280 stack backtrace: CPU: 1 UID: 0 PID: 62 Comm: kworker/1:1 Not tainted 6.12.0-rc2-g9654bd3e8806 torvalds#406 Workqueue: 0x0 (events) Call trace: dump_backtrace 0xa4/0x130 show_stack 0x20/0x38 dump_stack_lvl 0x90/0xd0 dump_stack 0x18/0x28 __lock_acquire 0x480/0x1ad8 lock_acquire 0x114/0x310 _raw_spin_lock 0x50/0x70 cpc_write 0xcc/0x370 cppc_set_perf 0xa0/0x3a8 cppc_cpufreq_fast_switch 0x40/0xc0 cpufreq_driver_fast_switch 0x4c/0x218 sugov_update_shared 0x234/0x280 update_load_avg 0x6ec/0x7b8 dequeue_entities 0x108/0x830 dequeue_task_fair 0x58/0x408 __schedule 0x4f0/0x1070 schedule 0x54/0x130 worker_thread 0xc0/0x2e8 kthread 0x130/0x148 ret_from_fork 0x10/0x20 sugov_update_shared() locks a raw_spinlock while cpc_write() locks a spinlock. To have a correct wait-type order, update rmw_lock to a raw spinlock and ensure that interrupts will be disabled on the CPU holding it. Fixes: 60949b7 ("ACPI: CPPC: Fix MASK_VAL() usage") Signed-off-by: Pierre Gondois <[email protected]> Link: https://patch.msgid.link/[email protected] [ rjw: Changelog edits ] Signed-off-by: Rafael J. Wysocki <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

commit 3cea8af upstream. Currently, when configuring TMU (Time Management Unit) mode of a given router, we take into account only its own TMU requirements ignoring other routers in the domain. This is problematic if the router we are configuring has lower TMU requirements than what is already configured in the domain. In the scenario below, we have a host router with two USB4 ports: A and B. Port A connected to device router #1 (which supports CL states) and existing DisplayPort tunnel, thus, the TMU mode is HiFi uni-directional. 1. Initial topology [Host] A/ / [Device #1] / Monitor 2. Plug in device #2 (that supports CL states) to downstream port B of the host router [Host] A/ B\ / \ [Device #1] [Device #2] / Monitor The TMU mode on port B and port A will be configured to LowRes which is not what we want and will cause monitor to start flickering. To address this we first scan the domain and search for any router configured to HiFi uni-directional mode, and if found, configure TMU mode of the given router to HiFi uni-directional as well. Cc: [email protected] Signed-off-by: Gil Fine <[email protected]> Signed-off-by: Mika Westerberg <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>

commit 101c268 upstream. In support of investigating an initialization failure report [1], cxl_test was updated to register mock memory-devices after the mock root-port/bus device had been registered. That led to cxl_test crashing with a use-after-free bug with the following signature: cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem0:decoder7.0 @ 0 next: cxl_switch_uport.0 nr_eps: 1 nr_targets: 1 cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem4:decoder14.0 @ 1 next: cxl_switch_uport.0 nr_eps: 2 nr_targets: 1 cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[0] = cxl_switch_dport.0 for mem0:decoder7.0 @ 0 1) cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[1] = cxl_switch_dport.4 for mem4:decoder14.0 @ 1 [..] cxld_unregister: cxl decoder14.0: cxl_region_decode_reset: cxl_region region3: mock_decoder_reset: cxl_port port3: decoder3.0 reset 2) mock_decoder_reset: cxl_port port3: decoder3.0: out of order reset, expected decoder3.1 cxl_endpoint_decoder_release: cxl decoder14.0: [..] cxld_unregister: cxl decoder7.0: 3) cxl_region_decode_reset: cxl_region region3: Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bc3: 0000 [#1] PREEMPT SMP PTI [..] RIP: 0010:to_cxl_port 0x8/0x60 [cxl_core] [..] Call Trace: <TASK> cxl_region_decode_reset 0x69/0x190 [cxl_core] cxl_region_detach 0xe8/0x210 [cxl_core] cxl_decoder_kill_region 0x27/0x40 [cxl_core] cxld_unregister 0x5d/0x60 [cxl_core] At 1) a region has been established with 2 endpoint decoders (7.0 and 14.0). Those endpoints share a common switch-decoder in the topology (3.0). At teardown, 2), decoder14.0 is the first to be removed and hits the "out of order reset case" in the switch decoder. The effect though is that region3 cleanup is aborted leaving it in-tact and referencing decoder14.0. At 3) the second attempt to teardown region3 trips over the stale decoder14.0 object which has long since been deleted. The fix here is to recognize that the CXL specification places no mandate on in-order shutdown of switch-decoders, the driver enforces in-order allocation, and hardware enforces in-order commit. So, rather than fail and leave objects dangling, always remove them. In support of making cxl_region_decode_reset() always succeed, cxl_region_invalidate_memregion() failures are turned into warnings. Crashing the kernel is ok there since system integrity is at risk if caches cannot be managed around physical address mutation events like CXL region destruction. A new device_for_each_child_reverse_from() is added to cleanup port->commit_end after all dependent decoders have been disabled. In other words if decoders are allocated 0->1->2 and disabled 1->2->0 then port->commit_end only decrements from 2 after 2 has been disabled, and it decrements all the way to zero since 1 was disabled previously. Link: http://lore.kernel.org/[email protected] [1] Cc: [email protected] Fixes: 176baef ("cxl/hdm: Commit decoder state to hardware") Reviewed-by: Jonathan Cameron <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Alison Schofield <[email protected]> Cc: Ira Weiny <[email protected]> Cc: Zijun Hu <[email protected]> Signed-off-by: Dan Williams <[email protected]> Reviewed-by: Ira Weiny <[email protected]> Link: https://patch.msgid.link/172964782781.81806.17902885593105284330.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Ira Weiny <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>

[ Upstream commit d48e1de ] The purpose of btrfs_bbio_propagate_error() shall be propagating an error of split bio to its original btrfs_bio, and tell the error to the upper layer. However, it's not working well on some cases. * Case 1. Immediate (or quick) end_bio with an error When btrfs sends btrfs_bio to mirrored devices, btrfs calls btrfs_bio_end_io() when all the mirroring bios are completed. If that btrfs_bio was split, it is from btrfs_clone_bioset and its end_io function is btrfs_orig_write_end_io. For this case, btrfs_bbio_propagate_error() accesses the orig_bbio's bio context to increase the error count. That works well in most cases. However, if the end_io is called enough fast, orig_bbio's (remaining part after split) bio context may not be properly set at that time. Since the bio context is set when the orig_bbio (the last btrfs_bio) is sent to devices, that might be too late for earlier split btrfs_bio's completion. That will result in NULL pointer dereference. That bug is easily reproducible by running btrfs/146 on zoned devices [1] and it shows the following trace. [1] You need raid-stripe-tree feature as it create "-d raid0 -m raid1" FS. BUG: kernel NULL pointer dereference, address: 0000000000000020 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] PREEMPT SMP PTI CPU: 1 UID: 0 PID: 13 Comm: kworker/u32:1 Not tainted 6.11.0-rc7-BTRFS-ZNS torvalds#474 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Workqueue: writeback wb_workfn (flush-btrfs-5) RIP: 0010:btrfs_bio_end_io 0xae/0xc0 [btrfs] BTRFS error (device dm-0): bdev /dev/mapper/error-test errs: wr 2, rd 0, flush 0, corrupt 0, gen 0 RSP: 0018:ffffc9500006f248 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888005a7f080 RCX: ffffc9500006f1dc RDX: 0000000000000000 RSI: 000000000000000a RDI: ffff888005a7f080 RBP: ffff888011dfc540 R08: 0000000000000000 R09: 0000000000000001 R10: ffffffff82e508e0 R11: 0000000000000005 R12: ffff88800ddfbe58 R13: ffff888005a7f080 R14: ffff888005a7f158 R15: ffff888005a7f158 FS: 0000000000000000(0000) GS:ffff88803ea80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000020 CR3: 0000000002e22006 CR4: 0000000000370ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __die_body.cold 0x19/0x26 ? page_fault_oops 0x13e/0x2b0 ? _printk 0x58/0x73 ? do_user_addr_fault 0x5f/0x750 ? exc_page_fault 0x76/0x240 ? asm_exc_page_fault 0x22/0x30 ? btrfs_bio_end_io 0xae/0xc0 [btrfs] ? btrfs_log_dev_io_error 0x7f/0x90 [btrfs] btrfs_orig_write_end_io 0x51/0x90 [btrfs] dm_submit_bio 0x5c2/0xa50 [dm_mod] ? find_held_lock 0x2b/0x80 ? blk_try_enter_queue 0x90/0x1e0 __submit_bio 0xe0/0x130 ? ktime_get 0x10a/0x160 ? lockdep_hardirqs_on 0x74/0x100 submit_bio_noacct_nocheck 0x199/0x410 btrfs_submit_bio 0x7d/0x150 [btrfs] btrfs_submit_chunk 0x1a1/0x6d0 [btrfs] ? lockdep_hardirqs_on 0x74/0x100 ? __folio_start_writeback 0x10/0x2c0 btrfs_submit_bbio 0x1c/0x40 [btrfs] submit_one_bio 0x44/0x60 [btrfs] submit_extent_folio 0x13f/0x330 [btrfs] ? btrfs_set_range_writeback 0xa3/0xd0 [btrfs] extent_writepage_io 0x18b/0x360 [btrfs] extent_write_locked_range 0x17c/0x340 [btrfs] ? __pfx_end_bbio_data_write 0x10/0x10 [btrfs] run_delalloc_cow 0x71/0xd0 [btrfs] btrfs_run_delalloc_range 0x176/0x500 [btrfs] ? find_lock_delalloc_range 0x119/0x260 [btrfs] writepage_delalloc 0x2ab/0x480 [btrfs] extent_write_cache_pages 0x236/0x7d0 [btrfs] btrfs_writepages 0x72/0x130 [btrfs] do_writepages 0xd4/0x240 ? find_held_lock 0x2b/0x80 ? wbc_attach_and_unlock_inode 0x12c/0x290 ? wbc_attach_and_unlock_inode 0x12c/0x290 __writeback_single_inode 0x5c/0x4c0 ? do_raw_spin_unlock 0x49/0xb0 writeback_sb_inodes 0x22c/0x560 __writeback_inodes_wb 0x4c/0xe0 wb_writeback 0x1d6/0x3f0 wb_workfn 0x334/0x520 process_one_work 0x1ee/0x570 ? lock_is_held_type 0xc6/0x130 worker_thread 0x1d1/0x3b0 ? __pfx_worker_thread 0x10/0x10 kthread 0xee/0x120 ? __pfx_kthread 0x10/0x10 ret_from_fork 0x30/0x50 ? __pfx_kthread 0x10/0x10 ret_from_fork_asm 0x1a/0x30 </TASK> Modules linked in: dm_mod btrfs blake2b_generic xor raid6_pq rapl CR2: 0000000000000020 * Case 2. Earlier completion of orig_bbio for mirrored btrfs_bios btrfs_bbio_propagate_error() assumes the end_io function for orig_bbio is called last among split bios. In that case, btrfs_orig_write_end_io() sets the bio->bi_status to BLK_STS_IOERR by seeing the bioc->error [2]. Otherwise, the increased orig_bio's bioc->error is not checked by anyone and return BLK_STS_OK to the upper layer. [2] Actually, this is not true. Because we only increases orig_bioc->errors by max_errors, the condition "atomic_read(&bioc->error) > bioc->max_errors" is still not met if only one split btrfs_bio fails. * Case 3. Later completion of orig_bbio for un-mirrored btrfs_bios In contrast to the above case, btrfs_bbio_propagate_error() is not working well if un-mirrored orig_bbio is completed last. It sets orig_bbio->bio.bi_status to the btrfs_bio's error. But, that is easily over-written by orig_bbio's completion status. If the status is BLK_STS_OK, the upper layer would not know the failure. * Solution Considering the above cases, we can only save the error status in the orig_bbio (remaining part after split) itself as it is always available. Also, the saved error status should be propagated when all the split btrfs_bios are finished (i.e, bbio->pending_ios == 0). This commit introduces "status" to btrfs_bbio and saves the first error of split bios to original btrfs_bio's "status" variable. When all the split bios are finished, the saved status is loaded into original btrfs_bio's status. With this commit, btrfs/146 on zoned devices does not hit the NULL pointer dereference anymore. Fixes: 852eee6 ("btrfs: allow btrfs_submit_bio to split bios") CC: [email protected] # 6.6 Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Naohiro Aota <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 25f00a1 ] Add check for the return value of spi_get_csgpiod() to avoid passing a NULL pointer to gpiod_direction_output(), preventing a crash when GPIO chip select is not used. Fix below crash: [ 4.251960] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 [ 4.260762] Mem abort info: [ 4.263556] ESR = 0x0000000096000004 [ 4.267308] EC = 0x25: DABT (current EL), IL = 32 bits [ 4.272624] SET = 0, FnV = 0 [ 4.275681] EA = 0, S1PTW = 0 [ 4.278822] FSC = 0x04: level 0 translation fault [ 4.283704] Data abort info: [ 4.286583] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 [ 4.292074] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 4.297130] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 4.302445] [0000000000000000] user address but active_mm is swapper [ 4.308805] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP [ 4.315072] Modules linked in: [ 4.318124] CPU: 2 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.12.0-rc4-next-20241023-00008-ga20ec42c5fc1 torvalds#359 [ 4.328130] Hardware name: LS1046A QDS Board (DT) [ 4.332832] pstate: 40000005 (nZcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 4.339794] pc : gpiod_direction_output 0x34/0x5c [ 4.344505] lr : gpiod_direction_output 0x18/0x5c [ 4.349208] sp : ffff80008003b8f0 [ 4.352517] x29: ffff80008003b8f0 x28: 0000000000000000 x27: ffffc96bcc7e9068 [ 4.359659] x26: ffffc96bcc6e00b0 x25: ffffc96bcc598398 x24: ffff447400132810 [ 4.366800] x23: 0000000000000000 x22: 0000000011e1a300 x21: 0000000000020002 [ 4.373940] x20: 0000000000000000 x19: 0000000000000000 x18: ffffffffffffffff [ 4.381081] x17: ffff44740016e600 x16: 0000000500000003 x15: 0000000000000007 [ 4.388221] x14: 0000000000989680 x13: 0000000000020000 x12: 000000000000001e [ 4.395362] x11: 0044b82fa09b5a53 x10: 0000000000000019 x9 : 0000000000000008 [ 4.402502] x8 : 0000000000000002 x7 : 0000000000000007 x6 : 0000000000000000 [ 4.409641] x5 : 0000000000000200 x4 : 0000000002000000 x3 : 0000000000000000 [ 4.416781] x2 : 0000000000022202 x1 : 0000000000000000 x0 : 0000000000000000 [ 4.423921] Call trace: [ 4.426362] gpiod_direction_output 0x34/0x5c (P) [ 4.431067] gpiod_direction_output 0x18/0x5c (L) [ 4.435771] dspi_setup 0x220/0x334 Fixes: 9e264f3 ("spi: Replace all spi->chip_select and spi->cs_gpiod references with function call") Cc: [email protected] Signed-off-by: Frank Li <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Mark Brown <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit 281dd25 ] Under memory pressure it's possible for GFP_ATOMIC order-0 allocations to fail even though free pages are available in the highatomic reserves. GFP_ATOMIC allocations cannot trigger unreserve_highatomic_pageblock() since it's only run from reclaim. Given that such allocations will pass the watermarks in __zone_watermark_unusable_free(), it makes sense to fallback to highatomic reserves the same way that ALLOC_OOM can. This fixes order-0 page allocation failures observed on Cloudflare's fleet when handling network packets: kswapd1: page allocation failure: order:0, mode:0x820(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0-7 CPU: 10 PID: 696 Comm: kswapd1 Kdump: loaded Tainted: G O 6.6.43-CUSTOM #1 Hardware name: MACHINE Call Trace: <IRQ> dump_stack_lvl 0x3c/0x50 warn_alloc 0x13a/0x1c0 __alloc_pages_slowpath.constprop.0 0xc9d/0xd10 __alloc_pages 0x327/0x340 __napi_alloc_skb 0x16d/0x1f0 bnxt_rx_page_skb 0x96/0x1b0 [bnxt_en] bnxt_rx_pkt 0x201/0x15e0 [bnxt_en] __bnxt_poll_work 0x156/0x2b0 [bnxt_en] bnxt_poll 0xd9/0x1c0 [bnxt_en] __napi_poll 0x2b/0x1b0 bpf_trampoline_6442524138 0x7d/0x1000 __napi_poll 0x5/0x1b0 net_rx_action 0x342/0x740 handle_softirqs 0xcf/0x2b0 irq_exit_rcu 0x6c/0x90 sysvec_apic_timer_interrupt 0x72/0x90 </IRQ> [[email protected]: update comment] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/all/CAGis_TWzSu=P7QJmjD58WWiu3zjMTVKSzdOwWE8ORaGytzWJwQ@mail.gmail.com/ Fixes: 1d91df8 ("mm/page_alloc: handle a missing case for memalloc_nocma_{save/restore} APIs") Signed-off-by: Matt Fleming <[email protected]> Suggested-by: Vlastimil Babka <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

[ Upstream commit b125a0d ] walk_system_ram_res_rev() erroneously discards resource flags when passing the information to the callback. This causes systems with IORESOURCE_SYSRAM_DRIVER_MANAGED memory to have these resources selected during kexec to store kexec buffers if that memory happens to be at placed above normal system ram. This leads to undefined behavior after reboot. If the kexec buffer is never touched, nothing happens. If the kexec buffer is touched, it could lead to a crash (like below) or undefined behavior. Tested on a system with CXL memory expanders with driver managed memory, TPM enabled, and CONFIG_IMA_KEXEC=y. Adding printk's showed the flags were being discarded and as a result the check for IORESOURCE_SYSRAM_DRIVER_MANAGED passes. find_next_iomem_res: name(System RAM (kmem)) start(10000000000) end(1034fffffff) flags(83000200) locate_mem_hole_top_down: start(10000000000) end(1034fffffff) flags(0) [.] BUG: unable to handle page fault for address: ffff89834ffff000 [.] #PF: supervisor read access in kernel mode [.] #PF: error_code(0x0000) - not-present page [.] PGD c04c8bf067 P4D c04c8bf067 PUD c04c8be067 PMD 0 [.] Oops: 0000 [#1] SMP [.] RIP: 0010:ima_restore_measurement_list 0x95/0x4b0 [.] RSP: 0018:ffffc950000d3a80 EFLAGS: 00010286 [.] RAX: 0000000000001000 RBX: 0000000000000000 RCX: ffff89834ffff000 [.] RDX: 0000000000000018 RSI: ffff89834ffff000 RDI: ffff89834ffff018 [.] RBP: ffffc950000d3ba0 R08: 0000000000000020 R09: ffff888132b8a900 [.] R10: 4000000000000000 R11: 000000003a616d69 R12: 0000000000000000 [.] R13: ffffffff8404ac28 R14: 0000000000000000 R15: ffff89834ffff000 [.] FS: 0000000000000000(0000) GS:ffff893d44640000(0000) knlGS:0000000000000000 [.] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [.] ata5: SATA link down (SStatus 0 SControl 300) [.] CR2: ffff89834ffff000 CR3: 000001034d00f001 CR4: 0000000000770ef0 [.] PKRU: 55555554 [.] Call Trace: [.] <TASK> [.] ? __die 0x78/0xc0 [.] ? page_fault_oops 0x2a8/0x3a0 [.] ? exc_page_fault 0x84/0x130 [.] ? asm_exc_page_fault 0x22/0x30 [.] ? ima_restore_measurement_list 0x95/0x4b0 [.] ? template_desc_init_fields 0x317/0x410 [.] ? crypto_alloc_tfm_node 0x9c/0xc0 [.] ? init_ima_lsm 0x30/0x30 [.] ima_load_kexec_buffer 0x72/0xa0 [.] ima_init 0x44/0xa0 [.] __initstub__kmod_ima__373_1201_init_ima7 0x1e/0xb0 [.] ? init_ima_lsm 0x30/0x30 [.] do_one_initcall 0xad/0x200 [.] ? idr_alloc_cyclic 0xaa/0x110 [.] ? new_slab 0x12c/0x420 [.] ? new_slab 0x12c/0x420 [.] ? number 0x12a/0x430 [.] ? sysvec_apic_timer_interrupt 0xa/0x80 [.] ? asm_sysvec_apic_timer_interrupt 0x16/0x20 [.] ? parse_args 0xd4/0x380 [.] ? parse_args 0x14b/0x380 [.] kernel_init_freeable 0x1c1/0x2b0 [.] ? rest_init 0xb0/0xb0 [.] kernel_init 0x16/0x1a0 [.] ret_from_fork 0x2f/0x40 [.] ? rest_init 0xb0/0xb0 [.] ret_from_fork_asm 0x11/0x20 [.] </TASK> Link: https://lore.kernel.org/all/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Fixes: 7acf164 ("resource: add walk_system_ram_res_rev()") Signed-off-by: Gregory Price <[email protected]> Reviewed-by: Dan Williams <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: AKASHI Takahiro <[email protected]> Cc: Andy Shevchenko <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Ilpo Järvinen <[email protected]> Cc: Mika Westerberg <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Sasha Levin <[email protected]>

kerneltoast added 8 commits September 19, 2023 21:24

mm: Disable proactive compaction by default

a43aaaf

On-demand compaction works fine assuming that you don't have a need to spam the page allocator nonstop for large order page allocations. Signed-off-by: Sultan Alsawaf <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mm patches #1

mm patches #1

Rasenkai commented Sep 19, 2023

mm patches #1

mm patches #1

Conversation

Rasenkai commented Sep 19, 2023