arm64 provides always working implementation of futex_atomic_cmpxchg_inatomic(),
so there is no need to check it runtime.
Change-Id: Id4b9ba07d979fddbdac9f2aaa5250b1487ff9042
Reported-by: Piyush swami <Piyush.swami@arm.com>
Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: atndko <z1281552865@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
It is probably safe to assume that all Armv8-A implementations have a
multiplier whose efficiency is comparable or better than a sequence of
three or so register-dependent arithmetic instructions. Select
ARCH_HAS_FAST_MULTIPLIER to get ever-so-slightly nicer codegen in the
few dusty old corners which care.
In a contrived benchmark calling hweight64() in a loop, this does indeed
turn out to be a small win overall, with no measurable impact on
Cortex-A57 but about 5% performance improvement on Cortex-A53.
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Throwing our __uint128_t idioms at csum_ipv6_magic() makes it
about 1.3x-2x faster across a range of microarchitecture/compiler
combinations. Not much in absolute terms, but every little helps.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
In validating the checksumming results of the new routine, I sadly
neglected to test its not-checksumming results. Thus it slipped through
that the one case where @buff is already dword-aligned and @len = 0
manages to defeat the tail-masking logic and behave as if @len = 8.
For a zero length it doesn't make much sense to deference @buff anyway,
so just add an early return (which has essentially zero impact on
performance).
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
Apparently there exist certain workloads which rely heavily on software
checksumming, for which the generic do_csum() implementation becomes a
significant bottleneck. Therefore let's give arm64 its own optimised
version - for ease of maintenance this foregoes assembly or intrisics,
and is thus not actually arm64-specific, but does rely heavily on C
idioms that translate well to the A64 ISA and the typical load/store
capabilities of most ARMv8 CPU cores.
The resulting increase in checksum throughput scales nicely with buffer
size, tending towards 4x for a small in-order core (Cortex-A53), and up
to 6x or more for an aggressive big core (Ampere eMAG).
Reported-by: Lingyan Huang <huanglingyan2@huawei.com>
Tested-by: Lingyan Huang <huanglingyan2@huawei.com>
Change-Id: I42f718428ee872541006b3932dc010dd3f8b0f28
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
A lot of CPU time is wasted on allocating, populating, and copying
debug names back and forth with userspace when they're not actually
needed. We can't just remove the name buffers from the various sync data
structures though because we must preserve ABI compatibility with
userspace, but instead we can just pretend the name fields of the
user-shared structs aren't there. This massively reduces the sizes of
memory allocated for these data structures and the amount of data passed
between userspace, as well as eliminates a kzalloc() entirely from
sync_file_ioctl_fence_info(), thus improving graphics performance.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
This kernel won't be used on devices with 7nm, 14nm, or 28nm PLLs nor
will it be used with the 10nm DisplayPort PLL since our only display is
connected via DSI.
Don't compile support for PLLs we won't use.
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
[dereference23: Adapted for atoll]
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
This kernel will not be used on devices with other GPUs.
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Change-Id: I205e625dc1520c8baef622bc6fd303f714c6f7d6
This unused debug print wastes CPU time when writing to registers,
resulting in perf top reporting a decent chunk of time spent inside
sde_reg_write(). Removing the debug print gets sde_reg_write() off perf
top's radar.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
These debug logs are everywhere and not only bloat the driver, but add
latency everywhere they're used because they're not compiled out. Since
they serve no purpose for us as we're not debugging SDE, compile them
out.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
A measurably significant amount of CPU time is spent in these routines
while the camera is open. These are also responsible for a grotesque
amount of dmesg spam, so let's nuke them.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
While Android userspace (e.g. storaged) does use iostats via
/proc/diskstats, init will explicitly enable iostats for the devices on
which it is primarily used - sda and sdf. Avoid the 0.5-1% overhead for
block devices that do not need it.
Signed-off-by: kdrag0n <dragon@khronodragon.com>
CRC errors on SPI bus usually means there is something wrong with the
hardware(unstable voltage, wiring, etc).
Disable SPI CRC in favor of improving performance as the cost of
detecting hardware errors are too high, and not all that useful.
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Change-Id: I5d7ef9dedbddf8d7f4c4911788051a7753eb67d8
Binder code is very hot, so checking frequently to see if a debug
message should be printed is a waste of cycles. We're not debugging
binder, so just stub out the debug prints to compile them out entirely.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Sometimes, it may be desirable to use CPU frequency tables different
from the ones in the hardware's OSM LUTs. This commit adds support for
overriding each CPU's frequency table with a list of allowed frequencies
defined in the OSM driver's DT node.
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Joel Fernandes found that the synchronize_rcu_tasks() was taking a
significant amount of time. He demonstrated it with the following test:
# cd /sys/kernel/tracing
# while [ 1 ]; do x=1; done &
# echo '__schedule_bug:traceon' > set_ftrace_filter
# time echo '!__schedule_bug:traceon' > set_ftrace_filter;
real 0m1.064s
user 0m0.000s
sys 0m0.004s
Where it takes a little over a second to perform the synchronize,
because there's a loop that waits 1 second at a time for tasks to get
through their quiescent points when there's a task that must be waited
for.
After discussion we came up with a simple way to wait for holdouts but
increase the time for each iteration of the loop but no more than a
full second.
With the new patch we have:
# time echo '!__schedule_bug:traceon' > set_ftrace_filter;
real 0m0.131s
user 0m0.000s
sys 0m0.004s
Which drops it down to 13% of what the original wait time was.
Link: http://lkml.kernel.org/r/20180523063815.198302-2-joel@joelfernandes.org
Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Suggested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: celtare21 <celtare21@gmail.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Make sure the compiler optimises away conditions that are always false
since commit b91319892e (cpufreq: schedutil: Don't jump to max frequency for RT tasks).
Change-Id: I7a108ff1a4ba09f2cb82ea8a82bd15967e724709
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Currently, the raw cache will be reset when next_f is changed after get_next_freq for correctness. However, it may introduce more cycles in those cases. This patch changes it to maintain the cached value instead of dropping it.
Bug: 159936782
Bug: 158863204
Signed-off-by: Wei Wang <wvw@google.com>
[dereference23: Backport to 4.14]
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Change-Id: I519ca02dd2e6038e3966e1f68fee641628827c82
The cpufreq_schedutil governor keeps a cache of the last
raw frequency that was mapped to a supported device frequency.
If the next request for a frequency matches the cached
value, the policy's next_freq value is reused. But there
are paths that can update the raw cached value without
updating the next_freq value, and there are paths that
can set the next_freq value without setting the raw
cached value. On those paths, the cached value
must be reset.
The case that has been observed is when a frequency request
reaches sugov_update_commit but is then rejected by to
the sugov_up_down_rate_limit check.
Bug: 116279565
Change-Id: I7c585339a04ff1732054d6e5b36a57e2d41266aa
Signed-off-by: John Dias <joaodias@google.com>
Signed-off-by: Miguel de Dios <migueldedios@google.com>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Part of the fix from commit d86ab9cff8 ("cpufreq: schedutil: use now
as reference when aggregating shared policy requests") is reversed in
commit 05d2ca2420 ("cpufreq: schedutil: Ignore CPU load older than
WALT window size") due to a porting mistake. Restore it while keeping
the relevant change from the latter patch.
Bug: 117438867
Test: build & boot
Change-Id: I21399be760d7c8e2fff6c158368a285dc6261647
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Bug: 120438505
Change-Id: I59e3675a320ce71c3c90be3904756b125300ba6b
Signed-off-by: Miguel de Dios <migueldedios@google.com>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
This reverts commit b8b6f565c0.
CAF's hispeed boost and predicted load features aren't any good. Remove
them entirely to prevent userspace from trying to enable them
(specifically pl) and to reduce useless overhead in schedutil, since it
runs *very* often.
Change-Id: I0446b49a59e5dce8e1b7712bdb654c9a5e6ff0ed
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
This is tuned to match energy model characteristics and scheduler
efficiency enhancements.
Change-Id: Ia60e1ea888457fa1c0c0273cdd4b0180f0a87abf
Co-authored-by: Diep Quynh <remilia.1505@gmail.com>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Qualcomm's LLCC controller does not have an error IRQ line on lito and
instead polls to check memory banks for errors every 5 seconds, which is
inefficient and will add to system jitter.
The generic Kryo CPU cache controller does have error IRQ lines so it
doesn't need to use polling, but EDAC in general is fairly useless in
its current state anyway because Google disabled the option to panic on
uncorrectable error. Let's follow their decision and just disable EDAC
entirely, as well as its placeholder RAS dependency.
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
__rmqueue(), __rmqueue_fallback(), __rmqueue_smallest() and
__rmqueue_cma_fallback() are all in page allocator's hot path and better
be finished as soon as possible. One way to make them faster is by making
them inline. But as Andrew Morton and Andi Kleen pointed out:
https://lkml.org/lkml/2017/10/10/1252https://lkml.org/lkml/2017/10/10/1279
To make sure they are inlined, we should use __always_inline for them.
With the will-it-scale/page_fault1/process benchmark, when using nr_cpu
processes to stress buddy, the results for will-it-scale.processes with
and without the patch are:
On a 2-sockets Intel-Skylake machine:
compiler base head
gcc-4.4.7 6496131 6911823 +6.4%
gcc-4.9.4 7225110 7731072 +7.0%
gcc-5.4.1 7054224 7688146 +9.0%
gcc-6.2.0 7059794 7651675 +8.4%
On a 4-sockets Intel-Skylake machine:
compiler base head
gcc-4.4.7 13162890 13508193 +2.6%
gcc-4.9.4 14997463 15484353 +3.2%
gcc-5.4.1 14708711 15449805 +5.0%
gcc-6.2.0 14574099 15349204 +5.3%
The above 4 compilers are used because I've done the tests through
Intel's Linux Kernel Performance(LKP) infrastructure and they are the
available compilers there.
The benefit being less on 4 sockets machine is due to the lock
contention there(perf-profile/native_queued_spin_lock_slowpath=81%) is
less severe than on the 2 sockets machine(85%).
What the benchmark does is: it forks nr_cpu processes and then each
process does the following:
1 mmap() 128M anonymous space;
2 writes to each page there to trigger actual page allocation;
3 munmap() it.
in a loop.
https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c
Binary size wise, I have locally built them with different compilers:
[aaron@aaronlu obj]$ size */*/mm/page_alloc.o
text data bss dec hex filename
37409 9904 8524 55837 da1d gcc-4.9.4/base/mm/page_alloc.o
38273 9904 8524 56701 dd7d gcc-4.9.4/head/mm/page_alloc.o
37465 9840 8428 55733 d9b5 gcc-5.5.0/base/mm/page_alloc.o
38169 9840 8428 56437 dc75 gcc-5.5.0/head/mm/page_alloc.o
37573 9840 8428 55841 da21 gcc-6.4.0/base/mm/page_alloc.o
38261 9840 8428 56529 dcd1 gcc-6.4.0/head/mm/page_alloc.o
36863 9840 8428 55131 d75b gcc-7.2.0/base/mm/page_alloc.o
37711 9840 8428 55979 daab gcc-7.2.0/head/mm/page_alloc.o
Text size increased about 800 bytes for mm/page_alloc.o.
[aaron@aaronlu obj]$ size */*/vmlinux
text data bss dec hex filename
10342757 5903208 17723392 33969357 20654cd gcc-4.9.4/base/vmlinux
10342757 5903208 17723392 33969357 20654cd gcc-4.9.4/head/vmlinux
10332448 5836608 17715200 33884256 2050860 gcc-5.5.0/base/vmlinux
10332448 5836608 17715200 33884256 2050860 gcc-5.5.0/head/vmlinux
10094546 5836696 17715200 33646442 201676a gcc-6.4.0/base/vmlinux
10094546 5836696 17715200 33646442 201676a gcc-6.4.0/head/vmlinux
10018775 5828732 17715200 33562707 2002053 gcc-7.2.0/base/vmlinux
10018775 5828732 17715200 33562707 2002053 gcc-7.2.0/head/vmlinux
Text size for vmlinux has no change though, probably due to function
alignment.
Link: http://lkml.kernel.org/r/20171013063111.GA26032@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Commit 60a3cdd063 ("x86: add optimized inlining") introduced
CONFIG_OPTIMIZE_INLINING, but it has been available only for x86.
The idea is obviously arch-agnostic. This commit moves the config entry
from arch/x86/Kconfig.debug to lib/Kconfig.debug so that all
architectures can benefit from it.
This can make a huge difference in kernel image size especially when
CONFIG_OPTIMIZE_FOR_SIZE is enabled.
For example, I got 3.5% smaller arm64 kernel for v5.1-rc1.
dec file
18983424 arch/arm64/boot/Image.before
18321920 arch/arm64/boot/Image.after
This also slightly improves the "Kernel hacking" Kconfig menu as
e61aca5158 ("Merge branch 'kconfig-diet' from Dave Hansen') suggested;
this config option would be a good fit in the "compiler option" menu.
Link: http://lkml.kernel.org/r/20190423034959.13525-12-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Borislav Petkov <bp@suse.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Boris Brezillon <bbrezillon@kernel.org>
Cc: Brian Norris <computersforpeace@gmail.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marek Vasut <marek.vasut@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Stefan Agner <stefan@agner.ch>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[kdrag0n: Backported to k4.14]
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
-O3 is much more stable with modern compilers these days than it was a
decade ago. Using -O3 on the kernel results in significantly improved
hackbench performance, which is a sign that overall performance in the
kernel is improved. It works especially well in conjunction with LTO.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
There's plenty of room on the stack for a few more inlined bytes here
and there. The measured stack usage at runtime is still safe without
this, and performance is surely improved at a microscopic level, so
remove it.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
SD720G doesn't have 2400000 frequency, it only goes to 2323200.
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
[dereference23: Adapted for atoll]
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Now that pstore_register() can correctly pass max_reason to the kmesg
dump facility, introduce a new "max_reason" module parameter and
"max-reason" Device Tree field.
The "dump_oops" module parameter and "dump-oops" Device
Tree field are now considered deprecated, but are now automatically
converted to their corresponding max_reason values when present, though
the new max_reason setting has precedence.
For struct ramoops_platform_data, the "dump_oops" member is entirely
replaced by a new "max_reason" member, with the only existing user
updated in place.
Additionally remove the "reason" filter logic from ramoops_pstore_write(),
as that is not specifically needed anymore, though technically
this is a change in behavior for any ramoops users also setting the
printk.always_kmsg_dump boot param, which will cause ramoops to behave as
if max_reason was set to KMSG_DUMP_MAX.
Co-developed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/lkml/20200515184434.8470-6-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
Add a new member to struct pstore_info for passing information about
kmesg dump maximum reason. This allows a finer control of what kmesg
dumps are sent to pstore storage backends.
Those backends that do not explicitly set this field (keeping it equal to
0), get the default behavior: store only Oopses and Panics, or everything
if the printk.always_kmsg_dump boot param is set.
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/lkml/20200515184434.8470-5-keescook@chromium.org/
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Since the header is a fixed small maximum size, just use a stack variable
to avoid memory allocation in the write path.
Signed-off-by: Kees Cook <keescook@chromium.org>
If zero-length header happened in ramoops_write_kmsg_hdr(), that means
we will not be able to read back dmesg record later, since it will be
treated as invalid header in ramoops_pstore_read(). So we should not
execute the following code but return the error.
Signed-off-by: Yue Hu <huyue2@yulong.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Since only one single ramoops area allowed at a time, other probes
(like device tree) are meaningless, as it will waste CPU resources.
So let's check for being already initialized first.
Signed-off-by: Yue Hu <huyue2@yulong.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Sometimes pstore_console_write() will write records with zero size
to persistent ram zone, which is unnecessary. It will only increase
resource consumption. Also adjust ramoops_write_kmsg_hdr() to have
same logic if memory allocation fails.
Signed-off-by: Yue Hu <huyue2@yulong.com>
Signed-off-by: Kees Cook <keescook@chromium.org>