Binder code is very hot, so checking frequently to see if a debug
message should be printed is a waste of cycles. We're not debugging
binder, so just stub out the debug prints to compile them out entirely.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Sometimes, it may be desirable to use CPU frequency tables different
from the ones in the hardware's OSM LUTs. This commit adds support for
overriding each CPU's frequency table with a list of allowed frequencies
defined in the OSM driver's DT node.
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Joel Fernandes found that the synchronize_rcu_tasks() was taking a
significant amount of time. He demonstrated it with the following test:
# cd /sys/kernel/tracing
# while [ 1 ]; do x=1; done &
# echo '__schedule_bug:traceon' > set_ftrace_filter
# time echo '!__schedule_bug:traceon' > set_ftrace_filter;
real 0m1.064s
user 0m0.000s
sys 0m0.004s
Where it takes a little over a second to perform the synchronize,
because there's a loop that waits 1 second at a time for tasks to get
through their quiescent points when there's a task that must be waited
for.
After discussion we came up with a simple way to wait for holdouts but
increase the time for each iteration of the loop but no more than a
full second.
With the new patch we have:
# time echo '!__schedule_bug:traceon' > set_ftrace_filter;
real 0m0.131s
user 0m0.000s
sys 0m0.004s
Which drops it down to 13% of what the original wait time was.
Link: http://lkml.kernel.org/r/20180523063815.198302-2-joel@joelfernandes.org
Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Suggested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: celtare21 <celtare21@gmail.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Make sure the compiler optimises away conditions that are always false
since commit b91319892e (cpufreq: schedutil: Don't jump to max frequency for RT tasks).
Change-Id: I7a108ff1a4ba09f2cb82ea8a82bd15967e724709
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Currently, the raw cache will be reset when next_f is changed after get_next_freq for correctness. However, it may introduce more cycles in those cases. This patch changes it to maintain the cached value instead of dropping it.
Bug: 159936782
Bug: 158863204
Signed-off-by: Wei Wang <wvw@google.com>
[dereference23: Backport to 4.14]
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Change-Id: I519ca02dd2e6038e3966e1f68fee641628827c82
The cpufreq_schedutil governor keeps a cache of the last
raw frequency that was mapped to a supported device frequency.
If the next request for a frequency matches the cached
value, the policy's next_freq value is reused. But there
are paths that can update the raw cached value without
updating the next_freq value, and there are paths that
can set the next_freq value without setting the raw
cached value. On those paths, the cached value
must be reset.
The case that has been observed is when a frequency request
reaches sugov_update_commit but is then rejected by to
the sugov_up_down_rate_limit check.
Bug: 116279565
Change-Id: I7c585339a04ff1732054d6e5b36a57e2d41266aa
Signed-off-by: John Dias <joaodias@google.com>
Signed-off-by: Miguel de Dios <migueldedios@google.com>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Part of the fix from commit d86ab9cff8 ("cpufreq: schedutil: use now
as reference when aggregating shared policy requests") is reversed in
commit 05d2ca2420 ("cpufreq: schedutil: Ignore CPU load older than
WALT window size") due to a porting mistake. Restore it while keeping
the relevant change from the latter patch.
Bug: 117438867
Test: build & boot
Change-Id: I21399be760d7c8e2fff6c158368a285dc6261647
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Bug: 120438505
Change-Id: I59e3675a320ce71c3c90be3904756b125300ba6b
Signed-off-by: Miguel de Dios <migueldedios@google.com>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
This reverts commit b8b6f565c0.
CAF's hispeed boost and predicted load features aren't any good. Remove
them entirely to prevent userspace from trying to enable them
(specifically pl) and to reduce useless overhead in schedutil, since it
runs *very* often.
Change-Id: I0446b49a59e5dce8e1b7712bdb654c9a5e6ff0ed
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
This is tuned to match energy model characteristics and scheduler
efficiency enhancements.
Change-Id: Ia60e1ea888457fa1c0c0273cdd4b0180f0a87abf
Co-authored-by: Diep Quynh <remilia.1505@gmail.com>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Qualcomm's LLCC controller does not have an error IRQ line on lito and
instead polls to check memory banks for errors every 5 seconds, which is
inefficient and will add to system jitter.
The generic Kryo CPU cache controller does have error IRQ lines so it
doesn't need to use polling, but EDAC in general is fairly useless in
its current state anyway because Google disabled the option to panic on
uncorrectable error. Let's follow their decision and just disable EDAC
entirely, as well as its placeholder RAS dependency.
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
__rmqueue(), __rmqueue_fallback(), __rmqueue_smallest() and
__rmqueue_cma_fallback() are all in page allocator's hot path and better
be finished as soon as possible. One way to make them faster is by making
them inline. But as Andrew Morton and Andi Kleen pointed out:
https://lkml.org/lkml/2017/10/10/1252https://lkml.org/lkml/2017/10/10/1279
To make sure they are inlined, we should use __always_inline for them.
With the will-it-scale/page_fault1/process benchmark, when using nr_cpu
processes to stress buddy, the results for will-it-scale.processes with
and without the patch are:
On a 2-sockets Intel-Skylake machine:
compiler base head
gcc-4.4.7 6496131 6911823 +6.4%
gcc-4.9.4 7225110 7731072 +7.0%
gcc-5.4.1 7054224 7688146 +9.0%
gcc-6.2.0 7059794 7651675 +8.4%
On a 4-sockets Intel-Skylake machine:
compiler base head
gcc-4.4.7 13162890 13508193 +2.6%
gcc-4.9.4 14997463 15484353 +3.2%
gcc-5.4.1 14708711 15449805 +5.0%
gcc-6.2.0 14574099 15349204 +5.3%
The above 4 compilers are used because I've done the tests through
Intel's Linux Kernel Performance(LKP) infrastructure and they are the
available compilers there.
The benefit being less on 4 sockets machine is due to the lock
contention there(perf-profile/native_queued_spin_lock_slowpath=81%) is
less severe than on the 2 sockets machine(85%).
What the benchmark does is: it forks nr_cpu processes and then each
process does the following:
1 mmap() 128M anonymous space;
2 writes to each page there to trigger actual page allocation;
3 munmap() it.
in a loop.
https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c
Binary size wise, I have locally built them with different compilers:
[aaron@aaronlu obj]$ size */*/mm/page_alloc.o
text data bss dec hex filename
37409 9904 8524 55837 da1d gcc-4.9.4/base/mm/page_alloc.o
38273 9904 8524 56701 dd7d gcc-4.9.4/head/mm/page_alloc.o
37465 9840 8428 55733 d9b5 gcc-5.5.0/base/mm/page_alloc.o
38169 9840 8428 56437 dc75 gcc-5.5.0/head/mm/page_alloc.o
37573 9840 8428 55841 da21 gcc-6.4.0/base/mm/page_alloc.o
38261 9840 8428 56529 dcd1 gcc-6.4.0/head/mm/page_alloc.o
36863 9840 8428 55131 d75b gcc-7.2.0/base/mm/page_alloc.o
37711 9840 8428 55979 daab gcc-7.2.0/head/mm/page_alloc.o
Text size increased about 800 bytes for mm/page_alloc.o.
[aaron@aaronlu obj]$ size */*/vmlinux
text data bss dec hex filename
10342757 5903208 17723392 33969357 20654cd gcc-4.9.4/base/vmlinux
10342757 5903208 17723392 33969357 20654cd gcc-4.9.4/head/vmlinux
10332448 5836608 17715200 33884256 2050860 gcc-5.5.0/base/vmlinux
10332448 5836608 17715200 33884256 2050860 gcc-5.5.0/head/vmlinux
10094546 5836696 17715200 33646442 201676a gcc-6.4.0/base/vmlinux
10094546 5836696 17715200 33646442 201676a gcc-6.4.0/head/vmlinux
10018775 5828732 17715200 33562707 2002053 gcc-7.2.0/base/vmlinux
10018775 5828732 17715200 33562707 2002053 gcc-7.2.0/head/vmlinux
Text size for vmlinux has no change though, probably due to function
alignment.
Link: http://lkml.kernel.org/r/20171013063111.GA26032@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Commit 60a3cdd063 ("x86: add optimized inlining") introduced
CONFIG_OPTIMIZE_INLINING, but it has been available only for x86.
The idea is obviously arch-agnostic. This commit moves the config entry
from arch/x86/Kconfig.debug to lib/Kconfig.debug so that all
architectures can benefit from it.
This can make a huge difference in kernel image size especially when
CONFIG_OPTIMIZE_FOR_SIZE is enabled.
For example, I got 3.5% smaller arm64 kernel for v5.1-rc1.
dec file
18983424 arch/arm64/boot/Image.before
18321920 arch/arm64/boot/Image.after
This also slightly improves the "Kernel hacking" Kconfig menu as
e61aca5158 ("Merge branch 'kconfig-diet' from Dave Hansen') suggested;
this config option would be a good fit in the "compiler option" menu.
Link: http://lkml.kernel.org/r/20190423034959.13525-12-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Borislav Petkov <bp@suse.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Boris Brezillon <bbrezillon@kernel.org>
Cc: Brian Norris <computersforpeace@gmail.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marek Vasut <marek.vasut@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Stefan Agner <stefan@agner.ch>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[kdrag0n: Backported to k4.14]
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
-O3 is much more stable with modern compilers these days than it was a
decade ago. Using -O3 on the kernel results in significantly improved
hackbench performance, which is a sign that overall performance in the
kernel is improved. It works especially well in conjunction with LTO.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
There's plenty of room on the stack for a few more inlined bytes here
and there. The measured stack usage at runtime is still safe without
this, and performance is surely improved at a microscopic level, so
remove it.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
SD720G doesn't have 2400000 frequency, it only goes to 2323200.
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
[dereference23: Adapted for atoll]
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Now that pstore_register() can correctly pass max_reason to the kmesg
dump facility, introduce a new "max_reason" module parameter and
"max-reason" Device Tree field.
The "dump_oops" module parameter and "dump-oops" Device
Tree field are now considered deprecated, but are now automatically
converted to their corresponding max_reason values when present, though
the new max_reason setting has precedence.
For struct ramoops_platform_data, the "dump_oops" member is entirely
replaced by a new "max_reason" member, with the only existing user
updated in place.
Additionally remove the "reason" filter logic from ramoops_pstore_write(),
as that is not specifically needed anymore, though technically
this is a change in behavior for any ramoops users also setting the
printk.always_kmsg_dump boot param, which will cause ramoops to behave as
if max_reason was set to KMSG_DUMP_MAX.
Co-developed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/lkml/20200515184434.8470-6-keescook@chromium.org/
Signed-off-by: Kees Cook <keescook@chromium.org>
Add a new member to struct pstore_info for passing information about
kmesg dump maximum reason. This allows a finer control of what kmesg
dumps are sent to pstore storage backends.
Those backends that do not explicitly set this field (keeping it equal to
0), get the default behavior: store only Oopses and Panics, or everything
if the printk.always_kmsg_dump boot param is set.
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/lkml/20200515184434.8470-5-keescook@chromium.org/
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Since the header is a fixed small maximum size, just use a stack variable
to avoid memory allocation in the write path.
Signed-off-by: Kees Cook <keescook@chromium.org>
If zero-length header happened in ramoops_write_kmsg_hdr(), that means
we will not be able to read back dmesg record later, since it will be
treated as invalid header in ramoops_pstore_read(). So we should not
execute the following code but return the error.
Signed-off-by: Yue Hu <huyue2@yulong.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Since only one single ramoops area allowed at a time, other probes
(like device tree) are meaningless, as it will waste CPU resources.
So let's check for being already initialized first.
Signed-off-by: Yue Hu <huyue2@yulong.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Sometimes pstore_console_write() will write records with zero size
to persistent ram zone, which is unnecessary. It will only increase
resource consumption. Also adjust ramoops_write_kmsg_hdr() to have
same logic if memory allocation fails.
Signed-off-by: Yue Hu <huyue2@yulong.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
The UFS callback is the most time consuming callback in the dpm_resume
section of kernel resumes, taking around 30 ms. Making it async
improves resume latency by around 20 ms, and helps with decreasing
suspend times as well.
Bug: 134704391
Change-Id: I708c8a7bc8f2250d6b2365971ccc394c7fbf8896
Signed-off-by: Vincent Palomares <paillon@google.com>
Although try_to_freeze_tasks() stops when there's a wakeup, it doesn't
return an error when it successfully freezes everything it wants to freeze.
As a result, the suspend attempt can continue even after a wakeup is
issued. Although the wakeup will be eventually caught later in the suspend
process, kicking the can down the road is suboptimal; when there's a wakeup
detected, suspend should be immediately aborted by returning an error
instead. Make try_to_freeze_tasks() do just that, and also move the wakeup
check above the `todo` check so that we don't miss a wakeup from a process
that successfully froze.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Change-Id: I6d0ff54b1e1e143df2679d3848019590725c6351
PM callbacks can be used as an indicator that the system is suspending, and
as such, a wakeup may be issued outside the notifier in order to cancel an
ongoing suspend. Clearing wakeups after running the PM callbacks can thus
cause wakeups to be missed. Fix it by simply clearing wakeups prior to
running PM callbacks.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Change-Id: I49b6f900284ff42b245be77548f6ca756cf585ee
Calling s2idle_wake() once after pm_abort_suspend is incremented is all
that's needed to wake up from s2idle. Avoid multiple unnecessary attempts
to wake from s2idle by only doing the wakeup when pm_abort_suspend hits 1.
The s2idle machinery already provides the synchronization needed to make
this safe.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Change-Id: I789857351353737f7199a8df5c5ba210050352e6
Freezing processes on Android usually takes less than 100 ms, and if it
takes longer than that to the point where the 20 second freeze timeout is
reached, it's because the remaining processes to be frozen are deadlocked
waiting for something from a process which is already frozen. There's no
point in burning power trying to freeze for that long, so reduce the freeze
timeout to a very generous 1 second for Android and don't let anything mess
with it.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
When some CPUs cycle out of s2idle due to non-wakeup IRQs, it's possible
for them to run while the CPU responsible for jiffies updates remains idle.
This can delay the execution of timers indefinitely until the CPU managing
the jiffies updates finally wakes up, by which point everything could be
dead if enough time passes.
Fix it by handing off timekeeping duties when the timekeeping CPU enters
s2idle and freezes its tick. When all CPUs are in s2idle, the first one to
wake up for any reason (either from a wakeup IRQ or non-wakeup IRQ) will
assume responsibility for the timekeeping tick.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Change-Id: Idbb2e825c489e174d5701e0c315b51a3149bfe49
Replace 'schedule(); try_to_freeze();' with a call to freezable_schedule().
Tasks calling freezable_schedule() set the PF_FREEZER_SKIP flag
before calling schedule(). Unlike tasks calling schedule();
try_to_freeze() tasks calling freezable_schedule() are not awaken by
try_to_freeze_tasks(). Instead they call try_to_freeze() when they
wake up if the freeze is still underway.
It is not a problem since sleeping tasks can't do anything which isn't
allowed for a frozen task while sleeping.
The result is a potential performance gain during freeze, since less
tasks have to be awaken.
For instance on a bare Debian vm running a 4.19 stable kernel, the
number of tasks skipped in freeze_task() went up from 12 without the
patch to 32 with the patch (out of 448), an increase of > x2.5.
Signed-off-by: Hugo Lefeuvre <hle@owl.eu.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20190207200352.GA27859@behemoth.owl.eu.com.local
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Change-Id: I1b90daf9ebc8cdfaffc5fe8fe8f1883b9588a6c6
* This is spurious and does not get destroyed, which keeps this
wakelock as pending. Remove this code to save power.
* To enable these wakelocks again, pass `-DIPA_WAKELOCKS` as a
cflag.
Signed-off-by: Vaisakh Murali <mvaisakh@statixos.com>
Change-Id: I4a424f7bddbfb793edbdeb3b6ac1b69d2f4a676e
The original dedup code does not handle collision from the observation
that it practically does not happen.
For additional peace of mind, use a bigger hash size for reducing the
possibility of collision even further.
Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Change-Id: I4543ce22d69a21b2edcc72a72790d2c55a9e312c
Reference: 59e1a2f4bf83 ("ksm: replace jhash2 with xxhash")
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Change-Id: Icb6221a96d0d59e6bbc1c908ec1cd5e59d4a7117
xxhash's performance depends heavily on compiler optimizations including
inlines. Follow upstream's behavior and inline those helper functions.
Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Change-Id: Ieee439865dc48f21a97032a8009941331c302227
These simply wraps memcpy().
Replace it with macros so that it is naturally inlined.
Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Change-Id: I67887a706816aff46339d43b2f28e97813c1d4af
A measurably significant amount of CPU time is spent on logging events
for debugging purposes in lpm_cpuidle_enter. Kill the useless logging to
reduce overhead.
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Change-Id: I9d34ece7e910d92e6f1a55fa85b3db5828fbbb9e
Waking the GPU upon touch wastes power when the screen is being touched
in a way that does not induce animation or any actual need for GPU usage.
Instead of preemptively waking the GPU on touch input, wake it up upon
receiving a IOCTL_KGSL_GPU_COMMAND ioctl since it is a sign that the GPU
will soon be needed.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Change-Id: I6387083562578b229ea0913b5b2fa6562d4a85e9
When the rotator is actually used (still an unsolved question in
computer science), these PM QoS requests block some CPUs in the LITTLE
cluster from entering deep idle because the driver assumes that display
rotating work occurs on a hardcoded set of CPUs, which is false. We
already have the IRQ PM QoS machinery for display rendering operations
that actually matter, so this cruft is unneeded.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Change-Id: I30542fa3009dd46ff0c21058f14c8923d7fcb892