Initializing debufs under monc->mutex introduces a lock dependency for
sb->s_type->i_mutex_key, which (combined with several other dependencies)
leads to an annoying lockdep warning. There's no particular reason to do
the debugfs setup under this lock, so move it out.
It used to be the case that our first monmap could come from the OSD; that
is no longer the case with recent servers, so we will reliably set up the
client entry during the initial authentication.
We don't have to worry about racing with debugfs teardown by
ceph_debugfs_client_cleanup() because ceph_destroy_client() calls
ceph_msgr_flush() first, which will wait for the message dispatch work
to complete (and the debugfs init to complete).
Fixes: #1940
Signed-off-by: Sage Weil <sage@newdream.net>
Use kmemdup rather than duplicating its implementation
The semantic patch that makes this change is available
in scripts/coccinelle/api/memdup.cocci.
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Sage Weil <sage@newdream.net>
ceph_osd_request struct allocates a 40-byte buffer for object names.
RBD image names can be up to 96 chars long (100 with the .rbd suffix),
which results in the object name for the image being truncated, and a
subsequent map failure.
Increase the oid buffer in request messages, in order to avoid the
truncation.
Signed-off-by: Stratos Psomadakis <psomas@grnet.gr>
Signed-off-by: Sage Weil <sage@newdream.net>
These files are non modular, but need to export symbols using
the macros now living in export.h -- call out the include so
that things won't break when we remove the implicit presence
of module.h from everywhere.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
If we skip over one or more map epochs, we need to resend all osd requests
because it is possible they remapped to other servers and then back.
Signed-off-by: Sage Weil <sage@newdream.net>
Change ceph_parse_ips to take either names given as
IP addresses or standard hostnames (e.g. localhost).
The DNS lookup is done using the dns_resolver facility
similar to its use in AFS, NFS, and CIFS.
This patch defines CONFIG_CEPH_LIB_USE_DNS_RESOLVER
that controls if this feature is on or off.
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Any non-masked msg allocation failure should generate a warning and stack
trace to the console. All of these need to eventually be replaced by
safe preallocation or msgpools.
Signed-off-by: Sage Weil <sage@newdream.net>
The pool allocation failures are masked by the pool; there is no need to
spam the console about them. (That's the whole point of having the pool
in the first place.)
Mark msg allocations whose failure is safely handled as such.
Signed-off-by: Sage Weil <sage@newdream.net>
Allocate the mon connection on init. We already reuse it across
reconnects. Remove now unnecessary (and incomplete) NULL checks.
Signed-off-by: Sage Weil <sage@newdream.net>
This simplifies the init/shutdown paths, and makes client->msgr available
during the rest of the setup process.
Signed-off-by: Sage Weil <sage@newdream.net>
The incremental map updates have a record for each pg_temp mapping that is
to be add/updated (len > 0) or removed (len == 0). The old code was
written as if the updates were a complete enumeration; that was just wrong.
Update the code to remove 0-length entries and drop the rbtree traversal.
This avoids misdirected (and hung) requests that manifest as server
errors like
[WRN] client4104 10.0.1.219:0/275025290 misdirected client4104.1:129 0.1 to osd0 not [1,0] in e11/11
Signed-off-by: Sage Weil <sage@newdream.net>
We need to apply the modulo pg_num calculation before looking up a pgid in
the pg_temp mapping rbtree. This fixes pg_temp mappings, and fixes
(some) misdirected requests that result in messages like
[WRN] client4104 10.0.1.219:0/275025290 misdirected client4104.1:129 0.1 to osd0 not [1,0] in e11/11
on the server and stall make the client block without getting a reply (at
least until the pg_temp mapping goes way, but that can take a long long
time).
Reorder calc_pg_raw() a bit to make more sense.
Signed-off-by: Sage Weil <sage@newdream.net>
The r_req_lru_item list node moves between several lists, and that cycle
is not directly related (and does not begin) with __register_request().
Initialize it in the request constructor, not __register_request(). This
fixes later badness (below) when OSDs restart underneath an rbd mount.
Crashes we've seen due to this include:
[ 213.974288] kernel BUG at net/ceph/messenger.c:2193!
and
[ 144.035274] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
[ 144.035278] IP: [<ffffffffa036c053>] con_work+0x1463/0x2ce0 [libceph]
Signed-off-by: Sage Weil <sage@newdream.net>
ceph_destroy_options does not free opt->mon_addr that
is allocated in ceph_parse_options.
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Commit 4cf9d54463 recorded when an outgoing ceph message was ACKed,
in order to avoid unnecessary connection resets when an OSD is busy.
However, ack_stamp is uninitialized, so there is a window between
when the message is sent and when it is ACKed in which handle_timeout()
interprets the unitialized value as an expired timeout, and resets
the connection unnecessarily.
Close the window by initializing ack_stamp.
Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Sage Weil <sage@newdream.net>
There were several problems here:
1- we weren't tagging allocations with the pool, so they were never
returned to the pool.
2- msgpool_put didn't add back to the mempool, even it were called.
3- msgpool_release didn't clear the pool pointer, so it would have looped
had #1 not been broken.
These may or may not have been responsible for #1136 or #1381 (BUG due to
non-empty mempool on umount). I can't seem to trigger the crash now using
the method I was using before.
Signed-off-by: Sage Weil <sage@newdream.net>
Keep track of when an outgoing message is ACKed (i.e., the server fully
received it and, presumably, queued it for processing). Time out OSD
requests only if it's been too long since they've been received.
This prevents timeouts and connection thrashing when the OSDs are simply
busy and are throttling the requests they read off the network.
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
open(2) must always include one of O_RDONLY, O_WRONLY, or O_RDWR. No need
for any O_APPEND special case.
Passing O_WRONLY|O_RDWR is undefined according to the man page, but the
Linux VFS interprets this as O_RDWR, so we'll do the same.
This fixes open(2) with flags O_RDWR|O_APPEND, which was incorrectly being
translated to readonly.
Reported-by: Fyodor Ustinov <ufm@ufm.su>
Signed-off-by: Sage Weil <sage@newdream.net>
Unnecessary casts of void * clutter the code.
These are the remainder casts after several specific
patches to remove netdev_priv and dev_priv.
Done via coccinelle script:
$ cat cast_void_pointer.cocci
@@
type T;
T *pt;
void *pv;
@@
- pt = (T *)pv;
+ pt = pv;
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Paul Moore <paul.moore@hp.com>
Signed-off-by: David S. Miller <davem@conan.davemloft.net>
Set the page count correctly for non-page-aligned IO. We were already
doing this correctly for alignment, but not the page count. Fixes
DIRECT_IO writes from unaligned pages.
Signed-off-by: Sage Weil <sage@newdream.net>
If we cancel a write, trigger the safe completions to prevent a sync from
blocking indefinitely in ceph_osdc_sync().
Signed-off-by: Sage Weil <sage@newdream.net>
When the cluster is marked full, subscribe to subsequent map updates to
ensure we find out promptly when it is no longer full. This will prevent
us from spewing ENOSPC for (much) longer than necessary.
Signed-off-by: Sage Weil <sage@newdream.net>
Old incrementals encode a 0 value (nearly always) when an osd goes down.
Change that to allow any state bit(s) to be flipped. Special case 0 to
mean flip the CEPH_OSD_UP bit to mimic the old behavior.
Signed-off-by: Sage Weil <sage@newdream.net>
Since we pass the nofail arg, we should never get an error; BUG if we do.
(And fix the function to not return an error if __map_request fails.)
Signed-off-by: Sage Weil <sage@newdream.net>
If we get a WAIT as a client something went wrong; error out. And don't
fall through to an unrelated case.
Signed-off-by: Sage Weil <sage@newdream.net>
If there is no get_authorizer method we set the out_kvec to a bogus
pointer. The length is also zero in that case, so it doesn't much matter,
but it's better not to add the empty item in the first place.
Signed-off-by: Sage Weil <sage@newdream.net>
If a connection is closed and/or reopened (ceph_con_close, ceph_con_open)
it can race with a callback. con_work does various state checks for
closed or reopened sockets at the beginning, but drops con->mutex before
making callbacks. We need to check for state bit changes after retaking
the lock to ensure we restart con_work and execute those CLOSED/OPENING
tests or else we may end up operating under stale assumptions.
In Jim's case, this was causing 'bad tag' errors.
There are four cases where we re-take the con->mutex inside con_work: catch
them all and return EAGAIN from try_{read,write} so that we can restart
con_work.
Reported-by: Jim Schutt <jaschut@sandia.gov>
Tested-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Sage Weil <sage@newdream.net>
If memory allocation failed, calling ceph_msg_put() will cause GPF
since some of ceph_msg variables are not initialized first.
Fix Bug #970.
Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Fix the request transition from linger -> normal request. The key is to
preserve r_osd and requeue on the same OSD. Reregister as a normal request,
add the request to the proper queues, then unregister the linger. Fix the
unregister helper to avoid clearing r_osd (and also simplify the parallel
check in __unregister_request()).
Reported-by: Henry Chang <henry.cy.chang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
This allows us to use existence of the key type as a feature test,
from userspace.
Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
This makes the base64 logic be contained in mount option parsing,
and prepares us for replacing the homebew key management with the
kernel key retention service.
Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
We should only clear r_osd if we are neither registered as a linger or a
regular request. We may unregister as a linger while still registered as
a regular request (e.g., in reset_osd). Incorrectly clearing r_osd there
leads to a null pointer dereference in __send_request.
Also simplify the parallel check in __unregister_request() where we just
removed r_osd_item and know it's empty.
Signed-off-by: Sage Weil <sage@newdream.net>
There was a missing unlock on the error path if __map_request() failed.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
This patch fixes 'event_work' dereference before it is checked for NULL.
Signed-off-by: Mariusz Kozlowski <mk@lab.zgora.pl>
Signed-off-by: Sage Weil <sage@newdream.net>
The release method for mds connections uses a backpointer to the
mds_client, so we need to flush the workqueue of any pending work (and
ceph_connection references) prior to freeing the mds_client. This fixes
an oops easily triggered under UML by
while true ; do mount ... ; umount ... ; done
Also fix an outdated comment: the flush in ceph_destroy_client only flushes
OSD connections out. This bug is basically an artifact of the ceph ->
ceph+libceph conversion.
Signed-off-by: Sage Weil <sage@newdream.net>
Lingering requests are requests that are sent to the OSD normally but
tracked also after we get a successful request. This keeps the OSD
connection open and resends the original request if the object moves to
another OSD. The OSD can then send notification messages back to us
if another client initiates a notify.
This framework will be used by RBD so that the client gets notification
when a snapshot is created by another node or tool.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
If we send a request to osd A, and the request's pg remaps to osd B and
then back to A in quick succession, we need to resend the request to A. The
old code was only calling kick_requests after processing all incremental
maps in a message, so it was very possible to not resend a request that
needed to be resent. This would make the osd eventually time out (at least
with the current default of osd timeouts enabled).
The correct approach is to scan requests on every map incremental. This
patch refactors the kick code in a few ways:
- all requests are either on req_lru (in flight), req_unsent (ready to
send), or req_notarget (currently map to no up osd)
- mapping always done by map_request (previous map_osds)
- if the mapping changes, we requeue. requests are resent only after all
map incrementals are processed.
- some osd reset code is moved out of kick_requests into a separate
function
- the "kick this osd" functionality is moved to kick_osd_requests, as it
is unrelated to scanning for request->pg->osd mapping changes
Signed-off-by: Sage Weil <sage@newdream.net>
It used to return -EINVAL because it thought the end was not aligned
to 4 bytes.
Clean up superfluous src < end test in if, the while itself guarantees
that.
Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>