Odd Directory Server problem
So, I’ve run into a bit of an odd problem with Sun’s LDAP Directory Server (the latest version - 5.2P3). Occasionally, Linux clients have their authentications hang. Everything will be fine, and then you’ll request another shell (or need a process created or whatever) and on the surface at least it seems that the connection the client opens to the LDAP server just stalls. Often the first time you log into a box the auth will never succeed, but any subsequent attempts will fail.
Now, fortunately whenever this happens I get what you’d think would be a fairly useful error message on the server:
Jun 14 09:55:04 ldap01 ns-slapd[21169]: [ID 555009 user.alert] crypt: fdopen(266) failed: Too many open files
The curious part is that AFAIK the ns-slapd process has setrlimit()ed its file descriptor max up to 65K:
bash-3.00# plimit /proc/21169 21169: ./ns-slapd -D /var/opt/mps/serverroot/slapd-ldap01 -i /var/opt/mps/ser resource current maximum time(seconds) unlimited unlimited file(blocks) unlimited unlimited data(kbytes) unlimited unlimited stack(kbytes) 8480 130336 coredump(blocks) unlimited unlimited nofiles(descriptors) 65536 65536 vmemory(kbytes) unlimited unlimited
The Directory Server Console app also reports that it has 65K connections available, yet when I start seeing the problem the tool reports “Open Connections” in the 200+ range, but never more than 256. It’s almost like the file descriptor soft limit (which defaults to 256) is in fact being enforced somehow. A normal shell does report a limit of 256:
bash-3.00# ulimit -a core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited open files (-n) 256 pipe size (512 bytes, -p) 10 stack size (kbytes, -s) 8480 cpu time (seconds, -t) unlimited max user processes (-u) 8357 virtual memory (kbytes, -v) unlimited
So. Who’s telling the truth? Everything points to the process hitting the 256 open file soft limit, yet plimit and the tool itself claim they’ve got all 65K descriptors available.
While this problem started with a “stock” Directory Server 5.2 running Solaris 9, I’ve since moved to the latest Patch 3 release of Directory Server running on Solaris 10 just to see if perhaps there was some weird bug in Solaris 9 x86. But it’s the exact same behaviour on both setups.
I suppose the next step is to set rlim_fd_cur to something like 2048, reboot, and see if the problem magically goes away. I just hate it when I get an apparently inconsistent view of the system, though. So perhaps before I do the /etc/system update I’ll do some DTrace reading and see if there’s something there that might enlighten me.
Of course, the other interesting thing to ponder is why these Linux pam_ldap clients seem to open so many freakin’ connections to the LDAP server. For example, just from the mail server:
bash-3.00# netstat -a|grep ESTABLISHED|grep mercury|wc -l 59
60 connections. A relatively idle shell server has 24 connections open right now. Seems like a lot. Zope and its LDAPUserFolder, by contrast, only opens at most 3 connections yet it probably does as many or more auths per second as the Linux shell server. Is this normal?
Update: So on a hunch that ns-slapd was somehow reverting back to a 256 open file limit after it switches uid to the ldap user, I decided to update /etc/system and “set rlim_fd_cur=65536“. After reboot now even the unpriveleged ldap account should be able to open 65K files:
-bash-3.00$ id uid=389(ldap) gid=389(ldap) -bash-3.00$ ulimit -n 65536
Yet I still see the same problem, and again I *never* see more than 230 or so open connections by the ns-slapd process even when it’s spewing the “too many open files” errors. Frustrating problem, to say the least. The bit that has me curious now is what role “crypt” has to play in all of this. Most of my passwords are indeed in crypt format. Perhaps the open file limit is being hit in making calls to crypt() on password checks?
Update 2: Well just out of curiosity I wanted to see what ns-slapd was up to. So I ran a “prstat -m -p 525” on it and saw that most of the time was spent on systems calls and voluntary context switches (the two go hand in hand of course). Curious what the syscall load was I found this little DTrace script in the DTrace Guide:
root@ldap01:/var/opt/mps# dtrace -n syscall:::entry'/pid == 525/{ @syscalls[probefunc] = count(); }'
dtrace: description 'syscall:::entry' matched 229 probes
^C
lwp_exit 1
fstat64 1
lwp_sigmask 1
schedctl 1
setcontext 1
lwp_continue 1
lwp_create 1
lwp_kill 1
close 3
setsockopt 3
accept 3
getsockname 3
statvfs64 33
read 100
gtime 134
lwp_park 192
write 474
pollsys 615
So it’s really nothing surprising. The main thread is poll()ing on the socket and of course writing some data back over the network to clients. So no big break yet. Tomorrow I’m going to see if I can convert some of the passwords to native MD5 (vs. crypted MD5s) and do a few specific connection tests with a LDAP load script I wrote up in Ruby today.
Update 3: So my hunch that crypt was somehow the culprit was right. The reason, however, wasn’t what I expected. And determing the problem without the existence of OpenSolaris probably wouldn’t have been possible. After reading this warning from “man fdopen” I knew it was something in the stdio library that was getting me:
The minimum number of streams that a process can have open
at one time is FOPEN_MAX. For 32-bit applications, however,
the underlying ABIs require that no file descriptor used to
access the file underlying an stdio STREAM have a value
greater than 255.
fdopen() uses a stdio STREAM, so that’s where I’m hitting the wall. Now, why would crypt be using fdopen() from stdio at all? I floated the question in #opensolaris and happily dprice and sommerfeld offered insight. First of all, these guys are very approachable, and open about the limits of the AT&T stdio implementation that Solaris uses. After a little grumbling/joking talk about “BSD got it right the first time”, and “Thank you for using.. AT&T”, I found out that as I expected, the 255 fd limit on the Solaris 32-bit ABI has a “long, and very sad saga behind it”.
That’s not surprising, of course. What was surprising was just talking to kernel engineers about the issue, how willing they were to help, and how they openly discussed how they might work around it. So refreshing. So… fantastic, really. Back to why ns-slapd was running into fdopen() limits. Dan looked at crypt.c and had a hunch that it was trying to read /etc/security/crypt.conf and that was the problem. He dug up a Dtrace script for me to get a stack trace when the fdopen() call returned null. It didn’t take long to get a result:
root@ldap01:/etc/security# dtrace -p 525 -n 'pid$target::fdopen:return/arg1 == 0/{ustack();}'
dtrace: description 'pid$target::fdopen:return' matched 1 probe
Jun 16 16:04:59 ldap01 ns-slapd[525]: [ID 555009 user.alert] crypt: fdopen(294) failed: Too many open files
CPU ID FUNCTION:NAME
0 41216 fdopen:return
libc.so.1`0xcd78422a
libc.so.1`crypt+0x88
pwdstorage-plugin.so`crypt_pw_cmp+0x2b
libslapd.so.1`slapi_pw_find_sv+0x6a
libback-ldbm.so`ldbm_back_bind+0x140
libfe.so`bind_core_pb+0x56d
libfe.so`dispatch_operation_core_pb+0xdb
libfe.so`0xce6a024b
libfe.so`ldap_frontend_main_using_core_api+0xa7
libfe.so`generic_handle_read_data+0x95
libfe.so`0xce6a1244
libnspr4.so`_pt_root+0xc4
ns-slapd`0x81078e0
0x4d580000
The top call is to getalgbyname(), which reads crypt.conf to figure out which .so library to load for the algorithm in question. So there was the culprit! Phew! So now I have bug 6287034 open, but of course a kernel workaround isn’t going to magically appear tomorrow which puts me back at square one. I.e. how do I get this LDAP server fixed and into production?
Right now the plan is pretty simple: I’ll bring a shell server over to pam_ldap but leave SSL off so I can sniff out the LDAP packets and grab user’s plain text passwords. A little python or ruby script can do the filtering for me. Once I have the passwords I’ll put them into the directory server as “native” SSHA hashes. That should eliminate the crypt_unix() calls that ns-slapd makes to check a user’s password.









9 Comments
Jump to comment form | comments rss [?] | trackback uri [?]