Bug in userdel found – affects all 11i v1,v2,v3

This is my first bug I’ve found in HP-UX. In the following story I modified some data to protect our customers privacy. According to the white book of our customer, a user integrated into a HA package need to be created with a symlinked home directory like this:

# ll -d /home/user1
lrwxr-xr-x 1 root root 27 Jul 21 2006 /home/user1 -> /package/home/user1
#


This caused us some problems at this time. I needed to delete a user account on several machines. I’ve done it as I usually do it, with a for loop started on our ignite server. (We use keyed ssh connections to speed up such rollout processes.) But the userdel command hung on two machines, which are two nodes of the same cluster. I interrupted the userdel with ^C, but it only disconnected the ssh connection. The userdel process remained there, using up 100% CPU of a CPU core. I noticed later this intensive CPU usage, and started to find out the root cause. It turned out that the userdel process didn’t terminate with that ^C, only the ssh connection. I analyzed the running userdel with the tusc tool:

# tusc -p 17635
( Attached to process 17635 ("userdel -r user123") [32-bit] )
[17635] lstat(0x4001b890, 0x7bff07c4) .................................................... [running]
[17635] lstat("/package/home/user1", 0x7bff07c4) ................................ = 0
[17635] readlink("/package/home/user1", "/home/user1", 1024) ................. = 14
[17635] lstat("/home/user1", 0x7bff07c4) .............................................. = 0
[17635] readlink("/home/user1", "/package/home/user1", 1024) ................. = 28
[17635] lstat("/package/home/user1", 0x7bff07c4) ................................ = 0
[17635] readlink("/package/home/user1", "/home/user1", 1024) ................. = 14

At this point the home of user1 became suspicious to me. I checked it:

# ll -d /home/user1
lrwxr-xr-x 1 root root 28 Jul 25 2006 /home/user1 -> /package/home/user1
# ll -d /package/home/user1
lrwxr-xr-x 1 root root 14 Jul 20 2006 /package/home/user1 -> /home/user1
#

As you can see, the home of user1 was invalid, because it points to a symlink which points back. And it resulted in an endless loop of lstat and readlink for the userdel command, it used up 100% CPU. The strange thing is, that I never wanted to poke with the user “user1″. I only wanted to delete user123, along with his home.

After that, I opened a software call at HP. They could reproduce the manner on all the three versions. (We only have 11iv1) They gave me an internal change request number, QXCR1001051729. I was told that it won’t be fixed in 11iv1 since it’s not a critical bug. But it was promised to me that when a patch for this comes out, I will be informed about it. At our side, the solution was to fix that broken home directory entry. I created the /package/home/user1 directory. I didn’t need to check the other machines for broken links, because the userdel terminated there in a second.

For the broken links: As far as I remember, on the sysnet course it was mentioned that such links will be handled with a counter. If the counter exceeds e.g. 32, it will display an error like this:

# ll cat dog
lrwxr-xr-x 1 root root 3 Jun 29 16:41 cat -> dog
lrwxr-xr-x 1 root root 3 Jun 29 16:41 dog -> cat
# cd cat
ksh: cat: bad directory
# cat dog
cat: Cannot open dog: Too many levels of symbolic links
#

It is strange that I was presented with different error messages. I think that it’s because of a caching mechanism. Or maybe this is related to a fastlink mechanism. (Or is this feature only in HFS?) Nevermind, the kernel parameter for this is unset:

# kmtune -q create_fastlinks
Parameter Current Dyn Planned Module Version
===============================================================================
create_fastlinks 0 - 0
#

So, I am waiting now for the patch. Alas at our main customer we don’t have other than 11iv1, but I am curious about that bug report/fix.

Post a Comment

Your email is never shared. Required fields are marked *

*
*