Installing Solaris

Every once in a while -- every year or two, as it happens -- I need to install Solaris on a SPARC machine here at home. From a Linux server.

This, you might think, shouldn't be too hard a thing to do. And you'd be right: it's not. What is hard to do, however, is configure your Linux server correctly, as the various scripts and whatnot that Sun provide aren't entirely suited to the GNU tools, and have a tendency to break.

Not good.

So, here is a shortlist of issues you may face when attempting this. Google is not always helpful, and to save me the usual three hours or so it takes to work all this lot out again from first principles, here's a list of what goes wrong, how to fix it, and how to identify the problems. I make no apologies for being somewhat terse and assuming you know what you're doing; you're installing Solaris: if you don't understand how to dd something to somewhere, you're in the wrong job.

Please note that for the purposes of this demonstration, I'm installing Solaris 10.hw0606 as it happens to have a SCSI driver for the machine I'm eventually going to be installing on that doesn't barf on a 2TB LUN. These hints and problems aren't in any particular order, as the thought to write them all down has come somewhat late in the day and I've almost finished it now. Server is a home-built AMD64 PC running Debian testing/unstable, with a custom 2.6.18.1 kernel (at present) named 'newdesktop'. Client is a Netra T1 105, with an entire 256MB RAM and a couple of old SCSI discs, to be called 'newmach' (for reasons which are unlikely to become clear, although they do exist).

Without further ado:

Minicom: set it up 9600/8/N/1. Use a Cisco serial console cable (the cyan-coloured flat cat-5-alike RJ45 cable) with a 9-pin D-type bodge box thing on the end of it -- the ones Cisco supplies with the 800-series routers.
Better: screen -fn /dev/ttyS0 9600 in a handy screen(1) session.
apt-get install atftpd. Start with 'atftpd --daemon /tftpboot'.
apt-get install rarpd. Add your client's MAC address to /etc/ethers:
```
8:0:20:da:11:4e 172.29.23.54
```
apt-get install bootparamd. This is where the fun really starts.
/etc/bootparams: this file is superficially easy, however there are a number of gotchas about the Linux version that don't seem to apply to the Solaris version:
- The boottype and rootopts parameters must have the server IP addess before the value, ie: boottype=172.29.23.206:in and rootopts=172.29.23.206:rsize=32768; the daemon returns:
```
	getfile got question for "newmach" and file "rootopts"
	getfile failed for newmach
```
  otherwise.
- It seems that the machine needs to appear in its own line, even if all the options are correctly set in the wildcard value at the top. I may be wrong, but I'm too lazy to check this now.
The error:
```
WARNING: /pci@7c0/pci@0/pci@1/pci@0,2/scsi@1/sd@0,0 (sd1):
        disk capacity is too large for current cdb length
```
means that you're trying to install an older Solaris onto a machine with an MPT SCSI controller (eg. a T2000) with a LUN attached to it that's >2TB. Bad luck. Try another version.

Be careful: some versions will appear to work, but won't: they just fail to complain until you use 2TB of it, when further accesses throw an IO error. Thanks, Sun...

I have, finally, found a fix for this. The latest Solaris Express (as of 20070828) release seems to work with these devices. I suggest using either that, or OpenSolaris, if appropriate to your environment.
Your jumpstart installation directory must be exported read-only. This is critical. If it's read-write, it'll be trashed by the install process and you'll need to copy it from media again. This isn't an obvious error, the installation will fail, and subsequent attempts fiddle and try again will also fail with interesting results. As I've not done this one in a while I don't have any symptoms for you to see.
Don't use the install scripts to install the server, or add_install_client to setup the boot environment. They'll fail. Just copy the contents from the DVD (exercise left to the reader), and edit /etc/ethers, /etc/bootparams, and /etc/exports appropriately.
Use nfs-user-server not nfs-kernel-server. The kernel server exports over NFS 3 by default, and this will fail. I'm not entirely sure why, but I believe it's something to do with NFS 3 and ACL support; Linux's ACL support is subtly different to Solaris' ($current-contract had some fun with this causing a reboot of their fileserver some time back) and either way it doesn't work. Setting rootopts to mount it NFS 2 will probably work, but I wouldn't like to guarantee your sanity when it comes to installing packages later. Best be safe. A reader pointed me at this blog post and suggested it works for him. I'm not convinced, but haven't tried it. YMMV.

Alex Stram tells me that Ubuntu no longer has nfs-user-server, and says that the blog post above works, with the addition of :vers=2 to the rootopts.
If you can get the bloody jumpstart to honour netmasks, you've managed better than me. Well done. Let me know how.
```
ifconfig eth1:1 netmask 255.255.0.0 broadcast 172.29.255.255
```
worked for me, but then I can get away with that here. Ask your network admin nicely. In theory the client sends an ICMP request for the network's netmask, but I've yet to see it.

This is important, as you'll get:
```
ERROR: bpgetfile unable to access network
/sbin/install-discovery: information: not found
#
```
(and it'll bomb-out at that root prompt) as the broadcast-RPC for the bootparams install_config file will fail. You can test things using bpgetfile:
```
# bpgetfile rootopts
 0.0.0.0 rsize=32768,proto=udp
#
```
for example. This also comes in handy when trying to debug the sixth point in this list.
Related, if you do manage to get your client to send the ICMP request, and you're attempting this on FreeBSD (possibly OpenBSD and NetBSD too, I have no idea), Alexandre Snarskii discovered that it will hang at:
```
Configured interface eri0
```
(or similar interface name) when installing Solaris 8. This is caused by FreeBSD ignoring the ICMP netmask request, so Solaris just sits waiting. The solution is apparently:
```
sysctl -w net.inet.icmp.maskrepl=1
```
and restart the installation. Please note that I haven't verified this.
Annoying boot errors like:
```
boot: cannot open kernel/sparcv9/unix
Enter filename [kernel/sparcv9/unix]:
```
can be fixed by ensuring your client can NFS mount your server. Cryptic error message, I know. Might also be related to NFS 3 problems.
clntbtcp_call: rpc cansend error is, I think, the same problem.
```
Using rules.ok from 132.185.128.21:/install/jumpstart/dickonhtest.
Checking rules.ok file...
awk: division by zero
 record number 17
/sbin/install-solaris: test: unknown operator 1
#
```
means you've got a disc with an EFI bootlabel on it that Jumpstart is attempting to consider for use as a root disc. Unfortunately, there's a bug in Jumpstart where the prtvtoc command prints a different set of details about EFI-formatted devices, and the awk script doesn't take this into account.

To fix this, dd /dev/zero over the start of /dev/rdsk/cxtxdx, and another load over the EFI bootlabel at the end. Note that as you can't directly access cxtxdxs8 as there's no device associated with it, you'll have to compute the correct seek values for dd yourself. This isn't hard: just run format(1M) on it and use the values in the slice 8 section.

You may have some luck using format -e and installing an SMI bootlabel on the disc. I've yet to try this.

Please note that creating a device with a minor number one bigger than s7's is likely to be a Bad Idea, as these things tend to be s0 of the following drive, or the following drive itself...

This little mess:

svc.configd: smf(5) database integrity check of:

    /etc/svc/repository.db

  failed. The database might be damaged or a media error might have
  prevented it from being verified.  Additional information useful to
  your service provider is in:

    /etc/svc/volatile/db_errors

  The system will not be able to boot until you have restored a working
  database.  svc.startd(1M) will provide a sulogin(1M) prompt for recovery
  purposes.  The command:

    /lib/svc/bin/restore_repository

  can be run to restore a backup version of your repository.  See 
  http://sun.com/msg/SMF-8000-MY for more information.

Requesting System Maintenance Mode
(See /lib/svc/share/README for more information.)
svc.configd exited with status 102 (database initialization failure)

means that your filesystem is exported with your client's root user squashed to a nobody value. You can confirm this by trying to read /etc/svc/repository.db, which is 0600 root / sys. Ensure your exports line in /etc/exports reads something akin to:

/install        *(insecure,sync,ro,no_root_squash)

Don't ask how long it took me to find that one.

Incidentally, if you ifconfig up another interface on a Solaris box, don't expect it to NFS export everything from it immediately. Or, indeed, after restarting all the NFS and RPC services using svcadm. You need to re-share(1M) the filesystem first. Very odd symptoms occur otherwise, although I forget what. I gave up, and got my sysadmin to track it down. How he found it I don't know.
```
rtioctl: kstr_ioctl failed: error 128
whoami: couldn't add route: error 128.
WARNING: hme0: no response from interface
```
appears to be an error produced by the boot sequence when rpc.bootparam feeds it a default route it can't get to (eg., when you're booting it from a secondary interface and the server is sending back the address of that interface as its default route). This can be seen by:
```
bootparamd: whoami got question for 172.29.23.54
This is host newmach
Returning newmach   (none)    8x.158.xx.xxx
```
(ie, my public interface) in the rpc.bootparamd logs; avoid it by starting it as 'rpc.bootparamd -r 172.29.23.1'. You'll see this immediately after the copyright notice and 'Use is subject to license terms.' line. It appears to be harmless, but if you're installing packages from a machine which isn't on the local LAN (either by HTTP or NFS), you'll probably hit problems.
If you get a metric shedload of the following:
```
Cannot open pkginfo file /cdrom/Solaris_10/Patches/./xxxxxx-xx/pkginfo
```
then you need to ensure your Jumpstart's Patches directory only contains patch directories, and no files. Yup, odd, especially considering that Solaris patches are order-sensitive. Apparently Jumpstart installs them in mtime order (according to something I read somewhere) rather than `cat patch_order` (which install_cluster in a *_Recommended does), although I've not verified this myself. By 'shedload', I mean 'at 9600 baud it's time to go for lunch'.

That's it for the moment. Any questions, corrections, or clarifications, let me know. No HTML mail, please; my spam filter has something of a dislike of it, and generally files it appropriately.