Monday, February 18, 2008

iSCSI vs NFS

VM Measurement iSCSI
Host Measurement iSCSI


Lately I've been measuring performance of iSCSI and NFS. I transferred a 3 gig iso file to each storage pool. I tested the iSCSI pool first. As you can see above the there is a bit more CPU activity with iSCSI due to all the extra network activity. That would concern me a bit in a large scale ESX enviroment. I'll be onsite in a few days where they use iSCSI for the whole site. I'm interested to see how it performs.

Saturday, February 2, 2008


Setting up ESX from scratch

I am going to attempt to document my steps to build an ESX cluster here. A good friend of mine suggested I do so so here it goes. The Goal here is to build a fully functional 2 host ESX 3.5 cluster with an iSCSI array holding and serving the VM images.

I started out with my good old ASUS p4 3.0gig. The ESX35 boot CD immediately did a kernel panic so that one was out right of the gate. I went out to my local supplier and picked up a new Intel D945GCNL with a new Core2duo 2.2 ghz proc and 2 gigs or ram. I finally managed to get ESX loaded on it but it was very unstable. Thats what I get for using a desktop board. ESX is really picky about hardware. This is one product where you really need to follow the VMware HCL this one too. So after looking around some more here I decided not to mess around anymore and just drop the dough to get a setup I know will work. I'm going to order the ASUS P5M2/SAS on Monday. In the meantime I'm setting up my storage array. Below are the very rough steps I took so far. I've decided to make my new Intel system the storage array. Using the steps outlined below the array is now up and running. I tried bonding two of the Intel gig network interfaces together to increase throughput. It turned out later that VMware server does not act to kindly to that. In the box are two SATAII 160 gig drives configured for RAID0. So far a pretty nice box.

Building an iSCSI array
Install Fedora 8
fdisk new drives
format ext3
yum install mc This is a good console based file managment tool
transfer files test locally 56-61mbs according to midnight commander

Create raid0 array on 2 160GB drives
mdadm --create --verbose /dev/md0 --level=0 --raid-devices=2 /dev/sdb1 /dev/sdc1
fdisk array
format array
transfer files test locally to array 58-62 according to midnight commander

Setup RAID0 permanently for this box
The below will setup the /etc/mdadm.conf file:
# mdadm --detail --scan
ARRAY /dev/md0 level=raid0 num-devices=2 UUID=c0137e30:dbc0a7f8:02efaf64:0df28dad
Copy and paste the above line into /etc/mdadm.conf
update /etc/fstab so the array mounts at boot time.
/dev/md0 /mnt/array ext3 defaults 1 1
reboot to test

Set up NIC bonding
Nic bonding is sort of like RAID for the network. You are basically teaming the nics together to share the load and provide fail over capability. There are lots of different terms that are used to describe the same thing here. Linux uses the term "bonding". In the windows world the word "teaming" is often used. Cisco uses "channeling" to describe it. Sun uses "trunking" to describe it. Cisco also has "trunking" but this is NOT the same. Trunking according to Cisco is when a particular port on the switch is given access to several networks or VLANs. Since I'm at home I don't have to deal with this but I do deal with it daily at work. Its good to know all of this going in. Follow this from this link. The instructions are spot on and worked the first time for me. Thanks nixCraft!


Test file transfer across newly bonded network via NFS
Getting speeds of about 35-38 mbs when pushing a 3 gig ISO to the array. Not bad I guess. I posted over at linuxquestions.org to see where I stand. One thing you need to watch out for when bonding. At first I did all this using cheap netgear gig nics I already had. The bonding steps above appeared to work but when I actually looked at the nic activity I noticed that only one nic was active. The Active Load Balancing used in the Linux bonding module is actually Round Robin which basically means that when one adapter is busy the other takes the load. The Intel cards I picked up that were already running in the other box were both active. I've since replaced the netgear nics with Intel. Both boxes nic activity are the same now.

Performance Tweaks
Since I'm using NFS I used the noatime option also to improve performance. The array is also mounted on the host box with the noatime option. This really helped improved performance. I'm currently using bonnie to get disk baselines. Bonnie would not let me use files bigger than 2 gig due to the system being 32 bit and using the -v switch to get around that did not work so I was limited to 2 gigs. I performed this over NFS from a P4 to the box with the array.

[root@vmserver ~]# bonnie -d /mnt/homeserver/tmp -s 2000
Bonnie 1.4: File '/mnt/homeserver/tmp/Bonnie.2385', size: 2097152000, volumes: 1
Writing with putc()... done: 32780 kB/s 91.1 %CPU
Rewriting... done: 21012 kB/s 14.9 %CPU
Writing intelligently... done: 48896 kB/s 20.7 %CPU
Reading with getc()... done: 31673 kB/s 85.1 %CPU
Reading intelligently... done: 48975 kB/s 17.1 %CPU
Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done...
---Sequential Output (nosync)--- ---Sequential Input-- --Rnd Seek-
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --04k (03)-
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
vmserv 1*2000 32780 91.1 48896 20.7 21012 14.9 31673 85.1 48975 17.1 1955.9 8.7

Here are the results locally on the array itself. The results are a bit skewed becaue I have more ram in the array box.

[root@homeserver array]# bonnie -d /mnt/array/tmp -s 2000 -m homeserver
Bonnie: Warning: You have 2016MB RAM, but you test with only 2000MB datasize!
Bonnie: This might yield unrealistically good results,
Bonnie: for reading and seeking and writing.
Bonnie 1.4: File '/mnt/array/tmp/Bonnie.4864', size: 2097152000, volumes: 1
Writing with putc()... done: 53798 kB/s 99.9 %CPU
Rewriting... done: 89594 kB/s 23.8 %CPU
Writing intelligently... done: 117621 kB/s 37.4 %CPU
Reading with getc()... done: 57670 kB/s 99.5 %CPU
Reading intelligently... done: 346498 kB/s 32.8 %CPU
Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done...
---Sequential Output (nosync)--- ---Sequential Input-- --Rnd Seek-
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --04k (03)-
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
homese 1*2000 53798 99.9117621 37.4 89594 23.8 57670 99.5346498 32.8 3785.4 6.6

Next I'll need to set up VMware to run VSA from LeftHand networks. They offer a demo of thier product. This software will be used to create an iSCSI target that will store the VMs.

Setting up the iSCSI initiator
So I've spent all morning running into problems. The Left Hand Demo needs to run inside of VMware. I've installed VMware server 104 on my array box and set up the VSA with a 20 gig virtual array. I bridged the network for the VM to my 2 bonded nics. Apparently VMware or the VSA is not too happy with that because it does not stay on the network for long and when it does I get DUP packets until it just drops off. Consequently I cannot run the managment console to set up the VSA.

[root@homeserver LeftHand]# ping iscsi
PING iscsi (10.10.10.10) 56(84) bytes of data.
64 bytes from iscsi (10.10.10.10): icmp_seq=6 ttl=64 time=0.188 ms
64 bytes from iscsi (10.10.10.10): icmp_seq=7 ttl=64 time=0.176 ms
64 bytes from iscsi (10.10.10.10): icmp_seq=7 ttl=64 time=0.259 ms (DUP!)
64 bytes from iscsi (10.10.10.10): icmp_seq=8 ttl=64 time=0.201 ms
64 bytes from iscsi (10.10.10.10): icmp_seq=8 ttl=64 time=0.257 ms (DUP!)
64 bytes from iscsi (10.10.10.10): icmp_seq=9 ttl=64 time=0.260 ms
64 bytes from iscsi (10.10.10.10): icmp_seq=9 ttl=64 time=0.360 ms (DUP!)

--- iscsi ping statistics ---
9 packets transmitted, 4 received, +3 duplicates, 55% packet loss, time 7999ms
rtt min/avg/max/mdev = 0.176/0.243/0.360/0.058 ms


I've since switched the VSA over to the third nic which is bridged to another network. I'm really not sure if my problems are that I am doing all of this on a single switch. Maybe I need to get another switch. For now I'm just going to run it over the single gig nic. The network issues are gone now with the third nic. I am now able to run the VSA console and configure the virtual array.

Now that I've configured my target array I'm setting up my initiator on my old P4. First I needed to install the iscsi initiator. For that I used yum:

[root@vmserver mnt]# yum search iscsi
netbsd-iscsi.i386 : User-space implementation of iSCSI target from NetBSD project
scsi-target-utils.i386 : The SCSI target daemon and utility programs
iscsi-initiator-utils.i386 : iSCSI daemon and utility programs
[root@vmserver mnt]# yum -y install iscsi-initiator-utils.i386
Setting up Install Process
Parsing package install arguments
Resolving Dependencies
--> Running transaction check
---> Package iscsi-initiator-utils.i386 0:6.2.0.865-0.2.fc8 set to be updated
--> Finished Dependency Resolution

Dependencies Resolved

Installed: iscsi-initiator-utils.i386 0:6.2.0.865-0.2.fc8
Complete!

You gotta love yum! Configuring iSCSI is pretty easy. I played around with this last week and was able to do it using this site. With RedHat/Fedora iSCSI initiator there are 2 files to watch out for.

/etc/iscsi/initiatorname.iscsi Open this file and edit. You'll need to make sure that you have the correct iscsi target in here. Here are the contents of my file for reference. You'll need to put your target iqn in this file otherwise it will not work.

[root@vmserver ~]# cat /etc/iscsi/initiatorname.iscsi
InitiatorName=iqn.2003-10.com.lefthandnetworks:home:11:homevol1
InitiatorAlias=iqn.2003-10.com.lefthandnetworks:home:11:homevol1

I needed both lines in there to make it work.

/etc/iscsi/iscsid.conf If you are using CHAP authentication you'll put the login credentials here. Otherwise at first leave it alone.

These two files are read by the iscsi intiator module on startup. If the information is incorrect you'll get:

iscsiadm: discovery login to 192.168.1.20 rejected: initiator error (02/02), non-retryable, giving up

This is one of those tidbits of information that I was unable to get from the web but I did find in the man pages. There will be very little in the way of logging. It drove me crazy for about an hour which is why I am writing all of this down. I was making the assumption that the "discovery" would actually discover the iSCSI target. After all I was specifying the IP of the array in the command. What it was doing was reading /etc/iscsi/initiatorname.iscsi and using the default entry as what to look for. When it could not find any targets named what was in that file it failed. So once those two files were correct I was able to discover the target.

Later on in this document I found out that its best to use CHAP if you plan on connecting more than one host to an iSCSI volume. At least if you use the Left Hand VSA product.

[root@vmserver ~]# service iscsi start if its not started already. It should start itself when you install it via yum.

[root@vmserver iscsi]# iscsiadm -m discovery -tst -p 192.168.1.20:3260
192.168.1.20:3260,1 iqn.2003-10.com.lefthandnetworks:home:11:homevol1

So now we can see the target we can attach to it:

iscsiadm -m node --login

If you tail /var/log/messages file you should see the volume being discovered.

Feb 3 15:07:47 vmserver kernel: scsi4 : iSCSI Initiator over TCP/IP
Feb 3 15:07:48 vmserver kernel: scsi 4:0:0:0: Direct-Access LEFTHAND iSCSIDisk 7000 PQ: 0 ANSI: 5
Feb 3 15:07:48 vmserver kernel: sd 4:0:0:0: [sdc] 18874368 512-byte hardware sectors (9664 MB)
Feb 3 15:07:48 vmserver kernel: sd 4:0:0:0: [sdc] Write Protect is off
Feb 3 15:07:48 vmserver kernel: sd 4:0:0:0: [sdc] Write cache: disabled, read cache: disabled, doesn't support DPO or FUA
Feb 3 15:07:48 vmserver kernel: sd 4:0:0:0: [sdc] 18874368 512-byte hardware sectors (9664 MB)
Feb 3 15:07:48 vmserver kernel: sd 4:0:0:0: [sdc] Write Protect is off
Feb 3 15:07:48 vmserver kernel: sd 4:0:0:0: [sdc] Write cache: disabled, read cache: disabled, doesn't support DPO or FUA
Feb 3 15:07:48 vmserver kernel: sdc: unknown partition table
Feb 3 15:07:48 vmserver kernel: sd 4:0:0:0: [sdc] Attached SCSI disk
Feb 3 15:07:48 vmserver kernel: sd 4:0:0:0: Attached scsi generic sg4 type 0
Feb 3 15:07:48 vmserver iscsid: received iferror -38
Feb 3 15:07:48 vmserver iscsid: connection1:0 is operational now

I created a 9 gig volume on my Left Hand VSA. If we list all disks we should now see it.

[root@vmserver ~]# fdisk -l

Disk /dev/sda: 40.0 GB, 40060403712 bytes
255 heads, 63 sectors/track, 4870 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x88508850

Device Boot Start End Blocks Id System
/dev/sda1 * 1 13 104391 83 Linux
/dev/sda2 14 268 2048287+ 82 Linux swap / Solaris
/dev/sda3 269 4870 36965565 83 Linux

Disk /dev/sdb: 80.0 GB, 80026361856 bytes
255 heads, 63 sectors/track, 9729 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x000d3d03

Device Boot Start End Blocks Id System
/dev/sdb1 * 1 9729 78148161 83 Linux

Disk /dev/sdc: 9663 MB, 9663676416 bytes
64 heads, 32 sectors/track, 9216 cylinders
Units = cylinders of 2048 * 512 = 1048576 bytes
Disk identifier: 0x00000000

Disk /dev/sdc doesn't contain a valid partition table

/dev/sdc is my new disk. From now on we treat it like a local disk. We need to the usual fdisk, format routine on it to use it.

More Testing
I took my same 3 gig ISO file and moved it back and forth a few time to the newly created iSCSI volume. Transfer speeds to the vol were about 21mbs. Not as good as NFS. Reading from the volume I was getting about 28-30 mbs. Still not as good as NFS but remember my iSCSI target is in a virtual machine. Although I'd like it to be better it will do for my purposes of having a test ESX setup of my own. I'm currently running the bonnie tests again to the iSCSI volume:

[root@vmserver iscsi]# bonnie -d /mnt/iscsi -s 2000
Bonnie 1.4: File '/mnt/iscsi/Bonnie.2921', size: 2097152000, volumes: 1
Writing with putc()... done: 27425 kB/s 75.2 %CPU
Rewriting... done: 11034 kB/s 6.1 %CPU
Writing intelligently... done: 19545 kB/s 7.2 %CPU
Reading with getc()... done: 25133 kB/s 75.7 %CPU
Reading intelligently... done: 36555 kB/s 13.3 %CPU
Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done...
---Sequential Output (nosync)--- ---Sequential Input-- --Rnd Seek-
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --04k (03)-
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
vmserv 1*2000 27425 75.2 19545 7.2 11034 6.1 25133 75.7 36555 13.3 338.2 1.4

Once all that is done it needs to be mounted on boot. There are a few tricks to this because of how things startup on the system. Normally in Linux all volumes are mounted before the network is started. This will be a problem because we need the network up and running to get to our volumes. This is another tidbit that I wish I had found in one place. As always doing this stuff is not terribly hard but hunting around for the information can be time consuming. The source of information I used was from here towards the bottom at "Automatic Start and Volume Mounting."

I could just go ahead and assume that /dev/sdc1 on my array will always come up that way but theoretically I can add another drive and the drive order might change. In other words /dev/sdc1 today would be /dev/sde1 tomorrow after adding or removing a drive in the box. To address this we can mount the drive using a unique signature. The command udevinfo -q symlink -n /dev/sdc1 gives me the following output:

[root@vmserver array]# udevinfo -q symlink -n /dev/sdc1
disk/by-id/scsi-36000eb367106cef8000000000000000b-part1 disk/by-path/ip-192.168.1.20:3260-iscsi-iqn.2003-10.com.lefthandnetworks:home:11:homevol1-lun-0-part1

I can use either of these links in my /etc/fstab. Using one of these names assures that I'll be mounting the expected disk every time. We'll use _netdev in the fstab and noatime as options also. _netdev will make sure that the network is started so that this volume can be mounted. Here is my /etc/fstab:

[root@vmserver array]# cat /etc/fstab
LABEL=/ / ext3 defaults,noatime 1 1
LABEL=/home /home ext3 defaults 1 2
LABEL=/boot /boot ext3 defaults 1 2
tmpfs /dev/shm tmpfs defaults 0 0
devpts /dev/pts devpts gid=5,mode=620 0 0
sysfs /sys sysfs defaults 0 0
proc /proc proc defaults 0 0
LABEL=SWAP-sdc2 swap swap defaults 0 0
#iSCSI Volumes
/dev/disk/by-path/ip-192.168.1.20:3260-iscsi-iqn.2003-10.com.lefthandnetworks:home:11:homevol1-lun-0-part1 /mnt/iscsi ext3 _netdev,noatime 0 0

Notice the last line. I was able to manually mount /mnt/iscsi and it also mounted on a reboot automatically.

[root@vmserver ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 35G 3.1G 30G 10% /
/dev/sdb1 74G 3.4G 67G 5% /home
/dev/sda1 99M 18M 77M 19% /boot
tmpfs 506M 0 506M 0% /dev/shm
/dev/sdc1 8.9G 3.4G 5.1G 40% /mnt/iscsi


Thats pretty much it for my iSCSI array.

More iSCSI
Well I thought that was going to be it for iSCSI. Tonight I thought I would give native linux iscsi a try. First I loaded up iscsitarget:

[root@homeserver rpms]# rpm -ivh http://rpm.livna.org/livna-release-8.rpm
Retrieving http://rpm.livna.org/livna-release-8.rpm
warning: /var/tmp/rpm-xfer.nwUz5q: Header V3 DSA signature: NOKEY, key ID a109b1ec
Preparing... ########################################### [100%]
1:livna-release ########################################### [100%]
[root@homeserver rpms]# yum install iscsitarget


Dependency Installed: iscsitarget.i386 1:0.4.15-9.svn142.lvn8 kmod-iscsitarget.i686 1:0.4.15-7.svn142.lvn8 kmod-iscsitarget-2.6.23.14-107.fc8.i686 1:0.4.15-7.svn142.lvn8
Complete!

The first step was to load up the Livina repositories. Then I was able to use yum to install everything which ended up being 3 packages. I'll add on to this later. My new boards have arrived !!

The New Servers are Here !!

It's been about a week now. I had to special order my boards to make sure ESX would function properly. I ended up getting 2 Asus P5M2-M server boards. I decided to not get the SAS model which saved me over $100 per board. These things were not cheap. This nice thing about these boards is that they will take Core2Duo or Xeons. I was thankful for that because a Xeon would cost more than the board. They will also take quad cores so I have that option later. Its a very flexable board. They have dual broadcom gig nics and SATAII hw RAID 0 or 1. If I ever build a server for anybody I'll be using these boards. I figured since I was dropping some money I should put them in some good cases. I installed them last night into Antec NSK440 which you can see here. Somehow I missed ordering hard drives but I'll get those later today. I put them together last night crossed my fingers and booted them both up with the ESX35 cd. Both booted flawlessly and attached to my install server via nfs. I could not get any further because at that point it was looking for hard drives but it was a relief to see my work pay off. I've documented my hardware in a spreadsheet I've put here for reference. It will probably be changing for the iSCSI array. I've decided I want hardware RAID0 in my iSCSI array for performance. This week I installed an XP box in my old P4 VMware Server box. The VMware server box is using the iSCSI array as external storage. The XPVM I installed works fine so I expect ESX to do just as well.

I picked up 2 new 80 gig SATAII drives and installed them. ESX35 loaded perfectly and is now on my network, albeit with cable running all over my floors :-) I've decided to put the new esx boxes in a cabinet under my TV. I wanted to put them in my hall closet with my other servers but I was afraid of the heat all four of those things would generate. I've converted my hall coat closet into my wiring closet. Its pretty much in the center of my house so it makes sense. I've been running both of my machines right out of there for about 2 years now. The temperature guage says its about 84 in there. Thats getting up there so I decided to put my esx boxes under my tv. I cut out the panel in the back of the entertaiment center and loaded the esx boxes in there. They fit perfectly but the only problem is that I have to run cable. 4 runs to be exact. I already have 2 runs going in there. One for my Tivo and the other for our airport express. I have not been up there in awhile since I've gone wireless but I think I'm going to have to.

Fun with Cabling

Ah, cabling how do I hate thee. Let me count the ways. Anybody who has run cable in their house by themselves knows what I know. It really sucks!! Climbing into the dark end of a hot attic , lying down across beams on your chest while drilling holes and hoping the cable does not get snagged coming out of the box at the other end. All of these things happened this morning. At least its done now and I can concentrate on getting ESX funtional. I found out today that I had to add another gigabit switch to my network to accommodate all the new ports coming online. Even with my new switch I've had to abandon NIC bonding. I'm not sure how effective it all is without a managed switch anyway. Keeping it simple is key when doing these things. Time for a coffee break.

Pictures

I've taken some pictures of my machines and network now that I've finished cleaning it all up. You should be able to see them all here

Sunday and another fun day with ESX and iSCSI

There is something about Sunday mornings that make progress impossible for me. I spent about three hours trying to get the iSCSI initiators working with my Left Hand VSA. It was extremely frustrating reading document after document about how dead easy it is to get the esx to discover an iscsi volume. Last week I spent about an hour having trouble. This week it was over 3 hours but I finally got it and I've learned alot in the process. My wife's Mac anounces hourly what time it is. It was almost like it was mocking me as the hours went past with almost 0 progress. My breakthrough came with I turned on CHAP authentication. In an earlier post I found that I needed to set my initiatorname to be the same as the target to get a discovery. That worked fine but it was only for a single host. An ESX cluster means that 2 or more hosts need to be connected to the same target. So what happens when you connect two hosts using the same intiator name ? They connected ok sometimes, sometimes not. Thats what almost drove me mad. While watching my VSA console I could see a single sessions flipping between 2 ips. So basically I guess what was happening was that all parties were arguing over who they were. They were all claiming the same identity !! In frustration I shutdown an esx host and suddenly the other host connected up to the array with no problem. I thought I was being smart by trying to do everything at once but I ended up stomping all over myself. I did set all of this up to learn and I'm glad I did but that does not make it any less frustrating when problems are happening. Here is the final proceedure I used:

- Create iSCSI volume on the array
- Set up CHAP for the iSCSI volume
- make sure the target uses a uniqe name and password
- On each ESX host configure iSCSI and set up CHAP to use those credentials
- discover targets.

In my case each ESX host had to reboot because I changed the iqn. You may or may not have to reboot. Once I did all of that it worked. I now have 2 happy ESX servers with iSCSI volumes attached and waiting for VMs. Right now I'm copying a large iso from each esx host into the array just to run it a bit. Using those 2 iso files I am going to load an XP and a Fedora virtual machine on each esx host.

Pink Floyd : Cluster One

Well its done. My cluster is up and running. My XP ghost image won't boot because it uses an IDE disk which is not availble in ESX . I have not really looked into that yet. I was able to build a Fedora8 box with out an issue. I need to find that cpubusy script to make the failover work on its own.