[Gluster-users] Performance

Discussion:

[Gluster-users] Performance

Mohit Anchlia

2011-04-20 00:52:24 UTC

I am getting miserable performance in LAN setup with 6 servers with
distributed and 2 replicas with 1GigE. I am using native glsuter
clients (mount -t glusterfs server1:/vol /mnt) I am only able to get
20Mb/s. This is some of the output from sar:

Each server has 4 10K SAS drives RAID0. This is a new setup and I
expected to get much much higher performance. Can someone please help
with recommendations?

!sar
sar -B 1 100

05:44:38 PM pgpgin/s pgpgout/s fault/s majflt/s
05:44:39 PM 0.00 0.00 1413.00 0.00
05:44:40 PM 0.00 29896.00 29.00 0.00
05:44:41 PM 0.00 4510.89 1523.76 0.00
05:44:42 PM 0.00 16.16 20.20 0.00
05:44:43 PM 0.00 12.00 16.00 0.00
05:44:44 PM 0.00 102.97 15.84 0.00
05:44:45 PM 0.00 21100.00 14.00 0.00
05:44:46 PM 0.00 8092.00 19.00 0.00

sar -ru 1 100

05:46:48 PM kbmemfree kbmemused %memused kbbuffers kbcached
kbswpfree kbswpused %swpused kbswpcad
05:46:49 PM 250696 98742512 99.75 6327752 87155568
101056220 284 0.00 0

05:46:49 PM CPU %user %nice %system %iowait
%steal %idle
05:46:50 PM all 1.75 0.00 16.62 1.87
0.00 79.76

05:46:49 PM kbmemfree kbmemused %memused kbbuffers kbcached
kbswpfree kbswpused %swpused kbswpcad
05:46:50 PM 257364 98735844 99.74 6328180 87149728
101056220 284 0.00 0

05:46:50 PM CPU %user %nice %system %iowait
%steal %idle
05:46:51 PM all 0.00 0.00 0.54 4.12
0.00 95.34

05:46:50 PM kbmemfree kbmemused %memused kbbuffers kbcached
kbswpfree kbswpused %swpused kbswpcad
05:46:51 PM 257388 98735820 99.74 6328184 87150080
101056220 284 0.00 0

05:46:51 PM CPU %user %nice %system %iowait
%steal %idle
05:46:52 PM all 0.12 0.00 3.08 0.08
0.00 96.72

05:46:51 PM kbmemfree kbmemused %memused kbbuffers kbcached
kbswpfree kbswpused %swpused kbswpcad
05:46:52 PM 257436 98735772 99.74 6328284 87151456
101056220 284 0.00 0

05:46:52 PM CPU %user %nice %system %iowait
%steal %idle
05:46:53 PM all 0.17 0.00 1.62 0.04
0.00 98.17

05:46:52 PM kbmemfree kbmemused %memused kbbuffers kbcached
kbswpfree kbswpused %swpused kbswpcad
05:46:53 PM 253468 98739740 99.74 6328368 87155004
101056220 284 0.00 0

05:46:53 PM CPU %user %nice %system %iowait
%steal %idle
05:46:54 PM all 0.17 0.00 3.08 0.00
0.00 96.76

05:46:53 PM kbmemfree kbmemused %memused kbbuffers kbcached
kbswpfree kbswpused %swpused kbswpcad
05:46:54 PM 252320 98740888 99.75 6328444 87156888
101056220 284 0.00 0

05:46:54 PM CPU %user %nice %system %iowait
%steal %idle
05:46:55 PM all 0.00 0.00 0.00 0.79
0.00 99.21

05:46:54 PM kbmemfree kbmemused %memused kbbuffers kbcached
kbswpfree kbswpused %swpused kbswpcad
05:46:55 PM 252240 98740968 99.75 6328448 87156884
101056220 284 0.00 0

05:46:55 PM CPU %user %nice %system %iowait
%steal %idle
05:46:56 PM all 0.83 0.00 23.89 0.00
0.00 75.28

05:46:55 PM kbmemfree kbmemused %memused kbbuffers kbcached
kbswpfree kbswpused %swpused kbswpcad
05:46:56 PM 247652 98745556 99.75 6328712 87159900
101056220 284 0.00 0

---

Joe Landman

2011-04-20 02:07:56 UTC

Post by Mohit Anchlia
I am getting miserable performance in LAN setup with 6 servers with
distributed and 2 replicas with 1GigE. I am using native glsuter
clients (mount -t glusterfs server1:/vol /mnt) I am only able to get
Each server has 4 10K SAS drives RAID0. This is a new setup and I
expected to get much much higher performance. Can someone please help
with recommendations?

It seems like I just handled a case like this a few months ago ...

What does your IO workload actually look like? Much more interested in
iostat like output (though dstat also works very well for bandwidth
heavy loads).

Gluster isn't going to do well with small IO operations without serious
caching (NFS client). Despite the fact that these are RAID0 across 4x
10kRPM SAS drives, this is *not* a high performance IO system in most
senses of the definition. The design of the storage should be driven by
the application and anticipated workloads.

More to the point, what specifically are your goals in terms of
throughput/bandwidth ... what will your storage loads look like?

What RAID cards are you using (if any)? If software raid, could you
report output of

mdadm --detail /dev/MD

where /dev/MD is your MD raid device. Which SAS 10k disks are you
using? How are they connected to the machine if not through a RAID
card? Is the RAID0 a hardware or software RAID?

Post by Mohit Anchlia
!sar
sar -B 1 100
05:44:38 PM pgpgin/s pgpgout/s fault/s majflt/s
05:44:39 PM 0.00 0.00 1413.00 0.00
05:44:40 PM 0.00 29896.00 29.00 0.00
05:44:41 PM 0.00 4510.89 1523.76 0.00
05:44:42 PM 0.00 16.16 20.20 0.00
05:44:43 PM 0.00 12.00 16.00 0.00
05:44:44 PM 0.00 102.97 15.84 0.00
05:44:45 PM 0.00 21100.00 14.00 0.00
05:44:46 PM 0.00 8092.00 19.00 0.00

sar isn't too useful for figuring out whats going on in the io channel.
iostat is much better. dstat, atop, vmstat are all specifically good
tools. If you want too much data (e.g. it gathers everything of value),
use collectl, with a 1 second interval, and the right options.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: ***@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Mohit Anchlia

2011-04-20 02:43:32 UTC

It's HW RAID we are using. These are Dell C6100 server with RAID
controller. We expect around 100MB/sec. When I run "dd" I get 20MB/sec
and since I have 6 servers I at least expected 3 X 20 MB/sec since it
replica of 2. We are in a subnet inside a LAB with only us doing the
testing so network is not a issue for sure.

File distribution are as follows:

bytes %
130000 19.762%
70000 30.101%
100000 20.165%
230000 20.016%
1100000 0.447%
430000 5%
2000000 039%
630000 4.47%

On Tue, Apr 19, 2011 at 7:07 PM, Joe Landman

Post by Joe Landman

Post by Mohit Anchlia
I am getting miserable performance in LAN setup with 6 servers with
distributed and 2 replicas with 1GigE. I am using native glsuter
clients (mount -t glusterfs server1:/vol /mnt) I am only able to get
Each server has 4 10K SAS drives RAID0. This is a new setup and I
expected to get much much higher performance. Can someone please help
with recommendations?

It seems like I just handled a case like this a few months ago ...
What does your IO workload actually look like? Much more interested in
iostat like output (though dstat also works very well for bandwidth heavy
loads).
Gluster isn't going to do well with small IO operations without serious
caching (NFS client). Despite the fact that these are RAID0 across 4x
10kRPM SAS drives, this is *not* a high performance IO system in most senses
of the definition. The design of the storage should be driven by the
application and anticipated workloads.
More to the point, what specifically are your goals in terms of
throughput/bandwidth ... what will your storage loads look like?
What RAID cards are you using (if any)? If software raid, could you report
output of
mdadm --detail /dev/MD
where /dev/MD is your MD raid device. Which SAS 10k disks are you using?
How are they connected to the machine if not through a RAID card? Is the
RAID0 a hardware or software RAID?

Post by Mohit Anchlia
!sar
sar -B 1 100
05:44:38 PM pgpgin/s pgpgout/s fault/s majflt/s
05:44:39 PM 0.00 0.00 1413.00 0.00
05:44:40 PM 0.00 29896.00 29.00 0.00
05:44:41 PM 0.00 4510.89 1523.76 0.00
05:44:42 PM 0.00 16.16 20.20 0.00
05:44:43 PM 0.00 12.00 16.00 0.00
05:44:44 PM 0.00 102.97 15.84 0.00
05:44:45 PM 0.00 21100.00 14.00 0.00
05:44:46 PM 0.00 8092.00 19.00 0.00

sar isn't too useful for figuring out whats going on in the io channel.
iostat is much better. dstat, atop, vmstat are all specifically good tools.
If you want too much data (e.g. it gathers everything of value), use
collectl, with a 1 second interval, and the right options.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
_______________________________________________
Gluster-users mailing list
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Joe Landman

2011-04-20 04:08:28 UTC

Post by Mohit Anchlia
It's HW RAID we are using. These are Dell C6100 server with RAID
controller. We expect around 100MB/sec. When I run "dd" I get 20MB/sec
and since I have 6 servers I at least expected 3 X 20 MB/sec since it
replica of 2. We are in a subnet inside a LAB with only us doing the
testing so network is not a issue for sure.
bytes %
130000 19.762%
70000 30.101%
100000 20.165%
230000 20.016%
1100000 0.447%
430000 5%
2000000 039%
630000 4.47%

Hmmm ... 90% of your io sizes seem to be between 70k and 230k bytes.
This is ok. What does your dd command look like?

Are you running stripe as well as mirror in gluster (not that it matters
for performance).

A single dd will be rate limited to the single mirror pair it is writing
to or reading from, unless you stripe. Not RAID0 stripe, but stripe in
gluster.

However ... before we get there ... on the raw volume (e.g. outside of
gluster) what does

dd if=/dev/sdX of=/dev/null bs=128k count=80k

e.g.

[***@jr5-lab ~]# dd if=/dev/sdg of=/dev/null bs=128k count=80k
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 14.7611 seconds, 727 MB/s

get you for each of the RAID0? This will read ~ 11GB of data from the
RAID0. No writing. You can write a file as well (would be very helpful)

dd if=/dev/zero of=/path/to/gluster/file/system bs=128k \
count=80k

e.g.

[***@jr5-lab ~]# dd if=/dev/zero of=/data1/big.file bs=128k \
count=80k
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 29.5108 seconds, 364 MB/s

I am assuming these are PERC RAID controllers. And assuming you
configured the RAID0 in hardware rather than software. What
chunk/stripe size did you use?
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: ***@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Mohit Anchlia

2011-04-20 03:49:27 UTC

I'll post the results of dd soon. Meanwhile is it possible to check
the stripe size for HW raid using some commands on linux? Yes it is
PERC controller.

On Tue, Apr 19, 2011 at 9:08 PM, Joe Landman

Post by Mohit Anchlia
It's HW RAID we are using. These are Dell C6100 server with RAID
controller. We expect around 100MB/sec. When I run "dd" I get 20MB/sec
and since I have 6 servers I at least expected 3 X 20 MB/sec since it
replica of 2. We are in a subnet inside a LAB with only us doing the
testing so network is not a issue for sure.
bytes %
130000 19.762%
70000 30.101%
100000 20.165%
230000 20.016%
1100000 0.447%
430000 5%
2000000 039%
630000 4.47%

Hmmm ... 90% of your io sizes seem to be between 70k and 230k bytes. This
is ok. What does your dd command look like?
Are you running stripe as well as mirror in gluster (not that it matters for
performance).
A single dd will be rate limited to the single mirror pair it is writing to
or reading from, unless you stripe. Not RAID0 stripe, but stripe in
gluster.
However ... before we get there ... on the raw volume (e.g. outside of
gluster) what does
dd if=/dev/sdX of=/dev/null bs=128k count=80k
e.g.
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 14.7611 seconds, 727 MB/s
get you for each of the RAID0? This will read ~ 11GB of data from the
RAID0. No writing. You can write a file as well (would be very helpful)
dd if=/dev/zero of=/path/to/gluster/file/system bs=128k \
count=80k
e.g.
count=80k
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 29.5108 seconds, 364 MB/s
I am assuming these are PERC RAID controllers. And assuming you configured
the RAID0 in hardware rather than software. What chunk/stripe size did you
use?
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Joe Landman

2011-04-20 05:13:52 UTC

Post by Mohit Anchlia
I'll post the results of dd soon. Meanwhile is it possible to check
the stripe size for HW raid using some commands on linux? Yes it is
PERC controller.

Assuming you have MegaCli installed

/opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -Lall -aALL

will give us info on the virtual disks.

c.f. http://tools.rapidsoft.de/perc/perc-cheat-sheet.html
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: ***@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Mohit Anchlia

2011-04-20 17:04:52 UTC

Thanks! I installed this Cli and ran below command but it returns no
result just a blank line.

When we created RAID0 we chose auto option.
On Tue, Apr 19, 2011 at 10:13 PM, Joe Landman

Post by Joe Landman

Post by Mohit Anchlia
I'll post the results of dd soon. Meanwhile is it possible to check
the stripe size for HW raid using some commands on linux? Yes it is
PERC controller.

Assuming you have MegaCli installed
/opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -Lall -aALL
will give us info on the virtual disks.
c.f. http://tools.rapidsoft.de/perc/perc-cheat-sheet.html
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Joe Landman

2011-04-20 17:11:16 UTC

Post by Mohit Anchlia
Thanks! I installed this Cli and ran below command but it returns no
result just a blank line.
When we created RAID0 we chose auto option.

Did you do the RAID creation in hardware or in software?

what is the output of

cat /proc/mdstat
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: ***@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Mohit Anchlia

2011-04-20 17:19:09 UTC

It was done using console.using hardware. Not software.

cat /proc/mdstat
Personalities :
unused devices: <none>

On Wed, Apr 20, 2011 at 10:11 AM, Joe Landman

Post by Joe Landman

Post by Mohit Anchlia
Thanks! I installed this Cli and ran below command but it returns no
result just a blank line.
When we created RAID0 we chose auto option.

Did you do the RAID creation in hardware or in software?
what is the output of
cat /proc/mdstat
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Joe Landman

2011-04-20 17:22:01 UTC

Post by Mohit Anchlia
It was done using console.using hardware. Not software.
cat /proc/mdstat
unused devices:<none>

Hmmm ... MegaCLI isn't reporting any LD's ... and no software RAID.

How about a nice simple

lsscsi

so we can see what it thinks

Post by Mohit Anchlia
On Wed, Apr 20, 2011 at 10:11 AM, Joe Landman

Post by Joe Landman

Post by Mohit Anchlia
Thanks! I installed this Cli and ran below command but it returns no
result just a blank line.
When we created RAID0 we chose auto option.

Did you do the RAID creation in hardware or in software?
what is the output of
cat /proc/mdstat
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: ***@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Mohit Anchlia

2011-04-20 17:35:32 UTC

Should that command be there by default? I couldn't find lsscsi

On Wed, Apr 20, 2011 at 10:22 AM, Joe Landman

Post by Joe Landman

Post by Mohit Anchlia
It was done using console.using hardware. Not software.
cat /proc/mdstat
unused devices:<none>

Hmmm ... MegaCLI isn't reporting any LD's ... and no software RAID.
How about a nice simple
lsscsi
so we can see what it thinks

Post by Mohit Anchlia
On Wed, Apr 20, 2011 at 10:11 AM, Joe Landman

Post by Joe Landman

Post by Mohit Anchlia
Thanks! I installed this Cli and ran below command but it returns no
result just a blank line.
When we created RAID0 we chose auto option.

Did you do the RAID creation in hardware or in software?
what is the output of
cat /proc/mdstat
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Joe Landman

2011-04-20 17:39:27 UTC

Post by Mohit Anchlia
Should that command be there by default? I couldn't find lsscsi

How about

mount

output?
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: ***@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Mohit Anchlia

2011-04-20 17:42:45 UTC

mount
/dev/mapper/VolGroup00-LogVol00 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/sdb1 on /boot type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
/dev/sda1 on /data type ext3 (rw)
glusterfs#dsdb1:/stress-volume on /data/mnt-stress type fuse
(rw,allow_other,default_permissions,max_read=131072)

On Wed, Apr 20, 2011 at 10:39 AM, Joe Landman

Post by Joe Landman

Post by Mohit Anchlia
Should that command be there by default? I couldn't find lsscsi

How about
mount
output?
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Joe Landman

2011-04-20 17:49:37 UTC

Post by Mohit Anchlia
mount
/dev/mapper/VolGroup00-LogVol00 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/sdb1 on /boot type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
/dev/sda1 on /data type ext3 (rw)
glusterfs#dsdb1:/stress-volume on /data/mnt-stress type fuse
(rw,allow_other,default_permissions,max_read=131072)

ok ...

so gluster is running atop /data/mnt-stress, which is on /dev/sda1 and
is ext3.

Could you do this

dd if=/dev/zero of=/data/big.file bs=128k count=80k
echo 3 > /proc/sys/vm/drop_caches
dd of=/dev/null if=/data/big.file bs=128k

so we can see the write and then read performance using 128k blocks?

Also, since you are using the gluster native client, you don't get all
the nice NFS caching bits. Gluster native client is somewhat slower
than the NFS client.

So lets start with the write/read speed of the system before we deal
with the gluster side of things.

Post by Mohit Anchlia
On Wed, Apr 20, 2011 at 10:39 AM, Joe Landman

Post by Joe Landman

Post by Mohit Anchlia
Should that command be there by default? I couldn't find lsscsi

How about
mount
output?
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: ***@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Mohit Anchlia

2011-04-20 18:01:01 UTC

Is 128K block size right size given that file sizes I have is from 70K - 2MB?

Please find

[***@dsdb1 ~]# dd if=/dev/zero of=/data/big.file bs=128k count=80k
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 14.751 seconds, 728 MB/s
[***@dsdb1 ~]# echo 3 > /proc/sys/vm/drop_caches
[***@dsdb1 ~]# dd of=/dev/null if=/data/big.file bs=128k
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 3.10485 seconds, 3.5 GB/s

On Wed, Apr 20, 2011 at 10:49 AM, Joe Landman

Post by Joe Landman

Post by Joe Landman
mount
/dev/mapper/VolGroup00-LogVol00 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/sdb1 on /boot type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
/dev/sda1 on /data type ext3 (rw)
glusterfs#dsdb1:/stress-volume on /data/mnt-stress type fuse
(rw,allow_other,default_permissions,max_read=131072)

ok ...
so gluster is running atop /data/mnt-stress, which is on /dev/sda1 and is
ext3.
Could you do this
dd if=/dev/zero of=/data/big.file bs=128k count=80k
echo 3 > /proc/sys/vm/drop_caches
dd of=/dev/null if=/data/big.file bs=128k
so we can see the write and then read performance using 128k blocks?
Also, since you are using the gluster native client, you don't get all the
nice NFS caching bits. Gluster native client is somewhat slower than the
NFS client.
So lets start with the write/read speed of the system before we deal with
the gluster side of things.

Post by Joe Landman
On Wed, Apr 20, 2011 at 10:39 AM, Joe Landman

Post by Joe Landman

Post by Mohit Anchlia
Should that command be there by default? I couldn't find lsscsi

How about
mount
output?
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Joe Landman

2011-04-20 18:05:44 UTC

Post by Mohit Anchlia
Is 128K block size right size given that file sizes I have is from 70K - 2MB?
Please find
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 14.751 seconds, 728 MB/s
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 3.10485 seconds, 3.5 GB/s

Hmm ... this looks like it came from cache. 4 drive RAID0's aren't even
remotely this fast.

Add an oflag=direct to the first dd, and an iflag=direct to the second
dd so we can avoid the OS memory cache for the moment (looks like the
driver isn't respecting the drop caches command, or you had no space
between the 3 and the > sign).

Post by Mohit Anchlia
On Wed, Apr 20, 2011 at 10:49 AM, Joe Landman

Post by Joe Landman

Post by Mohit Anchlia
mount
/dev/mapper/VolGroup00-LogVol00 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/sdb1 on /boot type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
/dev/sda1 on /data type ext3 (rw)
glusterfs#dsdb1:/stress-volume on /data/mnt-stress type fuse
(rw,allow_other,default_permissions,max_read=131072)

ok ...
so gluster is running atop /data/mnt-stress, which is on /dev/sda1 and is
ext3.
Could you do this
dd if=/dev/zero of=/data/big.file bs=128k count=80k
echo 3> /proc/sys/vm/drop_caches
dd of=/dev/null if=/data/big.file bs=128k
so we can see the write and then read performance using 128k blocks?
Also, since you are using the gluster native client, you don't get all the
nice NFS caching bits. Gluster native client is somewhat slower than the
NFS client.
So lets start with the write/read speed of the system before we deal with
the gluster side of things.

Post by Mohit Anchlia
On Wed, Apr 20, 2011 at 10:39 AM, Joe Landman

Post by Joe Landman

Post by Mohit Anchlia
Should that command be there by default? I couldn't find lsscsi

How about
mount
output?
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: ***@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Mohit Anchlia

2011-04-20 18:29:32 UTC

Please find

[***@dsdb1 ~]# cat /proc/sys/vm/drop_caches
3
[***@dsdb1 ~]# dd if=/dev/zero of=/data/big.file bs=128k count=80k oflag=direct

81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 521.553 seconds, 20.6 MB/s
[***@dsdb1 ~]#
[***@dsdb1 ~]# dd if=/dev/zero of=/data/big.file bs=128k count=80k iflag=direct
dd: opening `/dev/zero': Invalid argument
[***@dsdb1 ~]# dd of=/dev/null if=/data/big.file bs=128k iflag=direct
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 37.854 seconds, 284 MB/s
[***@dsdb1 ~]#

On Wed, Apr 20, 2011 at 11:05 AM, Joe Landman

Post by Mohit Anchlia
Is 128K block size right size given that file sizes I have is from 70K - 2MB?
Please find
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 14.751 seconds, 728 MB/s
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 3.10485 seconds, 3.5 GB/s

Hmm ... this looks like it came from cache. 4 drive RAID0's aren't even
remotely this fast.
Add an oflag=direct to the first dd, and an iflag=direct to the second dd so
we can avoid the OS memory cache for the moment (looks like the driver isn't
respecting the drop caches command, or you had no space between the 3 and
the > sign).

Post by Mohit Anchlia
On Wed, Apr 20, 2011 at 10:49 AM, Joe Landman

Post by Joe Landman

Post by Joe Landman
mount
/dev/mapper/VolGroup00-LogVol00 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/sdb1 on /boot type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
/dev/sda1 on /data type ext3 (rw)
glusterfs#dsdb1:/stress-volume on /data/mnt-stress type fuse
(rw,allow_other,default_permissions,max_read=131072)

ok ...
so gluster is running atop /data/mnt-stress, which is on /dev/sda1 and is
ext3.
Could you do this
dd if=/dev/zero of=/data/big.file bs=128k count=80k
echo 3> /proc/sys/vm/drop_caches
dd of=/dev/null if=/data/big.file bs=128k
so we can see the write and then read performance using 128k blocks?
Also, since you are using the gluster native client, you don't get all the
nice NFS caching bits. Gluster native client is somewhat slower than the
NFS client.
So lets start with the write/read speed of the system before we deal with
the gluster side of things.

Post by Joe Landman
On Wed, Apr 20, 2011 at 10:39 AM, Joe Landman

Post by Joe Landman

Post by Mohit Anchlia
Should that command be there by default? I couldn't find lsscsi

How about
mount
output?
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Joe Landman

2011-04-20 18:47:06 UTC

Post by Mohit Anchlia
Please find
3
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 521.553 seconds, 20.6 MB/s

Suddenly this makes a great deal more sense.

Post by Mohit Anchlia
dd: opening `/dev/zero': Invalid argument
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 37.854 seconds, 284 MB/s

About what I expected.

Ok. Uncached OS writes get you to 20MB/s. Which is about what you are
seeing with the fuse mount and a dd. So I think we understand the write
side.

The read side is about where I expected (lower actually, but not by
enough that I am concerned).

You can try changing bs=2M count=6k on both to see the effect of larger
blocks. You should get some improvement.

I think we need to dig into the details of that RAID0 construction now.
This might be something better done offlist (unless everyone wants to
see the gory details of digging into the hardware side).

My current thought is that this is a hardware issue, and not a gluster
issue per se, but that there are possibilities for improving performance
on the gluster side of the equation.

Short version: PERC is not fast (never has been), and it is often a bad
choice for high performance. You are often better off building an MD
RAID using the software tools in Linux, it will be faster. Think of
PERC as an HBA with some modicum of built in RAID capability. You don't
really want to use that capability if possible, but you do want to use
the HBA.

Longer version: Likely a striping issue, or a caching issue (need to
see battery state, cache size, etc.), not to mention the slow chip. Are
the disk write caches off or on (guessing off which is the right thing
to do for some workloads but it does impact performance). Also, the
RAID cpu in PERC (its a rebadged LSI) is very low performance in
general, and specifically not terribly good even at RAID0. These are
direct writes, skipping OS cache. They will let you see how fast the
underlying hardware is, and if it can handle the amount of data you want
to shove onto disks.

Here is my desktop:

***@metal:/local2/home/landman# dd if=/dev/zero of=/local2/big.file
bs=128k count=80k oflag=direct
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 64.7407 s, 166 MB/s

***@metal:/local2/home/landman# dd if=/dev/zero of=/local2/big.file
bs=2M count=6k oflag=direct
6144+0 records in
6144+0 records out
12884901888 bytes (13 GB) copied, 86.0184 s, 150 MB/s

and a server in the lab

[***@jr5-1 ~]# dd if=/dev/zero of=/data/big.file bs=128k count=80k
oflag=direct
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 11.0948 seconds, 968 MB/s

[***@jr5-1 ~]# dd if=/dev/zero of=/data/big.file bs=2M count=6k
oflag=direct
6144+0 records in
6144+0 records out
12884901888 bytes (13 GB) copied, 5.11935 seconds, 2.5 GB/s

Gluster will not be faster than the bare metal (silicon). It may hide
some of the issues with caching. But it is bounded by how fast you can
push to or pull bits from the media.

In an "optimal" config, the 4x SAS 10k RPM drives should be able to
sustain ~600 MB/s write. Reality will be less than this, guessing
250-400 MB/s in most cases. This is still pretty low in performance.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: ***@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Mohit Anchlia

2011-04-20 18:54:18 UTC

Thanks again! But what I don't understand is that I have 3 X 2 servers
so I would expect 20 X 3 = 60 MBPS total atleast. My load is getting
spread accross 3 X 2 servers in distributed replica. If I was using
just one gluster server I would understand but with 6 it makes no
sense..

On Wed, Apr 20, 2011 at 11:47 AM, Joe Landman

Post by Joe Landman

Post by Mohit Anchlia
Please find
3
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 521.553 seconds, 20.6 MB/s

Suddenly this makes a great deal more sense.

Post by Mohit Anchlia
dd: opening `/dev/zero': Invalid argument
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 37.854 seconds, 284 MB/s

About what I expected.
Ok. Uncached OS writes get you to 20MB/s. Which is about what you are
seeing with the fuse mount and a dd. So I think we understand the write
side.
The read side is about where I expected (lower actually, but not by enough
that I am concerned).
You can try changing bs=2M count=6k on both to see the effect of larger
blocks. You should get some improvement.
I think we need to dig into the details of that RAID0 construction now.
This might be something better done offlist (unless everyone wants to see
the gory details of digging into the hardware side).
My current thought is that this is a hardware issue, and not a gluster issue
per se, but that there are possibilities for improving performance on the
gluster side of the equation.
Short version: PERC is not fast (never has been), and it is often a bad
choice for high performance. You are often better off building an MD RAID
using the software tools in Linux, it will be faster. Think of PERC as an
HBA with some modicum of built in RAID capability. You don't really want to
use that capability if possible, but you do want to use the HBA.
Longer version: Likely a striping issue, or a caching issue (need to see
battery state, cache size, etc.), not to mention the slow chip. Are the
disk write caches off or on (guessing off which is the right thing to do for
some workloads but it does impact performance). Also, the RAID cpu in PERC
(its a rebadged LSI) is very low performance in general, and specifically
not terribly good even at RAID0. These are direct writes, skipping OS
cache. They will let you see how fast the underlying hardware is, and if it
can handle the amount of data you want to shove onto disks.
count=80k oflag=direct
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 64.7407 s, 166 MB/s
count=6k oflag=direct
6144+0 records in
6144+0 records out
12884901888 bytes (13 GB) copied, 86.0184 s, 150 MB/s
and a server in the lab
oflag=direct
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 11.0948 seconds, 968 MB/s
oflag=direct
6144+0 records in
6144+0 records out
12884901888 bytes (13 GB) copied, 5.11935 seconds, 2.5 GB/s
Gluster will not be faster than the bare metal (silicon). It may hide some
of the issues with caching. But it is bounded by how fast you can push to
or pull bits from the media.
In an "optimal" config, the 4x SAS 10k RPM drives should be able to sustain
~600 MB/s write. Reality will be less than this, guessing 250-400 MB/s in
most cases. This is still pretty low in performance.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Joe Landman

2011-04-20 18:56:32 UTC

Post by Mohit Anchlia
Thanks again! But what I don't understand is that I have 3 X 2 servers
so I would expect 20 X 3 = 60 MBPS total atleast. My load is getting
spread accross 3 X 2 servers in distributed replica. If I was using
just one gluster server I would understand but with 6 it makes no
sense..

The dd is going to only one server, unless you are doing 6 simultaneous
dd-s.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: ***@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Mohit Anchlia

2011-04-20 18:58:40 UTC

Correct. I was referring to gluster performance test that I am running
using gluster client that spreads the load accross all the servers.

Burnash, James

2011-04-20 19:01:48 UTC

Gluster doesn't "spread the load across 6 servers", because the file is only being retrieved from one server (as Joe has stated). So you are limited to the throughput of that server. Even with the files replicated, only one copy is ever read at a time from a single client.

Hopefully that makes it clearer.

Nice work Joe - very helpful to the rest of us, I think.

James Burnash, Unix Engineering

-----Original Message-----
From: gluster-users-***@gluster.org [mailto:gluster-users-***@gluster.org] On Behalf Of Mohit Anchlia
Sent: Wednesday, April 20, 2011 2:59 PM
To: ***@scalableinformatics.com; gluster-***@gluster.org
Subject: Re: [Gluster-users] Performance

Correct. I was referring to gluster performance test that I am running
using gluster client that spreads the load accross all the servers.

Mohit Anchlia

2011-04-20 19:05:20 UTC

Like I mentioned several times my tests are running concurrent threads
and doing concurrent writes. So overall throughput per sec should be
atleast 20 X 3 = 60 MB/s.

Post by Burnash, James
Gluster doesn't "spread the load across 6 servers", because the file is only being retrieved from one server (as Joe has stated). So you are limited to the throughput of that server. Even with the files replicated, only one copy is ever read at a time from a single client.
Hopefully that makes it clearer.
Nice work Joe - very helpful to the rest of us, I think.
James Burnash, Unix Engineering
-----Original Message-----
Sent: Wednesday, April 20, 2011 2:59 PM
Subject: Re: [Gluster-users] Performance
Correct. I was referring to gluster performance test that I am running
using gluster client that spreads the load accross all the servers.

Burnash, James

2011-04-20 19:08:29 UTC

But each dd will only see the bandwidth from a single server. So 6 dd's run simultaneously will in aggregate see more than 20MB/s - but no single one will

James Burnash, Unix Engineering

-----Original Message-----
From: Mohit Anchlia [mailto:***@gmail.com]
Sent: Wednesday, April 20, 2011 3:05 PM
To: Burnash, James; ***@scalableinformatics.com
Cc: gluster-***@gluster.org
Subject: Re: [Gluster-users] Performance

Like I mentioned several times my tests are running concurrent threads
and doing concurrent writes. So overall throughput per sec should be
atleast 20 X 3 = 60 MB/s.

Post by Burnash, James
Gluster doesn't "spread the load across 6 servers", because the file is only being retrieved from one server (as Joe has stated). So you are limited to the throughput of that server. Even with the files replicated, only one copy is ever read at a time from a single client.
Hopefully that makes it clearer.
Nice work Joe - very helpful to the rest of us, I think.
James Burnash, Unix Engineering
-----Original Message-----
Sent: Wednesday, April 20, 2011 2:59 PM
Subject: Re: [Gluster-users] Performance
Correct. I was referring to gluster performance test that I am running
using gluster client that spreads the load accross all the servers.

Joe Landman

2011-04-20 19:19:07 UTC

Post by Mohit Anchlia
Like I mentioned several times my tests are running concurrent threads
and doing concurrent writes. So overall throughput per sec should be
atleast 20 X 3 = 60 MB/s.

Get a copy of fio installed (yum install fio), and use the following as
an input file to it. Call it sw_.fio

[sw]
rw=write
size=10g
directory=/data/mnt-stress
iodepth=32
direct=0
blocksize=512k
numjobs=12
nrfiles=1
ioengine=vsync
loops=1
group_reporting
create_on_open=1
create_serialize=0

run this as

fio sw_.fio

then use the following as sr_.fio

[sw]
rw=read
size=10g
directory=/data/mnt-stress
iodepth=32
direct=0
blocksize=512k
numjobs=12
nrfiles=1
ioengine=vsync
loops=1
group_reporting
create_on_open=1
create_serialize=0

run this as
echo 3 > /proc/sys/vm/drop_caches # note the space after "3"
fio sr_.fio

This will run 12 simultaneous IOs, and theoretically distribute across
most of your nodes (with some oversubscription). Please report back the
WRITE: and READ: portions.

Run status group 0 (all jobs):
WRITE: io=122694MB, aggrb=2219.5MB/s, minb=2272.8MB/s,
maxb=2272.8MB/s, mint=55281msec, maxt=55281msec

Run status group 0 (all jobs):
READ: io=122694MB, aggrb=1231.4MB/s, minb=1260.9MB/s,
maxb=1260.9MB/s, mint=99645msec, maxt=99645msec

fio is one of the best load generators out there, and I'd strongly urge
you to leverage it for your tests.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: ***@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Mohit Anchlia

2011-04-20 19:26:34 UTC

Thanks! Looks like at any rate 20MB/s is miserable for 4 X 10K SAS.
What are your recommendations on that regard? Should I try software
RAID0 instead?

How can I tell if it's the controller or disks itself is a problem?

Thanks for your help!

On Wed, Apr 20, 2011 at 12:19 PM, Joe Landman

Post by Mohit Anchlia
Like I mentioned several times my tests are running concurrent threads
and doing concurrent writes. So overall throughput per sec should be
atleast 20 X 3 = 60 MB/s.

Get a copy of fio installed (yum install fio), and use the following as an
input file to it. Call it sw_.fio
[sw]
rw=write
size=10g
directory=/data/mnt-stress
iodepth=32
direct=0
blocksize=512k
numjobs=12
nrfiles=1
ioengine=vsync
loops=1
group_reporting
create_on_open=1
create_serialize=0
run this as
fio sw_.fio
then use the following as sr_.fio
[sw]
rw=read
size=10g
directory=/data/mnt-stress
iodepth=32
direct=0
blocksize=512k
numjobs=12
nrfiles=1
ioengine=vsync
loops=1
group_reporting
create_on_open=1
create_serialize=0
run this as
echo 3 > /proc/sys/vm/drop_caches # note the space after "3"
fio sr_.fio
This will run 12 simultaneous IOs, and theoretically distribute across most
and READ: portions.
WRITE: io=122694MB, aggrb=2219.5MB/s, minb=2272.8MB/s, maxb=2272.8MB/s,
mint=55281msec, maxt=55281msec
READ: io=122694MB, aggrb=1231.4MB/s, minb=1260.9MB/s, maxb=1260.9MB/s,
mint=99645msec, maxt=99645msec
fio is one of the best load generators out there, and I'd strongly urge you
to leverage it for your tests.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Joe Landman

2011-04-20 19:31:35 UTC

Post by Mohit Anchlia
Thanks! Looks like at any rate 20MB/s is miserable for 4 X 10K SAS.
What are your recommendations on that regard? Should I try software
RAID0 instead?
How can I tell if it's the controller or disks itself is a problem?

We need to break one of the RAID0's and look at how it was constructed.
I'd suggest also seeing if you can rebuild the RAID0's as MD raids.
MD raid is usually (significantly) faster.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: ***@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Mohit Anchlia

2011-04-20 19:31:54 UTC

Can you suggest how to do that? I will then work on getting that done. Thanks!

On Wed, Apr 20, 2011 at 12:31 PM, Joe Landman

Post by Mohit Anchlia
Thanks! Looks like at any rate 20MB/s is miserable for 4 X 10K SAS.
What are your recommendations on that regard? Should I try software
RAID0 instead?
How can I tell if it's the controller or disks itself is a problem?

We need to break one of the RAID0's and look at how it was constructed. I'd
suggest also seeing if you can rebuild the RAID0's as MD raids. MD raid is
usually (significantly) faster.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Joe Landman

2011-04-20 19:35:13 UTC

Post by Mohit Anchlia
Can you suggest how to do that? I will then work on getting that done. Thanks!

Not sure, as MegaCLI didn't work for you.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: ***@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Mohit Anchlia

2011-04-20 19:41:40 UTC

Is MD RAID software RAID? If that's the case I can use console to
destroy the RAID and bring them up as individual disks then use mdadm.
Is that what you mean?

On Wed, Apr 20, 2011 at 12:35 PM, Joe Landman

Post by Joe Landman

Post by Mohit Anchlia
Can you suggest how to do that? I will then work on getting that done. Thanks!

Not sure, as MegaCLI didn't work for you.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Joe Landman

2011-04-20 19:43:32 UTC

Post by Mohit Anchlia
Is MD RAID software RAID? If that's the case I can use console to
destroy the RAID and bring them up as individual disks then use mdadm.
Is that what you mean?

Yes
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: ***@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Mohit Anchlia

2011-04-20 19:43:46 UTC

Thanks! Is there any recommended configuration you want me to use when
using mdadm?

I got this link:

http://tldp.org/HOWTO/Software-RAID-HOWTO-5.html#ss5.1

On Wed, Apr 20, 2011 at 12:43 PM, Joe Landman

Post by Joe Landman

Post by Mohit Anchlia
Is MD RAID software RAID? If that's the case I can use console to
destroy the RAID and bring them up as individual disks then use mdadm.
Is that what you mean?

Yes
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Joe Landman

2011-04-20 19:53:21 UTC

Post by Mohit Anchlia
Thanks! Is there any recommended configuration you want me to use when
using mdadm?
http://tldp.org/HOWTO/Software-RAID-HOWTO-5.html#ss5.1

First things first, break the RAID0, and then lets measure performance
per disk, to make sure nothing else bad is going on.

dd if=/dev/zero of=/dev/DISK bs=128k count=80k oflag=direct
dd of=/dev/null if=/dev/DISK bs=128k count=80k iflag=direct

for /dev/DISK being one of the drives in your existing RAID0. Once we
know the raw performance, I'd suggest something like this

mdadm --create /dev/md0 --metadata=1.2 --chunk=512 \
--raid-devices=4 /dev/DISK1 /dev/DISK2 \
/dev/DISK3 /dev/DISK4
mdadm --examine --scan | grep "md\/0" >> /etc/mdadm.conf

then

dd if=/dev/zero of=/dev/md0 bs=128k count=80k oflag=direct
dd of=/dev/null if=/dev/md0 bs=128k count=80k iflag=direct

and lets see how it behaves. If these are good, then

mkfs.xfs -l version=2 -d su=512k,sw=4,agcount=32 /dev/md0

(yeah, I know, gluster folk have a preference for ext* ... we generally
don't recommend ext* for anything other than OS drives ... you might
need to install xfsprogs and the xfs kernel module ... which kernel are
you using BTW?)

then

mount -o logbufs=4,logbsize=64k /dev/md0 /data
mkdir stress

dd if=/dev/zero of=/data/big.file bs=128k count=80k oflag=direct
dd of=/dev/null if=/data/big.file bs=128k count=80k iflag=direct

and see how it handles things.

When btrfs finally stabilizes enough to be used, it should be a
reasonable replacement for xfs, but this is likely to be a few years.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: ***@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Mohit Anchlia

2011-04-20 20:02:34 UTC

Thanks a lot for taking time and effort. I will try raw performance
first but that will only be going to one disk instead of 4. But I
think it definitely makes sense as the first step.

On Wed, Apr 20, 2011 at 12:53 PM, Joe Landman

Post by Mohit Anchlia
Thanks! Is there any recommended configuration you want me to use when
using mdadm?
http://tldp.org/HOWTO/Software-RAID-HOWTO-5.html#ss5.1

First things first, break the RAID0, and then lets measure performance per
disk, to make sure nothing else bad is going on.
dd if=/dev/zero of=/dev/DISK bs=128k count=80k oflag=direct
dd of=/dev/null if=/dev/DISK bs=128k count=80k iflag=direct
for /dev/DISK being one of the drives in your existing RAID0. Once we know
the raw performance, I'd suggest something like this
mdadm --create /dev/md0 --metadata=1.2 --chunk=512 \
--raid-devices=4 /dev/DISK1 /dev/DISK2 \
/dev/DISK3 /dev/DISK4
mdadm --examine --scan | grep "md\/0" >> /etc/mdadm.conf
then
dd if=/dev/zero of=/dev/md0 bs=128k count=80k oflag=direct
dd of=/dev/null if=/dev/md0 bs=128k count=80k iflag=direct
and lets see how it behaves. If these are good, then
mkfs.xfs -l version=2 -d su=512k,sw=4,agcount=32 /dev/md0
(yeah, I know, gluster folk have a preference for ext* ... we generally
don't recommend ext* for anything other than OS drives ... you might need to
install xfsprogs and the xfs kernel module ... which kernel are you using
BTW?)
then
mount -o logbufs=4,logbsize=64k /dev/md0 /data
mkdir stress
dd if=/dev/zero of=/data/big.file bs=128k count=80k oflag=direct
dd of=/dev/null if=/data/big.file bs=128k count=80k iflag=direct
and see how it handles things.
When btrfs finally stabilizes enough to be used, it should be a reasonable
replacement for xfs, but this is likely to be a few years.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

paul simpson

2011-04-20 21:43:32 UTC

many thanks for sharing guys. an informative read indeed!

i've 4x dells - each running 12 drives on PERC 600. was dissapointed to
hear they're so bad! we never got round to doing intensive tests this in
depth. 12x2T WD RE4 (sata) is giving me ~600Mb/s write on the bare
filesystem. joe, does that tally with your expectations for 12 SATA drives
running RAID6? (i'd put more faith in your gut reaction than our last
tests...) ;)

-p

Post by Mohit Anchlia
Thanks a lot for taking time and effort. I will try raw performance
first but that will only be going to one disk instead of 4. But I
think it definitely makes sense as the first step.
On Wed, Apr 20, 2011 at 12:53 PM, Joe Landman

Post by Joe Landman

Post by Mohit Anchlia
Thanks! Is there any recommended configuration you want me to use when
using mdadm?
http://tldp.org/HOWTO/Software-RAID-HOWTO-5.html#ss5.1

First things first, break the RAID0, and then lets measure performance

per

Post by Joe Landman
disk, to make sure nothing else bad is going on.
dd if=/dev/zero of=/dev/DISK bs=128k count=80k oflag=direct
dd of=/dev/null if=/dev/DISK bs=128k count=80k iflag=direct
for /dev/DISK being one of the drives in your existing RAID0. Once we

know

Post by Joe Landman
the raw performance, I'd suggest something like this
mdadm --create /dev/md0 --metadata=1.2 --chunk=512 \
--raid-devices=4 /dev/DISK1 /dev/DISK2 \
/dev/DISK3 /dev/DISK4
mdadm --examine --scan | grep "md\/0" >> /etc/mdadm.conf
then
dd if=/dev/zero of=/dev/md0 bs=128k count=80k oflag=direct
dd of=/dev/null if=/dev/md0 bs=128k count=80k iflag=direct
and lets see how it behaves. If these are good, then
mkfs.xfs -l version=2 -d su=512k,sw=4,agcount=32 /dev/md0
(yeah, I know, gluster folk have a preference for ext* ... we generally
don't recommend ext* for anything other than OS drives ... you might need

to

Post by Joe Landman
install xfsprogs and the xfs kernel module ... which kernel are you using
BTW?)
then
mount -o logbufs=4,logbsize=64k /dev/md0 /data
mkdir stress
dd if=/dev/zero of=/data/big.file bs=128k count=80k oflag=direct
dd of=/dev/null if=/data/big.file bs=128k count=80k iflag=direct
and see how it handles things.
When btrfs finally stabilizes enough to be used, it should be a

reasonable

Post by Joe Landman
replacement for xfs, but this is likely to be a few years.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

_______________________________________________
Gluster-users mailing list
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Mohit Anchlia

2011-04-20 21:52:08 UTC

Can you run that test with oflag=direct and see that is what you get?

many thanks for sharing guys. an informative read indeed!
i've 4x dells - each running 12 drives on PERC 600. was dissapointed to
hear they're so bad! we never got round to doing intensive tests this in
depth. 12x2T WD RE4 (sata) is giving me ~600Mb/s write on the bare
filesystem. joe, does that tally with your expectations for 12 SATA drives
running RAID6? (i'd put more faith in your gut reaction than our last
tests...) ;)
-p

Post by Mohit Anchlia
Thanks a lot for taking time and effort. I will try raw performance
first but that will only be going to one disk instead of 4. But I
think it definitely makes sense as the first step.
On Wed, Apr 20, 2011 at 12:53 PM, Joe Landman

Post by Mohit Anchlia
Thanks! Is there any recommended configuration you want me to use when
using mdadm?
http://tldp.org/HOWTO/Software-RAID-HOWTO-5.html#ss5.1

First things first, break the RAID0, and then lets measure performance per
disk, to make sure nothing else bad is going on.
dd if=/dev/zero of=/dev/DISK bs=128k count=80k oflag=direct
dd of=/dev/null if=/dev/DISK bs=128k count=80k iflag=direct
for /dev/DISK being one of the drives in your existing RAID0. Once we know
the raw performance, I'd suggest something like this
mdadm --create /dev/md0 --metadata=1.2 --chunk=512 \
--raid-devices=4 /dev/DISK1 /dev/DISK2 \
/dev/DISK3 /dev/DISK4
mdadm --examine --scan | grep "md\/0" >> /etc/mdadm.conf
then
dd if=/dev/zero of=/dev/md0 bs=128k count=80k oflag=direct
dd of=/dev/null if=/dev/md0 bs=128k count=80k iflag=direct
and lets see how it behaves. If these are good, then
mkfs.xfs -l version=2 -d su=512k,sw=4,agcount=32 /dev/md0
(yeah, I know, gluster folk have a preference for ext* ... we generally
don't recommend ext* for anything other than OS drives ... you might need to
install xfsprogs and the xfs kernel module ... which kernel are you using
BTW?)
then
mount -o logbufs=4,logbsize=64k /dev/md0 /data
mkdir stress
dd if=/dev/zero of=/data/big.file bs=128k count=80k oflag=direct
dd of=/dev/null if=/data/big.file bs=128k count=80k iflag=direct
and see how it handles things.
When btrfs finally stabilizes enough to be used, it should be a reasonable
replacement for xfs, but this is likely to be a few years.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

_______________________________________________
Gluster-users mailing list
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Joe Landman

2011-04-20 22:12:48 UTC

Post by paul simpson
many thanks for sharing guys. an informative read indeed!
i've 4x dells - each running 12 drives on PERC 600. was dissapointed to
hear they're so bad! we never got round to doing intensive tests this
in depth. 12x2T WD RE4 (sata) is giving me ~600Mb/s write on the bare
filesystem. joe, does that tally with your expectations for 12 SATA
drives running RAID6? (i'd put more faith in your gut reaction than our
last tests...) ;)

Hmmm ... I always put faith in the measurements ...

Ok, 600MB/s for 12 drives seems low, but they are WD drives (which is
another long subject for us).

This means you are getting about 60 MB/s per drive write on the bare
file system on drives that are at least (in theory) able to get (nearly)
double that.

This is in line with what I expect from these units, towards the higher
end of the range (was this direct or cached IO BTW). Most of our
customers never see more than about 300-450 MB/s out of their PERCs with
direct IO (actual performance measurement).
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: ***@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Mohit Anchlia

2011-04-20 22:49:26 UTC

Numbers look very disappointing. I destroyed RAID0. Does it mean disks
on all the servers are bad?

Disk seems to be from "Vendor: FUJITSU Model: MBD2300RC Rev: D809"

fdisk -l

Disk /dev/sda: 300.0 GB, 300000000000 bytes
255 heads, 63 sectors/track, 36472 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sda doesn't contain a valid partition table

Disk /dev/sdb: 300.0 GB, 300000000000 bytes
255 heads, 63 sectors/track, 36472 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdb doesn't contain a valid partition table

Disk /dev/sdc: 300.0 GB, 300000000000 bytes
255 heads, 63 sectors/track, 36472 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdc doesn't contain a valid partition table

Disk /dev/sdd: 300.0 GB, 300000000000 bytes
255 heads, 63 sectors/track, 36472 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdd doesn't contain a valid partition table

Disk /dev/sde: 299.4 GB, 299439751168 bytes
255 heads, 63 sectors/track, 36404 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sde1 * 1 13 104391 83 Linux
/dev/sde2 14 36404 292310707+ 8e Linux LVM
[***@dslg1 ~]# dd if=/dev/zero of=/dev/sda bs=128k count=80k oflag=direct
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 572.117 seconds, 18.8 MB/s

On Wed, Apr 20, 2011 at 12:53 PM, Joe Landman

Post by Mohit Anchlia
Thanks! Is there any recommended configuration you want me to use when
using mdadm?
http://tldp.org/HOWTO/Software-RAID-HOWTO-5.html#ss5.1

First things first, break the RAID0, and then lets measure performance per
disk, to make sure nothing else bad is going on.
dd if=/dev/zero of=/dev/DISK bs=128k count=80k oflag=direct
dd of=/dev/null if=/dev/DISK bs=128k count=80k iflag=direct
for /dev/DISK being one of the drives in your existing RAID0. Once we know
the raw performance, I'd suggest something like this
mdadm --create /dev/md0 --metadata=1.2 --chunk=512 \
--raid-devices=4 /dev/DISK1 /dev/DISK2 \
/dev/DISK3 /dev/DISK4
mdadm --examine --scan | grep "md\/0" >> /etc/mdadm.conf
then
dd if=/dev/zero of=/dev/md0 bs=128k count=80k oflag=direct
dd of=/dev/null if=/dev/md0 bs=128k count=80k iflag=direct
and lets see how it behaves. If these are good, then
mkfs.xfs -l version=2 -d su=512k,sw=4,agcount=32 /dev/md0
(yeah, I know, gluster folk have a preference for ext* ... we generally
don't recommend ext* for anything other than OS drives ... you might need to
install xfsprogs and the xfs kernel module ... which kernel are you using
BTW?)
then
mount -o logbufs=4,logbsize=64k /dev/md0 /data
mkdir stress
dd if=/dev/zero of=/data/big.file bs=128k count=80k oflag=direct
dd of=/dev/null if=/data/big.file bs=128k count=80k iflag=direct
and see how it handles things.
When btrfs finally stabilizes enough to be used, it should be a reasonable
replacement for xfs, but this is likely to be a few years.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Joe Landman

2011-04-21 00:24:54 UTC

Post by Mohit Anchlia
Numbers look very disappointing. I destroyed RAID0. Does it mean disks
on all the servers are bad?

[...]

Post by Mohit Anchlia
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 572.117 seconds, 18.8 MB/s

how about the read?
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: ***@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Mohit Anchlia

2011-04-20 23:28:02 UTC

dd of=/dev/null if=/dev/sda bs=128k count=80k iflag=direct
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 83.8293 seconds, 128 MB/s

Is it ok to use "sda" as a file in "dd" test? Thanks!

On Wed, Apr 20, 2011 at 5:24 PM, Joe Landman

Post by Joe Landman

Post by Mohit Anchlia
Numbers look very disappointing. I destroyed RAID0. Does it mean disks
on all the servers are bad?

[...]

Post by Mohit Anchlia
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 572.117 seconds, 18.8 MB/s

how about the read?
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Joe Landman

2011-04-21 00:45:32 UTC

Post by Mohit Anchlia
dd of=/dev/null if=/dev/sda bs=128k count=80k iflag=direct
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 83.8293 seconds, 128 MB/s

Ok, this is closer to what I was expecting (really ~150 MB/s would make
more sense to me, but I can live with 128 MB/s).

The write speed is definitely problematic. I am wondering if write
cache is off, and other features are turned off in strange ways.

This is a 2 year old SATA disk

[***@smash ~]# dd if=/dev/zero of=/dev/sda2 bs=128k oflag=direct
dd: writing `/dev/sda2': No space left on device
16379+0 records in
16378+0 records out
2146798080 bytes (2.1 GB) copied, 20.8322 s, 103 MB/s

Write cache is enabled. Turning write cache off (might not be so
relevant for a RAID0),

[***@smash ~]# hdparm -W /dev/sda

/dev/sda:
write-caching = 1 (on)
[***@smash ~]# hdparm -W0 /dev/sda

/dev/sda:
setting drive write-caching to 0 (off)
write-caching = 0 (off)

[***@smash ~]# dd if=/dev/zero of=/dev/sda2 bs=128k oflag=direct
dd: writing `/dev/sda2': No space left on device
16379+0 records in
16378+0 records out
2146798080 bytes (2.1 GB) copied, 155.636 s, 13.8 MB/s

See if you can do an

hdparm -W1 /dev/sda

and see if it has any impact on the write speed. If you are using a
RAID0, safety isn't so much on your mind anyway, so you can see if you
can adjust your cache settings. If this doesn't work, you might need to
get to the console and tell it to allow caching.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: ***@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Mohit Anchlia

2011-04-20 23:50:39 UTC

I did that but it looks the same. I did get an error even though it
says write-caching is on.

[***@dslg1 ~]# hdparm -W1 /dev/sda

/dev/sda:
setting drive write-caching to 1 (on)
HDIO_DRIVE_CMD(setcache) failed: Invalid argument
[***@dslg1 ~]# hdparm /dev/sda

/dev/sda:
readonly = 0 (off)
readahead = 256 (on)
geometry = 36472/255/63, sectors = 585937500, start = 0
[***@dslg1 ~]# [A
[***@dslg1 ~]# dd if=/dev/zero of=/dev/sda bs=128k count=1k oflag=direct
1024+0 records in
1024+0 records out
134217728 bytes (134 MB) copied, 8.10005 seconds, 16.6 MB/s

On Wed, Apr 20, 2011 at 5:45 PM, Joe Landman

Post by Mohit Anchlia
dd of=/dev/null if=/dev/sda bs=128k count=80k iflag=direct
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 83.8293 seconds, 128 MB/s

Ok, this is closer to what I was expecting (really ~150 MB/s would make more
sense to me, but I can live with 128 MB/s).
The write speed is definitely problematic. I am wondering if write cache is
off, and other features are turned off in strange ways.
This is a 2 year old SATA disk
dd: writing `/dev/sda2': No space left on device
16379+0 records in
16378+0 records out
2146798080 bytes (2.1 GB) copied, 20.8322 s, 103 MB/s
Write cache is enabled. Turning write cache off (might not be so relevant
for a RAID0),
write-caching = 1 (on)
setting drive write-caching to 0 (off)
write-caching = 0 (off)
dd: writing `/dev/sda2': No space left on device
16379+0 records in
16378+0 records out
2146798080 bytes (2.1 GB) copied, 155.636 s, 13.8 MB/s
See if you can do an
hdparm -W1 /dev/sda
and see if it has any impact on the write speed. If you are using a RAID0,
safety isn't so much on your mind anyway, so you can see if you can adjust
your cache settings. If this doesn't work, you might need to get to the
console and tell it to allow caching.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Joe Landman

2011-04-21 00:53:51 UTC

Post by Mohit Anchlia
I did that but it looks the same. I did get an error even though it
says write-caching is on.
setting drive write-caching to 1 (on)
HDIO_DRIVE_CMD(setcache) failed: Invalid argument

You might need sdparm

sdparm -a /dev/sda | grep WCE

With WCE on I see

[***@smash ~]# sdparm -a /dev/sda | grep WCE
WCE 1

and with it off, I see

[***@smash ~]# hdparm -W0 /dev/sda

/dev/sda:
setting drive write-caching to 0 (off)
write-caching = 0 (off)

[***@smash ~]# sdparm -a /dev/sda | grep WCE
WCE 0

You might need to change WCE using

sdparm --set=WCE -a /dev/sda

or similar ...

Post by Mohit Anchlia
readonly = 0 (off)
readahead = 256 (on)
geometry = 36472/255/63, sectors = 585937500, start = 0
1024+0 records in
1024+0 records out
134217728 bytes (134 MB) copied, 8.10005 seconds, 16.6 MB/s
On Wed, Apr 20, 2011 at 5:45 PM, Joe Landman

Post by Mohit Anchlia
dd of=/dev/null if=/dev/sda bs=128k count=80k iflag=direct
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 83.8293 seconds, 128 MB/s

Ok, this is closer to what I was expecting (really ~150 MB/s would make more
sense to me, but I can live with 128 MB/s).
The write speed is definitely problematic. I am wondering if write cache is
off, and other features are turned off in strange ways.
This is a 2 year old SATA disk
dd: writing `/dev/sda2': No space left on device
16379+0 records in
16378+0 records out
2146798080 bytes (2.1 GB) copied, 20.8322 s, 103 MB/s
Write cache is enabled. Turning write cache off (might not be so relevant
for a RAID0),
write-caching = 1 (on)
setting drive write-caching to 0 (off)
write-caching = 0 (off)
dd: writing `/dev/sda2': No space left on device
16379+0 records in
16378+0 records out
2146798080 bytes (2.1 GB) copied, 155.636 s, 13.8 MB/s
See if you can do an
hdparm -W1 /dev/sda
and see if it has any impact on the write speed. If you are using a RAID0,
safety isn't so much on your mind anyway, so you can see if you can adjust
your cache settings. If this doesn't work, you might need to get to the
console and tell it to allow caching.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: ***@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Mohit Anchlia

2011-04-21 01:16:09 UTC

Yes indeed. As soon as I sent you last email I realized that and set
the write back option. Now I get 130MB/s better but still nowhere
close to 600MB/s as advertised or what others say one should see.

What are your recommendations about HW choice? What is more
preferrable and better?

Another question do I need to set WCE on all the disks first before
creating RAID0? Or can I do that after creating RAID0? I tried to set
WCE on existing RAID0 but it fails "change_mode_page: failed fetching
page: Caching (SBC)".

sdparm --set=WCE --save /dev/sda
/dev/sda: FUJITSU MBD2300RC D809

dd if=/dev/zero of=/dev/sda bs=128k count=80k oflag=direct
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 82.8041 seconds, 130 MB/s

Thanks for your help as always.

On Wed, Apr 20, 2011 at 5:53 PM, Joe Landman

Post by Joe Landman

Post by Mohit Anchlia
I did that but it looks the same. I did get an error even though it
says write-caching is on.
setting drive write-caching to 1 (on)
HDIO_DRIVE_CMD(setcache) failed: Invalid argument

You might need sdparm
sdparm -a /dev/sda | grep WCE
With WCE on I see
WCE 1
and with it off, I see
setting drive write-caching to 0 (off)
write-caching = 0 (off)
WCE 0
You might need to change WCE using
sdparm --set=WCE -a /dev/sda
or similar ...

Post by Mohit Anchlia
readonly = 0 (off)
readahead = 256 (on)
geometry = 36472/255/63, sectors = 585937500, start = 0
1024+0 records in
1024+0 records out
134217728 bytes (134 MB) copied, 8.10005 seconds, 16.6 MB/s
On Wed, Apr 20, 2011 at 5:45 PM, Joe Landman

Post by Mohit Anchlia
dd of=/dev/null if=/dev/sda bs=128k count=80k iflag=direct
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 83.8293 seconds, 128 MB/s

Ok, this is closer to what I was expecting (really ~150 MB/s would make more
sense to me, but I can live with 128 MB/s).
The write speed is definitely problematic. I am wondering if write cache is
off, and other features are turned off in strange ways.
This is a 2 year old SATA disk
dd: writing `/dev/sda2': No space left on device
16379+0 records in
16378+0 records out
2146798080 bytes (2.1 GB) copied, 20.8322 s, 103 MB/s
Write cache is enabled. Turning write cache off (might not be so relevant
for a RAID0),
write-caching = 1 (on)
setting drive write-caching to 0 (off)
write-caching = 0 (off)
dd: writing `/dev/sda2': No space left on device
16379+0 records in
16378+0 records out
2146798080 bytes (2.1 GB) copied, 155.636 s, 13.8 MB/s
See if you can do an
hdparm -W1 /dev/sda
and see if it has any impact on the write speed. If you are using a RAID0,
safety isn't so much on your mind anyway, so you can see if you can adjust
your cache settings. If this doesn't work, you might need to get to the
console and tell it to allow caching.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Mohit Anchlia

2011-04-21 17:29:52 UTC

Hi Joe,

When you get chance can you please at my mail? It will be helpful to
get your advise.

Post by Mohit Anchlia
Yes indeed. As soon as I sent you last email I realized that and set
the write back option. Now I get 130MB/s better but still nowhere
close to 600MB/s as advertised or what others say one should see.
What are your recommendations about HW choice? What is more
preferrable and better?
Another question do I need to set WCE on all the disks first before
creating RAID0? Or can I do that after creating RAID0? I tried to set
WCE on existing RAID0 but it fails "change_mode_page: failed fetching
page: Caching (SBC)".
sdparm --set=WCE --save /dev/sda
/dev/sda: FUJITSU MBD2300RC D809
dd if=/dev/zero of=/dev/sda bs=128k count=80k oflag=direct
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 82.8041 seconds, 130 MB/s
Thanks for your help as always.
On Wed, Apr 20, 2011 at 5:53 PM, Joe Landman

Post by Joe Landman

Post by Mohit Anchlia
I did that but it looks the same. I did get an error even though it
says write-caching is on.
setting drive write-caching to 1 (on)
HDIO_DRIVE_CMD(setcache) failed: Invalid argument

You might need sdparm
sdparm -a /dev/sda | grep WCE
With WCE on I see
WCE 1
and with it off, I see
setting drive write-caching to 0 (off)
write-caching = 0 (off)
WCE 0
You might need to change WCE using
sdparm --set=WCE -a /dev/sda
or similar ...

Post by Mohit Anchlia
readonly = 0 (off)
readahead = 256 (on)
geometry = 36472/255/63, sectors = 585937500, start = 0
1024+0 records in
1024+0 records out
134217728 bytes (134 MB) copied, 8.10005 seconds, 16.6 MB/s
On Wed, Apr 20, 2011 at 5:45 PM, Joe Landman

Post by Mohit Anchlia
dd of=/dev/null if=/dev/sda bs=128k count=80k iflag=direct
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 83.8293 seconds, 128 MB/s

Ok, this is closer to what I was expecting (really ~150 MB/s would make more
sense to me, but I can live with 128 MB/s).
The write speed is definitely problematic. I am wondering if write cache is
off, and other features are turned off in strange ways.
This is a 2 year old SATA disk
dd: writing `/dev/sda2': No space left on device
16379+0 records in
16378+0 records out
2146798080 bytes (2.1 GB) copied, 20.8322 s, 103 MB/s
Write cache is enabled. Turning write cache off (might not be so relevant
for a RAID0),
write-caching = 1 (on)
setting drive write-caching to 0 (off)
write-caching = 0 (off)
dd: writing `/dev/sda2': No space left on device
16379+0 records in
16378+0 records out
2146798080 bytes (2.1 GB) copied, 155.636 s, 13.8 MB/s
See if you can do an
hdparm -W1 /dev/sda
and see if it has any impact on the write speed. If you are using a RAID0,
safety isn't so much on your mind anyway, so you can see if you can adjust
your cache settings. If this doesn't work, you might need to get to the
console and tell it to allow caching.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Mohit Anchlia

2011-04-22 00:49:02 UTC

After lot of digging today finaly figured out that it's not really
using PERC controller but some Fusion MPT. Then it wasn't clear which
tool it supports. Finally I installed lsiutil and was able to change
the cache size.

[***@dsdb1 ~]# lspci|grep LSI
02:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E
PCI-Express Fusion-MPT SAS (rev 08)
[***@dsdb1 ~]# dd if=/dev/zero of=/data/big.file bs=128k count=40k oflag=direct
1024+0 records in
1024+0 records out
134217728 bytes (134 MB) copied, 0.742517 seconds, 181 MB/s

I compared this with SW RAID mdadm that I created yesterday on one of
the servers and I get around 300MB/s. I will test out first with what
we have before destroying and testing with mdadm.

Thanks for your help that led me to this path. Another question I had
was when creating mdadm RAID does it make sense to use multipathing?

On Thu, Apr 21, 2011 at 10:30 AM, Joe Landman

In the lab working on fixing a unit will be about an hour
Please pardon brevity and spelling errors, sent from my BlackBerry
-----Original Message-----
Date: Thu, 21 Apr 2011 10:29:52
Subject: Re: [Gluster-users] Performance
Hi Joe,
When you get chance can you please at my mail? It will be helpful to
get your advise.

Post by Mohit Anchlia
Yes indeed. As soon as I sent you last email I realized that and set
the write back option. Now I get 130MB/s better but still nowhere
close to 600MB/s as advertised or what others say one should see.
What are your recommendations about HW choice? What is more
preferrable and better?
Another question do I need to set WCE on all the disks first before
creating RAID0? Or can I do that after creating RAID0? I tried to set
WCE on existing RAID0 but it fails "change_mode_page: failed fetching
page: Caching (SBC)".
sdparm --set=WCE --save /dev/sda
/dev/sda: FUJITSU MBD2300RC D809
dd if=/dev/zero of=/dev/sda bs=128k count=80k oflag=direct
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 82.8041 seconds, 130 MB/s
Thanks for your help as always.
On Wed, Apr 20, 2011 at 5:53 PM, Joe Landman

Post by Joe Landman

Post by Mohit Anchlia
I did that but it looks the same. I did get an error even though it
says write-caching is on.
setting drive write-caching to 1 (on)
HDIO_DRIVE_CMD(setcache) failed: Invalid argument

You might need sdparm
sdparm -a /dev/sda | grep WCE
With WCE on I see
WCE 1
and with it off, I see
setting drive write-caching to 0 (off)
write-caching = 0 (off)
WCE 0
You might need to change WCE using
sdparm --set=WCE -a /dev/sda
or similar ...

Post by Mohit Anchlia
readonly = 0 (off)
readahead = 256 (on)
geometry = 36472/255/63, sectors = 585937500, start = 0
1024+0 records in
1024+0 records out
134217728 bytes (134 MB) copied, 8.10005 seconds, 16.6 MB/s
On Wed, Apr 20, 2011 at 5:45 PM, Joe Landman

Post by Mohit Anchlia
dd of=/dev/null if=/dev/sda bs=128k count=80k iflag=direct
81920+0 records in
81920+0 records out
10737418240 bytes (11 GB) copied, 83.8293 seconds, 128 MB/s

Ok, this is closer to what I was expecting (really ~150 MB/s would make more
sense to me, but I can live with 128 MB/s).
The write speed is definitely problematic. I am wondering if write cache is
off, and other features are turned off in strange ways.
This is a 2 year old SATA disk
dd: writing `/dev/sda2': No space left on device
16379+0 records in
16378+0 records out
2146798080 bytes (2.1 GB) copied, 20.8322 s, 103 MB/s
Write cache is enabled. Turning write cache off (might not be so relevant
for a RAID0),
write-caching = 1 (on)
setting drive write-caching to 0 (off)
write-caching = 0 (off)
dd: writing `/dev/sda2': No space left on device
16379+0 records in
16378+0 records out
2146798080 bytes (2.1 GB) copied, 155.636 s, 13.8 MB/s
See if you can do an
hdparm -W1 /dev/sda
and see if it has any impact on the write speed. If you are using a RAID0,
safety isn't so much on your mind anyway, so you can see if you can adjust
your cache settings. If this doesn't work, you might need to get to the
console and tell it to allow caching.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Joe Landman

2011-04-22 02:23:32 UTC

Post by Mohit Anchlia
After lot of digging today finaly figured out that it's not really
using PERC controller but some Fusion MPT. Then it wasn't clear which

PERC is a rebadged LSI based on the 1068E chip.

Post by Mohit Anchlia
tool it supports. Finally I installed lsiutil and was able to change
the cache size.
02:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E
PCI-Express Fusion-MPT SAS (rev 08)

This looks like PERC. These are roughly equivalent to the LSI 3081
series. These are not fast units. There is a variant of this that does
RAID6, its usually available as a software update or plugin module
(button?) to this. I might be thinking of the 1078 chip though.

Regardless, these are fairly old designs.

Post by Mohit Anchlia
1024+0 records in
1024+0 records out
134217728 bytes (134 MB) copied, 0.742517 seconds, 181 MB/s
I compared this with SW RAID mdadm that I created yesterday on one of
the servers and I get around 300MB/s. I will test out first with what
we have before destroying and testing with mdadm.

So the software RAID is giving you 300 MB/s and the hardware 'RAID' is
giving you ~181 MB/s? Seems a pretty simple choice :)

BTW: The 300MB/s could also be a limitation of the PCIe channel
interconnect (or worse, if they hung the chip off a PCIx bridge). The
motherboard vendors are generally loathe to put more than a few PCIe
lanes for handling SATA, Networking, etc. So typically you wind up with
very low powered 'RAID' and 'SATA/SAS' on the motherboard, connected by
PCIe x2 or x4 at most. A number of motherboards have NICs that are
served by a single PCIe x1 link.

Post by Mohit Anchlia
Thanks for your help that led me to this path. Another question I had
was when creating mdadm RAID does it make sense to use multipathing?

Well, for a shared backend over a fabric, I'd say possibly. For an
internal connected set, I'd say no. Given what you are doing with
Gluster, I'd say that the additional expense/pain of setting up a
multipath scenario probably isn't worth it.

Gluster lets you get many of these benefits at a higher level in the
stack. Which to a degree, and in some use cases, obviates the need for
multipathing at a lower level. I'd still suggest real RAID at the lower
level (RAID6, and sometimes RAID10 make the most sense) for the backing
store.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: ***@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Mohit Anchlia

2011-04-26 21:48:22 UTC

I am not sure how valid this performance url is

http://www.gluster.com/community/documentation/index.php/Guide_to_Optimizing_GlusterFS

Does it make sense to separate out the journal and create mkfs -I 256?

Also, if I already have a file system on a different partition can I
still use it to store journal from other partition without corrupting
the file system?

On Thu, Apr 21, 2011 at 7:23 PM, Joe Landman

Post by Joe Landman

Post by Mohit Anchlia
After lot of digging today finaly figured out that it's not really
using PERC controller but some Fusion MPT. Then it wasn't clear which

PERC is a rebadged LSI based on the 1068E chip.

Post by Mohit Anchlia
tool it supports. Finally I installed lsiutil and was able to change
the cache size.
02:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E
PCI-Express Fusion-MPT SAS (rev 08)

This looks like PERC. These are roughly equivalent to the LSI 3081 series.
These are not fast units. There is a variant of this that does RAID6, its
usually available as a software update or plugin module (button?) to this.
I might be thinking of the 1078 chip though.
Regardless, these are fairly old designs.

Post by Mohit Anchlia
1024+0 records in
1024+0 records out
134217728 bytes (134 MB) copied, 0.742517 seconds, 181 MB/s
I compared this with SW RAID mdadm that I created yesterday on one of
the servers and I get around 300MB/s. I will test out first with what
we have before destroying and testing with mdadm.

So the software RAID is giving you 300 MB/s and the hardware 'RAID' is
giving you ~181 MB/s? Seems a pretty simple choice :)
BTW: The 300MB/s could also be a limitation of the PCIe channel interconnect
(or worse, if they hung the chip off a PCIx bridge). The motherboard
vendors are generally loathe to put more than a few PCIe lanes for handling
SATA, Networking, etc. So typically you wind up with very low powered
'RAID' and 'SATA/SAS' on the motherboard, connected by PCIe x2 or x4 at
most. A number of motherboards have NICs that are served by a single PCIe
x1 link.

Post by Mohit Anchlia
Thanks for your help that led me to this path. Another question I had
was when creating mdadm RAID does it make sense to use multipathing?

Well, for a shared backend over a fabric, I'd say possibly. For an internal
connected set, I'd say no. Given what you are doing with Gluster, I'd say
that the additional expense/pain of setting up a multipath scenario probably
isn't worth it.
Gluster lets you get many of these benefits at a higher level in the stack.
Which to a degree, and in some use cases, obviates the need for
multipathing at a lower level. I'd still suggest real RAID at the lower
level (RAID6, and sometimes RAID10 make the most sense) for the backing
store.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Joe Landman

2011-04-26 23:31:26 UTC

Post by Mohit Anchlia
I am not sure how valid this performance url is
http://www.gluster.com/community/documentation/index.php/Guide_to_Optimizing_GlusterFS
Does it make sense to separate out the journal and create mkfs -I 256?
Also, if I already have a file system on a different partition can I
still use it to store journal from other partition without corrupting
the file system?

Journals are small write heavy. You really want a raw device for them.
You do not want file system caching underneath them.

Raw partition for an external journal is best. Also, understand that
ext* suffers badly under intense parallel loads. Keep that in mind as
you make your file system choice.

Post by Mohit Anchlia
On Thu, Apr 21, 2011 at 7:23 PM, Joe Landman

Post by Joe Landman

Post by Mohit Anchlia
After lot of digging today finaly figured out that it's not really
using PERC controller but some Fusion MPT. Then it wasn't clear which

PERC is a rebadged LSI based on the 1068E chip.

Post by Mohit Anchlia
tool it supports. Finally I installed lsiutil and was able to change
the cache size.
02:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E
PCI-Express Fusion-MPT SAS (rev 08)

This looks like PERC. These are roughly equivalent to the LSI 3081 series.
These are not fast units. There is a variant of this that does RAID6, its
usually available as a software update or plugin module (button?) to this.
I might be thinking of the 1078 chip though.
Regardless, these are fairly old designs.

Post by Mohit Anchlia
1024+0 records in
1024+0 records out
134217728 bytes (134 MB) copied, 0.742517 seconds, 181 MB/s
I compared this with SW RAID mdadm that I created yesterday on one of
the servers and I get around 300MB/s. I will test out first with what
we have before destroying and testing with mdadm.

So the software RAID is giving you 300 MB/s and the hardware 'RAID' is
giving you ~181 MB/s? Seems a pretty simple choice :)
BTW: The 300MB/s could also be a limitation of the PCIe channel interconnect
(or worse, if they hung the chip off a PCIx bridge). The motherboard
vendors are generally loathe to put more than a few PCIe lanes for handling
SATA, Networking, etc. So typically you wind up with very low powered
'RAID' and 'SATA/SAS' on the motherboard, connected by PCIe x2 or x4 at
most. A number of motherboards have NICs that are served by a single PCIe
x1 link.

Post by Mohit Anchlia
Thanks for your help that led me to this path. Another question I had
was when creating mdadm RAID does it make sense to use multipathing?

Well, for a shared backend over a fabric, I'd say possibly. For an internal
connected set, I'd say no. Given what you are doing with Gluster, I'd say
that the additional expense/pain of setting up a multipath scenario probably
isn't worth it.
Gluster lets you get many of these benefits at a higher level in the stack.
Which to a degree, and in some use cases, obviates the need for
multipathing at a lower level. I'd still suggest real RAID at the lower
level (RAID6, and sometimes RAID10 make the most sense) for the backing
store.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: ***@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Mohit Anchlia

2011-04-26 23:43:50 UTC

In your experience does it really help having journal on different
disk? Just trying to see if it's worth the effort. Also, Gluster also
recommends creating mkfs with larger blocks mkfs -I 256

As always thanks for the suggestion.

On Tue, Apr 26, 2011 at 4:31 PM, Joe Landman

Post by Mohit Anchlia
I am not sure how valid this performance url is
http://www.gluster.com/community/documentation/index.php/Guide_to_Optimizing_GlusterFS
Does it make sense to separate out the journal and create mkfs -I 256?
Also, if I already have a file system on a different partition can I
still use it to store journal from other partition without corrupting
the file system?

Journals are small write heavy. You really want a raw device for them. You
do not want file system caching underneath them.
Raw partition for an external journal is best. Also, understand that ext*
suffers badly under intense parallel loads. Keep that in mind as you make
your file system choice.

Post by Mohit Anchlia
On Thu, Apr 21, 2011 at 7:23 PM, Joe Landman

Post by Joe Landman

Post by Mohit Anchlia
After lot of digging today finaly figured out that it's not really
using PERC controller but some Fusion MPT. Then it wasn't clear which

PERC is a rebadged LSI based on the 1068E chip.

Post by Mohit Anchlia
tool it supports. Finally I installed lsiutil and was able to change
the cache size.
02:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E
PCI-Express Fusion-MPT SAS (rev 08)

This looks like PERC. These are roughly equivalent to the LSI 3081 series.
These are not fast units. There is a variant of this that does RAID6, its
usually available as a software update or plugin module (button?) to this.
I might be thinking of the 1078 chip though.
Regardless, these are fairly old designs.

Post by Mohit Anchlia
1024+0 records in
1024+0 records out
134217728 bytes (134 MB) copied, 0.742517 seconds, 181 MB/s
I compared this with SW RAID mdadm that I created yesterday on one of
the servers and I get around 300MB/s. I will test out first with what
we have before destroying and testing with mdadm.

So the software RAID is giving you 300 MB/s and the hardware 'RAID' is
giving you ~181 MB/s? Seems a pretty simple choice :)
BTW: The 300MB/s could also be a limitation of the PCIe channel interconnect
(or worse, if they hung the chip off a PCIx bridge). The motherboard
vendors are generally loathe to put more than a few PCIe lanes for handling
SATA, Networking, etc. So typically you wind up with very low powered
'RAID' and 'SATA/SAS' on the motherboard, connected by PCIe x2 or x4 at
most. A number of motherboards have NICs that are served by a single PCIe
x1 link.

Post by Mohit Anchlia
Thanks for your help that led me to this path. Another question I had
was when creating mdadm RAID does it make sense to use multipathing?

Well, for a shared backend over a fabric, I'd say possibly. For an internal
connected set, I'd say no. Given what you are doing with Gluster, I'd say
that the additional expense/pain of setting up a multipath scenario probably
isn't worth it.
Gluster lets you get many of these benefits at a higher level in the stack.
Which to a degree, and in some use cases, obviates the need for
multipathing at a lower level. I'd still suggest real RAID at the lower
level (RAID6, and sometimes RAID10 make the most sense) for the backing
store.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Mohit Anchlia

2011-04-27 18:46:49 UTC

What are some of the good controller cards would you recommend for SAS
drives? Dell and Areca is what I am seeing most suggested online.

Post by Mohit Anchlia
In your experience does it really help having journal on different
disk? Just trying to see if it's worth the effort. Also, Gluster also
recommends creating mkfs with larger blocks mkfs -I 256
As always thanks for the suggestion.
On Tue, Apr 26, 2011 at 4:31 PM, Joe Landman

Post by Mohit Anchlia
I am not sure how valid this performance url is
http://www.gluster.com/community/documentation/index.php/Guide_to_Optimizing_GlusterFS
Does it make sense to separate out the journal and create mkfs -I 256?
Also, if I already have a file system on a different partition can I
still use it to store journal from other partition without corrupting
the file system?

Journals are small write heavy. You really want a raw device for them. You
do not want file system caching underneath them.
Raw partition for an external journal is best. Also, understand that ext*
suffers badly under intense parallel loads. Keep that in mind as you make
your file system choice.

Post by Mohit Anchlia
On Thu, Apr 21, 2011 at 7:23 PM, Joe Landman

Post by Joe Landman

Post by Mohit Anchlia
After lot of digging today finaly figured out that it's not really
using PERC controller but some Fusion MPT. Then it wasn't clear which

PERC is a rebadged LSI based on the 1068E chip.

Post by Mohit Anchlia
tool it supports. Finally I installed lsiutil and was able to change
the cache size.
02:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E
PCI-Express Fusion-MPT SAS (rev 08)

This looks like PERC. These are roughly equivalent to the LSI 3081 series.
These are not fast units. There is a variant of this that does RAID6, its
usually available as a software update or plugin module (button?) to this.
I might be thinking of the 1078 chip though.
Regardless, these are fairly old designs.

Post by Mohit Anchlia
1024+0 records in
1024+0 records out
134217728 bytes (134 MB) copied, 0.742517 seconds, 181 MB/s
I compared this with SW RAID mdadm that I created yesterday on one of
the servers and I get around 300MB/s. I will test out first with what
we have before destroying and testing with mdadm.

So the software RAID is giving you 300 MB/s and the hardware 'RAID' is
giving you ~181 MB/s? Seems a pretty simple choice :)
BTW: The 300MB/s could also be a limitation of the PCIe channel interconnect
(or worse, if they hung the chip off a PCIx bridge). The motherboard
vendors are generally loathe to put more than a few PCIe lanes for handling
SATA, Networking, etc. So typically you wind up with very low powered
'RAID' and 'SATA/SAS' on the motherboard, connected by PCIe x2 or x4 at
most. A number of motherboards have NICs that are served by a single PCIe
x1 link.

Post by Mohit Anchlia
Thanks for your help that led me to this path. Another question I had
was when creating mdadm RAID does it make sense to use multipathing?

Well, for a shared backend over a fabric, I'd say possibly. For an internal
connected set, I'd say no. Given what you are doing with Gluster, I'd say
that the additional expense/pain of setting up a multipath scenario probably
isn't worth it.
Gluster lets you get many of these benefits at a higher level in the stack.
Which to a degree, and in some use cases, obviates the need for
multipathing at a lower level. I'd still suggest real RAID at the lower
level (RAID6, and sometimes RAID10 make the most sense) for the backing
store.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Burnash, James

2011-04-27 18:59:20 UTC

We use HP controllers here - p800, p812. They're pretty good - but I believe they're fairly pricey (my sources say $600-$800 for the p812, depending on the options for battery and cache.

I use these controllers on my Gluster backend storage servers. Then again, we're an HP shop.

James Burnash, Unix Engineering

-----Original Message-----
From: gluster-users-***@gluster.org [mailto:gluster-users-***@gluster.org] On Behalf Of Mohit Anchlia
Sent: Wednesday, April 27, 2011 2:47 PM
To: ***@scalableinformatics.com
Cc: gluster-***@gluster.org
Subject: Re: [Gluster-users] Performance

What are some of the good controller cards would you recommend for SAS
drives? Dell and Areca is what I am seeing most suggested online.

Post by Mohit Anchlia
In your experience does it really help having journal on different
disk? Just trying to see if it's worth the effort. Also, Gluster also
recommends creating mkfs with larger blocks mkfs -I 256
As always thanks for the suggestion.
On Tue, Apr 26, 2011 at 4:31 PM, Joe Landman

Post by Mohit Anchlia
I am not sure how valid this performance url is
http://www.gluster.com/community/documentation/index.php/Guide_to_Optimizing_GlusterFS
Does it make sense to separate out the journal and create mkfs -I 256?
Also, if I already have a file system on a different partition can I
still use it to store journal from other partition without corrupting
the file system?

Journals are small write heavy. You really want a raw device for them. You
do not want file system caching underneath them.
Raw partition for an external journal is best. Also, understand that ext*
suffers badly under intense parallel loads. Keep that in mind as you make
your file system choice.

Post by Mohit Anchlia
On Thu, Apr 21, 2011 at 7:23 PM, Joe Landman

Post by Joe Landman

Post by Mohit Anchlia
After lot of digging today finaly figured out that it's not really
using PERC controller but some Fusion MPT. Then it wasn't clear which

PERC is a rebadged LSI based on the 1068E chip.

Post by Mohit Anchlia
tool it supports. Finally I installed lsiutil and was able to change
the cache size.
02:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E
PCI-Express Fusion-MPT SAS (rev 08)

This looks like PERC. These are roughly equivalent to the LSI 3081 series.
These are not fast units. There is a variant of this that does RAID6, its
usually available as a software update or plugin module (button?) to this.
I might be thinking of the 1078 chip though.
Regardless, these are fairly old designs.

Post by Mohit Anchlia
1024+0 records in
1024+0 records out
134217728 bytes (134 MB) copied, 0.742517 seconds, 181 MB/s
I compared this with SW RAID mdadm that I created yesterday on one of
the servers and I get around 300MB/s. I will test out first with what
we have before destroying and testing with mdadm.

So the software RAID is giving you 300 MB/s and the hardware 'RAID' is
giving you ~181 MB/s? Seems a pretty simple choice :)
BTW: The 300MB/s could also be a limitation of the PCIe channel interconnect
(or worse, if they hung the chip off a PCIx bridge). The motherboard
vendors are generally loathe to put more than a few PCIe lanes for handling
SATA, Networking, etc. So typically you wind up with very low powered
'RAID' and 'SATA/SAS' on the motherboard, connected by PCIe x2 or x4 at
most. A number of motherboards have NICs that are served by a single PCIe
x1 link.

Post by Mohit Anchlia
Thanks for your help that led me to this path. Another question I had
was when creating mdadm RAID does it make sense to use multipathing?

Well, for a shared backend over a fabric, I'd say possibly. For an internal
connected set, I'd say no. Given what you are doing with Gluster, I'd say
that the additional expense/pain of setting up a multipath scenario probably
isn't worth it.
Gluster lets you get many of these benefits at a higher level in the stack.
Which to a degree, and in some use cases, obviates the need for
multipathing at a lower level. I'd still suggest real RAID at the lower
level (RAID6, and sometimes RAID10 make the most sense) for the backing
store.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

_______________________________________________
Gluster-users mailing list
Gluster-***@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

DISCLAIMER:
This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this in error, please immediately notify me and permanently delete the original and any copy of any e-mail and any printout thereof. E-mail transmission cannot be guaranteed to be secure or error-free. The sender therefore does not accept liability for any errors or omissions in the contents of this message which arise as a result of e-mail transmission.
NOTICE REGARDING PRIVACY AND CONFIDENTIALITY Knight Capital Group may, at its discretion, monitor and review the content of all e-mail communications. http://www.knight.com

Mohit Anchlia

2011-05-03 18:44:03 UTC

Does anyone know if new controller cards can be replaced without
re-installing OS?

Post by Burnash, James
We use HP controllers here - p800, p812. They're pretty good - but I believe they're fairly pricey (my sources say $600-$800 for the p812, depending on the options for battery and cache.
I use these controllers on my Gluster backend storage servers. Then again, we're an HP shop.
James Burnash, Unix Engineering
-----Original Message-----
Sent: Wednesday, April 27, 2011 2:47 PM
Subject: Re: [Gluster-users] Performance
What are some of the good controller cards would you recommend for SAS
drives? Dell and Areca is what I am seeing most suggested online.

Post by Mohit Anchlia
In your experience does it really help having journal on different
disk? Just trying to see if it's worth the effort. Also, Gluster also
recommends creating mkfs with larger blocks mkfs -I 256
As always thanks for the suggestion.
On Tue, Apr 26, 2011 at 4:31 PM, Joe Landman

Post by Mohit Anchlia
I am not sure how valid this performance url is
http://www.gluster.com/community/documentation/index.php/Guide_to_Optimizing_GlusterFS
Does it make sense to separate out the journal and create mkfs -I 256?
Also, if I already have a file system on a different partition can I
still use it to store journal from other partition without corrupting
the file system?

Journals are small write heavy. You really want a raw device for them. You
do not want file system caching underneath them.
Raw partition for an external journal is best. Also, understand that ext*
suffers badly under intense parallel loads. Keep that in mind as you make
your file system choice.

Post by Mohit Anchlia
On Thu, Apr 21, 2011 at 7:23 PM, Joe Landman

Post by Joe Landman

Post by Mohit Anchlia
After lot of digging today finaly figured out that it's not really
using PERC controller but some Fusion MPT. Then it wasn't clear which

PERC is a rebadged LSI based on the 1068E chip.

Post by Mohit Anchlia
tool it supports. Finally I installed lsiutil and was able to change
the cache size.
02:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E
PCI-Express Fusion-MPT SAS (rev 08)

This looks like PERC. These are roughly equivalent to the LSI 3081 series.
These are not fast units. There is a variant of this that does RAID6, its
usually available as a software update or plugin module (button?) to this.
I might be thinking of the 1078 chip though.
Regardless, these are fairly old designs.

Post by Mohit Anchlia
1024+0 records in
1024+0 records out
134217728 bytes (134 MB) copied, 0.742517 seconds, 181 MB/s
I compared this with SW RAID mdadm that I created yesterday on one of
the servers and I get around 300MB/s. I will test out first with what
we have before destroying and testing with mdadm.

So the software RAID is giving you 300 MB/s and the hardware 'RAID' is
giving you ~181 MB/s? Seems a pretty simple choice :)
BTW: The 300MB/s could also be a limitation of the PCIe channel interconnect
(or worse, if they hung the chip off a PCIx bridge). The motherboard
vendors are generally loathe to put more than a few PCIe lanes for handling
SATA, Networking, etc. So typically you wind up with very low powered
'RAID' and 'SATA/SAS' on the motherboard, connected by PCIe x2 or x4 at
most. A number of motherboards have NICs that are served by a single PCIe
x1 link.

Post by Mohit Anchlia
Thanks for your help that led me to this path. Another question I had
was when creating mdadm RAID does it make sense to use multipathing?

Well, for a shared backend over a fabric, I'd say possibly. For an internal
connected set, I'd say no. Given what you are doing with Gluster, I'd say
that the additional expense/pain of setting up a multipath scenario probably
isn't worth it.
Gluster lets you get many of these benefits at a higher level in the stack.
Which to a degree, and in some use cases, obviates the need for
multipathing at a lower level. I'd still suggest real RAID at the lower
level (RAID6, and sometimes RAID10 make the most sense) for the backing
store.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

_______________________________________________
Gluster-users mailing list
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this in error, please immediately notify me and permanently delete the original and any copy of any e-mail and any printout thereof. E-mail transmission cannot be guaranteed to be secure or error-free. The sender therefore does not accept liability for any errors or omissions in the contents of this message which arise as a result of e-mail transmission.
NOTICE REGARDING PRIVACY AND CONFIDENTIALITY Knight Capital Group may, at its discretion, monitor and review the content of all e-mail communications. http://www.knight.com

Pavan

2011-05-04 05:51:17 UTC

Post by Mohit Anchlia
Does anyone know if new controller cards can be replaced without
re-installing OS?

If your root disk is on that controller card, you might hit issues with
device paths when you move the root disk onto a new card. You are better
off not doing that.
Otherwise, it is routine task - replacing cards on a server. And, if it
is hot plug capable servers, you don't event have to reboot the OS.

Pavan

Post by Mohit Anchlia

Post by Burnash, James
We use HP controllers here - p800, p812. They're pretty good - but I believe they're fairly pricey (my sources say $600-$800 for the p812, depending on the options for battery and cache.
I use these controllers on my Gluster backend storage servers. Then again, we're an HP shop.
James Burnash, Unix Engineering
-----Original Message-----
Sent: Wednesday, April 27, 2011 2:47 PM
Subject: Re: [Gluster-users] Performance
What are some of the good controller cards would you recommend for SAS
drives? Dell and Areca is what I am seeing most suggested online.

Post by Mohit Anchlia
In your experience does it really help having journal on different
disk? Just trying to see if it's worth the effort. Also, Gluster also
recommends creating mkfs with larger blocks mkfs -I 256
As always thanks for the suggestion.
On Tue, Apr 26, 2011 at 4:31 PM, Joe Landman

Post by Mohit Anchlia
I am not sure how valid this performance url is
http://www.gluster.com/community/documentation/index.php/Guide_to_Optimizing_GlusterFS
Does it make sense to separate out the journal and create mkfs -I 256?
Also, if I already have a file system on a different partition can I
still use it to store journal from other partition without corrupting
the file system?

Journals are small write heavy. You really want a raw device for them. You
do not want file system caching underneath them.
Raw partition for an external journal is best. Also, understand that ext*
suffers badly under intense parallel loads. Keep that in mind as you make
your file system choice.

Post by Mohit Anchlia
On Thu, Apr 21, 2011 at 7:23 PM, Joe Landman

Post by Joe Landman

Post by Mohit Anchlia
After lot of digging today finaly figured out that it's not really
using PERC controller but some Fusion MPT. Then it wasn't clear which

PERC is a rebadged LSI based on the 1068E chip.

Post by Mohit Anchlia
tool it supports. Finally I installed lsiutil and was able to change
the cache size.
02:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E
PCI-Express Fusion-MPT SAS (rev 08)

This looks like PERC. These are roughly equivalent to the LSI 3081 series.
These are not fast units. There is a variant of this that does RAID6, its
usually available as a software update or plugin module (button?) to this.
I might be thinking of the 1078 chip though.
Regardless, these are fairly old designs.

Post by Mohit Anchlia
oflag=direct
1024+0 records in
1024+0 records out
134217728 bytes (134 MB) copied, 0.742517 seconds, 181 MB/s
I compared this with SW RAID mdadm that I created yesterday on one of
the servers and I get around 300MB/s. I will test out first with what
we have before destroying and testing with mdadm.

So the software RAID is giving you 300 MB/s and the hardware 'RAID' is
giving you ~181 MB/s? Seems a pretty simple choice :)
BTW: The 300MB/s could also be a limitation of the PCIe channel interconnect
(or worse, if they hung the chip off a PCIx bridge). The motherboard
vendors are generally loathe to put more than a few PCIe lanes for handling
SATA, Networking, etc. So typically you wind up with very low powered
'RAID' and 'SATA/SAS' on the motherboard, connected by PCIe x2 or x4 at
most. A number of motherboards have NICs that are served by a single PCIe
x1 link.

Post by Mohit Anchlia
Thanks for your help that led me to this path. Another question I had
was when creating mdadm RAID does it make sense to use multipathing?

Well, for a shared backend over a fabric, I'd say possibly. For an internal
connected set, I'd say no. Given what you are doing with Gluster, I'd say
that the additional expense/pain of setting up a multipath scenario probably
isn't worth it.
Gluster lets you get many of these benefits at a higher level in the stack.
Which to a degree, and in some use cases, obviates the need for
multipathing at a lower level. I'd still suggest real RAID at the lower
level (RAID6, and sometimes RAID10 make the most sense) for the backing
store.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

_______________________________________________
Gluster-users mailing list
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this in error, please immediately notify me and permanently delete the original and any copy of any e-mail and any printout thereof. E-mail transmission cannot be guaranteed to be secure or error-free. The sender therefore does not accept liability for any errors or omissions in the contents of this message which arise as a result of e-mail transmission.
NOTICE REGARDING PRIVACY AND CONFIDENTIALITY Knight Capital Group may, at its discretion, monitor and review the content of all e-mail communications. http://www.knight.com

_______________________________________________
Gluster-users mailing list
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Mohit Anchlia

2011-05-13 22:17:17 UTC

I got new card with 512MB cache. But the current setting is:

Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write
Cache if Bad BBU

Does it make sense to enable ReadAhead? I was jus tgoing to change write policy.

On Tue, Apr 26, 2011 at 4:31 PM, Joe Landman

Post by Mohit Anchlia
I am not sure how valid this performance url is
http://www.gluster.com/community/documentation/index.php/Guide_to_Optimizing_GlusterFS
Does it make sense to separate out the journal and create mkfs -I 256?
Also, if I already have a file system on a different partition can I
still use it to store journal from other partition without corrupting
the file system?

Journals are small write heavy. You really want a raw device for them. You
do not want file system caching underneath them.
Raw partition for an external journal is best. Also, understand that ext*
suffers badly under intense parallel loads. Keep that in mind as you make
your file system choice.

Post by Mohit Anchlia
On Thu, Apr 21, 2011 at 7:23 PM, Joe Landman

Post by Joe Landman

Post by Mohit Anchlia
After lot of digging today finaly figured out that it's not really
using PERC controller but some Fusion MPT. Then it wasn't clear which

PERC is a rebadged LSI based on the 1068E chip.

Post by Mohit Anchlia
tool it supports. Finally I installed lsiutil and was able to change
the cache size.
02:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E
PCI-Express Fusion-MPT SAS (rev 08)

This looks like PERC. These are roughly equivalent to the LSI 3081 series.
These are not fast units. There is a variant of this that does RAID6, its
usually available as a software update or plugin module (button?) to this.
I might be thinking of the 1078 chip though.
Regardless, these are fairly old designs.

Post by Mohit Anchlia
1024+0 records in
1024+0 records out
134217728 bytes (134 MB) copied, 0.742517 seconds, 181 MB/s
I compared this with SW RAID mdadm that I created yesterday on one of
the servers and I get around 300MB/s. I will test out first with what
we have before destroying and testing with mdadm.

So the software RAID is giving you 300 MB/s and the hardware 'RAID' is
giving you ~181 MB/s? Seems a pretty simple choice :)
BTW: The 300MB/s could also be a limitation of the PCIe channel interconnect
(or worse, if they hung the chip off a PCIx bridge). The motherboard
vendors are generally loathe to put more than a few PCIe lanes for handling
SATA, Networking, etc. So typically you wind up with very low powered
'RAID' and 'SATA/SAS' on the motherboard, connected by PCIe x2 or x4 at
most. A number of motherboards have NICs that are served by a single PCIe
x1 link.

Post by Mohit Anchlia
Thanks for your help that led me to this path. Another question I had
was when creating mdadm RAID does it make sense to use multipathing?

Well, for a shared backend over a fabric, I'd say possibly. For an internal
connected set, I'd say no. Given what you are doing with Gluster, I'd say
that the additional expense/pain of setting up a multipath scenario probably
isn't worth it.
Gluster lets you get many of these benefits at a higher level in the stack.
Which to a degree, and in some use cases, obviates the need for
multipathing at a lower level. I'd still suggest real RAID at the lower
level (RAID6, and sometimes RAID10 make the most sense) for the backing
store.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Jan Písačka

2011-05-18 07:55:08 UTC

... Also, understand that ext* suffers badly under intense parallel
loads. Keep that in mind as you make your file system choice.

What sort of suffering can I expect? Is it relevant to high performance
RAID[56] systems only or also for a single hard drive test cases? I made
a quick local writing test with fio and a 1TB SATA disk and could not
see any problem. The configuration was identical for both ext4 and xfs
filesystems:

Ext4 filesystem:

[ext4w]
rw=write
size=4g
directory=/mnt/export
iodepth=16
direct=0
blocksize=512k
numjobs=8
ioengine=vsync
create_on_open=1

WRITE: io=32708MB, aggrb=100023KB/s, minb=102424KB/s, maxb=102424KB/s,
mint=334850msec, maxt=334850msec

Disk stats (read/write):
sdb: ios=11/63707, merge=0/7971550, ticks=2347/46130116,
in_queue=46203840, util=99.82%

XFS filesystem:

[xfsw]
.
.
all the same
.

WRITE: io=32708MB, aggrb=88095KB/s, minb=90209KB/s, maxb=90209KB/s,
mint=380191msec, maxt=380191msec

Disk stats (read/write):
sdb: ios=12/71594, merge=0/2191, ticks=2326/54558130,
in_queue=54641400, util=99.92%

With on-disk write-caching = 0 (off):

ext4:
WRITE: io=32708MB, aggrb=34179KB/s, minb=35000KB/s, maxb=35000KB/s,
mint=979907msec, maxt=979907msec

Disk stats (read/write):
sdb: ios=20/164893, merge=0/16622549, ticks=1953/64535056,
in_queue=64536990, util=97.41%

xfs:
WRITE: io=32708MB, aggrb=42452KB/s, minb=43471KB/s, maxb=43471KB/s,
mint=788954msec, maxt=788954msec

Disk stats (read/write):
sdb: ios=4/71372, merge=0/2292, ticks=1236/114143050,
in_queue=114306843, util=99.89%

Is the load sufficient? The system has 8GB of memory. Thanks for any
response.

Jan

Joe Landman

2011-04-20 19:09:02 UTC

Post by Mohit Anchlia
Correct. I was referring to gluster performance test that I am running
using gluster client that spreads the load accross all the servers.

Mohit Anchlia

2011-04-20 19:12:23 UTC

I guess I am not clear enough :) I am running multiple concurrent
threads (in parallel) writing actual files and OVERALL combined
throughput (from all the threads) never go above 20MB/s. It's not just
running one single dd at a time.

gluster volume info all

Volume Name: stress-volume
Type: Distributed-Replicate
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: dsdb1:/data/gluster
Brick2: dsdb2:/data/gluster
Brick3: dsdb3:/data/gluster
Brick4: dsdb4:/data/gluster
Brick5: dsdb5:/data/gluster
Brick6: dsdb6:/data/gluster

On Wed, Apr 20, 2011 at 12:09 PM, Joe Landman

Post by Mohit Anchlia
Correct. I was referring to gluster performance test that I am running
using gluster client that spreads the load accross all the servers.

Joe Landman

2011-04-20 19:27:39 UTC

Post by Mohit Anchlia
I guess I am not clear enough :) I am running multiple concurrent
threads (in parallel) writing actual files and OVERALL combined
throughput (from all the threads) never go above 20MB/s. It's not just
running one single dd at a time.
gluster volume info all
Volume Name: stress-volume
Type: Distributed-Replicate
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Brick1: dsdb1:/data/gluster
Brick2: dsdb2:/data/gluster
Brick3: dsdb3:/data/gluster
Brick4: dsdb4:/data/gluster
Brick5: dsdb5:/data/gluster
Brick6: dsdb6:/data/gluster

Ok ... your write bandwidth is an example of something called a zero sum
game. No matter how many threads you throw at it, you will never exceed
the maximum bandwidth (sum of the bandwidths per thread).

In the simplest case, you have 1 write thread which will get 20 MB/s.
With two simultaneous writers, each may get ~10 MB/s. 4 simultaneous
would be about 5 MB/s.

Now add in the mirroring operation while doing the write. You have 1
write and one mirror operation going. Thats 10MB/s per thread. 10 for
the write, and 10 for the mirror.

What I am getting at is that mirroring is not cheap on the bandwidth
level of the system. Given the low performance of this unit to begin
with, mirroring is going to be problematic in terms of the amount of
bandwidth consumed and thus available for other processes.

Take the fio input decks I indicated in the previous post, and vary the
number of jobs from 1 to say 24. Please report what happens to the
aggregate write bandwidth (it will report it). You only need a few
points, say 1, 4, 8, 12, 16, 20, 24 to see the effect.

Assuming my guess is correct, the bandwidth won't be constant, but will
drop rapidly with the oversubscription of write threads. You will get 2
writes for every fs write. So at some point the bandwidth will drop
rapidly.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: ***@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

58 Replies
6 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Mohit Anchlia 2011-04-20 00:52:24 UTC

Joe Landman 2011-04-20 02:07:56 UTC

Mohit Anchlia 2011-04-20 02:43:32 UTC

Joe Landman 2011-04-20 04:08:28 UTC

Mohit Anchlia 2011-04-20 03:49:27 UTC

Joe Landman 2011-04-20 05:13:52 UTC

Mohit Anchlia 2011-04-20 17:04:52 UTC

Joe Landman 2011-04-20 17:11:16 UTC

Mohit Anchlia 2011-04-20 17:19:09 UTC

Joe Landman 2011-04-20 17:22:01 UTC

Mohit Anchlia 2011-04-20 17:35:32 UTC

Joe Landman 2011-04-20 17:39:27 UTC

Mohit Anchlia 2011-04-20 17:42:45 UTC

Joe Landman 2011-04-20 17:49:37 UTC

Mohit Anchlia 2011-04-20 18:01:01 UTC

Joe Landman 2011-04-20 18:05:44 UTC

Mohit Anchlia 2011-04-20 18:29:32 UTC

Joe Landman 2011-04-20 18:47:06 UTC

Mohit Anchlia 2011-04-20 18:54:18 UTC

Joe Landman 2011-04-20 18:56:32 UTC

Mohit Anchlia 2011-04-20 18:58:40 UTC

Burnash, James 2011-04-20 19:01:48 UTC

Mohit Anchlia 2011-04-20 19:05:20 UTC

Burnash, James 2011-04-20 19:08:29 UTC

Joe Landman 2011-04-20 19:19:07 UTC

Mohit Anchlia 2011-04-20 19:26:34 UTC

Joe Landman 2011-04-20 19:31:35 UTC

Mohit Anchlia 2011-04-20 19:31:54 UTC

Joe Landman 2011-04-20 19:35:13 UTC

Mohit Anchlia 2011-04-20 19:41:40 UTC

Joe Landman 2011-04-20 19:43:32 UTC

Mohit Anchlia 2011-04-20 19:43:46 UTC

Joe Landman 2011-04-20 19:53:21 UTC

Mohit Anchlia 2011-04-20 20:02:34 UTC

paul simpson 2011-04-20 21:43:32 UTC

Mohit Anchlia 2011-04-20 21:52:08 UTC

Joe Landman 2011-04-20 22:12:48 UTC

Mohit Anchlia 2011-04-20 22:49:26 UTC

Joe Landman 2011-04-21 00:24:54 UTC

Mohit Anchlia 2011-04-20 23:28:02 UTC

Joe Landman 2011-04-21 00:45:32 UTC

Mohit Anchlia 2011-04-20 23:50:39 UTC

Joe Landman 2011-04-21 00:53:51 UTC

Mohit Anchlia 2011-04-21 01:16:09 UTC

Mohit Anchlia 2011-04-21 17:29:52 UTC

Mohit Anchlia 2011-04-22 00:49:02 UTC

Joe Landman 2011-04-22 02:23:32 UTC

Mohit Anchlia 2011-04-26 21:48:22 UTC

Joe Landman 2011-04-26 23:31:26 UTC

Mohit Anchlia 2011-04-26 23:43:50 UTC

Mohit Anchlia 2011-04-27 18:46:49 UTC

Burnash, James 2011-04-27 18:59:20 UTC

Mohit Anchlia 2011-05-03 18:44:03 UTC

Pavan 2011-05-04 05:51:17 UTC

Mohit Anchlia 2011-05-13 22:17:17 UTC

Jan Písačka 2011-05-18 07:55:08 UTC

Joe Landman 2011-04-20 19:09:02 UTC

Mohit Anchlia 2011-04-20 19:12:23 UTC

Joe Landman 2011-04-20 19:27:39 UTC

about - legalese

Loading...