Discussion:
[Gluster-users] disperse volume file to subvolume mapping
Serkan Çoban
2016-04-18 11:39:10 UTC
Permalink
Hi, I have a problem where clients are using only 1/3 of nodes in
disperse volume for writing.
I am testing from 50 clients using 1 to 10 threads with file names part-0-xxxx.
What I see is clients only use 20 nodes for writing. How is the file
name to sub volume hashing is done? Is this related to file names are
similar?

My cluster is 3.7.10 with 60 nodes each has 26 disks. Disperse volume
is 78 x (16+4). Only 26 out of 78 sub volumes used during writes..
Xavier Hernandez
2016-04-19 10:05:16 UTC
Permalink
Hi Serkan,

moved to gluster-users since this doesn't belong to devel list.
I am copying 10.000 files to gluster volume using mapreduce on
clients. Each map process took one file at a time and copy it to
gluster volume.
I assume that gluster is used to store the intermediate files before the
reduce phase.
My disperse volume consist of 78 subvolumes of 16+4 disk each. So If I
copy >78 files parallel I expect each file goes to different subvolume
right?
If you only copy 78 files, most probably you will get some subvolume
empty and some other with more than one or two files. It's not an exact
distribution, it's a statistially balanced distribution: over time and
with enough files, each brick will contain an amount of files in the
same order of magnitude, but they won't have the *same* number of files.
In my tests during tests with fio I can see every file goes to
different subvolume, but when I start mapreduce process from clients
only 78/3=26 subvolumes used for writing files.
This means that this is caused by some peculiarity of the mapreduce.
I see that clearly from network traffic. Mapreduce on client side can
be run multi thread. I tested with 1-5-10 threads on each client but
every time only 26 subvolumes used.
How can I debug the issue further?
You should look which files are created in each brick and how many while
the process is running.

Xavi
On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez
Hi Serkan,
Hi, I just reinstalled fresh 3.7.11 and I am seeing the same behavior.
50 clients copying part-0-xxxx named files using mapreduce to gluster
using one thread per server and they are using only 20 servers out of
60. On the other hand fio tests use all the servers. Anything I can do
to solve the issue?
Distribution of files to ec sets is done by dht. In theory if you create
many files each ec set will receive the same amount of files. However when
the number of files is small enough, statistics can fail.
Not sure what you are doing exactly, but a mapreduce procedure generally
only creates a single output. In that case it makes sense that only one ec
set is used. If you want to use all ec sets for a single file, you should
enable sharding (I haven't tested that) or split the result in multiple
files.
Xavi
Thanks,
Serkan
---------- Forwarded message ----------
Date: Mon, Apr 18, 2016 at 2:39 PM
Subject: disperse volume file to subvolume mapping
Hi, I have a problem where clients are using only 1/3 of nodes in
disperse volume for writing.
I am testing from 50 clients using 1 to 10 threads with file names part-0-xxxx.
What I see is clients only use 20 nodes for writing. How is the file
name to sub volume hashing is done? Is this related to file names are
similar?
My cluster is 3.7.10 with 60 nodes each has 26 disks. Disperse volume
is 78 x (16+4). Only 26 out of 78 sub volumes used during writes..
Serkan Çoban
2016-04-19 13:16:40 UTC
Permalink
Post by Xavier Hernandez
I assume that gluster is used to store the intermediate files before the reduce phase
Nope, gluster is the destination for distcp command. hadoop distcp -m
50 http://nn1:8020/path/to/folder file:///mnt/gluster
This run maps on datanodes which have /mnt/gluster mounted on all of them.
Post by Xavier Hernandez
This means that this is caused by some peculiarity of the mapreduce.
Yes but how a client write 500 files to gluster mount and those file
just written only to subset of subvolumes? I cannot use gluster as a
backup cluster if I cannot write with distcp.
Post by Xavier Hernandez
You should look which files are created in each brick and how many while the process is running.
Files only created on nodes 185..204 or 205..224 or 225..244. Only on
20 nodes in each test.
Post by Xavier Hernandez
Hi Serkan,
moved to gluster-users since this doesn't belong to devel list.
I am copying 10.000 files to gluster volume using mapreduce on
clients. Each map process took one file at a time and copy it to
gluster volume.
I assume that gluster is used to store the intermediate files before the
reduce phase.
My disperse volume consist of 78 subvolumes of 16+4 disk each. So If I
copy >78 files parallel I expect each file goes to different subvolume
right?
If you only copy 78 files, most probably you will get some subvolume empty
and some other with more than one or two files. It's not an exact
distribution, it's a statistially balanced distribution: over time and with
enough files, each brick will contain an amount of files in the same order
of magnitude, but they won't have the *same* number of files.
In my tests during tests with fio I can see every file goes to
different subvolume, but when I start mapreduce process from clients
only 78/3=26 subvolumes used for writing files.
This means that this is caused by some peculiarity of the mapreduce.
I see that clearly from network traffic. Mapreduce on client side can
be run multi thread. I tested with 1-5-10 threads on each client but
every time only 26 subvolumes used.
How can I debug the issue further?
You should look which files are created in each brick and how many while the
process is running.
Xavi
On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez
Hi Serkan,
Hi, I just reinstalled fresh 3.7.11 and I am seeing the same behavior.
50 clients copying part-0-xxxx named files using mapreduce to gluster
using one thread per server and they are using only 20 servers out of
60. On the other hand fio tests use all the servers. Anything I can do
to solve the issue?
Distribution of files to ec sets is done by dht. In theory if you create
many files each ec set will receive the same amount of files. However when
the number of files is small enough, statistics can fail.
Not sure what you are doing exactly, but a mapreduce procedure generally
only creates a single output. In that case it makes sense that only one ec
set is used. If you want to use all ec sets for a single file, you should
enable sharding (I haven't tested that) or split the result in multiple
files.
Xavi
Thanks,
Serkan
---------- Forwarded message ----------
Date: Mon, Apr 18, 2016 at 2:39 PM
Subject: disperse volume file to subvolume mapping
Hi, I have a problem where clients are using only 1/3 of nodes in
disperse volume for writing.
I am testing from 50 clients using 1 to 10 threads with file names part-0-xxxx.
What I see is clients only use 20 nodes for writing. How is the file
name to sub volume hashing is done? Is this related to file names are
similar?
My cluster is 3.7.10 with 60 nodes each has 26 disks. Disperse volume
is 78 x (16+4). Only 26 out of 78 sub volumes used during writes..
Xavier Hernandez
2016-04-20 06:34:35 UTC
Permalink
Hi Serkan,
Post by Serkan Çoban
Post by Xavier Hernandez
I assume that gluster is used to store the intermediate files before the reduce phase
Nope, gluster is the destination for distcp command. hadoop distcp -m
50 http://nn1:8020/path/to/folder file:///mnt/gluster
This run maps on datanodes which have /mnt/gluster mounted on all of them.
I don't know hadoop, so I'm of little help here. However it seems that
-m 50 means to execute 50 copies in parallel. This means that even if
the distribution worked fine, at most 50 (much probably less) of the 78
ec sets would be used in parallel.
Post by Serkan Çoban
Post by Xavier Hernandez
This means that this is caused by some peculiarity of the mapreduce.
Yes but how a client write 500 files to gluster mount and those file
just written only to subset of subvolumes? I cannot use gluster as a
backup cluster if I cannot write with distcp.
All 500 files were created only on one of the 78 ec sets and the
remaining 77 got empty ?
Post by Serkan Çoban
Post by Xavier Hernandez
You should look which files are created in each brick and how many while the process is running.
Files only created on nodes 185..204 or 205..224 or 225..244. Only on
20 nodes in each test.
How many files there were in each brick ?

Not sure if this can be related, but standard linux distributions have a
default limit of 1024 open file descriptors. Having a so big volume and
doing a massive copy, maybe this limit is affecting something ?

Are there any error or warning messages in the mount or bricks logs ?

Xavi
Post by Serkan Çoban
Post by Xavier Hernandez
Hi Serkan,
moved to gluster-users since this doesn't belong to devel list.
I am copying 10.000 files to gluster volume using mapreduce on
clients. Each map process took one file at a time and copy it to
gluster volume.
I assume that gluster is used to store the intermediate files before the
reduce phase.
My disperse volume consist of 78 subvolumes of 16+4 disk each. So If I
copy >78 files parallel I expect each file goes to different subvolume
right?
If you only copy 78 files, most probably you will get some subvolume empty
and some other with more than one or two files. It's not an exact
distribution, it's a statistially balanced distribution: over time and with
enough files, each brick will contain an amount of files in the same order
of magnitude, but they won't have the *same* number of files.
In my tests during tests with fio I can see every file goes to
different subvolume, but when I start mapreduce process from clients
only 78/3=26 subvolumes used for writing files.
This means that this is caused by some peculiarity of the mapreduce.
I see that clearly from network traffic. Mapreduce on client side can
be run multi thread. I tested with 1-5-10 threads on each client but
every time only 26 subvolumes used.
How can I debug the issue further?
You should look which files are created in each brick and how many while the
process is running.
Xavi
On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez
Hi Serkan,
Hi, I just reinstalled fresh 3.7.11 and I am seeing the same behavior.
50 clients copying part-0-xxxx named files using mapreduce to gluster
using one thread per server and they are using only 20 servers out of
60. On the other hand fio tests use all the servers. Anything I can do
to solve the issue?
Distribution of files to ec sets is done by dht. In theory if you create
many files each ec set will receive the same amount of files. However when
the number of files is small enough, statistics can fail.
Not sure what you are doing exactly, but a mapreduce procedure generally
only creates a single output. In that case it makes sense that only one ec
set is used. If you want to use all ec sets for a single file, you should
enable sharding (I haven't tested that) or split the result in multiple
files.
Xavi
Thanks,
Serkan
---------- Forwarded message ----------
Date: Mon, Apr 18, 2016 at 2:39 PM
Subject: disperse volume file to subvolume mapping
Hi, I have a problem where clients are using only 1/3 of nodes in
disperse volume for writing.
I am testing from 50 clients using 1 to 10 threads with file names part-0-xxxx.
What I see is clients only use 20 nodes for writing. How is the file
name to sub volume hashing is done? Is this related to file names are
similar?
My cluster is 3.7.10 with 60 nodes each has 26 disks. Disperse volume
is 78 x (16+4). Only 26 out of 78 sub volumes used during writes..
Serkan Çoban
2016-04-20 12:13:20 UTC
Permalink
Here is the steps that I do in detail and relevant output from bricks:

I am using below command for volume creation:
gluster volume create v0 disperse 20 redundancy 4 \
1.1.1.{185..204}:/bricks/02 \
1.1.1.{205..224}:/bricks/02 \
1.1.1.{225..244}:/bricks/02 \
1.1.1.{185..204}:/bricks/03 \
1.1.1.{205..224}:/bricks/03 \
1.1.1.{225..244}:/bricks/03 \
1.1.1.{185..204}:/bricks/04 \
1.1.1.{205..224}:/bricks/04 \
1.1.1.{225..244}:/bricks/04 \
1.1.1.{185..204}:/bricks/05 \
1.1.1.{205..224}:/bricks/05 \
1.1.1.{225..244}:/bricks/05 \
1.1.1.{185..204}:/bricks/06 \
1.1.1.{205..224}:/bricks/06 \
1.1.1.{225..244}:/bricks/06 \
1.1.1.{185..204}:/bricks/07 \
1.1.1.{205..224}:/bricks/07 \
1.1.1.{225..244}:/bricks/07 \
1.1.1.{185..204}:/bricks/08 \
1.1.1.{205..224}:/bricks/08 \
1.1.1.{225..244}:/bricks/08 \
1.1.1.{185..204}:/bricks/09 \
1.1.1.{205..224}:/bricks/09 \
1.1.1.{225..244}:/bricks/09 \
1.1.1.{185..204}:/bricks/10 \
1.1.1.{205..224}:/bricks/10 \
1.1.1.{225..244}:/bricks/10 \
1.1.1.{185..204}:/bricks/11 \
1.1.1.{205..224}:/bricks/11 \
1.1.1.{225..244}:/bricks/11 \
1.1.1.{185..204}:/bricks/12 \
1.1.1.{205..224}:/bricks/12 \
1.1.1.{225..244}:/bricks/12 \
1.1.1.{185..204}:/bricks/13 \
1.1.1.{205..224}:/bricks/13 \
1.1.1.{225..244}:/bricks/13 \
1.1.1.{185..204}:/bricks/14 \
1.1.1.{205..224}:/bricks/14 \
1.1.1.{225..244}:/bricks/14 \
1.1.1.{185..204}:/bricks/15 \
1.1.1.{205..224}:/bricks/15 \
1.1.1.{225..244}:/bricks/15 \
1.1.1.{185..204}:/bricks/16 \
1.1.1.{205..224}:/bricks/16 \
1.1.1.{225..244}:/bricks/16 \
1.1.1.{185..204}:/bricks/17 \
1.1.1.{205..224}:/bricks/17 \
1.1.1.{225..244}:/bricks/17 \
1.1.1.{185..204}:/bricks/18 \
1.1.1.{205..224}:/bricks/18 \
1.1.1.{225..244}:/bricks/18 \
1.1.1.{185..204}:/bricks/19 \
1.1.1.{205..224}:/bricks/19 \
1.1.1.{225..244}:/bricks/19 \
1.1.1.{185..204}:/bricks/20 \
1.1.1.{205..224}:/bricks/20 \
1.1.1.{225..244}:/bricks/20 \
1.1.1.{185..204}:/bricks/21 \
1.1.1.{205..224}:/bricks/21 \
1.1.1.{225..244}:/bricks/21 \
1.1.1.{185..204}:/bricks/22 \
1.1.1.{205..224}:/bricks/22 \
1.1.1.{225..244}:/bricks/22 \
1.1.1.{185..204}:/bricks/23 \
1.1.1.{205..224}:/bricks/23 \
1.1.1.{225..244}:/bricks/23 \
1.1.1.{185..204}:/bricks/24 \
1.1.1.{205..224}:/bricks/24 \
1.1.1.{225..244}:/bricks/24 \
1.1.1.{185..204}:/bricks/25 \
1.1.1.{205..224}:/bricks/25 \
1.1.1.{225..244}:/bricks/25 \
1.1.1.{185..204}:/bricks/26 \
1.1.1.{205..224}:/bricks/26 \
1.1.1.{225..244}:/bricks/26 \
1.1.1.{185..204}:/bricks/27 \
1.1.1.{205..224}:/bricks/27 \
1.1.1.{225..244}:/bricks/27 force

then I mount volume on 50 clients:
mount -t glusterfs 1.1.1.185:/v0 /mnt/gluster

then I make a directory from one of the clients and chmod it.
mkdir /mnt/gluster/s1 && chmod 777 /mnt/gluster/s1

then I start distcp on clients, there are 1059X8.8GB files in one folder and
they will be copied to /mnt/gluster/s1 with 100 parallel which means 2
copy jobs per client at same time.
hadoop distcp -m 100 http://nn1:8020/path/to/teragen-10tb file:///mnt/gluster/s1

After job finished here is the status of s1 directory from bricks:
s1 directory is present in all 1560 brick.
s1/teragen-10tb folder is present in all 1560 brick.

full listing of files in bricks:
https://www.dropbox.com/s/rbgdxmrtwz8oya8/teragen_list.zip?dl=0

You can ignore the .crc files in the brick output above, they are
checksum files...

As you can see part-m-xxxx files written only some bricks in nodes 0205..0224
All bricks have some files but they have zero size.

I increase file descriptors to 65k so it is not the issue...
Post by Xavier Hernandez
Hi Serkan,
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
I assume that gluster is used to store the intermediate files before
the reduce phase
Nope, gluster is the destination for distcp command. hadoop distcp -m
50 http://nn1:8020/path/to/folder file:///mnt/gluster
This run maps on datanodes which have /mnt/gluster mounted on all of them.
I don't know hadoop, so I'm of little help here. However it seems that -m 50
means to execute 50 copies in parallel. This means that even if the
distribution worked fine, at most 50 (much probably less) of the 78 ec sets
would be used in parallel.
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
This means that this is caused by some peculiarity of the mapreduce.
Yes but how a client write 500 files to gluster mount and those file
just written only to subset of subvolumes? I cannot use gluster as a
backup cluster if I cannot write with distcp.
All 500 files were created only on one of the 78 ec sets and the remaining
77 got empty ?
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
You should look which files are created in each brick and how many
while the process is running.
Files only created on nodes 185..204 or 205..224 or 225..244. Only on
20 nodes in each test.
How many files there were in each brick ?
Not sure if this can be related, but standard linux distributions have a
default limit of 1024 open file descriptors. Having a so big volume and
doing a massive copy, maybe this limit is affecting something ?
Are there any error or warning messages in the mount or bricks logs ?
Xavi
Post by Serkan Çoban
Post by Xavier Hernandez
Hi Serkan,
moved to gluster-users since this doesn't belong to devel list.
I am copying 10.000 files to gluster volume using mapreduce on
clients. Each map process took one file at a time and copy it to
gluster volume.
I assume that gluster is used to store the intermediate files before the
reduce phase.
My disperse volume consist of 78 subvolumes of 16+4 disk each. So If I
copy >78 files parallel I expect each file goes to different subvolume
right?
If you only copy 78 files, most probably you will get some subvolume empty
and some other with more than one or two files. It's not an exact
distribution, it's a statistially balanced distribution: over time and with
enough files, each brick will contain an amount of files in the same order
of magnitude, but they won't have the *same* number of files.
In my tests during tests with fio I can see every file goes to
different subvolume, but when I start mapreduce process from clients
only 78/3=26 subvolumes used for writing files.
This means that this is caused by some peculiarity of the mapreduce.
I see that clearly from network traffic. Mapreduce on client side can
be run multi thread. I tested with 1-5-10 threads on each client but
every time only 26 subvolumes used.
How can I debug the issue further?
You should look which files are created in each brick and how many while the
process is running.
Xavi
On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Hi, I just reinstalled fresh 3.7.11 and I am seeing the same behavior.
50 clients copying part-0-xxxx named files using mapreduce to gluster
using one thread per server and they are using only 20 servers out of
60. On the other hand fio tests use all the servers. Anything I can do
to solve the issue?
Distribution of files to ec sets is done by dht. In theory if you create
many files each ec set will receive the same amount of files. However when
the number of files is small enough, statistics can fail.
Not sure what you are doing exactly, but a mapreduce procedure generally
only creates a single output. In that case it makes sense that only one ec
set is used. If you want to use all ec sets for a single file, you should
enable sharding (I haven't tested that) or split the result in multiple
files.
Xavi
Thanks,
Serkan
---------- Forwarded message ----------
Date: Mon, Apr 18, 2016 at 2:39 PM
Subject: disperse volume file to subvolume mapping
Hi, I have a problem where clients are using only 1/3 of nodes in
disperse volume for writing.
I am testing from 50 clients using 1 to 10 threads with file names part-0-xxxx.
What I see is clients only use 20 nodes for writing. How is the file
name to sub volume hashing is done? Is this related to file names are
similar?
My cluster is 3.7.10 with 60 nodes each has 26 disks. Disperse volume
is 78 x (16+4). Only 26 out of 78 sub volumes used during writes..
Xavier Hernandez
2016-04-21 07:00:39 UTC
Permalink
Hi Serkan,

I think the problem is in the temporary name that distcp gives to the
file while it's being copied before renaming it to the real name. Do you
know what is the structure of this name ?

DHT selects the subvolume (in this case the ec set) on which the file
will be stored based on the name of the file. This has a problem when a
file is being renamed, because this could change the subvolume where the
file should be found.

DHT has a feature to avoid incorrect file placements when executing
renames for the rsync case. What it does is to check if the file matches
the following regular expression:

^\.(.+)\.[^.]+$

If a match is found, it only considers the part between parenthesis to
calculate the destination subvolume.

This is useful for rsync because temporary file names are constructed in
the following way: suppose the original filename is 'test'. The
temporary filename while rsync is being executed is made by prepending a
dot and appending '.<random chars>': .test.712hd

As you can see, the original name and the part of the name between
parenthesis that matches the regular expression are the same. This
causes that, after renaming the temporary file to its original filename,
both files will be considered to belong to the same subvolume by DHT.

In your case it's very probable that distcp uses a temporary name like
'.part.<number>'. In this case the portion of the name used to select
the subvolume is always 'part'. This would explain why all files go to
the same subvolume. Once the file is renamed to another name, DHT
realizes that it should go to another subvolume. At this point it
creates a link file (those files with access rights = '---------T') in
the correct subvolume but it doesn't move it. As you can see, this kind
of files are better balanced.

To solve this problem you have three options:

1. change the temporary filename used by distcp to correctly match the
regular expression. I'm not sure if this can be configured, but if this
is possible, this is the best option.

2. define the option 'extra-hash-regex' to an expression that matches
your temporary file names and returns the same name that will finally
have. Depending on the differences between original and temporary file
names, this option could be useless.

3. set the option 'rsync-hash-regex' to 'none'. This will prevent the
name conversion, so the files will be evenly distributed. However this
will cause a lot of files placed in incorrect subvolumes, creating a lot
of link files until a rebalance is executed.

Xavi
Post by Serkan Çoban
gluster volume create v0 disperse 20 redundancy 4 \
1.1.1.{185..204}:/bricks/02 \
1.1.1.{205..224}:/bricks/02 \
1.1.1.{225..244}:/bricks/02 \
1.1.1.{185..204}:/bricks/03 \
1.1.1.{205..224}:/bricks/03 \
1.1.1.{225..244}:/bricks/03 \
1.1.1.{185..204}:/bricks/04 \
1.1.1.{205..224}:/bricks/04 \
1.1.1.{225..244}:/bricks/04 \
1.1.1.{185..204}:/bricks/05 \
1.1.1.{205..224}:/bricks/05 \
1.1.1.{225..244}:/bricks/05 \
1.1.1.{185..204}:/bricks/06 \
1.1.1.{205..224}:/bricks/06 \
1.1.1.{225..244}:/bricks/06 \
1.1.1.{185..204}:/bricks/07 \
1.1.1.{205..224}:/bricks/07 \
1.1.1.{225..244}:/bricks/07 \
1.1.1.{185..204}:/bricks/08 \
1.1.1.{205..224}:/bricks/08 \
1.1.1.{225..244}:/bricks/08 \
1.1.1.{185..204}:/bricks/09 \
1.1.1.{205..224}:/bricks/09 \
1.1.1.{225..244}:/bricks/09 \
1.1.1.{185..204}:/bricks/10 \
1.1.1.{205..224}:/bricks/10 \
1.1.1.{225..244}:/bricks/10 \
1.1.1.{185..204}:/bricks/11 \
1.1.1.{205..224}:/bricks/11 \
1.1.1.{225..244}:/bricks/11 \
1.1.1.{185..204}:/bricks/12 \
1.1.1.{205..224}:/bricks/12 \
1.1.1.{225..244}:/bricks/12 \
1.1.1.{185..204}:/bricks/13 \
1.1.1.{205..224}:/bricks/13 \
1.1.1.{225..244}:/bricks/13 \
1.1.1.{185..204}:/bricks/14 \
1.1.1.{205..224}:/bricks/14 \
1.1.1.{225..244}:/bricks/14 \
1.1.1.{185..204}:/bricks/15 \
1.1.1.{205..224}:/bricks/15 \
1.1.1.{225..244}:/bricks/15 \
1.1.1.{185..204}:/bricks/16 \
1.1.1.{205..224}:/bricks/16 \
1.1.1.{225..244}:/bricks/16 \
1.1.1.{185..204}:/bricks/17 \
1.1.1.{205..224}:/bricks/17 \
1.1.1.{225..244}:/bricks/17 \
1.1.1.{185..204}:/bricks/18 \
1.1.1.{205..224}:/bricks/18 \
1.1.1.{225..244}:/bricks/18 \
1.1.1.{185..204}:/bricks/19 \
1.1.1.{205..224}:/bricks/19 \
1.1.1.{225..244}:/bricks/19 \
1.1.1.{185..204}:/bricks/20 \
1.1.1.{205..224}:/bricks/20 \
1.1.1.{225..244}:/bricks/20 \
1.1.1.{185..204}:/bricks/21 \
1.1.1.{205..224}:/bricks/21 \
1.1.1.{225..244}:/bricks/21 \
1.1.1.{185..204}:/bricks/22 \
1.1.1.{205..224}:/bricks/22 \
1.1.1.{225..244}:/bricks/22 \
1.1.1.{185..204}:/bricks/23 \
1.1.1.{205..224}:/bricks/23 \
1.1.1.{225..244}:/bricks/23 \
1.1.1.{185..204}:/bricks/24 \
1.1.1.{205..224}:/bricks/24 \
1.1.1.{225..244}:/bricks/24 \
1.1.1.{185..204}:/bricks/25 \
1.1.1.{205..224}:/bricks/25 \
1.1.1.{225..244}:/bricks/25 \
1.1.1.{185..204}:/bricks/26 \
1.1.1.{205..224}:/bricks/26 \
1.1.1.{225..244}:/bricks/26 \
1.1.1.{185..204}:/bricks/27 \
1.1.1.{205..224}:/bricks/27 \
1.1.1.{225..244}:/bricks/27 force
mount -t glusterfs 1.1.1.185:/v0 /mnt/gluster
then I make a directory from one of the clients and chmod it.
mkdir /mnt/gluster/s1 && chmod 777 /mnt/gluster/s1
then I start distcp on clients, there are 1059X8.8GB files in one folder and
they will be copied to /mnt/gluster/s1 with 100 parallel which means 2
copy jobs per client at same time.
hadoop distcp -m 100 http://nn1:8020/path/to/teragen-10tb file:///mnt/gluster/s1
s1 directory is present in all 1560 brick.
s1/teragen-10tb folder is present in all 1560 brick.
https://www.dropbox.com/s/rbgdxmrtwz8oya8/teragen_list.zip?dl=0
You can ignore the .crc files in the brick output above, they are
checksum files...
As you can see part-m-xxxx files written only some bricks in nodes 0205..0224
All bricks have some files but they have zero size.
I increase file descriptors to 65k so it is not the issue...
Post by Xavier Hernandez
Hi Serkan,
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
I assume that gluster is used to store the intermediate files before
the reduce phase
Nope, gluster is the destination for distcp command. hadoop distcp -m
50 http://nn1:8020/path/to/folder file:///mnt/gluster
This run maps on datanodes which have /mnt/gluster mounted on all of them.
I don't know hadoop, so I'm of little help here. However it seems that -m 50
means to execute 50 copies in parallel. This means that even if the
distribution worked fine, at most 50 (much probably less) of the 78 ec sets
would be used in parallel.
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
This means that this is caused by some peculiarity of the mapreduce.
Yes but how a client write 500 files to gluster mount and those file
just written only to subset of subvolumes? I cannot use gluster as a
backup cluster if I cannot write with distcp.
All 500 files were created only on one of the 78 ec sets and the remaining
77 got empty ?
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
You should look which files are created in each brick and how many
while the process is running.
Files only created on nodes 185..204 or 205..224 or 225..244. Only on
20 nodes in each test.
How many files there were in each brick ?
Not sure if this can be related, but standard linux distributions have a
default limit of 1024 open file descriptors. Having a so big volume and
doing a massive copy, maybe this limit is affecting something ?
Are there any error or warning messages in the mount or bricks logs ?
Xavi
Post by Serkan Çoban
Post by Xavier Hernandez
Hi Serkan,
moved to gluster-users since this doesn't belong to devel list.
I am copying 10.000 files to gluster volume using mapreduce on
clients. Each map process took one file at a time and copy it to
gluster volume.
I assume that gluster is used to store the intermediate files before the
reduce phase.
My disperse volume consist of 78 subvolumes of 16+4 disk each. So If I
copy >78 files parallel I expect each file goes to different subvolume
right?
If you only copy 78 files, most probably you will get some subvolume empty
and some other with more than one or two files. It's not an exact
distribution, it's a statistially balanced distribution: over time and with
enough files, each brick will contain an amount of files in the same order
of magnitude, but they won't have the *same* number of files.
In my tests during tests with fio I can see every file goes to
different subvolume, but when I start mapreduce process from clients
only 78/3=26 subvolumes used for writing files.
This means that this is caused by some peculiarity of the mapreduce.
I see that clearly from network traffic. Mapreduce on client side can
be run multi thread. I tested with 1-5-10 threads on each client but
every time only 26 subvolumes used.
How can I debug the issue further?
You should look which files are created in each brick and how many while the
process is running.
Xavi
On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Hi, I just reinstalled fresh 3.7.11 and I am seeing the same behavior.
50 clients copying part-0-xxxx named files using mapreduce to gluster
using one thread per server and they are using only 20 servers out of
60. On the other hand fio tests use all the servers. Anything I can do
to solve the issue?
Distribution of files to ec sets is done by dht. In theory if you create
many files each ec set will receive the same amount of files. However when
the number of files is small enough, statistics can fail.
Not sure what you are doing exactly, but a mapreduce procedure generally
only creates a single output. In that case it makes sense that only one ec
set is used. If you want to use all ec sets for a single file, you should
enable sharding (I haven't tested that) or split the result in multiple
files.
Xavi
Thanks,
Serkan
---------- Forwarded message ----------
Date: Mon, Apr 18, 2016 at 2:39 PM
Subject: disperse volume file to subvolume mapping
Hi, I have a problem where clients are using only 1/3 of nodes in
disperse volume for writing.
I am testing from 50 clients using 1 to 10 threads with file names part-0-xxxx.
What I see is clients only use 20 nodes for writing. How is the file
name to sub volume hashing is done? Is this related to file names are
similar?
My cluster is 3.7.10 with 60 nodes each has 26 disks. Disperse volume
is 78 x (16+4). Only 26 out of 78 sub volumes used during writes..
Serkan Çoban
2016-04-21 08:07:12 UTC
Permalink
I think the problem is in the temporary name that distcp gives to the file while it's being copied before renaming it to the real name. Do you know what is the structure of this name ?
Distcp temporary file name format is:
".distcp.tmp.attempt_1460381790773_0248_m_000001_0" and the same
temporary file name used by one map process. For example I see in the
logs that one map copies files part-m-00031,part-m-00047,part-m-00063
sequentially and they all use same temporary file name above. So no
original file name appears in temporary file name.

I will check if we can modify distcp behaviour, or we have to write
our mapreduce procedures instead of using distcp.
2. define the option 'extra-hash-regex' to an expression that matches your temporary file names and returns the same name that will finally have. Depending on the differences between original and temporary file names, this option could be useless.
3. set the option 'rsync-hash-regex' to 'none'. This will prevent the name conversion, so the files will be evenly distributed. However this will cause a lot of files placed in incorrect subvolumes, creating a lot of link files until a rebalance is executed.
How can I set these options?



On Thu, Apr 21, 2016 at 10:00 AM, Xavier Hernandez
Hi Serkan,
I think the problem is in the temporary name that distcp gives to the file
while it's being copied before renaming it to the real name. Do you know
what is the structure of this name ?
DHT selects the subvolume (in this case the ec set) on which the file will
be stored based on the name of the file. This has a problem when a file is
being renamed, because this could change the subvolume where the file should
be found.
DHT has a feature to avoid incorrect file placements when executing renames
for the rsync case. What it does is to check if the file matches the
^\.(.+)\.[^.]+$
If a match is found, it only considers the part between parenthesis to
calculate the destination subvolume.
This is useful for rsync because temporary file names are constructed in the
following way: suppose the original filename is 'test'. The temporary
filename while rsync is being executed is made by prepending a dot and
appending '.<random chars>': .test.712hd
As you can see, the original name and the part of the name between
parenthesis that matches the regular expression are the same. This causes
that, after renaming the temporary file to its original filename, both files
will be considered to belong to the same subvolume by DHT.
In your case it's very probable that distcp uses a temporary name like
'.part.<number>'. In this case the portion of the name used to select the
subvolume is always 'part'. This would explain why all files go to the same
subvolume. Once the file is renamed to another name, DHT realizes that it
should go to another subvolume. At this point it creates a link file (those
files with access rights = '---------T') in the correct subvolume but it
doesn't move it. As you can see, this kind of files are better balanced.
1. change the temporary filename used by distcp to correctly match the
regular expression. I'm not sure if this can be configured, but if this is
possible, this is the best option.
2. define the option 'extra-hash-regex' to an expression that matches your
temporary file names and returns the same name that will finally have.
Depending on the differences between original and temporary file names, this
option could be useless.
3. set the option 'rsync-hash-regex' to 'none'. This will prevent the name
conversion, so the files will be evenly distributed. However this will cause
a lot of files placed in incorrect subvolumes, creating a lot of link files
until a rebalance is executed.
Xavi
Post by Serkan Çoban
gluster volume create v0 disperse 20 redundancy 4 \
1.1.1.{185..204}:/bricks/02 \
1.1.1.{205..224}:/bricks/02 \
1.1.1.{225..244}:/bricks/02 \
1.1.1.{185..204}:/bricks/03 \
1.1.1.{205..224}:/bricks/03 \
1.1.1.{225..244}:/bricks/03 \
1.1.1.{185..204}:/bricks/04 \
1.1.1.{205..224}:/bricks/04 \
1.1.1.{225..244}:/bricks/04 \
1.1.1.{185..204}:/bricks/05 \
1.1.1.{205..224}:/bricks/05 \
1.1.1.{225..244}:/bricks/05 \
1.1.1.{185..204}:/bricks/06 \
1.1.1.{205..224}:/bricks/06 \
1.1.1.{225..244}:/bricks/06 \
1.1.1.{185..204}:/bricks/07 \
1.1.1.{205..224}:/bricks/07 \
1.1.1.{225..244}:/bricks/07 \
1.1.1.{185..204}:/bricks/08 \
1.1.1.{205..224}:/bricks/08 \
1.1.1.{225..244}:/bricks/08 \
1.1.1.{185..204}:/bricks/09 \
1.1.1.{205..224}:/bricks/09 \
1.1.1.{225..244}:/bricks/09 \
1.1.1.{185..204}:/bricks/10 \
1.1.1.{205..224}:/bricks/10 \
1.1.1.{225..244}:/bricks/10 \
1.1.1.{185..204}:/bricks/11 \
1.1.1.{205..224}:/bricks/11 \
1.1.1.{225..244}:/bricks/11 \
1.1.1.{185..204}:/bricks/12 \
1.1.1.{205..224}:/bricks/12 \
1.1.1.{225..244}:/bricks/12 \
1.1.1.{185..204}:/bricks/13 \
1.1.1.{205..224}:/bricks/13 \
1.1.1.{225..244}:/bricks/13 \
1.1.1.{185..204}:/bricks/14 \
1.1.1.{205..224}:/bricks/14 \
1.1.1.{225..244}:/bricks/14 \
1.1.1.{185..204}:/bricks/15 \
1.1.1.{205..224}:/bricks/15 \
1.1.1.{225..244}:/bricks/15 \
1.1.1.{185..204}:/bricks/16 \
1.1.1.{205..224}:/bricks/16 \
1.1.1.{225..244}:/bricks/16 \
1.1.1.{185..204}:/bricks/17 \
1.1.1.{205..224}:/bricks/17 \
1.1.1.{225..244}:/bricks/17 \
1.1.1.{185..204}:/bricks/18 \
1.1.1.{205..224}:/bricks/18 \
1.1.1.{225..244}:/bricks/18 \
1.1.1.{185..204}:/bricks/19 \
1.1.1.{205..224}:/bricks/19 \
1.1.1.{225..244}:/bricks/19 \
1.1.1.{185..204}:/bricks/20 \
1.1.1.{205..224}:/bricks/20 \
1.1.1.{225..244}:/bricks/20 \
1.1.1.{185..204}:/bricks/21 \
1.1.1.{205..224}:/bricks/21 \
1.1.1.{225..244}:/bricks/21 \
1.1.1.{185..204}:/bricks/22 \
1.1.1.{205..224}:/bricks/22 \
1.1.1.{225..244}:/bricks/22 \
1.1.1.{185..204}:/bricks/23 \
1.1.1.{205..224}:/bricks/23 \
1.1.1.{225..244}:/bricks/23 \
1.1.1.{185..204}:/bricks/24 \
1.1.1.{205..224}:/bricks/24 \
1.1.1.{225..244}:/bricks/24 \
1.1.1.{185..204}:/bricks/25 \
1.1.1.{205..224}:/bricks/25 \
1.1.1.{225..244}:/bricks/25 \
1.1.1.{185..204}:/bricks/26 \
1.1.1.{205..224}:/bricks/26 \
1.1.1.{225..244}:/bricks/26 \
1.1.1.{185..204}:/bricks/27 \
1.1.1.{205..224}:/bricks/27 \
1.1.1.{225..244}:/bricks/27 force
mount -t glusterfs 1.1.1.185:/v0 /mnt/gluster
then I make a directory from one of the clients and chmod it.
mkdir /mnt/gluster/s1 && chmod 777 /mnt/gluster/s1
then I start distcp on clients, there are 1059X8.8GB files in one folder and
they will be copied to /mnt/gluster/s1 with 100 parallel which means 2
copy jobs per client at same time.
hadoop distcp -m 100 http://nn1:8020/path/to/teragen-10tb
file:///mnt/gluster/s1
s1 directory is present in all 1560 brick.
s1/teragen-10tb folder is present in all 1560 brick.
https://www.dropbox.com/s/rbgdxmrtwz8oya8/teragen_list.zip?dl=0
You can ignore the .crc files in the brick output above, they are
checksum files...
As you can see part-m-xxxx files written only some bricks in nodes 0205..0224
All bricks have some files but they have zero size.
I increase file descriptors to 65k so it is not the issue...
Post by Xavier Hernandez
Hi Serkan,
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
I assume that gluster is used to store the intermediate files before
the reduce phase
Nope, gluster is the destination for distcp command. hadoop distcp -m
50 http://nn1:8020/path/to/folder file:///mnt/gluster
This run maps on datanodes which have /mnt/gluster mounted on all of them.
I don't know hadoop, so I'm of little help here. However it seems that -m 50
means to execute 50 copies in parallel. This means that even if the
distribution worked fine, at most 50 (much probably less) of the 78 ec sets
would be used in parallel.
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
This means that this is caused by some peculiarity of the mapreduce.
Yes but how a client write 500 files to gluster mount and those file
just written only to subset of subvolumes? I cannot use gluster as a
backup cluster if I cannot write with distcp.
All 500 files were created only on one of the 78 ec sets and the remaining
77 got empty ?
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
You should look which files are created in each brick and how many
while the process is running.
Files only created on nodes 185..204 or 205..224 or 225..244. Only on
20 nodes in each test.
How many files there were in each brick ?
Not sure if this can be related, but standard linux distributions have a
default limit of 1024 open file descriptors. Having a so big volume and
doing a massive copy, maybe this limit is affecting something ?
Are there any error or warning messages in the mount or bricks logs ?
Xavi
Post by Serkan Çoban
On Tue, Apr 19, 2016 at 1:05 PM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
moved to gluster-users since this doesn't belong to devel list.
I am copying 10.000 files to gluster volume using mapreduce on
clients. Each map process took one file at a time and copy it to
gluster volume.
I assume that gluster is used to store the intermediate files before the
reduce phase.
My disperse volume consist of 78 subvolumes of 16+4 disk each. So If I
copy >78 files parallel I expect each file goes to different subvolume
right?
If you only copy 78 files, most probably you will get some subvolume empty
and some other with more than one or two files. It's not an exact
distribution, it's a statistially balanced distribution: over time and with
enough files, each brick will contain an amount of files in the same order
of magnitude, but they won't have the *same* number of files.
In my tests during tests with fio I can see every file goes to
different subvolume, but when I start mapreduce process from clients
only 78/3=26 subvolumes used for writing files.
This means that this is caused by some peculiarity of the mapreduce.
I see that clearly from network traffic. Mapreduce on client side can
be run multi thread. I tested with 1-5-10 threads on each client but
every time only 26 subvolumes used.
How can I debug the issue further?
You should look which files are created in each brick and how many while
the
process is running.
Xavi
On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Hi, I just reinstalled fresh 3.7.11 and I am seeing the same behavior.
50 clients copying part-0-xxxx named files using mapreduce to gluster
using one thread per server and they are using only 20 servers out of
60. On the other hand fio tests use all the servers. Anything I can do
to solve the issue?
Distribution of files to ec sets is done by dht. In theory if you create
many files each ec set will receive the same amount of files. However when
the number of files is small enough, statistics can fail.
Not sure what you are doing exactly, but a mapreduce procedure generally
only creates a single output. In that case it makes sense that only
one
ec
set is used. If you want to use all ec sets for a single file, you should
enable sharding (I haven't tested that) or split the result in multiple
files.
Xavi
Thanks,
Serkan
---------- Forwarded message ----------
Date: Mon, Apr 18, 2016 at 2:39 PM
Subject: disperse volume file to subvolume mapping
Hi, I have a problem where clients are using only 1/3 of nodes in
disperse volume for writing.
I am testing from 50 clients using 1 to 10 threads with file names
part-0-xxxx.
What I see is clients only use 20 nodes for writing. How is the file
name to sub volume hashing is done? Is this related to file names are
similar?
My cluster is 3.7.10 with 60 nodes each has 26 disks. Disperse volume
is 78 x (16+4). Only 26 out of 78 sub volumes used during writes..
Xavier Hernandez
2016-04-21 08:24:58 UTC
Permalink
Hi Serkan,
Post by Serkan Çoban
I think the problem is in the temporary name that distcp gives to the file while it's being copied before renaming it to the real name. Do you know what is the structure of this name ?
".distcp.tmp.attempt_1460381790773_0248_m_000001_0" and the same
temporary file name used by one map process. For example I see in the
logs that one map copies files part-m-00031,part-m-00047,part-m-00063
sequentially and they all use same temporary file name above. So no
original file name appears in temporary file name.
This explains the problem. With the default options, DHT sends all files
to the subvolume that should store a file named 'distcp.tmp'.

With this temporary name format, little can be done.
Post by Serkan Çoban
I will check if we can modify distcp behaviour, or we have to write
our mapreduce procedures instead of using distcp.
2. define the option 'extra-hash-regex' to an expression that matches your temporary file names and returns the same name that will finally have. Depending on the differences between original and temporary file names, this option could be useless.
3. set the option 'rsync-hash-regex' to 'none'. This will prevent the name conversion, so the files will be evenly distributed. However this will cause a lot of files placed in incorrect subvolumes, creating a lot of link files until a rebalance is executed.
How can I set these options?
You can set gluster options using:

gluster volume set <volname> <option> <value>

for example:

gluster volume set v0 rsync-hash-regex none

Xavi
Post by Serkan Çoban
On Thu, Apr 21, 2016 at 10:00 AM, Xavier Hernandez
Hi Serkan,
I think the problem is in the temporary name that distcp gives to the file
while it's being copied before renaming it to the real name. Do you know
what is the structure of this name ?
DHT selects the subvolume (in this case the ec set) on which the file will
be stored based on the name of the file. This has a problem when a file is
being renamed, because this could change the subvolume where the file should
be found.
DHT has a feature to avoid incorrect file placements when executing renames
for the rsync case. What it does is to check if the file matches the
^\.(.+)\.[^.]+$
If a match is found, it only considers the part between parenthesis to
calculate the destination subvolume.
This is useful for rsync because temporary file names are constructed in the
following way: suppose the original filename is 'test'. The temporary
filename while rsync is being executed is made by prepending a dot and
appending '.<random chars>': .test.712hd
As you can see, the original name and the part of the name between
parenthesis that matches the regular expression are the same. This causes
that, after renaming the temporary file to its original filename, both files
will be considered to belong to the same subvolume by DHT.
In your case it's very probable that distcp uses a temporary name like
'.part.<number>'. In this case the portion of the name used to select the
subvolume is always 'part'. This would explain why all files go to the same
subvolume. Once the file is renamed to another name, DHT realizes that it
should go to another subvolume. At this point it creates a link file (those
files with access rights = '---------T') in the correct subvolume but it
doesn't move it. As you can see, this kind of files are better balanced.
1. change the temporary filename used by distcp to correctly match the
regular expression. I'm not sure if this can be configured, but if this is
possible, this is the best option.
2. define the option 'extra-hash-regex' to an expression that matches your
temporary file names and returns the same name that will finally have.
Depending on the differences between original and temporary file names, this
option could be useless.
3. set the option 'rsync-hash-regex' to 'none'. This will prevent the name
conversion, so the files will be evenly distributed. However this will cause
a lot of files placed in incorrect subvolumes, creating a lot of link files
until a rebalance is executed.
Xavi
Post by Serkan Çoban
gluster volume create v0 disperse 20 redundancy 4 \
1.1.1.{185..204}:/bricks/02 \
1.1.1.{205..224}:/bricks/02 \
1.1.1.{225..244}:/bricks/02 \
1.1.1.{185..204}:/bricks/03 \
1.1.1.{205..224}:/bricks/03 \
1.1.1.{225..244}:/bricks/03 \
1.1.1.{185..204}:/bricks/04 \
1.1.1.{205..224}:/bricks/04 \
1.1.1.{225..244}:/bricks/04 \
1.1.1.{185..204}:/bricks/05 \
1.1.1.{205..224}:/bricks/05 \
1.1.1.{225..244}:/bricks/05 \
1.1.1.{185..204}:/bricks/06 \
1.1.1.{205..224}:/bricks/06 \
1.1.1.{225..244}:/bricks/06 \
1.1.1.{185..204}:/bricks/07 \
1.1.1.{205..224}:/bricks/07 \
1.1.1.{225..244}:/bricks/07 \
1.1.1.{185..204}:/bricks/08 \
1.1.1.{205..224}:/bricks/08 \
1.1.1.{225..244}:/bricks/08 \
1.1.1.{185..204}:/bricks/09 \
1.1.1.{205..224}:/bricks/09 \
1.1.1.{225..244}:/bricks/09 \
1.1.1.{185..204}:/bricks/10 \
1.1.1.{205..224}:/bricks/10 \
1.1.1.{225..244}:/bricks/10 \
1.1.1.{185..204}:/bricks/11 \
1.1.1.{205..224}:/bricks/11 \
1.1.1.{225..244}:/bricks/11 \
1.1.1.{185..204}:/bricks/12 \
1.1.1.{205..224}:/bricks/12 \
1.1.1.{225..244}:/bricks/12 \
1.1.1.{185..204}:/bricks/13 \
1.1.1.{205..224}:/bricks/13 \
1.1.1.{225..244}:/bricks/13 \
1.1.1.{185..204}:/bricks/14 \
1.1.1.{205..224}:/bricks/14 \
1.1.1.{225..244}:/bricks/14 \
1.1.1.{185..204}:/bricks/15 \
1.1.1.{205..224}:/bricks/15 \
1.1.1.{225..244}:/bricks/15 \
1.1.1.{185..204}:/bricks/16 \
1.1.1.{205..224}:/bricks/16 \
1.1.1.{225..244}:/bricks/16 \
1.1.1.{185..204}:/bricks/17 \
1.1.1.{205..224}:/bricks/17 \
1.1.1.{225..244}:/bricks/17 \
1.1.1.{185..204}:/bricks/18 \
1.1.1.{205..224}:/bricks/18 \
1.1.1.{225..244}:/bricks/18 \
1.1.1.{185..204}:/bricks/19 \
1.1.1.{205..224}:/bricks/19 \
1.1.1.{225..244}:/bricks/19 \
1.1.1.{185..204}:/bricks/20 \
1.1.1.{205..224}:/bricks/20 \
1.1.1.{225..244}:/bricks/20 \
1.1.1.{185..204}:/bricks/21 \
1.1.1.{205..224}:/bricks/21 \
1.1.1.{225..244}:/bricks/21 \
1.1.1.{185..204}:/bricks/22 \
1.1.1.{205..224}:/bricks/22 \
1.1.1.{225..244}:/bricks/22 \
1.1.1.{185..204}:/bricks/23 \
1.1.1.{205..224}:/bricks/23 \
1.1.1.{225..244}:/bricks/23 \
1.1.1.{185..204}:/bricks/24 \
1.1.1.{205..224}:/bricks/24 \
1.1.1.{225..244}:/bricks/24 \
1.1.1.{185..204}:/bricks/25 \
1.1.1.{205..224}:/bricks/25 \
1.1.1.{225..244}:/bricks/25 \
1.1.1.{185..204}:/bricks/26 \
1.1.1.{205..224}:/bricks/26 \
1.1.1.{225..244}:/bricks/26 \
1.1.1.{185..204}:/bricks/27 \
1.1.1.{205..224}:/bricks/27 \
1.1.1.{225..244}:/bricks/27 force
mount -t glusterfs 1.1.1.185:/v0 /mnt/gluster
then I make a directory from one of the clients and chmod it.
mkdir /mnt/gluster/s1 && chmod 777 /mnt/gluster/s1
then I start distcp on clients, there are 1059X8.8GB files in one folder and
they will be copied to /mnt/gluster/s1 with 100 parallel which means 2
copy jobs per client at same time.
hadoop distcp -m 100 http://nn1:8020/path/to/teragen-10tb
file:///mnt/gluster/s1
s1 directory is present in all 1560 brick.
s1/teragen-10tb folder is present in all 1560 brick.
https://www.dropbox.com/s/rbgdxmrtwz8oya8/teragen_list.zip?dl=0
You can ignore the .crc files in the brick output above, they are
checksum files...
As you can see part-m-xxxx files written only some bricks in nodes 0205..0224
All bricks have some files but they have zero size.
I increase file descriptors to 65k so it is not the issue...
Post by Xavier Hernandez
Hi Serkan,
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
I assume that gluster is used to store the intermediate files before
the reduce phase
Nope, gluster is the destination for distcp command. hadoop distcp -m
50 http://nn1:8020/path/to/folder file:///mnt/gluster
This run maps on datanodes which have /mnt/gluster mounted on all of them.
I don't know hadoop, so I'm of little help here. However it seems that -m 50
means to execute 50 copies in parallel. This means that even if the
distribution worked fine, at most 50 (much probably less) of the 78 ec sets
would be used in parallel.
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
This means that this is caused by some peculiarity of the mapreduce.
Yes but how a client write 500 files to gluster mount and those file
just written only to subset of subvolumes? I cannot use gluster as a
backup cluster if I cannot write with distcp.
All 500 files were created only on one of the 78 ec sets and the remaining
77 got empty ?
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
You should look which files are created in each brick and how many
while the process is running.
Files only created on nodes 185..204 or 205..224 or 225..244. Only on
20 nodes in each test.
How many files there were in each brick ?
Not sure if this can be related, but standard linux distributions have a
default limit of 1024 open file descriptors. Having a so big volume and
doing a massive copy, maybe this limit is affecting something ?
Are there any error or warning messages in the mount or bricks logs ?
Xavi
Post by Serkan Çoban
On Tue, Apr 19, 2016 at 1:05 PM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
moved to gluster-users since this doesn't belong to devel list.
I am copying 10.000 files to gluster volume using mapreduce on
clients. Each map process took one file at a time and copy it to
gluster volume.
I assume that gluster is used to store the intermediate files before the
reduce phase.
My disperse volume consist of 78 subvolumes of 16+4 disk each. So If I
copy >78 files parallel I expect each file goes to different subvolume
right?
If you only copy 78 files, most probably you will get some subvolume empty
and some other with more than one or two files. It's not an exact
distribution, it's a statistially balanced distribution: over time and with
enough files, each brick will contain an amount of files in the same order
of magnitude, but they won't have the *same* number of files.
In my tests during tests with fio I can see every file goes to
different subvolume, but when I start mapreduce process from clients
only 78/3=26 subvolumes used for writing files.
This means that this is caused by some peculiarity of the mapreduce.
I see that clearly from network traffic. Mapreduce on client side can
be run multi thread. I tested with 1-5-10 threads on each client but
every time only 26 subvolumes used.
How can I debug the issue further?
You should look which files are created in each brick and how many while
the
process is running.
Xavi
On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Hi, I just reinstalled fresh 3.7.11 and I am seeing the same behavior.
50 clients copying part-0-xxxx named files using mapreduce to gluster
using one thread per server and they are using only 20 servers out of
60. On the other hand fio tests use all the servers. Anything I can do
to solve the issue?
Distribution of files to ec sets is done by dht. In theory if you create
many files each ec set will receive the same amount of files. However when
the number of files is small enough, statistics can fail.
Not sure what you are doing exactly, but a mapreduce procedure generally
only creates a single output. In that case it makes sense that only
one
ec
set is used. If you want to use all ec sets for a single file, you should
enable sharding (I haven't tested that) or split the result in multiple
files.
Xavi
Thanks,
Serkan
---------- Forwarded message ----------
Date: Mon, Apr 18, 2016 at 2:39 PM
Subject: disperse volume file to subvolume mapping
Hi, I have a problem where clients are using only 1/3 of nodes in
disperse volume for writing.
I am testing from 50 clients using 1 to 10 threads with file names
part-0-xxxx.
What I see is clients only use 20 nodes for writing. How is the file
name to sub volume hashing is done? Is this related to file names are
similar?
My cluster is 3.7.10 with 60 nodes each has 26 disks. Disperse volume
is 78 x (16+4). Only 26 out of 78 sub volumes used during writes..
Serkan Çoban
2016-04-21 10:39:41 UTC
Permalink
I started a gluster v rebalance v0 start command hoping that it will
equally redistribute files across 60 nodes but it did not do that...
why it did not redistribute files? any thoughts?

On Thu, Apr 21, 2016 at 11:24 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Post by Serkan Çoban
Post by Xavier Hernandez
I think the problem is in the temporary name that distcp gives to the
file while it's being copied before renaming it to the real name. Do you
know what is the structure of this name ?
".distcp.tmp.attempt_1460381790773_0248_m_000001_0" and the same
temporary file name used by one map process. For example I see in the
logs that one map copies files part-m-00031,part-m-00047,part-m-00063
sequentially and they all use same temporary file name above. So no
original file name appears in temporary file name.
This explains the problem. With the default options, DHT sends all files to
the subvolume that should store a file named 'distcp.tmp'.
With this temporary name format, little can be done.
Post by Serkan Çoban
I will check if we can modify distcp behaviour, or we have to write
our mapreduce procedures instead of using distcp.
Post by Xavier Hernandez
2. define the option 'extra-hash-regex' to an expression that matches
your temporary file names and returns the same name that will finally have.
Depending on the differences between original and temporary file names, this
option could be useless.
3. set the option 'rsync-hash-regex' to 'none'. This will prevent the
name conversion, so the files will be evenly distributed. However this will
cause a lot of files placed in incorrect subvolumes, creating a lot of link
files until a rebalance is executed.
How can I set these options?
gluster volume set <volname> <option> <value>
gluster volume set v0 rsync-hash-regex none
Xavi
Post by Serkan Çoban
On Thu, Apr 21, 2016 at 10:00 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
I think the problem is in the temporary name that distcp gives to the file
while it's being copied before renaming it to the real name. Do you know
what is the structure of this name ?
DHT selects the subvolume (in this case the ec set) on which the file will
be stored based on the name of the file. This has a problem when a file is
being renamed, because this could change the subvolume where the file should
be found.
DHT has a feature to avoid incorrect file placements when executing renames
for the rsync case. What it does is to check if the file matches the
^\.(.+)\.[^.]+$
If a match is found, it only considers the part between parenthesis to
calculate the destination subvolume.
This is useful for rsync because temporary file names are constructed in the
following way: suppose the original filename is 'test'. The temporary
filename while rsync is being executed is made by prepending a dot and
appending '.<random chars>': .test.712hd
As you can see, the original name and the part of the name between
parenthesis that matches the regular expression are the same. This causes
that, after renaming the temporary file to its original filename, both files
will be considered to belong to the same subvolume by DHT.
In your case it's very probable that distcp uses a temporary name like
'.part.<number>'. In this case the portion of the name used to select the
subvolume is always 'part'. This would explain why all files go to the same
subvolume. Once the file is renamed to another name, DHT realizes that it
should go to another subvolume. At this point it creates a link file (those
files with access rights = '---------T') in the correct subvolume but it
doesn't move it. As you can see, this kind of files are better balanced.
1. change the temporary filename used by distcp to correctly match the
regular expression. I'm not sure if this can be configured, but if this is
possible, this is the best option.
2. define the option 'extra-hash-regex' to an expression that matches your
temporary file names and returns the same name that will finally have.
Depending on the differences between original and temporary file names, this
option could be useless.
3. set the option 'rsync-hash-regex' to 'none'. This will prevent the name
conversion, so the files will be evenly distributed. However this will cause
a lot of files placed in incorrect subvolumes, creating a lot of link files
until a rebalance is executed.
Xavi
Post by Serkan Çoban
gluster volume create v0 disperse 20 redundancy 4 \
1.1.1.{185..204}:/bricks/02 \
1.1.1.{205..224}:/bricks/02 \
1.1.1.{225..244}:/bricks/02 \
1.1.1.{185..204}:/bricks/03 \
1.1.1.{205..224}:/bricks/03 \
1.1.1.{225..244}:/bricks/03 \
1.1.1.{185..204}:/bricks/04 \
1.1.1.{205..224}:/bricks/04 \
1.1.1.{225..244}:/bricks/04 \
1.1.1.{185..204}:/bricks/05 \
1.1.1.{205..224}:/bricks/05 \
1.1.1.{225..244}:/bricks/05 \
1.1.1.{185..204}:/bricks/06 \
1.1.1.{205..224}:/bricks/06 \
1.1.1.{225..244}:/bricks/06 \
1.1.1.{185..204}:/bricks/07 \
1.1.1.{205..224}:/bricks/07 \
1.1.1.{225..244}:/bricks/07 \
1.1.1.{185..204}:/bricks/08 \
1.1.1.{205..224}:/bricks/08 \
1.1.1.{225..244}:/bricks/08 \
1.1.1.{185..204}:/bricks/09 \
1.1.1.{205..224}:/bricks/09 \
1.1.1.{225..244}:/bricks/09 \
1.1.1.{185..204}:/bricks/10 \
1.1.1.{205..224}:/bricks/10 \
1.1.1.{225..244}:/bricks/10 \
1.1.1.{185..204}:/bricks/11 \
1.1.1.{205..224}:/bricks/11 \
1.1.1.{225..244}:/bricks/11 \
1.1.1.{185..204}:/bricks/12 \
1.1.1.{205..224}:/bricks/12 \
1.1.1.{225..244}:/bricks/12 \
1.1.1.{185..204}:/bricks/13 \
1.1.1.{205..224}:/bricks/13 \
1.1.1.{225..244}:/bricks/13 \
1.1.1.{185..204}:/bricks/14 \
1.1.1.{205..224}:/bricks/14 \
1.1.1.{225..244}:/bricks/14 \
1.1.1.{185..204}:/bricks/15 \
1.1.1.{205..224}:/bricks/15 \
1.1.1.{225..244}:/bricks/15 \
1.1.1.{185..204}:/bricks/16 \
1.1.1.{205..224}:/bricks/16 \
1.1.1.{225..244}:/bricks/16 \
1.1.1.{185..204}:/bricks/17 \
1.1.1.{205..224}:/bricks/17 \
1.1.1.{225..244}:/bricks/17 \
1.1.1.{185..204}:/bricks/18 \
1.1.1.{205..224}:/bricks/18 \
1.1.1.{225..244}:/bricks/18 \
1.1.1.{185..204}:/bricks/19 \
1.1.1.{205..224}:/bricks/19 \
1.1.1.{225..244}:/bricks/19 \
1.1.1.{185..204}:/bricks/20 \
1.1.1.{205..224}:/bricks/20 \
1.1.1.{225..244}:/bricks/20 \
1.1.1.{185..204}:/bricks/21 \
1.1.1.{205..224}:/bricks/21 \
1.1.1.{225..244}:/bricks/21 \
1.1.1.{185..204}:/bricks/22 \
1.1.1.{205..224}:/bricks/22 \
1.1.1.{225..244}:/bricks/22 \
1.1.1.{185..204}:/bricks/23 \
1.1.1.{205..224}:/bricks/23 \
1.1.1.{225..244}:/bricks/23 \
1.1.1.{185..204}:/bricks/24 \
1.1.1.{205..224}:/bricks/24 \
1.1.1.{225..244}:/bricks/24 \
1.1.1.{185..204}:/bricks/25 \
1.1.1.{205..224}:/bricks/25 \
1.1.1.{225..244}:/bricks/25 \
1.1.1.{185..204}:/bricks/26 \
1.1.1.{205..224}:/bricks/26 \
1.1.1.{225..244}:/bricks/26 \
1.1.1.{185..204}:/bricks/27 \
1.1.1.{205..224}:/bricks/27 \
1.1.1.{225..244}:/bricks/27 force
mount -t glusterfs 1.1.1.185:/v0 /mnt/gluster
then I make a directory from one of the clients and chmod it.
mkdir /mnt/gluster/s1 && chmod 777 /mnt/gluster/s1
then I start distcp on clients, there are 1059X8.8GB files in one folder and
they will be copied to /mnt/gluster/s1 with 100 parallel which means 2
copy jobs per client at same time.
hadoop distcp -m 100 http://nn1:8020/path/to/teragen-10tb
file:///mnt/gluster/s1
s1 directory is present in all 1560 brick.
s1/teragen-10tb folder is present in all 1560 brick.
https://www.dropbox.com/s/rbgdxmrtwz8oya8/teragen_list.zip?dl=0
You can ignore the .crc files in the brick output above, they are
checksum files...
As you can see part-m-xxxx files written only some bricks in nodes 0205..0224
All bricks have some files but they have zero size.
I increase file descriptors to 65k so it is not the issue...
On Wed, Apr 20, 2016 at 9:34 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
I assume that gluster is used to store the intermediate files before
the reduce phase
Nope, gluster is the destination for distcp command. hadoop distcp -m
50 http://nn1:8020/path/to/folder file:///mnt/gluster
This run maps on datanodes which have /mnt/gluster mounted on all of them.
I don't know hadoop, so I'm of little help here. However it seems that
-m
50
means to execute 50 copies in parallel. This means that even if the
distribution worked fine, at most 50 (much probably less) of the 78 ec sets
would be used in parallel.
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
This means that this is caused by some peculiarity of the mapreduce.
Yes but how a client write 500 files to gluster mount and those file
just written only to subset of subvolumes? I cannot use gluster as a
backup cluster if I cannot write with distcp.
All 500 files were created only on one of the 78 ec sets and the remaining
77 got empty ?
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
You should look which files are created in each brick and how many
while the process is running.
Files only created on nodes 185..204 or 205..224 or 225..244. Only on
20 nodes in each test.
How many files there were in each brick ?
Not sure if this can be related, but standard linux distributions have a
default limit of 1024 open file descriptors. Having a so big volume and
doing a massive copy, maybe this limit is affecting something ?
Are there any error or warning messages in the mount or bricks logs ?
Xavi
Post by Serkan Çoban
On Tue, Apr 19, 2016 at 1:05 PM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
moved to gluster-users since this doesn't belong to devel list.
I am copying 10.000 files to gluster volume using mapreduce on
clients. Each map process took one file at a time and copy it to
gluster volume.
I assume that gluster is used to store the intermediate files before the
reduce phase.
My disperse volume consist of 78 subvolumes of 16+4 disk each. So If I
copy >78 files parallel I expect each file goes to different subvolume
right?
If you only copy 78 files, most probably you will get some subvolume empty
and some other with more than one or two files. It's not an exact
distribution, it's a statistially balanced distribution: over time and
with
enough files, each brick will contain an amount of files in the same order
of magnitude, but they won't have the *same* number of files.
In my tests during tests with fio I can see every file goes to
different subvolume, but when I start mapreduce process from clients
only 78/3=26 subvolumes used for writing files.
This means that this is caused by some peculiarity of the mapreduce.
I see that clearly from network traffic. Mapreduce on client side can
be run multi thread. I tested with 1-5-10 threads on each client but
every time only 26 subvolumes used.
How can I debug the issue further?
You should look which files are created in each brick and how many while
the
process is running.
Xavi
On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Hi, I just reinstalled fresh 3.7.11 and I am seeing the same behavior.
50 clients copying part-0-xxxx named files using mapreduce to gluster
using one thread per server and they are using only 20 servers out of
60. On the other hand fio tests use all the servers. Anything I
can
do
to solve the issue?
Distribution of files to ec sets is done by dht. In theory if you create
many files each ec set will receive the same amount of files.
However
when
the number of files is small enough, statistics can fail.
Not sure what you are doing exactly, but a mapreduce procedure generally
only creates a single output. In that case it makes sense that only
one
ec
set is used. If you want to use all ec sets for a single file, you should
enable sharding (I haven't tested that) or split the result in multiple
files.
Xavi
Thanks,
Serkan
---------- Forwarded message ----------
Date: Mon, Apr 18, 2016 at 2:39 PM
Subject: disperse volume file to subvolume mapping
Hi, I have a problem where clients are using only 1/3 of nodes in
disperse volume for writing.
I am testing from 50 clients using 1 to 10 threads with file names
part-0-xxxx.
What I see is clients only use 20 nodes for writing. How is the file
name to sub volume hashing is done? Is this related to file names are
similar?
My cluster is 3.7.10 with 60 nodes each has 26 disks. Disperse volume
is 78 x (16+4). Only 26 out of 78 sub volumes used during writes..
Xavier Hernandez
2016-04-21 10:55:40 UTC
Permalink
Hi Serkan,
Post by Serkan Çoban
I started a gluster v rebalance v0 start command hoping that it will
equally redistribute files across 60 nodes but it did not do that...
why it did not redistribute files? any thoughts?
Has the rebalance operation finished successfully ? has it skipped any
files ?

After a successful rebalance all files with attributes '---------T'
should have disappeared.
Post by Serkan Çoban
On Thu, Apr 21, 2016 at 11:24 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Post by Serkan Çoban
Post by Xavier Hernandez
I think the problem is in the temporary name that distcp gives to the
file while it's being copied before renaming it to the real name. Do you
know what is the structure of this name ?
".distcp.tmp.attempt_1460381790773_0248_m_000001_0" and the same
temporary file name used by one map process. For example I see in the
logs that one map copies files part-m-00031,part-m-00047,part-m-00063
sequentially and they all use same temporary file name above. So no
original file name appears in temporary file name.
This explains the problem. With the default options, DHT sends all files to
the subvolume that should store a file named 'distcp.tmp'.
With this temporary name format, little can be done.
Post by Serkan Çoban
I will check if we can modify distcp behaviour, or we have to write
our mapreduce procedures instead of using distcp.
Post by Xavier Hernandez
2. define the option 'extra-hash-regex' to an expression that matches
your temporary file names and returns the same name that will finally have.
Depending on the differences between original and temporary file names, this
option could be useless.
3. set the option 'rsync-hash-regex' to 'none'. This will prevent the
name conversion, so the files will be evenly distributed. However this will
cause a lot of files placed in incorrect subvolumes, creating a lot of link
files until a rebalance is executed.
How can I set these options?
gluster volume set <volname> <option> <value>
gluster volume set v0 rsync-hash-regex none
Xavi
Post by Serkan Çoban
On Thu, Apr 21, 2016 at 10:00 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
I think the problem is in the temporary name that distcp gives to the file
while it's being copied before renaming it to the real name. Do you know
what is the structure of this name ?
DHT selects the subvolume (in this case the ec set) on which the file will
be stored based on the name of the file. This has a problem when a file is
being renamed, because this could change the subvolume where the file should
be found.
DHT has a feature to avoid incorrect file placements when executing renames
for the rsync case. What it does is to check if the file matches the
^\.(.+)\.[^.]+$
If a match is found, it only considers the part between parenthesis to
calculate the destination subvolume.
This is useful for rsync because temporary file names are constructed in the
following way: suppose the original filename is 'test'. The temporary
filename while rsync is being executed is made by prepending a dot and
appending '.<random chars>': .test.712hd
As you can see, the original name and the part of the name between
parenthesis that matches the regular expression are the same. This causes
that, after renaming the temporary file to its original filename, both files
will be considered to belong to the same subvolume by DHT.
In your case it's very probable that distcp uses a temporary name like
'.part.<number>'. In this case the portion of the name used to select the
subvolume is always 'part'. This would explain why all files go to the same
subvolume. Once the file is renamed to another name, DHT realizes that it
should go to another subvolume. At this point it creates a link file (those
files with access rights = '---------T') in the correct subvolume but it
doesn't move it. As you can see, this kind of files are better balanced.
1. change the temporary filename used by distcp to correctly match the
regular expression. I'm not sure if this can be configured, but if this is
possible, this is the best option.
2. define the option 'extra-hash-regex' to an expression that matches your
temporary file names and returns the same name that will finally have.
Depending on the differences between original and temporary file names, this
option could be useless.
3. set the option 'rsync-hash-regex' to 'none'. This will prevent the name
conversion, so the files will be evenly distributed. However this will cause
a lot of files placed in incorrect subvolumes, creating a lot of link files
until a rebalance is executed.
Xavi
Post by Serkan Çoban
gluster volume create v0 disperse 20 redundancy 4 \
1.1.1.{185..204}:/bricks/02 \
1.1.1.{205..224}:/bricks/02 \
1.1.1.{225..244}:/bricks/02 \
1.1.1.{185..204}:/bricks/03 \
1.1.1.{205..224}:/bricks/03 \
1.1.1.{225..244}:/bricks/03 \
1.1.1.{185..204}:/bricks/04 \
1.1.1.{205..224}:/bricks/04 \
1.1.1.{225..244}:/bricks/04 \
1.1.1.{185..204}:/bricks/05 \
1.1.1.{205..224}:/bricks/05 \
1.1.1.{225..244}:/bricks/05 \
1.1.1.{185..204}:/bricks/06 \
1.1.1.{205..224}:/bricks/06 \
1.1.1.{225..244}:/bricks/06 \
1.1.1.{185..204}:/bricks/07 \
1.1.1.{205..224}:/bricks/07 \
1.1.1.{225..244}:/bricks/07 \
1.1.1.{185..204}:/bricks/08 \
1.1.1.{205..224}:/bricks/08 \
1.1.1.{225..244}:/bricks/08 \
1.1.1.{185..204}:/bricks/09 \
1.1.1.{205..224}:/bricks/09 \
1.1.1.{225..244}:/bricks/09 \
1.1.1.{185..204}:/bricks/10 \
1.1.1.{205..224}:/bricks/10 \
1.1.1.{225..244}:/bricks/10 \
1.1.1.{185..204}:/bricks/11 \
1.1.1.{205..224}:/bricks/11 \
1.1.1.{225..244}:/bricks/11 \
1.1.1.{185..204}:/bricks/12 \
1.1.1.{205..224}:/bricks/12 \
1.1.1.{225..244}:/bricks/12 \
1.1.1.{185..204}:/bricks/13 \
1.1.1.{205..224}:/bricks/13 \
1.1.1.{225..244}:/bricks/13 \
1.1.1.{185..204}:/bricks/14 \
1.1.1.{205..224}:/bricks/14 \
1.1.1.{225..244}:/bricks/14 \
1.1.1.{185..204}:/bricks/15 \
1.1.1.{205..224}:/bricks/15 \
1.1.1.{225..244}:/bricks/15 \
1.1.1.{185..204}:/bricks/16 \
1.1.1.{205..224}:/bricks/16 \
1.1.1.{225..244}:/bricks/16 \
1.1.1.{185..204}:/bricks/17 \
1.1.1.{205..224}:/bricks/17 \
1.1.1.{225..244}:/bricks/17 \
1.1.1.{185..204}:/bricks/18 \
1.1.1.{205..224}:/bricks/18 \
1.1.1.{225..244}:/bricks/18 \
1.1.1.{185..204}:/bricks/19 \
1.1.1.{205..224}:/bricks/19 \
1.1.1.{225..244}:/bricks/19 \
1.1.1.{185..204}:/bricks/20 \
1.1.1.{205..224}:/bricks/20 \
1.1.1.{225..244}:/bricks/20 \
1.1.1.{185..204}:/bricks/21 \
1.1.1.{205..224}:/bricks/21 \
1.1.1.{225..244}:/bricks/21 \
1.1.1.{185..204}:/bricks/22 \
1.1.1.{205..224}:/bricks/22 \
1.1.1.{225..244}:/bricks/22 \
1.1.1.{185..204}:/bricks/23 \
1.1.1.{205..224}:/bricks/23 \
1.1.1.{225..244}:/bricks/23 \
1.1.1.{185..204}:/bricks/24 \
1.1.1.{205..224}:/bricks/24 \
1.1.1.{225..244}:/bricks/24 \
1.1.1.{185..204}:/bricks/25 \
1.1.1.{205..224}:/bricks/25 \
1.1.1.{225..244}:/bricks/25 \
1.1.1.{185..204}:/bricks/26 \
1.1.1.{205..224}:/bricks/26 \
1.1.1.{225..244}:/bricks/26 \
1.1.1.{185..204}:/bricks/27 \
1.1.1.{205..224}:/bricks/27 \
1.1.1.{225..244}:/bricks/27 force
mount -t glusterfs 1.1.1.185:/v0 /mnt/gluster
then I make a directory from one of the clients and chmod it.
mkdir /mnt/gluster/s1 && chmod 777 /mnt/gluster/s1
then I start distcp on clients, there are 1059X8.8GB files in one folder and
they will be copied to /mnt/gluster/s1 with 100 parallel which means 2
copy jobs per client at same time.
hadoop distcp -m 100 http://nn1:8020/path/to/teragen-10tb
file:///mnt/gluster/s1
s1 directory is present in all 1560 brick.
s1/teragen-10tb folder is present in all 1560 brick.
https://www.dropbox.com/s/rbgdxmrtwz8oya8/teragen_list.zip?dl=0
You can ignore the .crc files in the brick output above, they are
checksum files...
As you can see part-m-xxxx files written only some bricks in nodes 0205..0224
All bricks have some files but they have zero size.
I increase file descriptors to 65k so it is not the issue...
On Wed, Apr 20, 2016 at 9:34 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
I assume that gluster is used to store the intermediate files before
the reduce phase
Nope, gluster is the destination for distcp command. hadoop distcp -m
50 http://nn1:8020/path/to/folder file:///mnt/gluster
This run maps on datanodes which have /mnt/gluster mounted on all of them.
I don't know hadoop, so I'm of little help here. However it seems that
-m
50
means to execute 50 copies in parallel. This means that even if the
distribution worked fine, at most 50 (much probably less) of the 78 ec sets
would be used in parallel.
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
This means that this is caused by some peculiarity of the mapreduce.
Yes but how a client write 500 files to gluster mount and those file
just written only to subset of subvolumes? I cannot use gluster as a
backup cluster if I cannot write with distcp.
All 500 files were created only on one of the 78 ec sets and the remaining
77 got empty ?
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
You should look which files are created in each brick and how many
while the process is running.
Files only created on nodes 185..204 or 205..224 or 225..244. Only on
20 nodes in each test.
How many files there were in each brick ?
Not sure if this can be related, but standard linux distributions have a
default limit of 1024 open file descriptors. Having a so big volume and
doing a massive copy, maybe this limit is affecting something ?
Are there any error or warning messages in the mount or bricks logs ?
Xavi
Post by Serkan Çoban
On Tue, Apr 19, 2016 at 1:05 PM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
moved to gluster-users since this doesn't belong to devel list.
I am copying 10.000 files to gluster volume using mapreduce on
clients. Each map process took one file at a time and copy it to
gluster volume.
I assume that gluster is used to store the intermediate files before the
reduce phase.
My disperse volume consist of 78 subvolumes of 16+4 disk each. So If I
copy >78 files parallel I expect each file goes to different subvolume
right?
If you only copy 78 files, most probably you will get some subvolume empty
and some other with more than one or two files. It's not an exact
distribution, it's a statistially balanced distribution: over time and
with
enough files, each brick will contain an amount of files in the same order
of magnitude, but they won't have the *same* number of files.
In my tests during tests with fio I can see every file goes to
different subvolume, but when I start mapreduce process from clients
only 78/3=26 subvolumes used for writing files.
This means that this is caused by some peculiarity of the mapreduce.
I see that clearly from network traffic. Mapreduce on client side can
be run multi thread. I tested with 1-5-10 threads on each client but
every time only 26 subvolumes used.
How can I debug the issue further?
You should look which files are created in each brick and how many while
the
process is running.
Xavi
On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Hi, I just reinstalled fresh 3.7.11 and I am seeing the same behavior.
50 clients copying part-0-xxxx named files using mapreduce to gluster
using one thread per server and they are using only 20 servers out of
60. On the other hand fio tests use all the servers. Anything I
can
do
to solve the issue?
Distribution of files to ec sets is done by dht. In theory if you create
many files each ec set will receive the same amount of files.
However
when
the number of files is small enough, statistics can fail.
Not sure what you are doing exactly, but a mapreduce procedure generally
only creates a single output. In that case it makes sense that only
one
ec
set is used. If you want to use all ec sets for a single file, you should
enable sharding (I haven't tested that) or split the result in multiple
files.
Xavi
Thanks,
Serkan
---------- Forwarded message ----------
Date: Mon, Apr 18, 2016 at 2:39 PM
Subject: disperse volume file to subvolume mapping
Hi, I have a problem where clients are using only 1/3 of nodes in
disperse volume for writing.
I am testing from 50 clients using 1 to 10 threads with file names
part-0-xxxx.
What I see is clients only use 20 nodes for writing. How is the file
name to sub volume hashing is done? Is this related to file names are
similar?
My cluster is 3.7.10 with 60 nodes each has 26 disks. Disperse volume
is 78 x (16+4). Only 26 out of 78 sub volumes used during writes..
Serkan Çoban
2016-04-21 12:23:00 UTC
Permalink
Has the rebalance operation finished successfully ? has it skipped any files ?
Yes according to gluster v rebalance status it is completed without any errors.
rebalance status report is like:
Node Rebalanced files size Scanned
failures skipped
1.1.1.185 158 29GB 1720
0 314
1.1.1.205 93 46.5GB 761
0 95
1.1.1.225 74 37GB 779
0 94


All other hosts has 0 values.

I double check that files with '---------T' attributes are there,
maybe some of them deleted but I still see them in bricks...
I am also concerned why part files not distributed to all 60 nodes?
Rebalance should do that?
Hi Serkan,
Post by Serkan Çoban
I started a gluster v rebalance v0 start command hoping that it will
equally redistribute files across 60 nodes but it did not do that...
why it did not redistribute files? any thoughts?
Has the rebalance operation finished successfully ? has it skipped any files
?
After a successful rebalance all files with attributes '---------T' should
have disappeared.
Post by Serkan Çoban
On Thu, Apr 21, 2016 at 11:24 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Post by Serkan Çoban
Post by Xavier Hernandez
I think the problem is in the temporary name that distcp gives to the
file while it's being copied before renaming it to the real name. Do you
know what is the structure of this name ?
".distcp.tmp.attempt_1460381790773_0248_m_000001_0" and the same
temporary file name used by one map process. For example I see in the
logs that one map copies files part-m-00031,part-m-00047,part-m-00063
sequentially and they all use same temporary file name above. So no
original file name appears in temporary file name.
This explains the problem. With the default options, DHT sends all files to
the subvolume that should store a file named 'distcp.tmp'.
With this temporary name format, little can be done.
Post by Serkan Çoban
I will check if we can modify distcp behaviour, or we have to write
our mapreduce procedures instead of using distcp.
Post by Xavier Hernandez
2. define the option 'extra-hash-regex' to an expression that matches
your temporary file names and returns the same name that will finally have.
Depending on the differences between original and temporary file names, this
option could be useless.
3. set the option 'rsync-hash-regex' to 'none'. This will prevent the
name conversion, so the files will be evenly distributed. However this will
cause a lot of files placed in incorrect subvolumes, creating a lot of link
files until a rebalance is executed.
How can I set these options?
gluster volume set <volname> <option> <value>
gluster volume set v0 rsync-hash-regex none
Xavi
Post by Serkan Çoban
On Thu, Apr 21, 2016 at 10:00 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
I think the problem is in the temporary name that distcp gives to the file
while it's being copied before renaming it to the real name. Do you know
what is the structure of this name ?
DHT selects the subvolume (in this case the ec set) on which the file will
be stored based on the name of the file. This has a problem when a file is
being renamed, because this could change the subvolume where the file should
be found.
DHT has a feature to avoid incorrect file placements when executing renames
for the rsync case. What it does is to check if the file matches the
^\.(.+)\.[^.]+$
If a match is found, it only considers the part between parenthesis to
calculate the destination subvolume.
This is useful for rsync because temporary file names are constructed in
the
following way: suppose the original filename is 'test'. The temporary
filename while rsync is being executed is made by prepending a dot and
appending '.<random chars>': .test.712hd
As you can see, the original name and the part of the name between
parenthesis that matches the regular expression are the same. This causes
that, after renaming the temporary file to its original filename, both files
will be considered to belong to the same subvolume by DHT.
In your case it's very probable that distcp uses a temporary name like
'.part.<number>'. In this case the portion of the name used to select the
subvolume is always 'part'. This would explain why all files go to the same
subvolume. Once the file is renamed to another name, DHT realizes that it
should go to another subvolume. At this point it creates a link file (those
files with access rights = '---------T') in the correct subvolume but it
doesn't move it. As you can see, this kind of files are better balanced.
1. change the temporary filename used by distcp to correctly match the
regular expression. I'm not sure if this can be configured, but if this is
possible, this is the best option.
2. define the option 'extra-hash-regex' to an expression that matches your
temporary file names and returns the same name that will finally have.
Depending on the differences between original and temporary file names, this
option could be useless.
3. set the option 'rsync-hash-regex' to 'none'. This will prevent the name
conversion, so the files will be evenly distributed. However this will cause
a lot of files placed in incorrect subvolumes, creating a lot of link files
until a rebalance is executed.
Xavi
Post by Serkan Çoban
gluster volume create v0 disperse 20 redundancy 4 \
1.1.1.{185..204}:/bricks/02 \
1.1.1.{205..224}:/bricks/02 \
1.1.1.{225..244}:/bricks/02 \
1.1.1.{185..204}:/bricks/03 \
1.1.1.{205..224}:/bricks/03 \
1.1.1.{225..244}:/bricks/03 \
1.1.1.{185..204}:/bricks/04 \
1.1.1.{205..224}:/bricks/04 \
1.1.1.{225..244}:/bricks/04 \
1.1.1.{185..204}:/bricks/05 \
1.1.1.{205..224}:/bricks/05 \
1.1.1.{225..244}:/bricks/05 \
1.1.1.{185..204}:/bricks/06 \
1.1.1.{205..224}:/bricks/06 \
1.1.1.{225..244}:/bricks/06 \
1.1.1.{185..204}:/bricks/07 \
1.1.1.{205..224}:/bricks/07 \
1.1.1.{225..244}:/bricks/07 \
1.1.1.{185..204}:/bricks/08 \
1.1.1.{205..224}:/bricks/08 \
1.1.1.{225..244}:/bricks/08 \
1.1.1.{185..204}:/bricks/09 \
1.1.1.{205..224}:/bricks/09 \
1.1.1.{225..244}:/bricks/09 \
1.1.1.{185..204}:/bricks/10 \
1.1.1.{205..224}:/bricks/10 \
1.1.1.{225..244}:/bricks/10 \
1.1.1.{185..204}:/bricks/11 \
1.1.1.{205..224}:/bricks/11 \
1.1.1.{225..244}:/bricks/11 \
1.1.1.{185..204}:/bricks/12 \
1.1.1.{205..224}:/bricks/12 \
1.1.1.{225..244}:/bricks/12 \
1.1.1.{185..204}:/bricks/13 \
1.1.1.{205..224}:/bricks/13 \
1.1.1.{225..244}:/bricks/13 \
1.1.1.{185..204}:/bricks/14 \
1.1.1.{205..224}:/bricks/14 \
1.1.1.{225..244}:/bricks/14 \
1.1.1.{185..204}:/bricks/15 \
1.1.1.{205..224}:/bricks/15 \
1.1.1.{225..244}:/bricks/15 \
1.1.1.{185..204}:/bricks/16 \
1.1.1.{205..224}:/bricks/16 \
1.1.1.{225..244}:/bricks/16 \
1.1.1.{185..204}:/bricks/17 \
1.1.1.{205..224}:/bricks/17 \
1.1.1.{225..244}:/bricks/17 \
1.1.1.{185..204}:/bricks/18 \
1.1.1.{205..224}:/bricks/18 \
1.1.1.{225..244}:/bricks/18 \
1.1.1.{185..204}:/bricks/19 \
1.1.1.{205..224}:/bricks/19 \
1.1.1.{225..244}:/bricks/19 \
1.1.1.{185..204}:/bricks/20 \
1.1.1.{205..224}:/bricks/20 \
1.1.1.{225..244}:/bricks/20 \
1.1.1.{185..204}:/bricks/21 \
1.1.1.{205..224}:/bricks/21 \
1.1.1.{225..244}:/bricks/21 \
1.1.1.{185..204}:/bricks/22 \
1.1.1.{205..224}:/bricks/22 \
1.1.1.{225..244}:/bricks/22 \
1.1.1.{185..204}:/bricks/23 \
1.1.1.{205..224}:/bricks/23 \
1.1.1.{225..244}:/bricks/23 \
1.1.1.{185..204}:/bricks/24 \
1.1.1.{205..224}:/bricks/24 \
1.1.1.{225..244}:/bricks/24 \
1.1.1.{185..204}:/bricks/25 \
1.1.1.{205..224}:/bricks/25 \
1.1.1.{225..244}:/bricks/25 \
1.1.1.{185..204}:/bricks/26 \
1.1.1.{205..224}:/bricks/26 \
1.1.1.{225..244}:/bricks/26 \
1.1.1.{185..204}:/bricks/27 \
1.1.1.{205..224}:/bricks/27 \
1.1.1.{225..244}:/bricks/27 force
mount -t glusterfs 1.1.1.185:/v0 /mnt/gluster
then I make a directory from one of the clients and chmod it.
mkdir /mnt/gluster/s1 && chmod 777 /mnt/gluster/s1
then I start distcp on clients, there are 1059X8.8GB files in one
folder
and
they will be copied to /mnt/gluster/s1 with 100 parallel which means 2
copy jobs per client at same time.
hadoop distcp -m 100 http://nn1:8020/path/to/teragen-10tb
file:///mnt/gluster/s1
s1 directory is present in all 1560 brick.
s1/teragen-10tb folder is present in all 1560 brick.
https://www.dropbox.com/s/rbgdxmrtwz8oya8/teragen_list.zip?dl=0
You can ignore the .crc files in the brick output above, they are
checksum files...
As you can see part-m-xxxx files written only some bricks in nodes 0205..0224
All bricks have some files but they have zero size.
I increase file descriptors to 65k so it is not the issue...
On Wed, Apr 20, 2016 at 9:34 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
I assume that gluster is used to store the intermediate files before
the reduce phase
Nope, gluster is the destination for distcp command. hadoop distcp -m
50 http://nn1:8020/path/to/folder file:///mnt/gluster
This run maps on datanodes which have /mnt/gluster mounted on all of them.
I don't know hadoop, so I'm of little help here. However it seems that
-m
50
means to execute 50 copies in parallel. This means that even if the
distribution worked fine, at most 50 (much probably less) of the 78
ec
sets
would be used in parallel.
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
This means that this is caused by some peculiarity of the mapreduce.
Yes but how a client write 500 files to gluster mount and those file
just written only to subset of subvolumes? I cannot use gluster as a
backup cluster if I cannot write with distcp.
All 500 files were created only on one of the 78 ec sets and the remaining
77 got empty ?
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
You should look which files are created in each brick and how many
while the process is running.
Files only created on nodes 185..204 or 205..224 or 225..244. Only on
20 nodes in each test.
How many files there were in each brick ?
Not sure if this can be related, but standard linux distributions
have
a
default limit of 1024 open file descriptors. Having a so big volume and
doing a massive copy, maybe this limit is affecting something ?
Are there any error or warning messages in the mount or bricks logs ?
Xavi
Post by Serkan Çoban
On Tue, Apr 19, 2016 at 1:05 PM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
moved to gluster-users since this doesn't belong to devel list.
I am copying 10.000 files to gluster volume using mapreduce on
clients. Each map process took one file at a time and copy it to
gluster volume.
I assume that gluster is used to store the intermediate files before
the
reduce phase.
My disperse volume consist of 78 subvolumes of 16+4 disk each. So
If
I
copy >78 files parallel I expect each file goes to different subvolume
right?
If you only copy 78 files, most probably you will get some subvolume
empty
and some other with more than one or two files. It's not an exact
distribution, it's a statistially balanced distribution: over time and
with
enough files, each brick will contain an amount of files in the
same
order
of magnitude, but they won't have the *same* number of files.
In my tests during tests with fio I can see every file goes to
different subvolume, but when I start mapreduce process from clients
only 78/3=26 subvolumes used for writing files.
This means that this is caused by some peculiarity of the mapreduce.
I see that clearly from network traffic. Mapreduce on client side can
be run multi thread. I tested with 1-5-10 threads on each client but
every time only 26 subvolumes used.
How can I debug the issue further?
You should look which files are created in each brick and how many while
the
process is running.
Xavi
On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Hi, I just reinstalled fresh 3.7.11 and I am seeing the same
behavior.
50 clients copying part-0-xxxx named files using mapreduce to gluster
using one thread per server and they are using only 20 servers
out
of
60. On the other hand fio tests use all the servers. Anything I
can
do
to solve the issue?
Distribution of files to ec sets is done by dht. In theory if you
create
many files each ec set will receive the same amount of files.
However
when
the number of files is small enough, statistics can fail.
Not sure what you are doing exactly, but a mapreduce procedure
generally
only creates a single output. In that case it makes sense that only
one
ec
set is used. If you want to use all ec sets for a single file,
you
should
enable sharding (I haven't tested that) or split the result in multiple
files.
Xavi
Thanks,
Serkan
---------- Forwarded message ----------
Date: Mon, Apr 18, 2016 at 2:39 PM
Subject: disperse volume file to subvolume mapping
Hi, I have a problem where clients are using only 1/3 of nodes in
disperse volume for writing.
I am testing from 50 clients using 1 to 10 threads with file names
part-0-xxxx.
What I see is clients only use 20 nodes for writing. How is the file
name to sub volume hashing is done? Is this related to file
names
are
similar?
My cluster is 3.7.10 with 60 nodes each has 26 disks. Disperse volume
is 78 x (16+4). Only 26 out of 78 sub volumes used during writes..
Xavier Hernandez
2016-04-21 12:34:26 UTC
Permalink
Can you try a 'gluster volume rebalance v0 start force' ?
Post by Serkan Çoban
Has the rebalance operation finished successfully ? has it skipped any files ?
Yes according to gluster v rebalance status it is completed without any errors.
Node Rebalanced files size Scanned
failures skipped
1.1.1.185 158 29GB 1720
0 314
1.1.1.205 93 46.5GB 761
0 95
1.1.1.225 74 37GB 779
0 94
All other hosts has 0 values.
I double check that files with '---------T' attributes are there,
maybe some of them deleted but I still see them in bricks...
I am also concerned why part files not distributed to all 60 nodes?
Rebalance should do that?
Hi Serkan,
Post by Serkan Çoban
I started a gluster v rebalance v0 start command hoping that it will
equally redistribute files across 60 nodes but it did not do that...
why it did not redistribute files? any thoughts?
Has the rebalance operation finished successfully ? has it skipped any files ?
After a successful rebalance all files with attributes '---------T' should
have disappeared.
Post by Serkan Çoban
On Thu, Apr 21, 2016 at 11:24 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Post by Serkan Çoban
Post by Xavier Hernandez
I think the problem is in the temporary name that distcp gives to the
file while it's being copied before renaming it to the real name. Do you
know what is the structure of this name ?
".distcp.tmp.attempt_1460381790773_0248_m_000001_0" and the same
temporary file name used by one map process. For example I see in the
logs that one map copies files part-m-00031,part-m-00047,part-m-00063
sequentially and they all use same temporary file name above. So no
original file name appears in temporary file name.
This explains the problem. With the default options, DHT sends all files to
the subvolume that should store a file named 'distcp.tmp'.
With this temporary name format, little can be done.
Post by Serkan Çoban
I will check if we can modify distcp behaviour, or we have to write
our mapreduce procedures instead of using distcp.
Post by Xavier Hernandez
2. define the option 'extra-hash-regex' to an expression that matches
your temporary file names and returns the same name that will finally have.
Depending on the differences between original and temporary file names, this
option could be useless.
3. set the option 'rsync-hash-regex' to 'none'. This will prevent the
name conversion, so the files will be evenly distributed. However this will
cause a lot of files placed in incorrect subvolumes, creating a lot of link
files until a rebalance is executed.
How can I set these options?
gluster volume set <volname> <option> <value>
gluster volume set v0 rsync-hash-regex none
Xavi
Post by Serkan Çoban
On Thu, Apr 21, 2016 at 10:00 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
I think the problem is in the temporary name that distcp gives to the file
while it's being copied before renaming it to the real name. Do you know
what is the structure of this name ?
DHT selects the subvolume (in this case the ec set) on which the file will
be stored based on the name of the file. This has a problem when a file is
being renamed, because this could change the subvolume where the file should
be found.
DHT has a feature to avoid incorrect file placements when executing renames
for the rsync case. What it does is to check if the file matches the
^\.(.+)\.[^.]+$
If a match is found, it only considers the part between parenthesis to
calculate the destination subvolume.
This is useful for rsync because temporary file names are constructed in
the
following way: suppose the original filename is 'test'. The temporary
filename while rsync is being executed is made by prepending a dot and
appending '.<random chars>': .test.712hd
As you can see, the original name and the part of the name between
parenthesis that matches the regular expression are the same. This causes
that, after renaming the temporary file to its original filename, both files
will be considered to belong to the same subvolume by DHT.
In your case it's very probable that distcp uses a temporary name like
'.part.<number>'. In this case the portion of the name used to select the
subvolume is always 'part'. This would explain why all files go to the same
subvolume. Once the file is renamed to another name, DHT realizes that it
should go to another subvolume. At this point it creates a link file (those
files with access rights = '---------T') in the correct subvolume but it
doesn't move it. As you can see, this kind of files are better balanced.
1. change the temporary filename used by distcp to correctly match the
regular expression. I'm not sure if this can be configured, but if this is
possible, this is the best option.
2. define the option 'extra-hash-regex' to an expression that matches your
temporary file names and returns the same name that will finally have.
Depending on the differences between original and temporary file names, this
option could be useless.
3. set the option 'rsync-hash-regex' to 'none'. This will prevent the name
conversion, so the files will be evenly distributed. However this will cause
a lot of files placed in incorrect subvolumes, creating a lot of link files
until a rebalance is executed.
Xavi
Post by Serkan Çoban
gluster volume create v0 disperse 20 redundancy 4 \
1.1.1.{185..204}:/bricks/02 \
1.1.1.{205..224}:/bricks/02 \
1.1.1.{225..244}:/bricks/02 \
1.1.1.{185..204}:/bricks/03 \
1.1.1.{205..224}:/bricks/03 \
1.1.1.{225..244}:/bricks/03 \
1.1.1.{185..204}:/bricks/04 \
1.1.1.{205..224}:/bricks/04 \
1.1.1.{225..244}:/bricks/04 \
1.1.1.{185..204}:/bricks/05 \
1.1.1.{205..224}:/bricks/05 \
1.1.1.{225..244}:/bricks/05 \
1.1.1.{185..204}:/bricks/06 \
1.1.1.{205..224}:/bricks/06 \
1.1.1.{225..244}:/bricks/06 \
1.1.1.{185..204}:/bricks/07 \
1.1.1.{205..224}:/bricks/07 \
1.1.1.{225..244}:/bricks/07 \
1.1.1.{185..204}:/bricks/08 \
1.1.1.{205..224}:/bricks/08 \
1.1.1.{225..244}:/bricks/08 \
1.1.1.{185..204}:/bricks/09 \
1.1.1.{205..224}:/bricks/09 \
1.1.1.{225..244}:/bricks/09 \
1.1.1.{185..204}:/bricks/10 \
1.1.1.{205..224}:/bricks/10 \
1.1.1.{225..244}:/bricks/10 \
1.1.1.{185..204}:/bricks/11 \
1.1.1.{205..224}:/bricks/11 \
1.1.1.{225..244}:/bricks/11 \
1.1.1.{185..204}:/bricks/12 \
1.1.1.{205..224}:/bricks/12 \
1.1.1.{225..244}:/bricks/12 \
1.1.1.{185..204}:/bricks/13 \
1.1.1.{205..224}:/bricks/13 \
1.1.1.{225..244}:/bricks/13 \
1.1.1.{185..204}:/bricks/14 \
1.1.1.{205..224}:/bricks/14 \
1.1.1.{225..244}:/bricks/14 \
1.1.1.{185..204}:/bricks/15 \
1.1.1.{205..224}:/bricks/15 \
1.1.1.{225..244}:/bricks/15 \
1.1.1.{185..204}:/bricks/16 \
1.1.1.{205..224}:/bricks/16 \
1.1.1.{225..244}:/bricks/16 \
1.1.1.{185..204}:/bricks/17 \
1.1.1.{205..224}:/bricks/17 \
1.1.1.{225..244}:/bricks/17 \
1.1.1.{185..204}:/bricks/18 \
1.1.1.{205..224}:/bricks/18 \
1.1.1.{225..244}:/bricks/18 \
1.1.1.{185..204}:/bricks/19 \
1.1.1.{205..224}:/bricks/19 \
1.1.1.{225..244}:/bricks/19 \
1.1.1.{185..204}:/bricks/20 \
1.1.1.{205..224}:/bricks/20 \
1.1.1.{225..244}:/bricks/20 \
1.1.1.{185..204}:/bricks/21 \
1.1.1.{205..224}:/bricks/21 \
1.1.1.{225..244}:/bricks/21 \
1.1.1.{185..204}:/bricks/22 \
1.1.1.{205..224}:/bricks/22 \
1.1.1.{225..244}:/bricks/22 \
1.1.1.{185..204}:/bricks/23 \
1.1.1.{205..224}:/bricks/23 \
1.1.1.{225..244}:/bricks/23 \
1.1.1.{185..204}:/bricks/24 \
1.1.1.{205..224}:/bricks/24 \
1.1.1.{225..244}:/bricks/24 \
1.1.1.{185..204}:/bricks/25 \
1.1.1.{205..224}:/bricks/25 \
1.1.1.{225..244}:/bricks/25 \
1.1.1.{185..204}:/bricks/26 \
1.1.1.{205..224}:/bricks/26 \
1.1.1.{225..244}:/bricks/26 \
1.1.1.{185..204}:/bricks/27 \
1.1.1.{205..224}:/bricks/27 \
1.1.1.{225..244}:/bricks/27 force
mount -t glusterfs 1.1.1.185:/v0 /mnt/gluster
then I make a directory from one of the clients and chmod it.
mkdir /mnt/gluster/s1 && chmod 777 /mnt/gluster/s1
then I start distcp on clients, there are 1059X8.8GB files in one
folder
and
they will be copied to /mnt/gluster/s1 with 100 parallel which means 2
copy jobs per client at same time.
hadoop distcp -m 100 http://nn1:8020/path/to/teragen-10tb
file:///mnt/gluster/s1
s1 directory is present in all 1560 brick.
s1/teragen-10tb folder is present in all 1560 brick.
https://www.dropbox.com/s/rbgdxmrtwz8oya8/teragen_list.zip?dl=0
You can ignore the .crc files in the brick output above, they are
checksum files...
As you can see part-m-xxxx files written only some bricks in nodes 0205..0224
All bricks have some files but they have zero size.
I increase file descriptors to 65k so it is not the issue...
On Wed, Apr 20, 2016 at 9:34 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
I assume that gluster is used to store the intermediate files before
the reduce phase
Nope, gluster is the destination for distcp command. hadoop distcp -m
50 http://nn1:8020/path/to/folder file:///mnt/gluster
This run maps on datanodes which have /mnt/gluster mounted on all of them.
I don't know hadoop, so I'm of little help here. However it seems that
-m
50
means to execute 50 copies in parallel. This means that even if the
distribution worked fine, at most 50 (much probably less) of the 78
ec
sets
would be used in parallel.
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
This means that this is caused by some peculiarity of the mapreduce.
Yes but how a client write 500 files to gluster mount and those file
just written only to subset of subvolumes? I cannot use gluster as a
backup cluster if I cannot write with distcp.
All 500 files were created only on one of the 78 ec sets and the remaining
77 got empty ?
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
You should look which files are created in each brick and how many
while the process is running.
Files only created on nodes 185..204 or 205..224 or 225..244. Only on
20 nodes in each test.
How many files there were in each brick ?
Not sure if this can be related, but standard linux distributions
have
a
default limit of 1024 open file descriptors. Having a so big volume and
doing a massive copy, maybe this limit is affecting something ?
Are there any error or warning messages in the mount or bricks logs ?
Xavi
Post by Serkan Çoban
On Tue, Apr 19, 2016 at 1:05 PM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
moved to gluster-users since this doesn't belong to devel list.
I am copying 10.000 files to gluster volume using mapreduce on
clients. Each map process took one file at a time and copy it to
gluster volume.
I assume that gluster is used to store the intermediate files before
the
reduce phase.
My disperse volume consist of 78 subvolumes of 16+4 disk each. So
If
I
copy >78 files parallel I expect each file goes to different subvolume
right?
If you only copy 78 files, most probably you will get some subvolume
empty
and some other with more than one or two files. It's not an exact
distribution, it's a statistially balanced distribution: over time and
with
enough files, each brick will contain an amount of files in the
same
order
of magnitude, but they won't have the *same* number of files.
In my tests during tests with fio I can see every file goes to
different subvolume, but when I start mapreduce process from clients
only 78/3=26 subvolumes used for writing files.
This means that this is caused by some peculiarity of the mapreduce.
I see that clearly from network traffic. Mapreduce on client side can
be run multi thread. I tested with 1-5-10 threads on each client but
every time only 26 subvolumes used.
How can I debug the issue further?
You should look which files are created in each brick and how many while
the
process is running.
Xavi
On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Hi, I just reinstalled fresh 3.7.11 and I am seeing the same
behavior.
50 clients copying part-0-xxxx named files using mapreduce to
gluster
using one thread per server and they are using only 20 servers
out
of
60. On the other hand fio tests use all the servers. Anything I
can
do
to solve the issue?
Distribution of files to ec sets is done by dht. In theory if you
create
many files each ec set will receive the same amount of files.
However
when
the number of files is small enough, statistics can fail.
Not sure what you are doing exactly, but a mapreduce procedure
generally
only creates a single output. In that case it makes sense that only
one
ec
set is used. If you want to use all ec sets for a single file,
you
should
enable sharding (I haven't tested that) or split the result in
multiple
files.
Xavi
Thanks,
Serkan
---------- Forwarded message ----------
Date: Mon, Apr 18, 2016 at 2:39 PM
Subject: disperse volume file to subvolume mapping
Hi, I have a problem where clients are using only 1/3 of nodes in
disperse volume for writing.
I am testing from 50 clients using 1 to 10 threads with file names
part-0-xxxx.
What I see is clients only use 20 nodes for writing. How is the
file
name to sub volume hashing is done? Is this related to file
names
are
similar?
My cluster is 3.7.10 with 60 nodes each has 26 disks. Disperse
volume
is 78 x (16+4). Only 26 out of 78 sub volumes used during writes..
Serkan Çoban
2016-04-21 13:19:36 UTC
Permalink
Same result. Also checked the rebalance.log file, it has also no
reference to part files...
Post by Xavier Hernandez
Can you try a 'gluster volume rebalance v0 start force' ?
Post by Serkan Çoban
Has the rebalance operation finished successfully ? has it skipped any files ?
Yes according to gluster v rebalance status it is completed without any errors.
Node Rebalanced files size Scanned
failures skipped
1.1.1.185 158 29GB 1720
0 314
1.1.1.205 93 46.5GB 761
0 95
1.1.1.225 74 37GB 779
0 94
All other hosts has 0 values.
I double check that files with '---------T' attributes are there,
maybe some of them deleted but I still see them in bricks...
I am also concerned why part files not distributed to all 60 nodes?
Rebalance should do that?
Hi Serkan,
Post by Serkan Çoban
I started a gluster v rebalance v0 start command hoping that it will
equally redistribute files across 60 nodes but it did not do that...
why it did not redistribute files? any thoughts?
Has the rebalance operation finished successfully ? has it skipped any files
?
After a successful rebalance all files with attributes '---------T' should
have disappeared.
Post by Serkan Çoban
On Thu, Apr 21, 2016 at 11:24 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Post by Serkan Çoban
Post by Xavier Hernandez
I think the problem is in the temporary name that distcp gives to the
file while it's being copied before renaming it to the real name. Do you
know what is the structure of this name ?
".distcp.tmp.attempt_1460381790773_0248_m_000001_0" and the same
temporary file name used by one map process. For example I see in the
logs that one map copies files part-m-00031,part-m-00047,part-m-00063
sequentially and they all use same temporary file name above. So no
original file name appears in temporary file name.
This explains the problem. With the default options, DHT sends all files
to
the subvolume that should store a file named 'distcp.tmp'.
With this temporary name format, little can be done.
Post by Serkan Çoban
I will check if we can modify distcp behaviour, or we have to write
our mapreduce procedures instead of using distcp.
Post by Xavier Hernandez
2. define the option 'extra-hash-regex' to an expression that matches
your temporary file names and returns the same name that will finally have.
Depending on the differences between original and temporary file
names,
this
option could be useless.
3. set the option 'rsync-hash-regex' to 'none'. This will prevent the
name conversion, so the files will be evenly distributed. However this
will
cause a lot of files placed in incorrect subvolumes, creating a lot
of
link
files until a rebalance is executed.
How can I set these options?
gluster volume set <volname> <option> <value>
gluster volume set v0 rsync-hash-regex none
Xavi
Post by Serkan Çoban
On Thu, Apr 21, 2016 at 10:00 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
I think the problem is in the temporary name that distcp gives to the file
while it's being copied before renaming it to the real name. Do you know
what is the structure of this name ?
DHT selects the subvolume (in this case the ec set) on which the file will
be stored based on the name of the file. This has a problem when a
file
is
being renamed, because this could change the subvolume where the file should
be found.
DHT has a feature to avoid incorrect file placements when executing renames
for the rsync case. What it does is to check if the file matches the
^\.(.+)\.[^.]+$
If a match is found, it only considers the part between parenthesis to
calculate the destination subvolume.
This is useful for rsync because temporary file names are constructed in
the
following way: suppose the original filename is 'test'. The temporary
filename while rsync is being executed is made by prepending a dot and
appending '.<random chars>': .test.712hd
As you can see, the original name and the part of the name between
parenthesis that matches the regular expression are the same. This causes
that, after renaming the temporary file to its original filename,
both
files
will be considered to belong to the same subvolume by DHT.
In your case it's very probable that distcp uses a temporary name like
'.part.<number>'. In this case the portion of the name used to select the
subvolume is always 'part'. This would explain why all files go to
the
same
subvolume. Once the file is renamed to another name, DHT realizes
that
it
should go to another subvolume. At this point it creates a link file (those
files with access rights = '---------T') in the correct subvolume but it
doesn't move it. As you can see, this kind of files are better balanced.
1. change the temporary filename used by distcp to correctly match the
regular expression. I'm not sure if this can be configured, but if this
is
possible, this is the best option.
2. define the option 'extra-hash-regex' to an expression that matches your
temporary file names and returns the same name that will finally have.
Depending on the differences between original and temporary file
names,
this
option could be useless.
3. set the option 'rsync-hash-regex' to 'none'. This will prevent the name
conversion, so the files will be evenly distributed. However this
will
cause
a lot of files placed in incorrect subvolumes, creating a lot of link files
until a rebalance is executed.
Xavi
Post by Serkan Çoban
gluster volume create v0 disperse 20 redundancy 4 \
1.1.1.{185..204}:/bricks/02 \
1.1.1.{205..224}:/bricks/02 \
1.1.1.{225..244}:/bricks/02 \
1.1.1.{185..204}:/bricks/03 \
1.1.1.{205..224}:/bricks/03 \
1.1.1.{225..244}:/bricks/03 \
1.1.1.{185..204}:/bricks/04 \
1.1.1.{205..224}:/bricks/04 \
1.1.1.{225..244}:/bricks/04 \
1.1.1.{185..204}:/bricks/05 \
1.1.1.{205..224}:/bricks/05 \
1.1.1.{225..244}:/bricks/05 \
1.1.1.{185..204}:/bricks/06 \
1.1.1.{205..224}:/bricks/06 \
1.1.1.{225..244}:/bricks/06 \
1.1.1.{185..204}:/bricks/07 \
1.1.1.{205..224}:/bricks/07 \
1.1.1.{225..244}:/bricks/07 \
1.1.1.{185..204}:/bricks/08 \
1.1.1.{205..224}:/bricks/08 \
1.1.1.{225..244}:/bricks/08 \
1.1.1.{185..204}:/bricks/09 \
1.1.1.{205..224}:/bricks/09 \
1.1.1.{225..244}:/bricks/09 \
1.1.1.{185..204}:/bricks/10 \
1.1.1.{205..224}:/bricks/10 \
1.1.1.{225..244}:/bricks/10 \
1.1.1.{185..204}:/bricks/11 \
1.1.1.{205..224}:/bricks/11 \
1.1.1.{225..244}:/bricks/11 \
1.1.1.{185..204}:/bricks/12 \
1.1.1.{205..224}:/bricks/12 \
1.1.1.{225..244}:/bricks/12 \
1.1.1.{185..204}:/bricks/13 \
1.1.1.{205..224}:/bricks/13 \
1.1.1.{225..244}:/bricks/13 \
1.1.1.{185..204}:/bricks/14 \
1.1.1.{205..224}:/bricks/14 \
1.1.1.{225..244}:/bricks/14 \
1.1.1.{185..204}:/bricks/15 \
1.1.1.{205..224}:/bricks/15 \
1.1.1.{225..244}:/bricks/15 \
1.1.1.{185..204}:/bricks/16 \
1.1.1.{205..224}:/bricks/16 \
1.1.1.{225..244}:/bricks/16 \
1.1.1.{185..204}:/bricks/17 \
1.1.1.{205..224}:/bricks/17 \
1.1.1.{225..244}:/bricks/17 \
1.1.1.{185..204}:/bricks/18 \
1.1.1.{205..224}:/bricks/18 \
1.1.1.{225..244}:/bricks/18 \
1.1.1.{185..204}:/bricks/19 \
1.1.1.{205..224}:/bricks/19 \
1.1.1.{225..244}:/bricks/19 \
1.1.1.{185..204}:/bricks/20 \
1.1.1.{205..224}:/bricks/20 \
1.1.1.{225..244}:/bricks/20 \
1.1.1.{185..204}:/bricks/21 \
1.1.1.{205..224}:/bricks/21 \
1.1.1.{225..244}:/bricks/21 \
1.1.1.{185..204}:/bricks/22 \
1.1.1.{205..224}:/bricks/22 \
1.1.1.{225..244}:/bricks/22 \
1.1.1.{185..204}:/bricks/23 \
1.1.1.{205..224}:/bricks/23 \
1.1.1.{225..244}:/bricks/23 \
1.1.1.{185..204}:/bricks/24 \
1.1.1.{205..224}:/bricks/24 \
1.1.1.{225..244}:/bricks/24 \
1.1.1.{185..204}:/bricks/25 \
1.1.1.{205..224}:/bricks/25 \
1.1.1.{225..244}:/bricks/25 \
1.1.1.{185..204}:/bricks/26 \
1.1.1.{205..224}:/bricks/26 \
1.1.1.{225..244}:/bricks/26 \
1.1.1.{185..204}:/bricks/27 \
1.1.1.{205..224}:/bricks/27 \
1.1.1.{225..244}:/bricks/27 force
mount -t glusterfs 1.1.1.185:/v0 /mnt/gluster
then I make a directory from one of the clients and chmod it.
mkdir /mnt/gluster/s1 && chmod 777 /mnt/gluster/s1
then I start distcp on clients, there are 1059X8.8GB files in one
folder
and
they will be copied to /mnt/gluster/s1 with 100 parallel which means 2
copy jobs per client at same time.
hadoop distcp -m 100 http://nn1:8020/path/to/teragen-10tb
file:///mnt/gluster/s1
s1 directory is present in all 1560 brick.
s1/teragen-10tb folder is present in all 1560 brick.
https://www.dropbox.com/s/rbgdxmrtwz8oya8/teragen_list.zip?dl=0
You can ignore the .crc files in the brick output above, they are
checksum files...
As you can see part-m-xxxx files written only some bricks in nodes 0205..0224
All bricks have some files but they have zero size.
I increase file descriptors to 65k so it is not the issue...
On Wed, Apr 20, 2016 at 9:34 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
I assume that gluster is used to store the intermediate files
before
the reduce phase
Nope, gluster is the destination for distcp command. hadoop distcp -m
50 http://nn1:8020/path/to/folder file:///mnt/gluster
This run maps on datanodes which have /mnt/gluster mounted on all
of
them.
I don't know hadoop, so I'm of little help here. However it seems that
-m
50
means to execute 50 copies in parallel. This means that even if the
distribution worked fine, at most 50 (much probably less) of the 78
ec
sets
would be used in parallel.
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
This means that this is caused by some peculiarity of the
mapreduce.
Yes but how a client write 500 files to gluster mount and those file
just written only to subset of subvolumes? I cannot use gluster as a
backup cluster if I cannot write with distcp.
All 500 files were created only on one of the 78 ec sets and the remaining
77 got empty ?
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
You should look which files are created in each brick and how many
while the process is running.
Files only created on nodes 185..204 or 205..224 or 225..244. Only on
20 nodes in each test.
How many files there were in each brick ?
Not sure if this can be related, but standard linux distributions
have
a
default limit of 1024 open file descriptors. Having a so big volume and
doing a massive copy, maybe this limit is affecting something ?
Are there any error or warning messages in the mount or bricks logs ?
Xavi
Post by Serkan Çoban
On Tue, Apr 19, 2016 at 1:05 PM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
moved to gluster-users since this doesn't belong to devel list.
I am copying 10.000 files to gluster volume using mapreduce on
clients. Each map process took one file at a time and copy it to
gluster volume.
I assume that gluster is used to store the intermediate files before
the
reduce phase.
My disperse volume consist of 78 subvolumes of 16+4 disk each. So
If
I
copy >78 files parallel I expect each file goes to different
subvolume
right?
If you only copy 78 files, most probably you will get some subvolume
empty
and some other with more than one or two files. It's not an exact
distribution, it's a statistially balanced distribution: over
time
and
with
enough files, each brick will contain an amount of files in the
same
order
of magnitude, but they won't have the *same* number of files.
In my tests during tests with fio I can see every file goes to
different subvolume, but when I start mapreduce process from clients
only 78/3=26 subvolumes used for writing files.
This means that this is caused by some peculiarity of the mapreduce.
I see that clearly from network traffic. Mapreduce on client
side
can
be run multi thread. I tested with 1-5-10 threads on each client but
every time only 26 subvolumes used.
How can I debug the issue further?
You should look which files are created in each brick and how many
while
the
process is running.
Xavi
On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Hi, I just reinstalled fresh 3.7.11 and I am seeing the same
behavior.
50 clients copying part-0-xxxx named files using mapreduce to
gluster
using one thread per server and they are using only 20 servers
out
of
60. On the other hand fio tests use all the servers. Anything I
can
do
to solve the issue?
Distribution of files to ec sets is done by dht. In theory if you
create
many files each ec set will receive the same amount of files.
However
when
the number of files is small enough, statistics can fail.
Not sure what you are doing exactly, but a mapreduce procedure
generally
only creates a single output. In that case it makes sense that only
one
ec
set is used. If you want to use all ec sets for a single file,
you
should
enable sharding (I haven't tested that) or split the result in
multiple
files.
Xavi
Thanks,
Serkan
---------- Forwarded message ----------
Date: Mon, Apr 18, 2016 at 2:39 PM
Subject: disperse volume file to subvolume mapping
Hi, I have a problem where clients are using only 1/3 of nodes in
disperse volume for writing.
I am testing from 50 clients using 1 to 10 threads with file
names
part-0-xxxx.
What I see is clients only use 20 nodes for writing. How is the
file
name to sub volume hashing is done? Is this related to file
names
are
similar?
My cluster is 3.7.10 with 60 nodes each has 26 disks. Disperse
volume
is 78 x (16+4). Only 26 out of 78 sub volumes used during
writes..
Xavier Hernandez
2016-04-22 06:15:50 UTC
Permalink
When you execute a rebalance 'force' the skipped column should be 0 for
all nodes and all '---------T' files must have disappeared. Otherwise
something failed. Is this true in your case ?
Post by Serkan Çoban
Same result. Also checked the rebalance.log file, it has also no
reference to part files...
Post by Xavier Hernandez
Can you try a 'gluster volume rebalance v0 start force' ?
Post by Serkan Çoban
Has the rebalance operation finished successfully ? has it skipped any files ?
Yes according to gluster v rebalance status it is completed without any errors.
Node Rebalanced files size Scanned
failures skipped
1.1.1.185 158 29GB 1720
0 314
1.1.1.205 93 46.5GB 761
0 95
1.1.1.225 74 37GB 779
0 94
All other hosts has 0 values.
I double check that files with '---------T' attributes are there,
maybe some of them deleted but I still see them in bricks...
I am also concerned why part files not distributed to all 60 nodes?
Rebalance should do that?
Hi Serkan,
Post by Serkan Çoban
I started a gluster v rebalance v0 start command hoping that it will
equally redistribute files across 60 nodes but it did not do that...
why it did not redistribute files? any thoughts?
Has the rebalance operation finished successfully ? has it skipped any files
?
After a successful rebalance all files with attributes '---------T' should
have disappeared.
Post by Serkan Çoban
On Thu, Apr 21, 2016 at 11:24 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Post by Serkan Çoban
Post by Xavier Hernandez
I think the problem is in the temporary name that distcp gives to the
file while it's being copied before renaming it to the real name. Do you
know what is the structure of this name ?
".distcp.tmp.attempt_1460381790773_0248_m_000001_0" and the same
temporary file name used by one map process. For example I see in the
logs that one map copies files part-m-00031,part-m-00047,part-m-00063
sequentially and they all use same temporary file name above. So no
original file name appears in temporary file name.
This explains the problem. With the default options, DHT sends all files
to
the subvolume that should store a file named 'distcp.tmp'.
With this temporary name format, little can be done.
Post by Serkan Çoban
I will check if we can modify distcp behaviour, or we have to write
our mapreduce procedures instead of using distcp.
Post by Xavier Hernandez
2. define the option 'extra-hash-regex' to an expression that matches
your temporary file names and returns the same name that will finally have.
Depending on the differences between original and temporary file
names,
this
option could be useless.
3. set the option 'rsync-hash-regex' to 'none'. This will prevent the
name conversion, so the files will be evenly distributed. However this
will
cause a lot of files placed in incorrect subvolumes, creating a lot
of
link
files until a rebalance is executed.
How can I set these options?
gluster volume set <volname> <option> <value>
gluster volume set v0 rsync-hash-regex none
Xavi
Post by Serkan Çoban
On Thu, Apr 21, 2016 at 10:00 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
I think the problem is in the temporary name that distcp gives to the file
while it's being copied before renaming it to the real name. Do you know
what is the structure of this name ?
DHT selects the subvolume (in this case the ec set) on which the file will
be stored based on the name of the file. This has a problem when a
file
is
being renamed, because this could change the subvolume where the file should
be found.
DHT has a feature to avoid incorrect file placements when executing renames
for the rsync case. What it does is to check if the file matches the
^\.(.+)\.[^.]+$
If a match is found, it only considers the part between parenthesis to
calculate the destination subvolume.
This is useful for rsync because temporary file names are constructed in
the
following way: suppose the original filename is 'test'. The temporary
filename while rsync is being executed is made by prepending a dot and
appending '.<random chars>': .test.712hd
As you can see, the original name and the part of the name between
parenthesis that matches the regular expression are the same. This causes
that, after renaming the temporary file to its original filename,
both
files
will be considered to belong to the same subvolume by DHT.
In your case it's very probable that distcp uses a temporary name like
'.part.<number>'. In this case the portion of the name used to select the
subvolume is always 'part'. This would explain why all files go to
the
same
subvolume. Once the file is renamed to another name, DHT realizes
that
it
should go to another subvolume. At this point it creates a link file (those
files with access rights = '---------T') in the correct subvolume but it
doesn't move it. As you can see, this kind of files are better balanced.
1. change the temporary filename used by distcp to correctly match the
regular expression. I'm not sure if this can be configured, but if this
is
possible, this is the best option.
2. define the option 'extra-hash-regex' to an expression that matches your
temporary file names and returns the same name that will finally have.
Depending on the differences between original and temporary file
names,
this
option could be useless.
3. set the option 'rsync-hash-regex' to 'none'. This will prevent the name
conversion, so the files will be evenly distributed. However this
will
cause
a lot of files placed in incorrect subvolumes, creating a lot of link files
until a rebalance is executed.
Xavi
Post by Serkan Çoban
gluster volume create v0 disperse 20 redundancy 4 \
1.1.1.{185..204}:/bricks/02 \
1.1.1.{205..224}:/bricks/02 \
1.1.1.{225..244}:/bricks/02 \
1.1.1.{185..204}:/bricks/03 \
1.1.1.{205..224}:/bricks/03 \
1.1.1.{225..244}:/bricks/03 \
1.1.1.{185..204}:/bricks/04 \
1.1.1.{205..224}:/bricks/04 \
1.1.1.{225..244}:/bricks/04 \
1.1.1.{185..204}:/bricks/05 \
1.1.1.{205..224}:/bricks/05 \
1.1.1.{225..244}:/bricks/05 \
1.1.1.{185..204}:/bricks/06 \
1.1.1.{205..224}:/bricks/06 \
1.1.1.{225..244}:/bricks/06 \
1.1.1.{185..204}:/bricks/07 \
1.1.1.{205..224}:/bricks/07 \
1.1.1.{225..244}:/bricks/07 \
1.1.1.{185..204}:/bricks/08 \
1.1.1.{205..224}:/bricks/08 \
1.1.1.{225..244}:/bricks/08 \
1.1.1.{185..204}:/bricks/09 \
1.1.1.{205..224}:/bricks/09 \
1.1.1.{225..244}:/bricks/09 \
1.1.1.{185..204}:/bricks/10 \
1.1.1.{205..224}:/bricks/10 \
1.1.1.{225..244}:/bricks/10 \
1.1.1.{185..204}:/bricks/11 \
1.1.1.{205..224}:/bricks/11 \
1.1.1.{225..244}:/bricks/11 \
1.1.1.{185..204}:/bricks/12 \
1.1.1.{205..224}:/bricks/12 \
1.1.1.{225..244}:/bricks/12 \
1.1.1.{185..204}:/bricks/13 \
1.1.1.{205..224}:/bricks/13 \
1.1.1.{225..244}:/bricks/13 \
1.1.1.{185..204}:/bricks/14 \
1.1.1.{205..224}:/bricks/14 \
1.1.1.{225..244}:/bricks/14 \
1.1.1.{185..204}:/bricks/15 \
1.1.1.{205..224}:/bricks/15 \
1.1.1.{225..244}:/bricks/15 \
1.1.1.{185..204}:/bricks/16 \
1.1.1.{205..224}:/bricks/16 \
1.1.1.{225..244}:/bricks/16 \
1.1.1.{185..204}:/bricks/17 \
1.1.1.{205..224}:/bricks/17 \
1.1.1.{225..244}:/bricks/17 \
1.1.1.{185..204}:/bricks/18 \
1.1.1.{205..224}:/bricks/18 \
1.1.1.{225..244}:/bricks/18 \
1.1.1.{185..204}:/bricks/19 \
1.1.1.{205..224}:/bricks/19 \
1.1.1.{225..244}:/bricks/19 \
1.1.1.{185..204}:/bricks/20 \
1.1.1.{205..224}:/bricks/20 \
1.1.1.{225..244}:/bricks/20 \
1.1.1.{185..204}:/bricks/21 \
1.1.1.{205..224}:/bricks/21 \
1.1.1.{225..244}:/bricks/21 \
1.1.1.{185..204}:/bricks/22 \
1.1.1.{205..224}:/bricks/22 \
1.1.1.{225..244}:/bricks/22 \
1.1.1.{185..204}:/bricks/23 \
1.1.1.{205..224}:/bricks/23 \
1.1.1.{225..244}:/bricks/23 \
1.1.1.{185..204}:/bricks/24 \
1.1.1.{205..224}:/bricks/24 \
1.1.1.{225..244}:/bricks/24 \
1.1.1.{185..204}:/bricks/25 \
1.1.1.{205..224}:/bricks/25 \
1.1.1.{225..244}:/bricks/25 \
1.1.1.{185..204}:/bricks/26 \
1.1.1.{205..224}:/bricks/26 \
1.1.1.{225..244}:/bricks/26 \
1.1.1.{185..204}:/bricks/27 \
1.1.1.{205..224}:/bricks/27 \
1.1.1.{225..244}:/bricks/27 force
mount -t glusterfs 1.1.1.185:/v0 /mnt/gluster
then I make a directory from one of the clients and chmod it.
mkdir /mnt/gluster/s1 && chmod 777 /mnt/gluster/s1
then I start distcp on clients, there are 1059X8.8GB files in one
folder
and
they will be copied to /mnt/gluster/s1 with 100 parallel which means 2
copy jobs per client at same time.
hadoop distcp -m 100 http://nn1:8020/path/to/teragen-10tb
file:///mnt/gluster/s1
s1 directory is present in all 1560 brick.
s1/teragen-10tb folder is present in all 1560 brick.
https://www.dropbox.com/s/rbgdxmrtwz8oya8/teragen_list.zip?dl=0
You can ignore the .crc files in the brick output above, they are
checksum files...
As you can see part-m-xxxx files written only some bricks in nodes
0205..0224
All bricks have some files but they have zero size.
I increase file descriptors to 65k so it is not the issue...
On Wed, Apr 20, 2016 at 9:34 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
I assume that gluster is used to store the intermediate files
before
the reduce phase
Nope, gluster is the destination for distcp command. hadoop distcp -m
50 http://nn1:8020/path/to/folder file:///mnt/gluster
This run maps on datanodes which have /mnt/gluster mounted on all
of
them.
I don't know hadoop, so I'm of little help here. However it seems that
-m
50
means to execute 50 copies in parallel. This means that even if the
distribution worked fine, at most 50 (much probably less) of the 78
ec
sets
would be used in parallel.
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
This means that this is caused by some peculiarity of the
mapreduce.
Yes but how a client write 500 files to gluster mount and those file
just written only to subset of subvolumes? I cannot use gluster as a
backup cluster if I cannot write with distcp.
All 500 files were created only on one of the 78 ec sets and the
remaining
77 got empty ?
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
You should look which files are created in each brick and how
many
while the process is running.
Files only created on nodes 185..204 or 205..224 or 225..244. Only on
20 nodes in each test.
How many files there were in each brick ?
Not sure if this can be related, but standard linux distributions
have
a
default limit of 1024 open file descriptors. Having a so big volume and
doing a massive copy, maybe this limit is affecting something ?
Are there any error or warning messages in the mount or bricks logs ?
Xavi
Post by Serkan Çoban
On Tue, Apr 19, 2016 at 1:05 PM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
moved to gluster-users since this doesn't belong to devel list.
I am copying 10.000 files to gluster volume using mapreduce on
clients. Each map process took one file at a time and copy it to
gluster volume.
I assume that gluster is used to store the intermediate files before
the
reduce phase.
My disperse volume consist of 78 subvolumes of 16+4 disk each. So
If
I
copy >78 files parallel I expect each file goes to different
subvolume
right?
If you only copy 78 files, most probably you will get some subvolume
empty
and some other with more than one or two files. It's not an exact
distribution, it's a statistially balanced distribution: over
time
and
with
enough files, each brick will contain an amount of files in the
same
order
of magnitude, but they won't have the *same* number of files.
In my tests during tests with fio I can see every file goes to
different subvolume, but when I start mapreduce process from
clients
only 78/3=26 subvolumes used for writing files.
This means that this is caused by some peculiarity of the mapreduce.
I see that clearly from network traffic. Mapreduce on client
side
can
be run multi thread. I tested with 1-5-10 threads on each client
but
every time only 26 subvolumes used.
How can I debug the issue further?
You should look which files are created in each brick and how many
while
the
process is running.
Xavi
On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Hi, I just reinstalled fresh 3.7.11 and I am seeing the same
behavior.
50 clients copying part-0-xxxx named files using mapreduce to
gluster
using one thread per server and they are using only 20 servers
out
of
60. On the other hand fio tests use all the servers. Anything I
can
do
to solve the issue?
Distribution of files to ec sets is done by dht. In theory if you
create
many files each ec set will receive the same amount of files.
However
when
the number of files is small enough, statistics can fail.
Not sure what you are doing exactly, but a mapreduce procedure
generally
only creates a single output. In that case it makes sense that
only
one
ec
set is used. If you want to use all ec sets for a single file,
you
should
enable sharding (I haven't tested that) or split the result in
multiple
files.
Xavi
Thanks,
Serkan
---------- Forwarded message ----------
Date: Mon, Apr 18, 2016 at 2:39 PM
Subject: disperse volume file to subvolume mapping
Hi, I have a problem where clients are using only 1/3 of nodes
in
disperse volume for writing.
I am testing from 50 clients using 1 to 10 threads with file
names
part-0-xxxx.
What I see is clients only use 20 nodes for writing. How is the
file
name to sub volume hashing is done? Is this related to file
names
are
similar?
My cluster is 3.7.10 with 60 nodes each has 26 disks. Disperse
volume
is 78 x (16+4). Only 26 out of 78 sub volumes used during
writes..
Serkan Çoban
2016-04-22 06:24:17 UTC
Permalink
Not only skipped column but all columns are 0 in rebalance status
command. It seems rebalance does not to anything. All '---------T'
files are there. Anyway we wrote our custom mapreduce tool and it is
copying files right now to gluster and it is utilizing all 60 nodes as
expected. I will delete distcp folder and continue if you don't need
any further log/debug files to examine the issue.

Thanks for help,
Serkan
When you execute a rebalance 'force' the skipped column should be 0 for all
nodes and all '---------T' files must have disappeared. Otherwise something
failed. Is this true in your case ?
Post by Serkan Çoban
Same result. Also checked the rebalance.log file, it has also no
reference to part files...
Post by Xavier Hernandez
Can you try a 'gluster volume rebalance v0 start force' ?
Post by Serkan Çoban
Has the rebalance operation finished successfully ? has it skipped any files ?
Yes according to gluster v rebalance status it is completed without any errors.
Node Rebalanced files size Scanned
failures skipped
1.1.1.185 158 29GB 1720
0 314
1.1.1.205 93 46.5GB 761
0 95
1.1.1.225 74 37GB 779
0 94
All other hosts has 0 values.
I double check that files with '---------T' attributes are there,
maybe some of them deleted but I still see them in bricks...
I am also concerned why part files not distributed to all 60 nodes?
Rebalance should do that?
On Thu, Apr 21, 2016 at 1:55 PM, Xavier Hernandez
Hi Serkan,
Post by Serkan Çoban
I started a gluster v rebalance v0 start command hoping that it will
equally redistribute files across 60 nodes but it did not do that...
why it did not redistribute files? any thoughts?
Has the rebalance operation finished successfully ? has it skipped any files
?
After a successful rebalance all files with attributes '---------T' should
have disappeared.
Post by Serkan Çoban
On Thu, Apr 21, 2016 at 11:24 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Post by Serkan Çoban
Post by Xavier Hernandez
I think the problem is in the temporary name that distcp gives to the
file while it's being copied before renaming it to the real name.
Do
you
know what is the structure of this name ?
".distcp.tmp.attempt_1460381790773_0248_m_000001_0" and the same
temporary file name used by one map process. For example I see in the
logs that one map copies files
part-m-00031,part-m-00047,part-m-00063
sequentially and they all use same temporary file name above. So no
original file name appears in temporary file name.
This explains the problem. With the default options, DHT sends all files
to
the subvolume that should store a file named 'distcp.tmp'.
With this temporary name format, little can be done.
Post by Serkan Çoban
I will check if we can modify distcp behaviour, or we have to write
our mapreduce procedures instead of using distcp.
Post by Xavier Hernandez
2. define the option 'extra-hash-regex' to an expression that matches
your temporary file names and returns the same name that will finally
have.
Depending on the differences between original and temporary file
names,
this
option could be useless.
3. set the option 'rsync-hash-regex' to 'none'. This will prevent the
name conversion, so the files will be evenly distributed. However this
will
cause a lot of files placed in incorrect subvolumes, creating a lot
of
link
files until a rebalance is executed.
How can I set these options?
gluster volume set <volname> <option> <value>
gluster volume set v0 rsync-hash-regex none
Xavi
Post by Serkan Çoban
On Thu, Apr 21, 2016 at 10:00 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
I think the problem is in the temporary name that distcp gives to the
file
while it's being copied before renaming it to the real name. Do you know
what is the structure of this name ?
DHT selects the subvolume (in this case the ec set) on which the file
will
be stored based on the name of the file. This has a problem when a
file
is
being renamed, because this could change the subvolume where the
file
should
be found.
DHT has a feature to avoid incorrect file placements when executing renames
for the rsync case. What it does is to check if the file matches the
^\.(.+)\.[^.]+$
If a match is found, it only considers the part between parenthesis to
calculate the destination subvolume.
This is useful for rsync because temporary file names are
constructed
in
the
following way: suppose the original filename is 'test'. The temporary
filename while rsync is being executed is made by prepending a dot and
appending '.<random chars>': .test.712hd
As you can see, the original name and the part of the name between
parenthesis that matches the regular expression are the same. This causes
that, after renaming the temporary file to its original filename,
both
files
will be considered to belong to the same subvolume by DHT.
In your case it's very probable that distcp uses a temporary name like
'.part.<number>'. In this case the portion of the name used to select
the
subvolume is always 'part'. This would explain why all files go to
the
same
subvolume. Once the file is renamed to another name, DHT realizes
that
it
should go to another subvolume. At this point it creates a link
file
(those
files with access rights = '---------T') in the correct subvolume
but
it
doesn't move it. As you can see, this kind of files are better balanced.
1. change the temporary filename used by distcp to correctly match the
regular expression. I'm not sure if this can be configured, but if this
is
possible, this is the best option.
2. define the option 'extra-hash-regex' to an expression that matches
your
temporary file names and returns the same name that will finally have.
Depending on the differences between original and temporary file
names,
this
option could be useless.
3. set the option 'rsync-hash-regex' to 'none'. This will prevent the
name
conversion, so the files will be evenly distributed. However this
will
cause
a lot of files placed in incorrect subvolumes, creating a lot of
link
files
until a rebalance is executed.
Xavi
Post by Serkan Çoban
gluster volume create v0 disperse 20 redundancy 4 \
1.1.1.{185..204}:/bricks/02 \
1.1.1.{205..224}:/bricks/02 \
1.1.1.{225..244}:/bricks/02 \
1.1.1.{185..204}:/bricks/03 \
1.1.1.{205..224}:/bricks/03 \
1.1.1.{225..244}:/bricks/03 \
1.1.1.{185..204}:/bricks/04 \
1.1.1.{205..224}:/bricks/04 \
1.1.1.{225..244}:/bricks/04 \
1.1.1.{185..204}:/bricks/05 \
1.1.1.{205..224}:/bricks/05 \
1.1.1.{225..244}:/bricks/05 \
1.1.1.{185..204}:/bricks/06 \
1.1.1.{205..224}:/bricks/06 \
1.1.1.{225..244}:/bricks/06 \
1.1.1.{185..204}:/bricks/07 \
1.1.1.{205..224}:/bricks/07 \
1.1.1.{225..244}:/bricks/07 \
1.1.1.{185..204}:/bricks/08 \
1.1.1.{205..224}:/bricks/08 \
1.1.1.{225..244}:/bricks/08 \
1.1.1.{185..204}:/bricks/09 \
1.1.1.{205..224}:/bricks/09 \
1.1.1.{225..244}:/bricks/09 \
1.1.1.{185..204}:/bricks/10 \
1.1.1.{205..224}:/bricks/10 \
1.1.1.{225..244}:/bricks/10 \
1.1.1.{185..204}:/bricks/11 \
1.1.1.{205..224}:/bricks/11 \
1.1.1.{225..244}:/bricks/11 \
1.1.1.{185..204}:/bricks/12 \
1.1.1.{205..224}:/bricks/12 \
1.1.1.{225..244}:/bricks/12 \
1.1.1.{185..204}:/bricks/13 \
1.1.1.{205..224}:/bricks/13 \
1.1.1.{225..244}:/bricks/13 \
1.1.1.{185..204}:/bricks/14 \
1.1.1.{205..224}:/bricks/14 \
1.1.1.{225..244}:/bricks/14 \
1.1.1.{185..204}:/bricks/15 \
1.1.1.{205..224}:/bricks/15 \
1.1.1.{225..244}:/bricks/15 \
1.1.1.{185..204}:/bricks/16 \
1.1.1.{205..224}:/bricks/16 \
1.1.1.{225..244}:/bricks/16 \
1.1.1.{185..204}:/bricks/17 \
1.1.1.{205..224}:/bricks/17 \
1.1.1.{225..244}:/bricks/17 \
1.1.1.{185..204}:/bricks/18 \
1.1.1.{205..224}:/bricks/18 \
1.1.1.{225..244}:/bricks/18 \
1.1.1.{185..204}:/bricks/19 \
1.1.1.{205..224}:/bricks/19 \
1.1.1.{225..244}:/bricks/19 \
1.1.1.{185..204}:/bricks/20 \
1.1.1.{205..224}:/bricks/20 \
1.1.1.{225..244}:/bricks/20 \
1.1.1.{185..204}:/bricks/21 \
1.1.1.{205..224}:/bricks/21 \
1.1.1.{225..244}:/bricks/21 \
1.1.1.{185..204}:/bricks/22 \
1.1.1.{205..224}:/bricks/22 \
1.1.1.{225..244}:/bricks/22 \
1.1.1.{185..204}:/bricks/23 \
1.1.1.{205..224}:/bricks/23 \
1.1.1.{225..244}:/bricks/23 \
1.1.1.{185..204}:/bricks/24 \
1.1.1.{205..224}:/bricks/24 \
1.1.1.{225..244}:/bricks/24 \
1.1.1.{185..204}:/bricks/25 \
1.1.1.{205..224}:/bricks/25 \
1.1.1.{225..244}:/bricks/25 \
1.1.1.{185..204}:/bricks/26 \
1.1.1.{205..224}:/bricks/26 \
1.1.1.{225..244}:/bricks/26 \
1.1.1.{185..204}:/bricks/27 \
1.1.1.{205..224}:/bricks/27 \
1.1.1.{225..244}:/bricks/27 force
mount -t glusterfs 1.1.1.185:/v0 /mnt/gluster
then I make a directory from one of the clients and chmod it.
mkdir /mnt/gluster/s1 && chmod 777 /mnt/gluster/s1
then I start distcp on clients, there are 1059X8.8GB files in one
folder
and
they will be copied to /mnt/gluster/s1 with 100 parallel which
means
2
copy jobs per client at same time.
hadoop distcp -m 100 http://nn1:8020/path/to/teragen-10tb
file:///mnt/gluster/s1
s1 directory is present in all 1560 brick.
s1/teragen-10tb folder is present in all 1560 brick.
https://www.dropbox.com/s/rbgdxmrtwz8oya8/teragen_list.zip?dl=0
You can ignore the .crc files in the brick output above, they are
checksum files...
As you can see part-m-xxxx files written only some bricks in nodes
0205..0224
All bricks have some files but they have zero size.
I increase file descriptors to 65k so it is not the issue...
On Wed, Apr 20, 2016 at 9:34 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
I assume that gluster is used to store the intermediate files
before
the reduce phase
Nope, gluster is the destination for distcp command. hadoop
distcp
-m
50 http://nn1:8020/path/to/folder file:///mnt/gluster
This run maps on datanodes which have /mnt/gluster mounted on all
of
them.
I don't know hadoop, so I'm of little help here. However it seems that
-m
50
means to execute 50 copies in parallel. This means that even if the
distribution worked fine, at most 50 (much probably less) of the 78
ec
sets
would be used in parallel.
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
This means that this is caused by some peculiarity of the
mapreduce.
Yes but how a client write 500 files to gluster mount and those file
just written only to subset of subvolumes? I cannot use gluster
as
a
backup cluster if I cannot write with distcp.
All 500 files were created only on one of the 78 ec sets and the
remaining
77 got empty ?
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
You should look which files are created in each brick and how
many
while the process is running.
Files only created on nodes 185..204 or 205..224 or 225..244.
Only
on
20 nodes in each test.
How many files there were in each brick ?
Not sure if this can be related, but standard linux distributions
have
a
default limit of 1024 open file descriptors. Having a so big
volume
and
doing a massive copy, maybe this limit is affecting something ?
Are there any error or warning messages in the mount or bricks
logs
?
Xavi
Post by Serkan Çoban
On Tue, Apr 19, 2016 at 1:05 PM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
moved to gluster-users since this doesn't belong to devel list.
I am copying 10.000 files to gluster volume using mapreduce on
clients. Each map process took one file at a time and copy it to
gluster volume.
I assume that gluster is used to store the intermediate files
before
the
reduce phase.
My disperse volume consist of 78 subvolumes of 16+4 disk each. So
If
I
copy >78 files parallel I expect each file goes to different
subvolume
right?
If you only copy 78 files, most probably you will get some
subvolume
empty
and some other with more than one or two files. It's not an exact
distribution, it's a statistially balanced distribution: over
time
and
with
enough files, each brick will contain an amount of files in the
same
order
of magnitude, but they won't have the *same* number of files.
In my tests during tests with fio I can see every file goes to
different subvolume, but when I start mapreduce process from
clients
only 78/3=26 subvolumes used for writing files.
This means that this is caused by some peculiarity of the
mapreduce.
I see that clearly from network traffic. Mapreduce on client
side
can
be run multi thread. I tested with 1-5-10 threads on each client
but
every time only 26 subvolumes used.
How can I debug the issue further?
You should look which files are created in each brick and how many
while
the
process is running.
Xavi
On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Hi, I just reinstalled fresh 3.7.11 and I am seeing the same
behavior.
50 clients copying part-0-xxxx named files using mapreduce to
gluster
using one thread per server and they are using only 20
servers
out
of
60. On the other hand fio tests use all the servers.
Anything
I
can
do
to solve the issue?
Distribution of files to ec sets is done by dht. In theory if
you
create
many files each ec set will receive the same amount of files.
However
when
the number of files is small enough, statistics can fail.
Not sure what you are doing exactly, but a mapreduce procedure
generally
only creates a single output. In that case it makes sense that
only
one
ec
set is used. If you want to use all ec sets for a single file,
you
should
enable sharding (I haven't tested that) or split the result in
multiple
files.
Xavi
Thanks,
Serkan
---------- Forwarded message ----------
Date: Mon, Apr 18, 2016 at 2:39 PM
Subject: disperse volume file to subvolume mapping
Hi, I have a problem where clients are using only 1/3 of
nodes
in
disperse volume for writing.
I am testing from 50 clients using 1 to 10 threads with file
names
part-0-xxxx.
What I see is clients only use 20 nodes for writing. How is
the
file
name to sub volume hashing is done? Is this related to file
names
are
similar?
My cluster is 3.7.10 with 60 nodes each has 26 disks.
Disperse
volume
is 78 x (16+4). Only 26 out of 78 sub volumes used during
writes..
Xavier Hernandez
2016-04-22 06:43:37 UTC
Permalink
Even the number of scanned files is 0 ?

This seems an issue with DHT. I'm not an expert on this area. Not sure
if the regular expression pattern that some files still match could
cause an interference with rebalance.

Anyway, if you have found a solution for your use case, it's ok for me.

Best regards,

Xavi
Post by Serkan Çoban
Not only skipped column but all columns are 0 in rebalance status
command. It seems rebalance does not to anything. All '---------T'
files are there. Anyway we wrote our custom mapreduce tool and it is
copying files right now to gluster and it is utilizing all 60 nodes as
expected. I will delete distcp folder and continue if you don't need
any further log/debug files to examine the issue.
Thanks for help,
Serkan
When you execute a rebalance 'force' the skipped column should be 0 for all
nodes and all '---------T' files must have disappeared. Otherwise something
failed. Is this true in your case ?
Post by Serkan Çoban
Same result. Also checked the rebalance.log file, it has also no
reference to part files...
Post by Xavier Hernandez
Can you try a 'gluster volume rebalance v0 start force' ?
Post by Serkan Çoban
Has the rebalance operation finished successfully ? has it skipped any files ?
Yes according to gluster v rebalance status it is completed without any errors.
Node Rebalanced files size Scanned
failures skipped
1.1.1.185 158 29GB 1720
0 314
1.1.1.205 93 46.5GB 761
0 95
1.1.1.225 74 37GB 779
0 94
All other hosts has 0 values.
I double check that files with '---------T' attributes are there,
maybe some of them deleted but I still see them in bricks...
I am also concerned why part files not distributed to all 60 nodes?
Rebalance should do that?
On Thu, Apr 21, 2016 at 1:55 PM, Xavier Hernandez
Hi Serkan,
Post by Serkan Çoban
I started a gluster v rebalance v0 start command hoping that it will
equally redistribute files across 60 nodes but it did not do that...
why it did not redistribute files? any thoughts?
Has the rebalance operation finished successfully ? has it skipped any files
?
After a successful rebalance all files with attributes '---------T' should
have disappeared.
Post by Serkan Çoban
On Thu, Apr 21, 2016 at 11:24 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Post by Serkan Çoban
Post by Xavier Hernandez
I think the problem is in the temporary name that distcp gives to the
file while it's being copied before renaming it to the real name.
Do
you
know what is the structure of this name ?
".distcp.tmp.attempt_1460381790773_0248_m_000001_0" and the same
temporary file name used by one map process. For example I see in the
logs that one map copies files
part-m-00031,part-m-00047,part-m-00063
sequentially and they all use same temporary file name above. So no
original file name appears in temporary file name.
This explains the problem. With the default options, DHT sends all files
to
the subvolume that should store a file named 'distcp.tmp'.
With this temporary name format, little can be done.
Post by Serkan Çoban
I will check if we can modify distcp behaviour, or we have to write
our mapreduce procedures instead of using distcp.
Post by Xavier Hernandez
2. define the option 'extra-hash-regex' to an expression that matches
your temporary file names and returns the same name that will finally
have.
Depending on the differences between original and temporary file
names,
this
option could be useless.
3. set the option 'rsync-hash-regex' to 'none'. This will prevent the
name conversion, so the files will be evenly distributed. However this
will
cause a lot of files placed in incorrect subvolumes, creating a lot
of
link
files until a rebalance is executed.
How can I set these options?
gluster volume set <volname> <option> <value>
gluster volume set v0 rsync-hash-regex none
Xavi
Post by Serkan Çoban
On Thu, Apr 21, 2016 at 10:00 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
I think the problem is in the temporary name that distcp gives to the
file
while it's being copied before renaming it to the real name. Do you know
what is the structure of this name ?
DHT selects the subvolume (in this case the ec set) on which the file
will
be stored based on the name of the file. This has a problem when a
file
is
being renamed, because this could change the subvolume where the
file
should
be found.
DHT has a feature to avoid incorrect file placements when executing
renames
for the rsync case. What it does is to check if the file matches the
^\.(.+)\.[^.]+$
If a match is found, it only considers the part between parenthesis to
calculate the destination subvolume.
This is useful for rsync because temporary file names are
constructed
in
the
following way: suppose the original filename is 'test'. The temporary
filename while rsync is being executed is made by prepending a dot and
appending '.<random chars>': .test.712hd
As you can see, the original name and the part of the name between
parenthesis that matches the regular expression are the same. This causes
that, after renaming the temporary file to its original filename,
both
files
will be considered to belong to the same subvolume by DHT.
In your case it's very probable that distcp uses a temporary name like
'.part.<number>'. In this case the portion of the name used to select
the
subvolume is always 'part'. This would explain why all files go to
the
same
subvolume. Once the file is renamed to another name, DHT realizes
that
it
should go to another subvolume. At this point it creates a link
file
(those
files with access rights = '---------T') in the correct subvolume
but
it
doesn't move it. As you can see, this kind of files are better balanced.
1. change the temporary filename used by distcp to correctly match the
regular expression. I'm not sure if this can be configured, but if this
is
possible, this is the best option.
2. define the option 'extra-hash-regex' to an expression that matches
your
temporary file names and returns the same name that will finally have.
Depending on the differences between original and temporary file
names,
this
option could be useless.
3. set the option 'rsync-hash-regex' to 'none'. This will prevent the
name
conversion, so the files will be evenly distributed. However this
will
cause
a lot of files placed in incorrect subvolumes, creating a lot of
link
files
until a rebalance is executed.
Xavi
Post by Serkan Çoban
gluster volume create v0 disperse 20 redundancy 4 \
1.1.1.{185..204}:/bricks/02 \
1.1.1.{205..224}:/bricks/02 \
1.1.1.{225..244}:/bricks/02 \
1.1.1.{185..204}:/bricks/03 \
1.1.1.{205..224}:/bricks/03 \
1.1.1.{225..244}:/bricks/03 \
1.1.1.{185..204}:/bricks/04 \
1.1.1.{205..224}:/bricks/04 \
1.1.1.{225..244}:/bricks/04 \
1.1.1.{185..204}:/bricks/05 \
1.1.1.{205..224}:/bricks/05 \
1.1.1.{225..244}:/bricks/05 \
1.1.1.{185..204}:/bricks/06 \
1.1.1.{205..224}:/bricks/06 \
1.1.1.{225..244}:/bricks/06 \
1.1.1.{185..204}:/bricks/07 \
1.1.1.{205..224}:/bricks/07 \
1.1.1.{225..244}:/bricks/07 \
1.1.1.{185..204}:/bricks/08 \
1.1.1.{205..224}:/bricks/08 \
1.1.1.{225..244}:/bricks/08 \
1.1.1.{185..204}:/bricks/09 \
1.1.1.{205..224}:/bricks/09 \
1.1.1.{225..244}:/bricks/09 \
1.1.1.{185..204}:/bricks/10 \
1.1.1.{205..224}:/bricks/10 \
1.1.1.{225..244}:/bricks/10 \
1.1.1.{185..204}:/bricks/11 \
1.1.1.{205..224}:/bricks/11 \
1.1.1.{225..244}:/bricks/11 \
1.1.1.{185..204}:/bricks/12 \
1.1.1.{205..224}:/bricks/12 \
1.1.1.{225..244}:/bricks/12 \
1.1.1.{185..204}:/bricks/13 \
1.1.1.{205..224}:/bricks/13 \
1.1.1.{225..244}:/bricks/13 \
1.1.1.{185..204}:/bricks/14 \
1.1.1.{205..224}:/bricks/14 \
1.1.1.{225..244}:/bricks/14 \
1.1.1.{185..204}:/bricks/15 \
1.1.1.{205..224}:/bricks/15 \
1.1.1.{225..244}:/bricks/15 \
1.1.1.{185..204}:/bricks/16 \
1.1.1.{205..224}:/bricks/16 \
1.1.1.{225..244}:/bricks/16 \
1.1.1.{185..204}:/bricks/17 \
1.1.1.{205..224}:/bricks/17 \
1.1.1.{225..244}:/bricks/17 \
1.1.1.{185..204}:/bricks/18 \
1.1.1.{205..224}:/bricks/18 \
1.1.1.{225..244}:/bricks/18 \
1.1.1.{185..204}:/bricks/19 \
1.1.1.{205..224}:/bricks/19 \
1.1.1.{225..244}:/bricks/19 \
1.1.1.{185..204}:/bricks/20 \
1.1.1.{205..224}:/bricks/20 \
1.1.1.{225..244}:/bricks/20 \
1.1.1.{185..204}:/bricks/21 \
1.1.1.{205..224}:/bricks/21 \
1.1.1.{225..244}:/bricks/21 \
1.1.1.{185..204}:/bricks/22 \
1.1.1.{205..224}:/bricks/22 \
1.1.1.{225..244}:/bricks/22 \
1.1.1.{185..204}:/bricks/23 \
1.1.1.{205..224}:/bricks/23 \
1.1.1.{225..244}:/bricks/23 \
1.1.1.{185..204}:/bricks/24 \
1.1.1.{205..224}:/bricks/24 \
1.1.1.{225..244}:/bricks/24 \
1.1.1.{185..204}:/bricks/25 \
1.1.1.{205..224}:/bricks/25 \
1.1.1.{225..244}:/bricks/25 \
1.1.1.{185..204}:/bricks/26 \
1.1.1.{205..224}:/bricks/26 \
1.1.1.{225..244}:/bricks/26 \
1.1.1.{185..204}:/bricks/27 \
1.1.1.{205..224}:/bricks/27 \
1.1.1.{225..244}:/bricks/27 force
mount -t glusterfs 1.1.1.185:/v0 /mnt/gluster
then I make a directory from one of the clients and chmod it.
mkdir /mnt/gluster/s1 && chmod 777 /mnt/gluster/s1
then I start distcp on clients, there are 1059X8.8GB files in one
folder
and
they will be copied to /mnt/gluster/s1 with 100 parallel which
means
2
copy jobs per client at same time.
hadoop distcp -m 100 http://nn1:8020/path/to/teragen-10tb
file:///mnt/gluster/s1
s1 directory is present in all 1560 brick.
s1/teragen-10tb folder is present in all 1560 brick.
https://www.dropbox.com/s/rbgdxmrtwz8oya8/teragen_list.zip?dl=0
You can ignore the .crc files in the brick output above, they are
checksum files...
As you can see part-m-xxxx files written only some bricks in nodes
0205..0224
All bricks have some files but they have zero size.
I increase file descriptors to 65k so it is not the issue...
On Wed, Apr 20, 2016 at 9:34 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
I assume that gluster is used to store the intermediate files
before
the reduce phase
Nope, gluster is the destination for distcp command. hadoop
distcp
-m
50 http://nn1:8020/path/to/folder file:///mnt/gluster
This run maps on datanodes which have /mnt/gluster mounted on all
of
them.
I don't know hadoop, so I'm of little help here. However it seems
that
-m
50
means to execute 50 copies in parallel. This means that even if the
distribution worked fine, at most 50 (much probably less) of the 78
ec
sets
would be used in parallel.
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
This means that this is caused by some peculiarity of the
mapreduce.
Yes but how a client write 500 files to gluster mount and those
file
just written only to subset of subvolumes? I cannot use gluster
as
a
backup cluster if I cannot write with distcp.
All 500 files were created only on one of the 78 ec sets and the
remaining
77 got empty ?
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
You should look which files are created in each brick and how
many
while the process is running.
Files only created on nodes 185..204 or 205..224 or 225..244.
Only
on
20 nodes in each test.
How many files there were in each brick ?
Not sure if this can be related, but standard linux distributions
have
a
default limit of 1024 open file descriptors. Having a so big
volume
and
doing a massive copy, maybe this limit is affecting something ?
Are there any error or warning messages in the mount or bricks
logs
?
Xavi
Post by Serkan Çoban
On Tue, Apr 19, 2016 at 1:05 PM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
moved to gluster-users since this doesn't belong to devel list.
I am copying 10.000 files to gluster volume using mapreduce on
clients. Each map process took one file at a time and copy it
to
gluster volume.
I assume that gluster is used to store the intermediate files
before
the
reduce phase.
My disperse volume consist of 78 subvolumes of 16+4 disk each.
So
If
I
copy >78 files parallel I expect each file goes to different
subvolume
right?
If you only copy 78 files, most probably you will get some
subvolume
empty
and some other with more than one or two files. It's not an exact
distribution, it's a statistially balanced distribution: over
time
and
with
enough files, each brick will contain an amount of files in the
same
order
of magnitude, but they won't have the *same* number of files.
In my tests during tests with fio I can see every file goes to
different subvolume, but when I start mapreduce process from
clients
only 78/3=26 subvolumes used for writing files.
This means that this is caused by some peculiarity of the
mapreduce.
I see that clearly from network traffic. Mapreduce on client
side
can
be run multi thread. I tested with 1-5-10 threads on each
client
but
every time only 26 subvolumes used.
How can I debug the issue further?
You should look which files are created in each brick and how
many
while
the
process is running.
Xavi
On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Hi, I just reinstalled fresh 3.7.11 and I am seeing the same
behavior.
50 clients copying part-0-xxxx named files using mapreduce
to
gluster
using one thread per server and they are using only 20
servers
out
of
60. On the other hand fio tests use all the servers.
Anything
I
can
do
to solve the issue?
Distribution of files to ec sets is done by dht. In theory if
you
create
many files each ec set will receive the same amount of files.
However
when
the number of files is small enough, statistics can fail.
Not sure what you are doing exactly, but a mapreduce
procedure
generally
only creates a single output. In that case it makes sense
that
only
one
ec
set is used. If you want to use all ec sets for a single
file,
you
should
enable sharding (I haven't tested that) or split the result
in
multiple
files.
Xavi
Thanks,
Serkan
---------- Forwarded message ----------
Date: Mon, Apr 18, 2016 at 2:39 PM
Subject: disperse volume file to subvolume mapping
Hi, I have a problem where clients are using only 1/3 of
nodes
in
disperse volume for writing.
I am testing from 50 clients using 1 to 10 threads with file
names
part-0-xxxx.
What I see is clients only use 20 nodes for writing. How is
the
file
name to sub volume hashing is done? Is this related to file
names
are
similar?
My cluster is 3.7.10 with 60 nodes each has 26 disks.
Disperse
volume
is 78 x (16+4). Only 26 out of 78 sub volumes used during
writes..
Serkan Çoban
2016-04-22 07:19:47 UTC
Permalink
Scanned files are 1112 only on the node the rebalance command run, all
other fields are 0 for every nodes.
If the issue is happening because of temp file name, we will make sure
not to use temp files while using gluster.
Post by Xavier Hernandez
Even the number of scanned files is 0 ?
This seems an issue with DHT. I'm not an expert on this area. Not sure if
the regular expression pattern that some files still match could cause an
interference with rebalance.
Anyway, if you have found a solution for your use case, it's ok for me.
Best regards,
Xavi
Post by Serkan Çoban
Not only skipped column but all columns are 0 in rebalance status
command. It seems rebalance does not to anything. All '---------T'
files are there. Anyway we wrote our custom mapreduce tool and it is
copying files right now to gluster and it is utilizing all 60 nodes as
expected. I will delete distcp folder and continue if you don't need
any further log/debug files to examine the issue.
Thanks for help,
Serkan
When you execute a rebalance 'force' the skipped column should be 0 for all
nodes and all '---------T' files must have disappeared. Otherwise something
failed. Is this true in your case ?
Post by Serkan Çoban
Same result. Also checked the rebalance.log file, it has also no
reference to part files...
On Thu, Apr 21, 2016 at 3:34 PM, Xavier Hernandez
Post by Xavier Hernandez
Can you try a 'gluster volume rebalance v0 start force' ?
Post by Serkan Çoban
Post by Xavier Hernandez
Has the rebalance operation finished successfully ? has it skipped any
files ?
Yes according to gluster v rebalance status it is completed without
any
errors.
Node Rebalanced files size Scanned
failures skipped
1.1.1.185 158 29GB 1720
0 314
1.1.1.205 93 46.5GB 761
0 95
1.1.1.225 74 37GB 779
0 94
All other hosts has 0 values.
I double check that files with '---------T' attributes are there,
maybe some of them deleted but I still see them in bricks...
I am also concerned why part files not distributed to all 60 nodes?
Rebalance should do that?
On Thu, Apr 21, 2016 at 1:55 PM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Post by Serkan Çoban
I started a gluster v rebalance v0 start command hoping that it will
equally redistribute files across 60 nodes but it did not do that...
why it did not redistribute files? any thoughts?
Has the rebalance operation finished successfully ? has it skipped any
files
?
After a successful rebalance all files with attributes '---------T' should
have disappeared.
Post by Serkan Çoban
On Thu, Apr 21, 2016 at 11:24 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Post by Serkan Çoban
Post by Xavier Hernandez
I think the problem is in the temporary name that distcp gives to the
file while it's being copied before renaming it to the real name.
Do
you
know what is the structure of this name ?
".distcp.tmp.attempt_1460381790773_0248_m_000001_0" and the same
temporary file name used by one map process. For example I see in the
logs that one map copies files
part-m-00031,part-m-00047,part-m-00063
sequentially and they all use same temporary file name above. So no
original file name appears in temporary file name.
This explains the problem. With the default options, DHT sends all files
to
the subvolume that should store a file named 'distcp.tmp'.
With this temporary name format, little can be done.
Post by Serkan Çoban
I will check if we can modify distcp behaviour, or we have to write
our mapreduce procedures instead of using distcp.
Post by Xavier Hernandez
2. define the option 'extra-hash-regex' to an expression that matches
your temporary file names and returns the same name that will finally
have.
Depending on the differences between original and temporary file
names,
this
option could be useless.
3. set the option 'rsync-hash-regex' to 'none'. This will prevent the
name conversion, so the files will be evenly distributed. However this
will
cause a lot of files placed in incorrect subvolumes, creating a lot
of
link
files until a rebalance is executed.
How can I set these options?
gluster volume set <volname> <option> <value>
gluster volume set v0 rsync-hash-regex none
Xavi
Post by Serkan Çoban
On Thu, Apr 21, 2016 at 10:00 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
I think the problem is in the temporary name that distcp gives to the
file
while it's being copied before renaming it to the real name. Do
you
know
what is the structure of this name ?
DHT selects the subvolume (in this case the ec set) on which the file
will
be stored based on the name of the file. This has a problem when a
file
is
being renamed, because this could change the subvolume where the
file
should
be found.
DHT has a feature to avoid incorrect file placements when executing
renames
for the rsync case. What it does is to check if the file matches the
^\.(.+)\.[^.]+$
If a match is found, it only considers the part between
parenthesis
to
calculate the destination subvolume.
This is useful for rsync because temporary file names are
constructed
in
the
following way: suppose the original filename is 'test'. The temporary
filename while rsync is being executed is made by prepending a
dot
and
appending '.<random chars>': .test.712hd
As you can see, the original name and the part of the name between
parenthesis that matches the regular expression are the same. This
causes
that, after renaming the temporary file to its original filename,
both
files
will be considered to belong to the same subvolume by DHT.
In your case it's very probable that distcp uses a temporary name like
'.part.<number>'. In this case the portion of the name used to select
the
subvolume is always 'part'. This would explain why all files go to
the
same
subvolume. Once the file is renamed to another name, DHT realizes
that
it
should go to another subvolume. At this point it creates a link
file
(those
files with access rights = '---------T') in the correct subvolume
but
it
doesn't move it. As you can see, this kind of files are better
balanced.
1. change the temporary filename used by distcp to correctly
match
the
regular expression. I'm not sure if this can be configured, but
if
this
is
possible, this is the best option.
2. define the option 'extra-hash-regex' to an expression that matches
your
temporary file names and returns the same name that will finally have.
Depending on the differences between original and temporary file
names,
this
option could be useless.
3. set the option 'rsync-hash-regex' to 'none'. This will prevent the
name
conversion, so the files will be evenly distributed. However this
will
cause
a lot of files placed in incorrect subvolumes, creating a lot of
link
files
until a rebalance is executed.
Xavi
Post by Serkan Çoban
Here is the steps that I do in detail and relevant output from
gluster volume create v0 disperse 20 redundancy 4 \
1.1.1.{185..204}:/bricks/02 \
1.1.1.{205..224}:/bricks/02 \
1.1.1.{225..244}:/bricks/02 \
1.1.1.{185..204}:/bricks/03 \
1.1.1.{205..224}:/bricks/03 \
1.1.1.{225..244}:/bricks/03 \
1.1.1.{185..204}:/bricks/04 \
1.1.1.{205..224}:/bricks/04 \
1.1.1.{225..244}:/bricks/04 \
1.1.1.{185..204}:/bricks/05 \
1.1.1.{205..224}:/bricks/05 \
1.1.1.{225..244}:/bricks/05 \
1.1.1.{185..204}:/bricks/06 \
1.1.1.{205..224}:/bricks/06 \
1.1.1.{225..244}:/bricks/06 \
1.1.1.{185..204}:/bricks/07 \
1.1.1.{205..224}:/bricks/07 \
1.1.1.{225..244}:/bricks/07 \
1.1.1.{185..204}:/bricks/08 \
1.1.1.{205..224}:/bricks/08 \
1.1.1.{225..244}:/bricks/08 \
1.1.1.{185..204}:/bricks/09 \
1.1.1.{205..224}:/bricks/09 \
1.1.1.{225..244}:/bricks/09 \
1.1.1.{185..204}:/bricks/10 \
1.1.1.{205..224}:/bricks/10 \
1.1.1.{225..244}:/bricks/10 \
1.1.1.{185..204}:/bricks/11 \
1.1.1.{205..224}:/bricks/11 \
1.1.1.{225..244}:/bricks/11 \
1.1.1.{185..204}:/bricks/12 \
1.1.1.{205..224}:/bricks/12 \
1.1.1.{225..244}:/bricks/12 \
1.1.1.{185..204}:/bricks/13 \
1.1.1.{205..224}:/bricks/13 \
1.1.1.{225..244}:/bricks/13 \
1.1.1.{185..204}:/bricks/14 \
1.1.1.{205..224}:/bricks/14 \
1.1.1.{225..244}:/bricks/14 \
1.1.1.{185..204}:/bricks/15 \
1.1.1.{205..224}:/bricks/15 \
1.1.1.{225..244}:/bricks/15 \
1.1.1.{185..204}:/bricks/16 \
1.1.1.{205..224}:/bricks/16 \
1.1.1.{225..244}:/bricks/16 \
1.1.1.{185..204}:/bricks/17 \
1.1.1.{205..224}:/bricks/17 \
1.1.1.{225..244}:/bricks/17 \
1.1.1.{185..204}:/bricks/18 \
1.1.1.{205..224}:/bricks/18 \
1.1.1.{225..244}:/bricks/18 \
1.1.1.{185..204}:/bricks/19 \
1.1.1.{205..224}:/bricks/19 \
1.1.1.{225..244}:/bricks/19 \
1.1.1.{185..204}:/bricks/20 \
1.1.1.{205..224}:/bricks/20 \
1.1.1.{225..244}:/bricks/20 \
1.1.1.{185..204}:/bricks/21 \
1.1.1.{205..224}:/bricks/21 \
1.1.1.{225..244}:/bricks/21 \
1.1.1.{185..204}:/bricks/22 \
1.1.1.{205..224}:/bricks/22 \
1.1.1.{225..244}:/bricks/22 \
1.1.1.{185..204}:/bricks/23 \
1.1.1.{205..224}:/bricks/23 \
1.1.1.{225..244}:/bricks/23 \
1.1.1.{185..204}:/bricks/24 \
1.1.1.{205..224}:/bricks/24 \
1.1.1.{225..244}:/bricks/24 \
1.1.1.{185..204}:/bricks/25 \
1.1.1.{205..224}:/bricks/25 \
1.1.1.{225..244}:/bricks/25 \
1.1.1.{185..204}:/bricks/26 \
1.1.1.{205..224}:/bricks/26 \
1.1.1.{225..244}:/bricks/26 \
1.1.1.{185..204}:/bricks/27 \
1.1.1.{205..224}:/bricks/27 \
1.1.1.{225..244}:/bricks/27 force
mount -t glusterfs 1.1.1.185:/v0 /mnt/gluster
then I make a directory from one of the clients and chmod it.
mkdir /mnt/gluster/s1 && chmod 777 /mnt/gluster/s1
then I start distcp on clients, there are 1059X8.8GB files in one
folder
and
they will be copied to /mnt/gluster/s1 with 100 parallel which
means
2
copy jobs per client at same time.
hadoop distcp -m 100 http://nn1:8020/path/to/teragen-10tb
file:///mnt/gluster/s1
s1 directory is present in all 1560 brick.
s1/teragen-10tb folder is present in all 1560 brick.
https://www.dropbox.com/s/rbgdxmrtwz8oya8/teragen_list.zip?dl=0
You can ignore the .crc files in the brick output above, they are
checksum files...
As you can see part-m-xxxx files written only some bricks in nodes
0205..0224
All bricks have some files but they have zero size.
I increase file descriptors to 65k so it is not the issue...
On Wed, Apr 20, 2016 at 9:34 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
I assume that gluster is used to store the intermediate
files
before
the reduce phase
Nope, gluster is the destination for distcp command. hadoop
distcp
-m
50 http://nn1:8020/path/to/folder file:///mnt/gluster
This run maps on datanodes which have /mnt/gluster mounted on all
of
them.
I don't know hadoop, so I'm of little help here. However it seems
that
-m
50
means to execute 50 copies in parallel. This means that even if the
distribution worked fine, at most 50 (much probably less) of
the
78
ec
sets
would be used in parallel.
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
This means that this is caused by some peculiarity of the
mapreduce.
Yes but how a client write 500 files to gluster mount and those
file
just written only to subset of subvolumes? I cannot use gluster
as
a
backup cluster if I cannot write with distcp.
All 500 files were created only on one of the 78 ec sets and the
remaining
77 got empty ?
Post by Serkan Çoban
Post by Xavier Hernandez
Post by Xavier Hernandez
You should look which files are created in each brick and
how
many
while the process is running.
Files only created on nodes 185..204 or 205..224 or 225..244.
Only
on
20 nodes in each test.
How many files there were in each brick ?
Not sure if this can be related, but standard linux distributions
have
a
default limit of 1024 open file descriptors. Having a so big
volume
and
doing a massive copy, maybe this limit is affecting something ?
Are there any error or warning messages in the mount or bricks
logs
?
Xavi
Post by Serkan Çoban
On Tue, Apr 19, 2016 at 1:05 PM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
moved to gluster-users since this doesn't belong to devel list.
I am copying 10.000 files to gluster volume using mapreduce
on
clients. Each map process took one file at a time and copy it
to
gluster volume.
I assume that gluster is used to store the intermediate files
before
the
reduce phase.
My disperse volume consist of 78 subvolumes of 16+4 disk
each.
So
If
I
copy >78 files parallel I expect each file goes to different
subvolume
right?
If you only copy 78 files, most probably you will get some
subvolume
empty
and some other with more than one or two files. It's not an
exact
distribution, it's a statistially balanced distribution: over
time
and
with
enough files, each brick will contain an amount of files in the
same
order
of magnitude, but they won't have the *same* number of files.
In my tests during tests with fio I can see every file goes
to
different subvolume, but when I start mapreduce process from
clients
only 78/3=26 subvolumes used for writing files.
This means that this is caused by some peculiarity of the
mapreduce.
I see that clearly from network traffic. Mapreduce on client
side
can
be run multi thread. I tested with 1-5-10 threads on each
client
but
every time only 26 subvolumes used.
How can I debug the issue further?
You should look which files are created in each brick and how
many
while
the
process is running.
Xavi
On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez
Post by Xavier Hernandez
Hi Serkan,
Hi, I just reinstalled fresh 3.7.11 and I am seeing the
same
behavior.
50 clients copying part-0-xxxx named files using mapreduce
to
gluster
using one thread per server and they are using only 20
servers
out
of
60. On the other hand fio tests use all the servers.
Anything
I
can
do
to solve the issue?
Distribution of files to ec sets is done by dht. In theory
if
you
create
many files each ec set will receive the same amount of
files.
However
when
the number of files is small enough, statistics can fail.
Not sure what you are doing exactly, but a mapreduce
procedure
generally
only creates a single output. In that case it makes sense
that
only
one
ec
set is used. If you want to use all ec sets for a single
file,
you
should
enable sharding (I haven't tested that) or split the result
in
multiple
files.
Xavi
Thanks,
Serkan
---------- Forwarded message ----------
Date: Mon, Apr 18, 2016 at 2:39 PM
Subject: disperse volume file to subvolume mapping
Hi, I have a problem where clients are using only 1/3 of
nodes
in
disperse volume for writing.
I am testing from 50 clients using 1 to 10 threads with
file
names
part-0-xxxx.
What I see is clients only use 20 nodes for writing. How
is
the
file
name to sub volume hashing is done? Is this related to
file
names
are
similar?
My cluster is 3.7.10 with 60 nodes each has 26 disks.
Disperse
volume
is 78 x (16+4). Only 26 out of 78 sub volumes used during
writes..
Loading...