[Gluster-users] disperse volume file to subvolume mapping

Discussion:

Serkan Çoban

2016-04-18 11:39:10 UTC

Hi, I have a problem where clients are using only 1/3 of nodes in
disperse volume for writing.
I am testing from 50 clients using 1 to 10 threads with file names part-0-xxxx.
What I see is clients only use 20 nodes for writing. How is the file
name to sub volume hashing is done? Is this related to file names are
similar?

My cluster is 3.7.10 with 60 nodes each has 26 disks. Disperse volume
is 78 x (16+4). Only 26 out of 78 sub volumes used during writes..

Xavier Hernandez

2016-04-19 10:05:16 UTC

Permalink

Hi Serkan,

moved to gluster-users since this doesn't belong to devel list.

I am copying 10.000 files to gluster volume using mapreduce on
clients. Each map process took one file at a time and copy it to
gluster volume.

I assume that gluster is used to store the intermediate files before the
reduce phase.

My disperse volume consist of 78 subvolumes of 16+4 disk each. So If I
copy >78 files parallel I expect each file goes to different subvolume
right?

If you only copy 78 files, most probably you will get some subvolume
empty and some other with more than one or two files. It's not an exact
distribution, it's a statistially balanced distribution: over time and
with enough files, each brick will contain an amount of files in the
same order of magnitude, but they won't have the *same* number of files.

In my tests during tests with fio I can see every file goes to
different subvolume, but when I start mapreduce process from clients
only 78/3=26 subvolumes used for writing files.

This means that this is caused by some peculiarity of the mapreduce.

I see that clearly from network traffic. Mapreduce on client side can
be run multi thread. I tested with 1-5-10 threads on each client but
every time only 26 subvolumes used.
How can I debug the issue further?

You should look which files are created in each brick and how many while
the process is running.

Xavi

On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez

Hi Serkan,

Hi, I just reinstalled fresh 3.7.11 and I am seeing the same behavior.
50 clients copying part-0-xxxx named files using mapreduce to gluster
using one thread per server and they are using only 20 servers out of
60. On the other hand fio tests use all the servers. Anything I can do
to solve the issue?

Distribution of files to ec sets is done by dht. In theory if you create
many files each ec set will receive the same amount of files. However when
the number of files is small enough, statistics can fail.
Not sure what you are doing exactly, but a mapreduce procedure generally
only creates a single output. In that case it makes sense that only one ec
set is used. If you want to use all ec sets for a single file, you should
enable sharding (I haven't tested that) or split the result in multiple
files.
Xavi

Thanks,
Serkan
---------- Forwarded message ----------
Date: Mon, Apr 18, 2016 at 2:39 PM
Subject: disperse volume file to subvolume mapping
Hi, I have a problem where clients are using only 1/3 of nodes in
disperse volume for writing.
I am testing from 50 clients using 1 to 10 threads with file names part-0-xxxx.
What I see is clients only use 20 nodes for writing. How is the file
name to sub volume hashing is done? Is this related to file names are
similar?
My cluster is 3.7.10 with 60 nodes each has 26 disks. Disperse volume
is 78 x (16+4). Only 26 out of 78 sub volumes used during writes..

Serkan Çoban

2016-04-19 13:16:40 UTC

Permalink

Post by Xavier Hernandez

I assume that gluster is used to store the intermediate files before the reduce phase

Nope, gluster is the destination for distcp command. hadoop distcp -m
50 http://nn1:8020/path/to/folder file:///mnt/gluster
This run maps on datanodes which have /mnt/gluster mounted on all of them.

Post by Xavier Hernandez

This means that this is caused by some peculiarity of the mapreduce.

Yes but how a client write 500 files to gluster mount and those file
just written only to subset of subvolumes? I cannot use gluster as a
backup cluster if I cannot write with distcp.

Post by Xavier Hernandez

You should look which files are created in each brick and how many while the process is running.

Files only created on nodes 185..204 or 205..224 or 225..244. Only on
20 nodes in each test.

Post by Xavier Hernandez
Hi Serkan,
moved to gluster-users since this doesn't belong to devel list.

I am copying 10.000 files to gluster volume using mapreduce on
clients. Each map process took one file at a time and copy it to
gluster volume.

I assume that gluster is used to store the intermediate files before the
reduce phase.

My disperse volume consist of 78 subvolumes of 16+4 disk each. So If I
copy >78 files parallel I expect each file goes to different subvolume
right?

If you only copy 78 files, most probably you will get some subvolume empty
and some other with more than one or two files. It's not an exact
distribution, it's a statistially balanced distribution: over time and with
enough files, each brick will contain an amount of files in the same order
of magnitude, but they won't have the *same* number of files.

In my tests during tests with fio I can see every file goes to
different subvolume, but when I start mapreduce process from clients
only 78/3=26 subvolumes used for writing files.

This means that this is caused by some peculiarity of the mapreduce.

You should look which files are created in each brick and how many while the
process is running.
Xavi

On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez

Hi Serkan,

Xavier Hernandez

2016-04-20 06:34:35 UTC

Permalink

Hi Serkan,