Discussion:
[Mesa-dev] [PATCH 00/15] radv: Support for VK_AMD_shader_ballot
Connor Abbott
2017-08-08 01:32:26 UTC
Permalink
From: Connor Abbott <***@gmail.com>

This series implements VK_AMD_shader_ballot for radv. This extension
builds on VK_EXT_shader_subgroup_ballot and VK_EXT_shader_subgroup_vote
by adding a number of reductions across a subgroup (or wavefront in AMD
terminology). Previously, shaders had to use shared memory to compute,
say, the average across all threads in a workgroup, or the minimum and
maximum values across a workgroup. But that requires a lot of accesses
to LDS memory, which is (relatively) slow. This extension allows the
shader to do part of the reduction directly in registers, as long as it
stays within a single wavefront, reducing the amount of traffic to the
LDS that has to happen. It also adds a few AMD-specific instructions,
like mbcnt. To get an idea of what exactly is in the extension, and what
inclusive scan, exclusive scan, etc. mean, you can look at the GL
extension which exposes mostly the same things [1].

Why should you care? It turns out that with this extension enabled, plus
a few other AMD-specific extensions that are mostly trivial, DOOM will
take a different path that uses shaders that were tuned specifically for
AMD hardware. I haven't actually tested DOOM yet, since a few more
things need to be wired up, but it's a lot less work than this extension
and I'm sure Dave or Bas will be do it for me when they get around to it
:).

It uses a few new features of the AMDGPU LLVM backend that I just
landed, as well as one more small change that still needs review:
https://reviews.llvm.org/D34718, so it's going to require LLVM 6.0. It
also uses the DPP modifier that was only added on VI since that was
easier than using ds_swizzle (which is available on all GCN cards). It
should be possible to implement support for older cards using
ds_swizzle, but I haven't gotten to it yet. A note to those reviewing:
it might be helpful to look at the LLVM changes that this series uses,
in particular:

https://reviews.llvm.org/rL310087
https://reviews.llvm.org/rL310088
https://reviews.llvm.org/D34718

in order to get the complete picture.

This series depends on my previous series [2] to implement
VK_EXT_shader_subgroup_vote and VK_EXT_shader_subgroup_ballot, if
nothing else in order to be able to test the implementation. I think
DOOM also uses the latter two extensions. I've also based on my series
adding cross-thread semantics to NIR [3], which Jason needs to review,
since I was hoping that would land first, although with a little effort
it should be possible to land this first (it would require changing
PATCH 01 a little). The whole thing is available at:

git://people.freedesktop.org/~cwabbott0/mesa radv-amd-shader-ballot

and the LLVM branch that I've been using to test, with the one patch
added is at:

https://github.com/cwabbott0/llvm.git dpp-intrinsics-v4

I've got some Crucible tests for exercising the various different parts
of the implementation, although I didn't bother to test all the possible
combinations of reductions, since they didn't really require any special
code to implement anyways. I'll try and get that cleaned up and sent out
soon. Maybe I should just push the tests?

Finally, I'm leaving Valve soon (this week) to go back to school, and I
suspect that I won't have too much time to work on this afterwards, so
someone else will probably have to pick it up. I've been working on this
for most of the summer, since it turned out to be a way more complicated
beast to implement than I thought. It's required changes across the
entire stack, from spirv-to-nir all the way down to register allocation
in the LLVM backend. Thankfully, though, most of the tricky LLVM
changes have landed (thanks Nicolai for reviewing!) and what's left is a
lot more straightforward. I should still be around to answer questions,
though. Whew!

[1] https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt
[2] https://lists.freedesktop.org/archives/mesa-dev/2017-August/164903.html
[3] https://lists.freedesktop.org/archives/mesa-dev/2017-August/164898.html

Connor Abbott (15):
nir: define intrinsics needed for AMD_shader_ballot
spirv: import AMD extensions header
spirv: add plumbing for SPV_AMD_shader_ballot and Groups
nir: rename and generalize nir_lower_read_invocation_to_scalar
nir: scalarize AMD_shader_ballot intrinsics
radv: call nir_lower_cross_thread_to_scalar()
nir: add a lowering pass for some cross-workgroup intrinsics
radv: use nir_lower_group_reduce()
ac: move ac_to_integer() and ac_to_float() to ac_llvm_build.c
ac: remove bitcast_to_float()
ac: fix ac_get_type_size() for doubles
ac: add support for SPV_AMD_shader_ballot
ac/nir: add support for SPV_AMD_shader_ballot
radv: enable VK_AMD_shader_ballot
ac/nir: fix saturate emission

src/amd/common/ac_llvm_build.c | 783 ++++++++++++++++++++-
src/amd/common/ac_llvm_build.h | 120 ++++
src/amd/common/ac_nir_to_llvm.c | 300 ++++----
src/amd/vulkan/radv_device.c | 15 +
src/amd/vulkan/radv_pipeline.c | 6 +
src/compiler/Makefile.sources | 4 +-
src/compiler/nir/nir.h | 11 +-
src/compiler/nir/nir_intrinsics.h | 124 +++-
...scalar.c => nir_lower_cross_thread_to_scalar.c} | 63 +-
src/compiler/nir/nir_lower_group_reduce.c | 179 +++++
src/compiler/nir/nir_print.c | 1 +
src/compiler/spirv/GLSL.ext.AMD.h | 93 +++
src/compiler/spirv/nir_spirv.h | 2 +
src/compiler/spirv/spirv_to_nir.c | 32 +-
src/compiler/spirv/vtn_amd.c | 281 ++++++++
src/compiler/spirv/vtn_private.h | 9 +
src/intel/compiler/brw_nir.c | 2 +-
17 files changed, 1846 insertions(+), 179 deletions(-)
rename src/compiler/nir/{nir_lower_read_invocation_to_scalar.c => nir_lower_cross_thread_to_scalar.c} (56%)
create mode 100644 src/compiler/nir/nir_lower_group_reduce.c
create mode 100644 src/compiler/spirv/GLSL.ext.AMD.h
create mode 100644 src/compiler/spirv/vtn_amd.c
--
2.9.4
Connor Abbott
2017-08-08 01:32:27 UTC
Permalink
From: Connor Abbott <***@gmail.com>

---
src/compiler/nir/nir.h | 7 +++
src/compiler/nir/nir_intrinsics.h | 124 +++++++++++++++++++++++++++++++++++++-
src/compiler/nir/nir_print.c | 1 +
3 files changed, 129 insertions(+), 3 deletions(-)

diff --git a/src/compiler/nir/nir.h b/src/compiler/nir/nir.h
index 24934f0..4b5d78e 100644
--- a/src/compiler/nir/nir.h
+++ b/src/compiler/nir/nir.h
@@ -1076,6 +1076,12 @@ typedef enum {
*/
NIR_INTRINSIC_INTERP_MODE = 9,

+ /*
+ * Constant mask, invocation, swizzle, etc. used for a subgroup operation.
+ * The precise meaning depends on the intrinsic.
+ */
+ NIR_INTRINSIC_SUBGROUP_DATA = 10,
+
NIR_INTRINSIC_NUM_INDEX_FLAGS,

} nir_intrinsic_index_flag;
@@ -1144,6 +1150,7 @@ INTRINSIC_IDX_ACCESSORS(desc_set, DESC_SET, unsigned)
INTRINSIC_IDX_ACCESSORS(binding, BINDING, unsigned)
INTRINSIC_IDX_ACCESSORS(component, COMPONENT, unsigned)
INTRINSIC_IDX_ACCESSORS(interp_mode, INTERP_MODE, unsigned)
+INTRINSIC_IDX_ACCESSORS(subgroup_data, SUBGROUP_DATA, unsigned)

/**
* \group texture information
diff --git a/src/compiler/nir/nir_intrinsics.h b/src/compiler/nir/nir_intrinsics.h
index 72c4296..32f52aa 100644
--- a/src/compiler/nir/nir_intrinsics.h
+++ b/src/compiler/nir/nir_intrinsics.h
@@ -93,12 +93,24 @@ INTRINSIC(get_buffer_size, 1, ARR(1), true, 1, 0, 0, xx, xx, xx,
* the comment for NIR_INTRINSIC_CONVERGENT in nir.h for details.
*/
#define CONVERGENT(name, num_srcs, src0_components, src1_components, \
- dest_components) \
- INTRINSIC(name, num_srcs, ARR(src0_components, src1_components), \
+ src2_components, dest_components) \
+ INTRINSIC(name, num_srcs, ARR(src0_components, src1_components, \
+ src2_components), \
true, dest_components, 0, 0, xx, xx, xx, \
NIR_INTRINSIC_CAN_REORDER | NIR_INTRINSIC_CAN_ELIMINATE | \
NIR_INTRINSIC_CONVERGENT)

+/*
+ * Similar to CONVERGENT, except optimizations can assume that the intrinsic is
+ * only called in uniform control flow.
+ */
+#define UNIFORM_CONTROL(name, num_srcs, src0_components, src1_components, \
+ dest_components) \
+ INTRINSIC(name, num_srcs, ARR(src0_components, src1_components), \
+ true, dest_components, 0, 0, xx, xx, xx, \
+ NIR_INTRINSIC_CAN_REORDER | NIR_INTRINSIC_CAN_ELIMINATE | \
+ NIR_INTRINSIC_UNIFORM_CONTROL)
+
BARRIER(barrier)
BARRIER(discard)

@@ -126,7 +138,7 @@ INTRINSIC(shader_clock, 0, ARR(0), true, 2, 0, 0, xx, xx, xx, NIR_INTRINSIC_CAN_
* GLSL functions from ARB_shader_ballot.
*/
CROSS_THREAD(ballot, 1, 1, 0, 1)
-CONVERGENT(read_invocation, 2, 0, 1, 0)
+CONVERGENT(read_invocation, 2, 0, 1, 0, 0)
CROSS_THREAD(read_first_invocation, 1, 0, 0, 0)

/*
@@ -148,6 +160,105 @@ CROSS_THREAD(vote_any, 1, 1, 0, 1)
CROSS_THREAD(vote_all, 1, 1, 0, 1)
CROSS_THREAD(vote_eq, 1, 1, 0, 1)

+/* AMD_shader_ballot intrinsics */
+
+/*
+ * This is like CROSS_THREAD for instructions that communicate across an entire
+ * workgroup, instead of just the subgroup.
+ */
+
+#define CROSS_SUBGROUP(name, num_srcs, src0_components, src1_components, \
+ dest_components) \
+ INTRINSIC(name, num_srcs, ARR(src0_components, src1_components), \
+ true, dest_components, 0, 0, xx, xx, xx, \
+ NIR_INTRINSIC_CAN_ELIMINATE)
+
+#define CROSS_SUBGROUP_UNIFORM(name, num_srcs, src0_components, \
+ src1_components, dest_components) \
+ INTRINSIC(name, num_srcs, ARR(src0_components, src1_components), \
+ true, dest_components, 0, 0, xx, xx, xx, \
+ NIR_INTRINSIC_CAN_ELIMINATE | NIR_INTRINSIC_UNIFORM_CONTROL)
+
+#define REDUCE_OP(name, postfix) \
+ UNIFORM_CONTROL(subgroup_##name##postfix, 1, 0, 0, 0) \
+ CROSS_SUBGROUP(subgroup_##name##postfix##_nonuniform, 1, 0, 0, 0) \
+ CROSS_SUBGROUP_UNIFORM(group_##name##postfix, 1, 0, 0, 0) \
+ CROSS_SUBGROUP(group_##name##postfix##_nonuniform, 1, 0, 0, 0) \
+
+#define GROUP_SUBGROUP_REDUCE(name) \
+ REDUCE_OP(name, ) \
+ REDUCE_OP(name, _inclusive_scan) \
+ REDUCE_OP(name, _exclusive_scan)
+
+GROUP_SUBGROUP_REDUCE(iadd)
+GROUP_SUBGROUP_REDUCE(fadd)
+GROUP_SUBGROUP_REDUCE(fmin)
+GROUP_SUBGROUP_REDUCE(umin)
+GROUP_SUBGROUP_REDUCE(imin)
+GROUP_SUBGROUP_REDUCE(fmax)
+GROUP_SUBGROUP_REDUCE(umax)
+GROUP_SUBGROUP_REDUCE(imax)
+
+#undef GROUP_SUBGROUP_REDUCE
+#undef REDUCE_OP
+
+/* Analogous to vote_any and vote_all, but works across an entire workgroup.
+ * Also, this version can only be called in uniform control flow.
+ */
+CROSS_SUBGROUP_UNIFORM(group_any, 1, 1, 0, 1)
+CROSS_SUBGROUP_UNIFORM(group_all, 1, 1, 0, 1)
+
+/* Similarly, this is analogous to read_invocation but works across an entire
+ * workgroup and can only be called in uniform control flow. Unlike
+ * read_invocation, the second argument is a 3-dimensional local ID instead of
+ * a simple linear index.
+ */
+CROSS_SUBGROUP_UNIFORM(group_broadcast, 2, 1, 3, 1)
+
+#define CROSS_THREAD_WITH_DATA(name, src_components, dest_components) \
+ INTRINSIC(name, 1, ARR(src_components), \
+ true, dest_components, 0, 1, SUBGROUP_DATA, xx, xx, \
+ NIR_INTRINSIC_CAN_REORDER | NIR_INTRINSIC_CAN_ELIMINATE | \
+ NIR_INTRINSIC_CROSS_THREAD)
+
+/* The index is is interpreted as a swizzle value, with bits [2*i:2*i+1]
+ * giving the lane within the quad that should be stored in the i'th lane of
+ * the quad. Inactive lanes return 0. That is, the computation is:
+ *
+ * for (i = 0; i < SubgroupSize; i++) {
+ * for (j = 0; j < 4; j++) {
+ * out[i + j] = is_active[i + swizzle[2*j:2*j+1]] ? in[i + swizzle[2*j:2*j+1]] : 0;
+ * }
+ * }
+ */
+CROSS_THREAD_WITH_DATA(quad_swizzle_amd, 0, 0)
+
+/*
+ * Implements AMD's swizzle-masked intrinsic. The mask is interpreted as 3
+ * 5-bit integers. Implements the following, taken from AMD_shader_ballot:
+ *
+ * for (i = 0; i < SubgroupSize; i++) {
+ * j = (((i & 0x1f) & data[0:4]) | data[5:9]) ^ data[10:14];
+ * j |= i & 0x20;
+ * out[i] = is_active[j] ? in[j] : 0;
+ */
+CROSS_THREAD_WITH_DATA(masked_swizzle_amd, 0, 0)
+
+/*
+ * The opposite of read_invocation - return the first argument, except for the
+ * second argument in the invocation given by the third argument. The second
+ * and third arguments must be dynamically uniform within the subgroup.
+ */
+CONVERGENT(write_invocation, 3, 0, 0, 1, 0)
+
+/*
+ * Implements the AMD mbcnt instruction. Returns:
+ *
+ * bitCount(gl_SubgroupLtMask & src0)
+ */
+INTRINSIC(mbcnt_amd, 1, ARR(1), true, 1, 0, 0, xx, xx, xx,
+ NIR_INTRINSIC_CAN_ELIMINATE | NIR_INTRINSIC_CAN_REORDER)
+
/**
* Basic Geometry Shader intrinsics.
*
@@ -387,6 +498,13 @@ SYSTEM_VALUE(blend_const_color_a_float, 1, 0, xx, xx, xx)
SYSTEM_VALUE(blend_const_color_rgba8888_unorm, 1, 0, xx, xx, xx)
SYSTEM_VALUE(blend_const_color_aaaa8888_unorm, 1, 0, xx, xx, xx)

+#define CROSS_THREAD_UNIFORM(name, dest_components, src_components) \
+ INTRINSIC(name, 1, ARR(src_components), true, dest_components, 0, 0, \
+ xx, xx, xx, \
+ NIR_INTRINSIC_CAN_ELIMINATE | NIR_INTRINSIC_CAN_REORDER | \
+ NIR_INTRINSIC_CROSS_THREAD | NIR_INTRINSIC_CONVERGENT)
+
+
/**
* Barycentric coordinate intrinsics.
*
diff --git a/src/compiler/nir/nir_print.c b/src/compiler/nir/nir_print.c
index f4811fe..b9cb686 100644
--- a/src/compiler/nir/nir_print.c
+++ b/src/compiler/nir/nir_print.c
@@ -595,6 +595,7 @@ print_intrinsic_instr(nir_intrinsic_instr *instr, print_state *state)
[NIR_INTRINSIC_BINDING] = "binding",
[NIR_INTRINSIC_COMPONENT] = "component",
[NIR_INTRINSIC_INTERP_MODE] = "interp_mode",
+ [NIR_INTRINSIC_SUBGROUP_DATA] = "subgroup_data",
};
for (unsigned idx = 1; idx < NIR_INTRINSIC_NUM_INDEX_FLAGS; idx++) {
if (!info->index_map[idx])
--
2.9.4
Connor Abbott
2017-08-08 01:32:28 UTC
Permalink
From: Connor Abbott <***@gmail.com>

There's no Khronos-supplied C header, so I hacked up the C++ one in
glslang to be C-compatible. See
https://github.com/KhronosGroup/SPIRV-Headers/issues/36.
---
src/compiler/spirv/GLSL.ext.AMD.h | 93 +++++++++++++++++++++++++++++++++++++++
1 file changed, 93 insertions(+)
create mode 100644 src/compiler/spirv/GLSL.ext.AMD.h

diff --git a/src/compiler/spirv/GLSL.ext.AMD.h b/src/compiler/spirv/GLSL.ext.AMD.h
new file mode 100644
index 0000000..af28f21
--- /dev/null
+++ b/src/compiler/spirv/GLSL.ext.AMD.h
@@ -0,0 +1,93 @@
+/*
+** Copyright (c) 2014-2016 The Khronos Group Inc.
+**
+** Permission is hereby granted, free of charge, to any person obtaining a copy
+** of this software and/or associated documentation files (the "Materials"),
+** to deal in the Materials without restriction, including without limitation
+** the rights to use, copy, modify, merge, publish, distribute, sublicense,
+** and/or sell copies of the Materials, and to permit persons to whom the
+** Materials are furnished to do so, subject to the following conditions:
+**
+** The above copyright notice and this permission notice shall be included in
+** all copies or substantial portions of the Materials.
+**
+** MODIFICATIONS TO THIS FILE MAY MEAN IT NO LONGER ACCURATELY REFLECTS KHRONOS
+** STANDARDS. THE UNMODIFIED, NORMATIVE VERSIONS OF KHRONOS SPECIFICATIONS AND
+** HEADER INFORMATION ARE LOCATED AT https://www.khronos.org/registry/
+**
+** THE MATERIALS ARE PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
+** OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+** FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+** THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+** LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+** FROM,OUT OF OR IN CONNECTION WITH THE MATERIALS OR THE USE OR OTHER DEALINGS
+** IN THE MATERIALS.
+*/
+
+#ifndef GLSLextAMD_H
+#define GLSLextAMD_H
+
+static const int GLSLextAMDVersion = 100;
+static const int GLSLextAMDRevision = 2;
+
+// SPV_AMD_shader_ballot
+static const char* const E_SPV_AMD_shader_ballot = "SPV_AMD_shader_ballot";
+
+enum ShaderBallotAMD {
+ ShaderBallotBadAMD = 0, // Don't use
+
+ SwizzleInvocationsAMD = 1,
+ SwizzleInvocationsMaskedAMD = 2,
+ WriteInvocationAMD = 3,
+ MbcntAMD = 4,
+
+ ShaderBallotCountAMD
+};
+
+// SPV_AMD_shader_trinary_minmax
+static const char* const E_SPV_AMD_shader_trinary_minmax = "SPV_AMD_shader_trinary_minmax";
+
+enum ShaderTrinaryMinMaxAMD {
+ ShaderTrinaryMinMaxBadAMD = 0, // Don't use
+
+ FMin3AMD = 1,
+ UMin3AMD = 2,
+ SMin3AMD = 3,
+ FMax3AMD = 4,
+ UMax3AMD = 5,
+ SMax3AMD = 6,
+ FMid3AMD = 7,
+ UMid3AMD = 8,
+ SMid3AMD = 9,
+
+ ShaderTrinaryMinMaxCountAMD
+};
+
+// SPV_AMD_shader_explicit_vertex_parameter
+static const char* const E_SPV_AMD_shader_explicit_vertex_parameter = "SPV_AMD_shader_explicit_vertex_parameter";
+
+enum ShaderExplicitVertexParameterAMD {
+ ShaderExplicitVertexParameterBadAMD = 0, // Don't use
+
+ InterpolateAtVertexAMD = 1,
+
+ ShaderExplicitVertexParameterCountAMD
+};
+
+// SPV_AMD_gcn_shader
+static const char* const E_SPV_AMD_gcn_shader = "SPV_AMD_gcn_shader";
+
+enum GcnShaderAMD {
+ GcnShaderBadAMD = 0, // Don't use
+
+ CubeFaceIndexAMD = 1,
+ CubeFaceCoordAMD = 2,
+ TimeAMD = 3,
+
+ GcnShaderCountAMD
+};
+
+// SPV_AMD_gpu_shader_half_float
+static const char* const E_SPV_AMD_gpu_shader_half_float = "SPV_AMD_gpu_shader_half_float";
+
+#endif // #ifndef GLSLextAMD_H
--
2.9.4
Connor Abbott
2017-08-08 01:32:29 UTC
Permalink
From: Connor Abbott <***@gmail.com>

glslang enables the Group capability in SPIR-V when compiling shaders
with GL_AMD_shader_ballot, and uses OpGroupIAdd etc. for implementing
some functions in GL_AMD_shader_ballot, so it seems that the Group
capability is implicitly added by the Vulkan extension. I've added
support for both sets of instructions at the same time since they're
fairly closely related, and handled with the same macro. For the same
reason, I've put the implementation of the OpGroup* instructions in
vtn_amd.c, even though they're not part of an AMD extension.
---
src/compiler/Makefile.sources | 1 +
src/compiler/spirv/nir_spirv.h | 2 +
src/compiler/spirv/spirv_to_nir.c | 32 ++++-
src/compiler/spirv/vtn_amd.c | 281 ++++++++++++++++++++++++++++++++++++++
src/compiler/spirv/vtn_private.h | 9 ++
5 files changed, 324 insertions(+), 1 deletion(-)
create mode 100644 src/compiler/spirv/vtn_amd.c

diff --git a/src/compiler/Makefile.sources b/src/compiler/Makefile.sources
index a56a710..091b228 100644
--- a/src/compiler/Makefile.sources
+++ b/src/compiler/Makefile.sources
@@ -291,6 +291,7 @@ SPIRV_FILES = \
spirv/spirv_info.h \
spirv/spirv_to_nir.c \
spirv/vtn_alu.c \
+ spirv/vtn_amd.c \
spirv/vtn_cfg.c \
spirv/vtn_glsl450.c \
spirv/vtn_private.h \
diff --git a/src/compiler/spirv/nir_spirv.h b/src/compiler/spirv/nir_spirv.h
index 9d90a4d..3dc5542 100644
--- a/src/compiler/spirv/nir_spirv.h
+++ b/src/compiler/spirv/nir_spirv.h
@@ -54,6 +54,8 @@ struct nir_spirv_supported_extensions {
bool variable_pointers;
bool shader_ballot;
bool shader_group_vote;
+ bool amd_shader_ballot;
+ bool groups;
};

nir_function *spirv_to_nir(const uint32_t *words, size_t word_count,
diff --git a/src/compiler/spirv/spirv_to_nir.c b/src/compiler/spirv/spirv_to_nir.c
index b396840..82087d1 100644
--- a/src/compiler/spirv/spirv_to_nir.c
+++ b/src/compiler/spirv/spirv_to_nir.c
@@ -261,6 +261,11 @@ vtn_handle_extension(struct vtn_builder *b, SpvOp opcode,
struct vtn_value *val = vtn_push_value(b, w[1], vtn_value_type_extension);
if (strcmp((const char *)&w[2], "GLSL.std.450") == 0) {
val->ext_handler = vtn_handle_glsl450_instruction;
+ } else if (strcmp((const char *)&w[2], "SPV_AMD_shader_ballot") == 0) {
+ if (!(b->ext && b->ext->amd_shader_ballot)) {
+ vtn_warn("Unsupported extension SPV_AMD_shader_ballot");
+ }
+ val->ext_handler = vtn_handle_amd_ballot_ext;
} else {
assert(!"Unsupported extension");
}
@@ -2814,7 +2819,6 @@ vtn_handle_preamble_instruction(struct vtn_builder *b, SpvOp opcode,
case SpvCapabilityImageReadWrite:
case SpvCapabilityImageMipmap:
case SpvCapabilityPipes:
- case SpvCapabilityGroups:
case SpvCapabilityDeviceEnqueue:
case SpvCapabilityLiteralSampler:
case SpvCapabilityGenericPointer:
@@ -2859,6 +2863,10 @@ vtn_handle_preamble_instruction(struct vtn_builder *b, SpvOp opcode,
spv_check_supported(shader_group_vote, cap);
break;

+ case SpvCapabilityGroups:
+ spv_check_supported(groups, cap);
+ break;
+
default:
unreachable("Unhandled capability");
}
@@ -3386,6 +3394,28 @@ vtn_handle_body_instruction(struct vtn_builder *b, SpvOp opcode,
vtn_handle_subgroup(b, opcode, w, count);
break;

+ case SpvOpGroupAny:
+ case SpvOpGroupAll:
+ case SpvOpGroupBroadcast:
+ case SpvOpGroupIAdd:
+ case SpvOpGroupFAdd:
+ case SpvOpGroupFMin:
+ case SpvOpGroupUMin:
+ case SpvOpGroupSMin:
+ case SpvOpGroupFMax:
+ case SpvOpGroupUMax:
+ case SpvOpGroupSMax:
+ case SpvOpGroupIAddNonUniformAMD:
+ case SpvOpGroupFAddNonUniformAMD:
+ case SpvOpGroupFMinNonUniformAMD:
+ case SpvOpGroupUMinNonUniformAMD:
+ case SpvOpGroupSMinNonUniformAMD:
+ case SpvOpGroupFMaxNonUniformAMD:
+ case SpvOpGroupUMaxNonUniformAMD:
+ case SpvOpGroupSMaxNonUniformAMD:
+ vtn_handle_group(b, opcode, w, count);
+ break;
+
default:
unreachable("Unhandled opcode");
}
diff --git a/src/compiler/spirv/vtn_amd.c b/src/compiler/spirv/vtn_amd.c
new file mode 100644
index 0000000..3cc5eec
--- /dev/null
+++ b/src/compiler/spirv/vtn_amd.c
@@ -0,0 +1,281 @@
+/*
+ * Copyright © 2017 Valve Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include "vtn_private.h"
+#include "GLSL.ext.AMD.h"
+
+void
+vtn_handle_group(struct vtn_builder *b, SpvOp opcode,
+ const uint32_t *w, unsigned count)
+{
+ SpvScope scope =
+ vtn_value(b, w[3], vtn_value_type_constant)->constant->values[0].u32[0];
+ nir_intrinsic_op op;
+ switch (opcode) {
+ case SpvOpGroupAll:
+ switch (scope) {
+ case SpvScopeSubgroup:
+ op = nir_intrinsic_vote_all;
+ break;
+ case SpvScopeWorkgroup:
+ op = nir_intrinsic_group_all;
+ break;
+ default:
+ unreachable("bad scope");
+ }
+ break;
+ case SpvOpGroupAny:
+ switch (scope) {
+ case SpvScopeSubgroup:
+ op = nir_intrinsic_vote_any;
+ break;
+ case SpvScopeWorkgroup:
+ op = nir_intrinsic_group_any;
+ break;
+ default:
+ unreachable("bad scope");
+ }
+ break;
+ case SpvOpGroupBroadcast:
+ switch (scope) {
+ case SpvScopeSubgroup:
+ op = nir_intrinsic_read_invocation;
+ break;
+ case SpvScopeWorkgroup:
+ op = nir_intrinsic_group_broadcast;
+ break;
+ default:
+ unreachable("bad scope");
+ }
+ break;
+
+#define OP(spv, nir) \
+ case SpvOpGroup##spv##NonUniformAMD: \
+ switch (scope) { \
+ case SpvScopeSubgroup: \
+ switch ((SpvGroupOperation) w[4]) { \
+ case SpvGroupOperationReduce: \
+ op = nir_intrinsic_subgroup_##nir##_nonuniform; \
+ break; \
+ case SpvGroupOperationInclusiveScan: \
+ op = nir_intrinsic_subgroup_##nir##_inclusive_scan_nonuniform; \
+ break; \
+ case SpvGroupOperationExclusiveScan: \
+ op = nir_intrinsic_subgroup_##nir##_exclusive_scan_nonuniform; \
+ break; \
+ default: \
+ unreachable("unhandled group operation"); \
+ } \
+ break; \
+ case SpvScopeWorkgroup: \
+ switch ((SpvGroupOperation) w[4]) { \
+ case SpvGroupOperationReduce: \
+ op = nir_intrinsic_group_##nir##_nonuniform; \
+ break; \
+ case SpvGroupOperationInclusiveScan: \
+ op = nir_intrinsic_group_##nir##_inclusive_scan_nonuniform; \
+ break; \
+ case SpvGroupOperationExclusiveScan: \
+ op = nir_intrinsic_group_##nir##_exclusive_scan_nonuniform; \
+ break; \
+ default: \
+ unreachable("unhandled group operation"); \
+ } \
+ break; \
+ default: \
+ unreachable("bad scope for AMD_shader_ballot"); \
+ } \
+ break; \
+ case SpvOpGroup##spv: \
+ switch (scope) { \
+ case SpvScopeSubgroup: \
+ switch ((SpvGroupOperation) w[4]) { \
+ case SpvGroupOperationReduce: \
+ op = nir_intrinsic_subgroup_##nir; \
+ break; \
+ case SpvGroupOperationInclusiveScan: \
+ op = nir_intrinsic_subgroup_##nir##_inclusive_scan; \
+ break; \
+ case SpvGroupOperationExclusiveScan: \
+ op = nir_intrinsic_subgroup_##nir##_exclusive_scan; \
+ break; \
+ default: \
+ unreachable("unhandled group operation"); \
+ } \
+ break; \
+ case SpvScopeWorkgroup: \
+ switch ((SpvGroupOperation) w[4]) { \
+ case SpvGroupOperationReduce: \
+ op = nir_intrinsic_group_##nir; \
+ break; \
+ case SpvGroupOperationInclusiveScan: \
+ op = nir_intrinsic_group_##nir##_inclusive_scan; \
+ break; \
+ case SpvGroupOperationExclusiveScan: \
+ op = nir_intrinsic_group_##nir##_exclusive_scan; \
+ break; \
+ default: \
+ unreachable("unhandled group operation"); \
+ } \
+ break; \
+ default: \
+ unreachable("bad scope for group reduction"); \
+ } \
+ break;
+
+ OP(IAdd, iadd)
+ OP(FAdd, fadd)
+ OP(FMin, fmin)
+ OP(UMin, umin)
+ OP(SMin, imin)
+ OP(FMax, fmax)
+ OP(UMax, umax)
+ OP(SMax, imax)
+
+ default:
+ unreachable("bad opcode for AMD_shader_ballot");
+ }
+
+ nir_intrinsic_instr *intrin =
+ nir_intrinsic_instr_create(b->shader, op);
+
+ const uint32_t value =
+ (opcode == SpvOpGroupAll ||
+ opcode == SpvOpGroupAny ||
+ opcode == SpvOpGroupBroadcast) ? w[4] : w[5];
+ intrin->src[0] = nir_src_for_ssa(vtn_ssa_value(b, value)->def);
+
+ if (opcode == SpvOpGroupBroadcast) {
+ nir_ssa_def *id = vtn_ssa_value(b, w[5])->def;
+ if (scope == SpvScopeWorkgroup) {
+ /* From the SPIR-V 1.2 spec, OpGroupBroadcast:
+ *
+ * "LocalId must be an integer datatype. It can be a scalar, or a
+ * vector with 2 components or a vector with 3 components."
+ *
+ * Pad it with trailing 0's to make it always 3-dimensional, to match
+ * the definition of nir_intrinsic_group_broadcast.
+ */
+ nir_ssa_def *srcs[3];
+ for (unsigned i = 0; i < 3; i++) {
+ if (i >= id->num_components)
+ srcs[i] = nir_imm_int(&b->nb, 0);
+ else
+ srcs[i] = nir_channel(&b->nb, id, i);
+ }
+ id = nir_vec(&b->nb, srcs, 3);
+ }
+ intrin->src[1] = nir_src_for_ssa(id);
+ }
+
+ intrin->num_components = intrin->src[0].ssa->num_components;
+ nir_ssa_dest_init(&intrin->instr, &intrin->dest,
+ intrin->num_components,
+ intrin->src[0].ssa->bit_size,
+ NULL);
+ nir_builder_instr_insert(&b->nb, &intrin->instr);
+
+ nir_ssa_def *result = &intrin->dest.ssa;
+ struct vtn_value *val = vtn_push_value(b, w[2], vtn_value_type_ssa);
+ const struct glsl_type *result_type =
+ vtn_value(b, w[1], vtn_value_type_type)->type->type;
+ val->ssa = vtn_create_ssa_value(b, result_type);
+ val->ssa->def = result;
+}
+
+bool
+vtn_handle_amd_ballot_ext(struct vtn_builder *b, uint32_t opcode,
+ const uint32_t *w, unsigned count)
+{
+ unsigned num_srcs;
+ nir_intrinsic_op op;
+
+ switch ((enum ShaderBallotAMD) opcode) {
+ case SwizzleInvocationsAMD: {
+ op = nir_intrinsic_quad_swizzle_amd;
+ num_srcs = 1;
+ break;
+ }
+ case SwizzleInvocationsMaskedAMD: {
+ op = nir_intrinsic_masked_swizzle_amd;
+ num_srcs = 1;
+ break;
+ }
+ case WriteInvocationAMD:
+ op = nir_intrinsic_write_invocation;
+ num_srcs = 3;
+ break;
+ case MbcntAMD:
+ op = nir_intrinsic_mbcnt_amd;
+ num_srcs = 1;
+ break;
+ default:
+ unreachable("unknown AMD_shader_ballot opcode");
+ }
+
+ nir_intrinsic_instr *intrin =
+ nir_intrinsic_instr_create(b->shader, op);
+
+ for (unsigned i = 0; i < num_srcs; i++)
+ intrin->src[i] = nir_src_for_ssa(vtn_ssa_value(b, w[5 + i])->def);
+
+ switch ((enum ShaderBallotAMD) opcode) {
+ case SwizzleInvocationsAMD: {
+ nir_constant *offset = vtn_value(b, w[6], vtn_value_type_constant)->constant;
+ unsigned subgroup_data = 0;
+ for (unsigned i = 0; i < 4; i++)
+ subgroup_data |= offset->values[0].u32[i] << (2 * i);
+ nir_intrinsic_set_subgroup_data(intrin, subgroup_data);
+ break;
+ }
+ case SwizzleInvocationsMaskedAMD: {
+ nir_constant *mask = vtn_value(b, w[6], vtn_value_type_constant)->constant;
+ unsigned subgroup_data = 0;
+ for (unsigned i = 0; i < 3; i++)
+ subgroup_data |= mask->values[0].u32[i] << (5 * i);
+ nir_intrinsic_set_subgroup_data(intrin, subgroup_data);
+ break;
+ }
+ default:
+ break;
+ }
+
+ intrin->num_components = intrin->src[0].ssa->num_components;
+ nir_ssa_dest_init(&intrin->instr, &intrin->dest,
+ intrin->num_components,
+ (enum ShaderBallotAMD) opcode == MbcntAMD ? 32 :
+ intrin->src[0].ssa->bit_size,
+ NULL);
+ nir_builder_instr_insert(&b->nb, &intrin->instr);
+
+ nir_ssa_def *result = &intrin->dest.ssa;
+ struct vtn_value *val = vtn_push_value(b, w[2], vtn_value_type_ssa);
+ const struct glsl_type *result_type =
+ vtn_value(b, w[1], vtn_value_type_type)->type->type;
+ val->ssa = vtn_create_ssa_value(b, result_type);
+ val->ssa->def = result;
+
+ return true;
+}
+
diff --git a/src/compiler/spirv/vtn_private.h b/src/compiler/spirv/vtn_private.h
index 8458462..ab85e3c 100644
--- a/src/compiler/spirv/vtn_private.h
+++ b/src/compiler/spirv/vtn_private.h
@@ -630,6 +630,15 @@ void vtn_handle_alu(struct vtn_builder *b, SpvOp opcode,
bool vtn_handle_glsl450_instruction(struct vtn_builder *b, uint32_t ext_opcode,
const uint32_t *words, unsigned count);

+/* Handle instructions under the Group capability, including ones added by
+ * SPV_AMD_shader_ballot for non-uniform reductions.
+ */
+void vtn_handle_group(struct vtn_builder *b, SpvOp opcode,
+ const uint32_t *words, unsigned count);
+
+bool vtn_handle_amd_ballot_ext(struct vtn_builder *b, uint32_t ext_opcode,
+ const uint32_t *words, unsigned count);
+
static inline uint64_t
vtn_u64_literal(const uint32_t *w)
{
--
2.9.4
Connor Abbott
2017-08-08 01:32:30 UTC
Permalink
From: Connor Abbott <***@gmail.com>

We'll want to scalarize other intrinsics in a similar manner for
AMD_shader_ballot, and possibly other extensions in the future. This
patch reworks the pass to use the intrinsic's info to detect whether we
need to copy the source or scalarize it, similarly to how ALU operations
are handled, and makes sure we copy indices if there are any. We don't
yet handle any other intrinsics, though, so this should shouldn't have
any functional effect yet.
---
src/compiler/Makefile.sources | 2 +-
src/compiler/nir/nir.h | 2 +-
...scalar.c => nir_lower_cross_thread_to_scalar.c} | 39 ++++++++++++++--------
src/intel/compiler/brw_nir.c | 2 +-
4 files changed, 28 insertions(+), 17 deletions(-)
rename src/compiler/nir/{nir_lower_read_invocation_to_scalar.c => nir_lower_cross_thread_to_scalar.c} (70%)

diff --git a/src/compiler/Makefile.sources b/src/compiler/Makefile.sources
index 091b228..734424a 100644
--- a/src/compiler/Makefile.sources
+++ b/src/compiler/Makefile.sources
@@ -213,6 +213,7 @@ NIR_FILES = \
nir/nir_lower_clip.c \
nir/nir_lower_clip_cull_distance_arrays.c \
nir/nir_lower_constant_initializers.c \
+ nir/nir_lower_cross_thread_to_scalar.c \
nir/nir_lower_double_ops.c \
nir/nir_lower_drawpixels.c \
nir/nir_lower_global_vars_to_local.c \
@@ -229,7 +230,6 @@ NIR_FILES = \
nir/nir_lower_passthrough_edgeflags.c \
nir/nir_lower_patch_vertices.c \
nir/nir_lower_phis_to_scalar.c \
- nir/nir_lower_read_invocation_to_scalar.c \
nir/nir_lower_regs_to_ssa.c \
nir/nir_lower_returns.c \
nir/nir_lower_samplers.c \
diff --git a/src/compiler/nir/nir.h b/src/compiler/nir/nir.h
index 4b5d78e..2836cd1 100644
--- a/src/compiler/nir/nir.h
+++ b/src/compiler/nir/nir.h
@@ -2569,7 +2569,7 @@ bool nir_move_vec_src_uses_to_dest(nir_shader *shader);
bool nir_lower_vec_to_movs(nir_shader *shader);
bool nir_lower_alu_to_scalar(nir_shader *shader);
bool nir_lower_load_const_to_scalar(nir_shader *shader);
-bool nir_lower_read_invocation_to_scalar(nir_shader *shader);
+bool nir_lower_cross_thread_to_scalar(nir_shader *shader);
bool nir_lower_phis_to_scalar(nir_shader *shader);
void nir_lower_io_to_scalar(nir_shader *shader, nir_variable_mode mask);

diff --git a/src/compiler/nir/nir_lower_read_invocation_to_scalar.c b/src/compiler/nir/nir_lower_cross_thread_to_scalar.c
similarity index 70%
rename from src/compiler/nir/nir_lower_read_invocation_to_scalar.c
rename to src/compiler/nir/nir_lower_cross_thread_to_scalar.c
index 69e7c0a..56feba6 100644
--- a/src/compiler/nir/nir_lower_read_invocation_to_scalar.c
+++ b/src/compiler/nir/nir_lower_cross_thread_to_scalar.c
@@ -24,18 +24,20 @@
#include "nir.h"
#include "nir_builder.h"

-/** @file nir_lower_read_invocation_to_scalar.c
+/** @file nir_lower_cross_thread_to_scalar.c
*
- * Replaces nir_intrinsic_read_invocation/nir_intrinsic_read_first_invocation
- * operations with num_components != 1 with individual per-channel operations.
+ * Replaces certain cross-thread intrinsics with num_components != 1 with
+ * individual per-channel operations. So far, the operations supported are:
+ *
+ * - read_invocation
+ * - read_first_invocation
*/

static void
-lower_read_invocation_to_scalar(nir_builder *b, nir_intrinsic_instr *intrin)
+lower_to_scalar(nir_builder *b, nir_intrinsic_instr *intrin)
{
b->cursor = nir_before_instr(&intrin->instr);

- nir_ssa_def *value = nir_ssa_for_src(b, intrin->src[0], intrin->num_components);
nir_ssa_def *reads[4];

for (unsigned i = 0; i < intrin->num_components; i++) {
@@ -44,12 +46,21 @@ lower_read_invocation_to_scalar(nir_builder *b, nir_intrinsic_instr *intrin)
nir_ssa_dest_init(&chan_intrin->instr, &chan_intrin->dest,
1, intrin->dest.ssa.bit_size, NULL);
chan_intrin->num_components = 1;
+ const nir_intrinsic_info *info = &nir_intrinsic_infos[intrin->intrinsic];
+
+ for (unsigned src = 0; src < info->num_srcs; src++) {
+ if (info->src_components[src] != 0) {
+ nir_src_copy(&chan_intrin->src[src], &intrin->src[src],
+ chan_intrin);
+ } else {
+ nir_ssa_def *value = nir_ssa_for_src(b, intrin->src[src],
+ intrin->num_components);
+ chan_intrin->src[src] = nir_src_for_ssa(nir_channel(b, value, i));
+ }
+ }

- /* value */
- chan_intrin->src[0] = nir_src_for_ssa(nir_channel(b, value, i));
- /* invocation */
- if (intrin->intrinsic == nir_intrinsic_read_invocation)
- nir_src_copy(&chan_intrin->src[1], &intrin->src[1], chan_intrin);
+ for (unsigned idx = 0; idx < info->num_indices; idx++)
+ chan_intrin->const_index[idx] = intrin->const_index[idx];

nir_builder_instr_insert(b, &chan_intrin->instr);

@@ -63,7 +74,7 @@ lower_read_invocation_to_scalar(nir_builder *b, nir_intrinsic_instr *intrin)
}

static bool
-nir_lower_read_invocation_to_scalar_impl(nir_function_impl *impl)
+nir_lower_cross_thread_to_scalar_impl(nir_function_impl *impl)
{
bool progress = false;
nir_builder b;
@@ -82,7 +93,7 @@ nir_lower_read_invocation_to_scalar_impl(nir_function_impl *impl)
switch (intrin->intrinsic) {
case nir_intrinsic_read_invocation:
case nir_intrinsic_read_first_invocation:
- lower_read_invocation_to_scalar(&b, intrin);
+ lower_to_scalar(&b, intrin);
progress = true;
break;
default:
@@ -99,13 +110,13 @@ nir_lower_read_invocation_to_scalar_impl(nir_function_impl *impl)
}

bool
-nir_lower_read_invocation_to_scalar(nir_shader *shader)
+nir_lower_cross_thread_to_scalar(nir_shader *shader)
{
bool progress = false;

nir_foreach_function(function, shader) {
if (function->impl)
- progress |= nir_lower_read_invocation_to_scalar_impl(function->impl);
+ progress |= nir_lower_cross_thread_to_scalar_impl(function->impl);
}

return progress;
diff --git a/src/intel/compiler/brw_nir.c b/src/intel/compiler/brw_nir.c
index ce21c01..80b8822 100644
--- a/src/intel/compiler/brw_nir.c
+++ b/src/intel/compiler/brw_nir.c
@@ -620,7 +620,7 @@ brw_preprocess_nir(const struct brw_compiler *compiler, nir_shader *nir)

OPT(nir_lower_tex, &tex_options);
OPT(nir_normalize_cubemap_coords);
- OPT(nir_lower_read_invocation_to_scalar);
+ OPT(nir_lower_cross_thread_to_scalar);

OPT(nir_lower_global_vars_to_local);
--
2.9.4
Connor Abbott
2017-08-08 01:32:31 UTC
Permalink
From: Connor Abbott <***@gmail.com>

Scalarizing the group/subgroup reduction ops isn't strictly necessary
for SPIR-V since they're supposed to be scalars anyways, but it will be
useful if we want to wire up the GL version of the extension, so we
might as well wire them up too.
---
.../nir/nir_lower_cross_thread_to_scalar.c | 24 ++++++++++++++++++++++
1 file changed, 24 insertions(+)

diff --git a/src/compiler/nir/nir_lower_cross_thread_to_scalar.c b/src/compiler/nir/nir_lower_cross_thread_to_scalar.c
index 56feba6..a4a82ab 100644
--- a/src/compiler/nir/nir_lower_cross_thread_to_scalar.c
+++ b/src/compiler/nir/nir_lower_cross_thread_to_scalar.c
@@ -93,6 +93,30 @@ nir_lower_cross_thread_to_scalar_impl(nir_function_impl *impl)
switch (intrin->intrinsic) {
case nir_intrinsic_read_invocation:
case nir_intrinsic_read_first_invocation:
+ case nir_intrinsic_write_invocation:
+ case nir_intrinsic_quad_swizzle_amd:
+ case nir_intrinsic_masked_swizzle_amd:
+#define GROUP(name) \
+ case nir_intrinsic_group_##name: \
+ case nir_intrinsic_group_##name##_nonuniform: \
+ case nir_intrinsic_subgroup_##name: \
+ case nir_intrinsic_subgroup_##name##_nonuniform: \
+ case nir_intrinsic_group_##name##_inclusive_scan: \
+ case nir_intrinsic_group_##name##_inclusive_scan_nonuniform: \
+ case nir_intrinsic_subgroup_##name##_inclusive_scan: \
+ case nir_intrinsic_subgroup_##name##_inclusive_scan_nonuniform: \
+ case nir_intrinsic_group_##name##_exclusive_scan: \
+ case nir_intrinsic_group_##name##_exclusive_scan_nonuniform: \
+ case nir_intrinsic_subgroup_##name##_exclusive_scan: \
+ case nir_intrinsic_subgroup_##name##_exclusive_scan_nonuniform:
+GROUP(fadd)
+GROUP(iadd)
+GROUP(fmin)
+GROUP(imin)
+GROUP(umin)
+GROUP(fmax)
+GROUP(imax)
+GROUP(umax)
lower_to_scalar(&b, intrin);
progress = true;
break;
--
2.9.4
Connor Abbott
2017-08-08 01:32:32 UTC
Permalink
From: Connor Abbott <***@gmail.com>

We didn't need this for read_invocation, since the SPIR-V version of
shader_ballot explicitly says that ReadInvocation and
ReadFirstInvocation can only be used with scalars, but this isn't true
for some of the SPV_AMD_shader_ballot intrinsics.
---
src/amd/vulkan/radv_pipeline.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/src/amd/vulkan/radv_pipeline.c b/src/amd/vulkan/radv_pipeline.c
index fd86519..3ed0df7 100644
--- a/src/amd/vulkan/radv_pipeline.c
+++ b/src/amd/vulkan/radv_pipeline.c
@@ -155,6 +155,7 @@ radv_optimize_nir(struct nir_shader *shader)
NIR_PASS_V(shader, nir_lower_vars_to_ssa);
NIR_PASS_V(shader, nir_lower_64bit_pack);
NIR_PASS_V(shader, nir_lower_alu_to_scalar);
+ NIR_PASS_V(shader, nir_lower_cross_thread_to_scalar);
NIR_PASS_V(shader, nir_lower_phis_to_scalar);
NIR_PASS_V(shader, nir_opt_intrinsics);
--
2.9.4
Connor Abbott
2017-08-08 01:32:33 UTC
Permalink
From: Connor Abbott <***@gmail.com>

Notably this doesn't include the workgroup reduce intrinsics, since
those are tricky and the right implementation strategy might differ
across different HW. However, it implements all the other ancilliary
stuff needed for the SPIR-V Groups capability, so the driver doesn't
have to handle that stuff. We might want to move the workgroup reduction
(reduce, inclusive scan, and exclusive scan) operations here eventually
though.
---
src/compiler/Makefile.sources | 1 +
src/compiler/nir/nir.h | 2 +
src/compiler/nir/nir_lower_group_reduce.c | 179 ++++++++++++++++++++++++++++++
3 files changed, 182 insertions(+)
create mode 100644 src/compiler/nir/nir_lower_group_reduce.c

diff --git a/src/compiler/Makefile.sources b/src/compiler/Makefile.sources
index 734424a..70229ec 100644
--- a/src/compiler/Makefile.sources
+++ b/src/compiler/Makefile.sources
@@ -217,6 +217,7 @@ NIR_FILES = \
nir/nir_lower_double_ops.c \
nir/nir_lower_drawpixels.c \
nir/nir_lower_global_vars_to_local.c \
+ nir/nir_lower_group_reduce.c \
nir/nir_lower_gs_intrinsics.c \
nir/nir_lower_load_const_to_scalar.c \
nir/nir_lower_locals_to_regs.c \
diff --git a/src/compiler/nir/nir.h b/src/compiler/nir/nir.h
index 2836cd1..1a2cc61 100644
--- a/src/compiler/nir/nir.h
+++ b/src/compiler/nir/nir.h
@@ -2743,6 +2743,8 @@ typedef enum {
bool nir_lower_doubles(nir_shader *shader, nir_lower_doubles_options options);
bool nir_lower_64bit_pack(nir_shader *shader);

+bool nir_lower_group_reduce(nir_shader *shader);
+
bool nir_normalize_cubemap_coords(nir_shader *shader);

void nir_live_ssa_defs_impl(nir_function_impl *impl);
diff --git a/src/compiler/nir/nir_lower_group_reduce.c b/src/compiler/nir/nir_lower_group_reduce.c
new file mode 100644
index 0000000..4182f21
--- /dev/null
+++ b/src/compiler/nir/nir_lower_group_reduce.c
@@ -0,0 +1,179 @@
+/*
+ * Copyright © 2017 Valve Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include "nir.h"
+#include "nir_builder.h"
+
+/*
+ * Implement this workgroup operations using operations on shared variables:
+ *
+ * - group_broadcast
+ * - group_any
+ * - group_all
+ */
+
+static nir_ssa_def *
+build_subgroup_any(nir_builder *b, nir_ssa_def *src)
+{
+ nir_intrinsic_instr *instr = nir_intrinsic_instr_create(b->shader,
+ nir_intrinsic_vote_any);
+ nir_ssa_dest_init(&instr->instr, &instr->dest, 1, 32, NULL);
+ instr->src[0] = nir_src_for_ssa(src);
+ nir_builder_instr_insert(b, &instr->instr);
+ return &instr->dest.ssa;
+}
+
+static void
+build_barrier(nir_builder *b)
+{
+ nir_intrinsic_instr *intrin =
+ nir_intrinsic_instr_create(b->shader, nir_intrinsic_barrier);
+ nir_builder_instr_insert(b, &intrin->instr);
+}
+
+/* TODO share this between different instructions */
+
+static nir_variable *
+alloc_shared_temp(nir_shader *shader, unsigned components,
+ unsigned bit_size)
+{
+ enum glsl_base_type base_type;
+ switch (bit_size) {
+ case 32:
+ base_type = GLSL_TYPE_UINT;
+ break;
+ case 64:
+ base_type = GLSL_TYPE_UINT64;
+ break;
+ default:
+ unreachable("bad bit size");
+ }
+
+ const struct glsl_type *type;
+ if (components == 1)
+ type = glsl_scalar_type(base_type);
+ else
+ type = glsl_vector_type(base_type, components);
+ return nir_variable_create(shader, nir_var_shared, type, "shared_temp");
+}
+
+static nir_ssa_def *
+build_group_any(nir_builder *b, nir_ssa_def *src)
+{
+ assert(src->num_components == 1);
+ nir_variable *temp = nir_variable_create(b->shader, nir_var_shared,
+ glsl_bool_type(), "any_temp");
+
+ nir_store_var(b, temp, nir_imm_int(b, NIR_FALSE), 1);
+ build_barrier(b);
+ nir_push_if(b, build_subgroup_any(b, src));
+ nir_store_var(b, temp, nir_imm_int(b, NIR_TRUE), 1);
+ nir_pop_if(b, NULL);
+ build_barrier(b);
+ return nir_load_var(b, temp);
+}
+
+static nir_ssa_def *
+build_group_all(nir_builder *b, nir_ssa_def *src)
+{
+ assert(src->num_components == 1);
+ nir_variable *temp = nir_variable_create(b->shader, nir_var_shared,
+ glsl_bool_type(), "all_temp");
+
+ nir_store_var(b, temp, nir_imm_int(b, NIR_TRUE), 1);
+ build_barrier(b);
+ nir_push_if(b, build_subgroup_any(b, nir_inot(b, src)));
+ nir_store_var(b, temp, nir_imm_int(b, NIR_FALSE), 1);
+ nir_pop_if(b, NULL);
+ build_barrier(b);
+ return nir_load_var(b, temp);
+}
+
+static nir_ssa_def *
+build_group_broadcast(nir_builder *b, nir_ssa_def *src, nir_ssa_def *id)
+{
+ nir_variable *temp = alloc_shared_temp(b->shader, src->num_components,
+ src->bit_size);
+
+ nir_push_if(b, nir_ball_iequal3(b, id, nir_load_local_invocation_id(b)));
+ nir_store_var(b, temp, src, (1 << src->num_components) - 1);
+ nir_pop_if(b, NULL);
+ build_barrier(b);
+ return nir_load_var(b, temp);
+}
+
+static bool
+lower_group_reduce_impl(nir_function_impl *impl,
+ const struct shader_info *info)
+{
+ nir_builder b;
+ nir_builder_init(&b, impl);
+ bool progress = false;
+
+ nir_foreach_block_safe(block, impl) {
+ nir_foreach_instr_safe(instr, block) {
+ if (instr->type != nir_instr_type_intrinsic)
+ continue;
+
+ nir_intrinsic_instr *intrin = nir_instr_as_intrinsic(instr);
+ nir_ssa_def *replacement = NULL;
+ b.cursor = nir_before_instr(instr);
+
+ switch (intrin->intrinsic) {
+ case nir_intrinsic_group_any:
+ replacement = build_group_any(&b, intrin->src[0].ssa);
+ break;
+ case nir_intrinsic_group_all:
+ replacement = build_group_all(&b, intrin->src[0].ssa);
+ break;
+ case nir_intrinsic_group_broadcast:
+ replacement = build_group_broadcast(&b, intrin->src[0].ssa,
+ intrin->src[1].ssa);
+ break;
+ default:
+ continue;
+ }
+
+ assert(replacement);
+ nir_ssa_def_rewrite_uses(&intrin->dest.ssa,
+ nir_src_for_ssa(replacement));
+ progress = true;
+ }
+ }
+
+ return progress;
+}
+
+bool
+nir_lower_group_reduce(nir_shader *shader)
+{
+ bool progress = false;
+
+ nir_foreach_function(function, shader) {
+ if (function->impl)
+ progress |= lower_group_reduce_impl(function->impl, &shader->info);
+ }
+
+ return false;
+}
--
2.9.4
Connor Abbott
2017-08-08 01:32:34 UTC
Permalink
From: Connor Abbott <***@gmail.com>

---
src/amd/vulkan/radv_pipeline.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/src/amd/vulkan/radv_pipeline.c b/src/amd/vulkan/radv_pipeline.c
index 3ed0df7..4aecb81 100644
--- a/src/amd/vulkan/radv_pipeline.c
+++ b/src/amd/vulkan/radv_pipeline.c
@@ -295,6 +295,7 @@ radv_shader_compile_to_nir(struct radv_device *device,
nir_lower_var_copies(nir);
nir_lower_global_vars_to_local(nir);
nir_remove_dead_variables(nir, nir_var_local);
+ nir_lower_group_reduce(nir);
radv_optimize_nir(nir);

if (dump)
--
2.9.4
Connor Abbott
2017-08-08 01:32:35 UTC
Permalink
From: Connor Abbott <***@gmail.com>

We'll need to use ac_to_integer() for other stuff in ac_llvm_build.c.
---
src/amd/common/ac_llvm_build.c | 61 +++++++++++
src/amd/common/ac_llvm_build.h | 5 +
src/amd/common/ac_nir_to_llvm.c | 224 +++++++++++++++-------------------------
3 files changed, 150 insertions(+), 140 deletions(-)

diff --git a/src/amd/common/ac_llvm_build.c b/src/amd/common/ac_llvm_build.c
index 7b024aa..2fdb3e8 100644
--- a/src/amd/common/ac_llvm_build.c
+++ b/src/amd/common/ac_llvm_build.c
@@ -113,6 +113,67 @@ ac_get_type_size(LLVMTypeRef type)
}
}

+static LLVMTypeRef to_integer_type_scalar(struct ac_llvm_context *ctx, LLVMTypeRef t)
+{
+ if (t == ctx->f16 || t == ctx->i16)
+ return ctx->i16;
+ else if (t == ctx->f32 || t == ctx->i32)
+ return ctx->i32;
+ else if (t == ctx->f64 || t == ctx->i64)
+ return ctx->i64;
+ else
+ unreachable("Unhandled integer size");
+}
+
+LLVMTypeRef
+ac_to_integer_type(struct ac_llvm_context *ctx, LLVMTypeRef t)
+{
+ if (LLVMGetTypeKind(t) == LLVMVectorTypeKind) {
+ LLVMTypeRef elem_type = LLVMGetElementType(t);
+ return LLVMVectorType(to_integer_type_scalar(ctx, elem_type),
+ LLVMGetVectorSize(t));
+ }
+ return to_integer_type_scalar(ctx, t);
+}
+
+LLVMValueRef
+ac_to_integer(struct ac_llvm_context *ctx, LLVMValueRef v)
+{
+ LLVMTypeRef type = LLVMTypeOf(v);
+ return LLVMBuildBitCast(ctx->builder, v, ac_to_integer_type(ctx, type), "");
+}
+
+static LLVMTypeRef to_float_type_scalar(struct ac_llvm_context *ctx, LLVMTypeRef t)
+{
+ if (t == ctx->i16 || t == ctx->f16)
+ return ctx->f16;
+ else if (t == ctx->i32 || t == ctx->f32)
+ return ctx->f32;
+ else if (t == ctx->i64 || t == ctx->f64)
+ return ctx->f64;
+ else
+ unreachable("Unhandled float size");
+}
+
+LLVMTypeRef
+ac_to_float_type(struct ac_llvm_context *ctx, LLVMTypeRef t)
+{
+ if (LLVMGetTypeKind(t) == LLVMVectorTypeKind) {
+ LLVMTypeRef elem_type = LLVMGetElementType(t);
+ return LLVMVectorType(to_float_type_scalar(ctx, elem_type),
+ LLVMGetVectorSize(t));
+ }
+ return to_float_type_scalar(ctx, t);
+}
+
+LLVMValueRef
+ac_to_float(struct ac_llvm_context *ctx, LLVMValueRef v)
+{
+ LLVMTypeRef type = LLVMTypeOf(v);
+ return LLVMBuildBitCast(ctx->builder, v, ac_to_float_type(ctx, type), "");
+}
+
+
LLVMValueRef
ac_build_intrinsic(struct ac_llvm_context *ctx, const char *name,
LLVMTypeRef return_type, LLVMValueRef *params,
diff --git a/src/amd/common/ac_llvm_build.h b/src/amd/common/ac_llvm_build.h
index 13655f1..1d9850b 100644
--- a/src/amd/common/ac_llvm_build.h
+++ b/src/amd/common/ac_llvm_build.h
@@ -68,6 +68,11 @@ ac_llvm_context_init(struct ac_llvm_context *ctx, LLVMContextRef context);

unsigned ac_get_type_size(LLVMTypeRef type);

+LLVMTypeRef ac_to_integer_type(struct ac_llvm_context *ctx, LLVMTypeRef t);
+LLVMValueRef ac_to_integer(struct ac_llvm_context *ctx, LLVMValueRef v);
+LLVMTypeRef ac_to_float_type(struct ac_llvm_context *ctx, LLVMTypeRef t);
+LLVMValueRef ac_to_float(struct ac_llvm_context *ctx, LLVMValueRef v);
+
LLVMValueRef
ac_build_intrinsic(struct ac_llvm_context *ctx, const char *name,
LLVMTypeRef return_type, LLVMValueRef *params,
diff --git a/src/amd/common/ac_nir_to_llvm.c b/src/amd/common/ac_nir_to_llvm.c
index f01081c..b39b873 100644
--- a/src/amd/common/ac_nir_to_llvm.c
+++ b/src/amd/common/ac_nir_to_llvm.c
@@ -399,62 +399,6 @@ static LLVMTypeRef const_array(LLVMTypeRef elem_type, int num_elements)
CONST_ADDR_SPACE);
}

-static LLVMTypeRef to_integer_type_scalar(struct ac_llvm_context *ctx, LLVMTypeRef t)
-{
- if (t == ctx->f16 || t == ctx->i16)
- return ctx->i16;
- else if (t == ctx->f32 || t == ctx->i32)
- return ctx->i32;
- else if (t == ctx->f64 || t == ctx->i64)
- return ctx->i64;
- else
- unreachable("Unhandled integer size");
-}
-
-static LLVMTypeRef to_integer_type(struct ac_llvm_context *ctx, LLVMTypeRef t)
-{
- if (LLVMGetTypeKind(t) == LLVMVectorTypeKind) {
- LLVMTypeRef elem_type = LLVMGetElementType(t);
- return LLVMVectorType(to_integer_type_scalar(ctx, elem_type),
- LLVMGetVectorSize(t));
- }
- return to_integer_type_scalar(ctx, t);
-}
-
-static LLVMValueRef to_integer(struct ac_llvm_context *ctx, LLVMValueRef v)
-{
- LLVMTypeRef type = LLVMTypeOf(v);
- return LLVMBuildBitCast(ctx->builder, v, to_integer_type(ctx, type), "");
-}
-
-static LLVMTypeRef to_float_type_scalar(struct ac_llvm_context *ctx, LLVMTypeRef t)
-{
- if (t == ctx->i16 || t == ctx->f16)
- return ctx->f16;
- else if (t == ctx->i32 || t == ctx->f32)
- return ctx->f32;
- else if (t == ctx->i64 || t == ctx->f64)
- return ctx->f64;
- else
- unreachable("Unhandled float size");
-}
-
-static LLVMTypeRef to_float_type(struct ac_llvm_context *ctx, LLVMTypeRef t)
-{
- if (LLVMGetTypeKind(t) == LLVMVectorTypeKind) {
- LLVMTypeRef elem_type = LLVMGetElementType(t);
- return LLVMVectorType(to_float_type_scalar(ctx, elem_type),
- LLVMGetVectorSize(t));
- }
- return to_float_type_scalar(ctx, t);
-}
-
-static LLVMValueRef to_float(struct ac_llvm_context *ctx, LLVMValueRef v)
-{
- LLVMTypeRef type = LLVMTypeOf(v);
- return LLVMBuildBitCast(ctx->builder, v, to_float_type(ctx, type), "");
-}
-
static int get_elem_bits(struct ac_llvm_context *ctx, LLVMTypeRef type)
{
if (LLVMGetTypeKind(type) == LLVMVectorTypeKind)
@@ -1103,8 +1047,8 @@ static LLVMValueRef emit_float_cmp(struct ac_llvm_context *ctx,
LLVMValueRef src1)
{
LLVMValueRef result;
- src0 = to_float(ctx, src0);
- src1 = to_float(ctx, src1);
+ src0 = ac_to_float(ctx, src0);
+ src1 = ac_to_float(ctx, src1);
result = LLVMBuildFCmp(ctx->builder, pred, src0, src1, "");
return LLVMBuildSelect(ctx->builder, result,
LLVMConstInt(ctx->i32, 0xFFFFFFFF, false),
@@ -1118,7 +1062,7 @@ static LLVMValueRef emit_intrin_1f_param(struct ac_llvm_context *ctx,
{
char name[64];
LLVMValueRef params[] = {
- to_float(ctx, src0),
+ ac_to_float(ctx, src0),
};

MAYBE_UNUSED const int length = snprintf(name, sizeof(name), "%s.f%d", intrin,
@@ -1134,8 +1078,8 @@ static LLVMValueRef emit_intrin_2f_param(struct ac_llvm_context *ctx,
{
char name[64];
LLVMValueRef params[] = {
- to_float(ctx, src0),
- to_float(ctx, src1),
+ ac_to_float(ctx, src0),
+ ac_to_float(ctx, src1),
};

MAYBE_UNUSED const int length = snprintf(name, sizeof(name), "%s.f%d", intrin,
@@ -1151,9 +1095,9 @@ static LLVMValueRef emit_intrin_3f_param(struct ac_llvm_context *ctx,
{
char name[64];
LLVMValueRef params[] = {
- to_float(ctx, src0),
- to_float(ctx, src1),
- to_float(ctx, src2),
+ ac_to_float(ctx, src0),
+ ac_to_float(ctx, src1),
+ ac_to_float(ctx, src2),
};

MAYBE_UNUSED const int length = snprintf(name, sizeof(name), "%s.f%d", intrin,
@@ -1255,7 +1199,7 @@ static LLVMValueRef emit_ffract(struct ac_llvm_context *ctx,
LLVMValueRef src0)
{
const char *intr = "llvm.floor.f32";
- LLVMValueRef fsrc0 = to_float(ctx, src0);
+ LLVMValueRef fsrc0 = ac_to_float(ctx, src0);
LLVMValueRef params[] = {
fsrc0,
};
@@ -1293,7 +1237,7 @@ static LLVMValueRef emit_b2f(struct ac_llvm_context *ctx,
static LLVMValueRef emit_f2b(struct ac_llvm_context *ctx,
LLVMValueRef src0)
{
- src0 = to_float(ctx, src0);
+ src0 = ac_to_float(ctx, src0);
return LLVMBuildSExt(ctx->builder,
LLVMBuildFCmp(ctx->builder, LLVMRealUNE, src0, ctx->f32_0, ""),
ctx->i32, "");
@@ -1319,7 +1263,7 @@ static LLVMValueRef emit_f2f16(struct nir_to_llvm_context *ctx,
LLVMValueRef result;
LLVMValueRef cond;

- src0 = to_float(&ctx->ac, src0);
+ src0 = ac_to_float(&ctx->ac, src0);
result = LLVMBuildFPTrunc(ctx->builder, src0, ctx->f16, "");

if (ctx->options->chip_class >= VI) {
@@ -1429,7 +1373,7 @@ static LLVMValueRef emit_pack_half_2x16(struct ac_llvm_context *ctx,
int i;
LLVMValueRef comp[2];

- src0 = to_float(ctx, src0);
+ src0 = ac_to_float(ctx, src0);
comp[0] = LLVMBuildExtractElement(ctx->builder, src0, ctx->i32_0, "");
comp[1] = LLVMBuildExtractElement(ctx->builder, src0, ctx->i32_1, "");
for (i = 0; i < 2; i++) {
@@ -1550,7 +1494,7 @@ static void visit_alu(struct ac_nir_context *ctx, const nir_alu_instr *instr)
result = src[0];
break;
case nir_op_fneg:
- src[0] = to_float(&ctx->ac, src[0]);
+ src[0] = ac_to_float(&ctx->ac, src[0]);
result = LLVMBuildFNeg(ctx->ac.builder, src[0], "");
break;
case nir_op_ineg:
@@ -1563,13 +1507,13 @@ static void visit_alu(struct ac_nir_context *ctx, const nir_alu_instr *instr)
result = LLVMBuildAdd(ctx->ac.builder, src[0], src[1], "");
break;
case nir_op_fadd:
- src[0] = to_float(&ctx->ac, src[0]);
- src[1] = to_float(&ctx->ac, src[1]);
+ src[0] = ac_to_float(&ctx->ac, src[0]);
+ src[1] = ac_to_float(&ctx->ac, src[1]);
result = LLVMBuildFAdd(ctx->ac.builder, src[0], src[1], "");
break;
case nir_op_fsub:
- src[0] = to_float(&ctx->ac, src[0]);
- src[1] = to_float(&ctx->ac, src[1]);
+ src[0] = ac_to_float(&ctx->ac, src[0]);
+ src[1] = ac_to_float(&ctx->ac, src[1]);
result = LLVMBuildFSub(ctx->ac.builder, src[0], src[1], "");
break;
case nir_op_isub:
@@ -1585,17 +1529,17 @@ static void visit_alu(struct ac_nir_context *ctx, const nir_alu_instr *instr)
result = LLVMBuildURem(ctx->ac.builder, src[0], src[1], "");
break;
case nir_op_fmod:
- src[0] = to_float(&ctx->ac, src[0]);
- src[1] = to_float(&ctx->ac, src[1]);
+ src[0] = ac_to_float(&ctx->ac, src[0]);
+ src[1] = ac_to_float(&ctx->ac, src[1]);
result = ac_build_fdiv(&ctx->ac, src[0], src[1]);
result = emit_intrin_1f_param(&ctx->ac, "llvm.floor",
- to_float_type(&ctx->ac, def_type), result);
+ ac_to_float_type(&ctx->ac, def_type), result);
result = LLVMBuildFMul(ctx->ac.builder, src[1] , result, "");
result = LLVMBuildFSub(ctx->ac.builder, src[0], result, "");
break;
case nir_op_frem:
- src[0] = to_float(&ctx->ac, src[0]);
- src[1] = to_float(&ctx->ac, src[1]);
+ src[0] = ac_to_float(&ctx->ac, src[0]);
+ src[1] = ac_to_float(&ctx->ac, src[1]);
result = LLVMBuildFRem(ctx->ac.builder, src[0], src[1], "");
break;
case nir_op_irem:
@@ -1608,17 +1552,17 @@ static void visit_alu(struct ac_nir_context *ctx, const nir_alu_instr *instr)
result = LLVMBuildUDiv(ctx->ac.builder, src[0], src[1], "");
break;
case nir_op_fmul:
- src[0] = to_float(&ctx->ac, src[0]);
- src[1] = to_float(&ctx->ac, src[1]);
+ src[0] = ac_to_float(&ctx->ac, src[0]);
+ src[1] = ac_to_float(&ctx->ac, src[1]);
result = LLVMBuildFMul(ctx->ac.builder, src[0], src[1], "");
break;
case nir_op_fdiv:
- src[0] = to_float(&ctx->ac, src[0]);
- src[1] = to_float(&ctx->ac, src[1]);
+ src[0] = ac_to_float(&ctx->ac, src[0]);
+ src[1] = ac_to_float(&ctx->ac, src[1]);
result = ac_build_fdiv(&ctx->ac, src[0], src[1]);
break;
case nir_op_frcp:
- src[0] = to_float(&ctx->ac, src[0]);
+ src[0] = ac_to_float(&ctx->ac, src[0]);
result = ac_build_fdiv(&ctx->ac, ctx->ac.f32_1, src[0]);
break;
case nir_op_iand:
@@ -1680,7 +1624,7 @@ static void visit_alu(struct ac_nir_context *ctx, const nir_alu_instr *instr)
break;
case nir_op_fabs:
result = emit_intrin_1f_param(&ctx->ac, "llvm.fabs",
- to_float_type(&ctx->ac, def_type), src[0]);
+ ac_to_float_type(&ctx->ac, def_type), src[0]);
break;
case nir_op_iabs:
result = emit_iabs(&ctx->ac, src[0]);
@@ -1701,76 +1645,76 @@ static void visit_alu(struct ac_nir_context *ctx, const nir_alu_instr *instr)
result = emit_isign(&ctx->ac, src[0]);
break;
case nir_op_fsign:
- src[0] = to_float(&ctx->ac, src[0]);
+ src[0] = ac_to_float(&ctx->ac, src[0]);
result = emit_fsign(&ctx->ac, src[0]);
break;
case nir_op_ffloor:
result = emit_intrin_1f_param(&ctx->ac, "llvm.floor",
- to_float_type(&ctx->ac, def_type), src[0]);
+ ac_to_float_type(&ctx->ac, def_type), src[0]);
break;
case nir_op_ftrunc:
result = emit_intrin_1f_param(&ctx->ac, "llvm.trunc",
- to_float_type(&ctx->ac, def_type), src[0]);
+ ac_to_float_type(&ctx->ac, def_type), src[0]);
break;
case nir_op_fceil:
result = emit_intrin_1f_param(&ctx->ac, "llvm.ceil",
- to_float_type(&ctx->ac, def_type), src[0]);
+ ac_to_float_type(&ctx->ac, def_type), src[0]);
break;
case nir_op_fround_even:
result = emit_intrin_1f_param(&ctx->ac, "llvm.rint",
- to_float_type(&ctx->ac, def_type),src[0]);
+ ac_to_float_type(&ctx->ac, def_type),src[0]);
break;
case nir_op_ffract:
result = emit_ffract(&ctx->ac, src[0]);
break;
case nir_op_fsin:
result = emit_intrin_1f_param(&ctx->ac, "llvm.sin",
- to_float_type(&ctx->ac, def_type), src[0]);
+ ac_to_float_type(&ctx->ac, def_type), src[0]);
break;
case nir_op_fcos:
result = emit_intrin_1f_param(&ctx->ac, "llvm.cos",
- to_float_type(&ctx->ac, def_type), src[0]);
+ ac_to_float_type(&ctx->ac, def_type), src[0]);
break;
case nir_op_fsqrt:
result = emit_intrin_1f_param(&ctx->ac, "llvm.sqrt",
- to_float_type(&ctx->ac, def_type), src[0]);
+ ac_to_float_type(&ctx->ac, def_type), src[0]);
break;
case nir_op_fexp2:
result = emit_intrin_1f_param(&ctx->ac, "llvm.exp2",
- to_float_type(&ctx->ac, def_type), src[0]);
+ ac_to_float_type(&ctx->ac, def_type), src[0]);
break;
case nir_op_flog2:
result = emit_intrin_1f_param(&ctx->ac, "llvm.log2",
- to_float_type(&ctx->ac, def_type), src[0]);
+ ac_to_float_type(&ctx->ac, def_type), src[0]);
break;
case nir_op_frsq:
result = emit_intrin_1f_param(&ctx->ac, "llvm.sqrt",
- to_float_type(&ctx->ac, def_type), src[0]);
+ ac_to_float_type(&ctx->ac, def_type), src[0]);
result = ac_build_fdiv(&ctx->ac, ctx->ac.f32_1, result);
break;
case nir_op_fpow:
result = emit_intrin_2f_param(&ctx->ac, "llvm.pow",
- to_float_type(&ctx->ac, def_type), src[0], src[1]);
+ ac_to_float_type(&ctx->ac, def_type), src[0], src[1]);
break;
case nir_op_fmax:
result = emit_intrin_2f_param(&ctx->ac, "llvm.maxnum",
- to_float_type(&ctx->ac, def_type), src[0], src[1]);
+ ac_to_float_type(&ctx->ac, def_type), src[0], src[1]);
if (instr->dest.dest.ssa.bit_size == 32)
result = emit_intrin_1f_param(&ctx->ac, "llvm.canonicalize",
- to_float_type(&ctx->ac, def_type),
+ ac_to_float_type(&ctx->ac, def_type),
result);
break;
case nir_op_fmin:
result = emit_intrin_2f_param(&ctx->ac, "llvm.minnum",
- to_float_type(&ctx->ac, def_type), src[0], src[1]);
+ ac_to_float_type(&ctx->ac, def_type), src[0], src[1]);
if (instr->dest.dest.ssa.bit_size == 32)
result = emit_intrin_1f_param(&ctx->ac, "llvm.canonicalize",
- to_float_type(&ctx->ac, def_type),
+ ac_to_float_type(&ctx->ac, def_type),
result);
break;
case nir_op_ffma:
result = emit_intrin_3f_param(&ctx->ac, "llvm.fma",
- to_float_type(&ctx->ac, def_type), src[0], src[1], src[2]);
+ ac_to_float_type(&ctx->ac, def_type), src[0], src[1], src[2]);
break;
case nir_op_ibitfield_extract:
result = emit_bitfield_extract(&ctx->ac, true, src);
@@ -1791,32 +1735,32 @@ static void visit_alu(struct ac_nir_context *ctx, const nir_alu_instr *instr)
case nir_op_vec3:
case nir_op_vec4:
for (unsigned i = 0; i < nir_op_infos[instr->op].num_inputs; i++)
- src[i] = to_integer(&ctx->ac, src[i]);
+ src[i] = ac_to_integer(&ctx->ac, src[i]);
result = ac_build_gather_values(&ctx->ac, src, num_components);
break;
case nir_op_f2i32:
case nir_op_f2i64:
- src[0] = to_float(&ctx->ac, src[0]);
+ src[0] = ac_to_float(&ctx->ac, src[0]);
result = LLVMBuildFPToSI(ctx->ac.builder, src[0], def_type, "");
break;
case nir_op_f2u32:
case nir_op_f2u64:
- src[0] = to_float(&ctx->ac, src[0]);
+ src[0] = ac_to_float(&ctx->ac, src[0]);
result = LLVMBuildFPToUI(ctx->ac.builder, src[0], def_type, "");
break;
case nir_op_i2f32:
case nir_op_i2f64:
- result = LLVMBuildSIToFP(ctx->ac.builder, src[0], to_float_type(&ctx->ac, def_type), "");
+ result = LLVMBuildSIToFP(ctx->ac.builder, src[0], ac_to_float_type(&ctx->ac, def_type), "");
break;
case nir_op_u2f32:
case nir_op_u2f64:
- result = LLVMBuildUIToFP(ctx->ac.builder, src[0], to_float_type(&ctx->ac, def_type), "");
+ result = LLVMBuildUIToFP(ctx->ac.builder, src[0], ac_to_float_type(&ctx->ac, def_type), "");
break;
case nir_op_f2f64:
- result = LLVMBuildFPExt(ctx->ac.builder, src[0], to_float_type(&ctx->ac, def_type), "");
+ result = LLVMBuildFPExt(ctx->ac.builder, src[0], ac_to_float_type(&ctx->ac, def_type), "");
break;
case nir_op_f2f32:
- result = LLVMBuildFPTrunc(ctx->ac.builder, src[0], to_float_type(&ctx->ac, def_type), "");
+ result = LLVMBuildFPTrunc(ctx->ac.builder, src[0], ac_to_float_type(&ctx->ac, def_type), "");
break;
case nir_op_u2u32:
case nir_op_u2u64:
@@ -1925,7 +1869,7 @@ static void visit_alu(struct ac_nir_context *ctx, const nir_alu_instr *instr)

if (result) {
assert(instr->dest.dest.is_ssa);
- result = to_integer(&ctx->ac, result);
+ result = ac_to_integer(&ctx->ac, result);
_mesa_hash_table_insert(ctx->defs, &instr->dest.dest.ssa,
result);
}
@@ -2270,7 +2214,7 @@ static void visit_store_ssbo(struct ac_nir_context *ctx,
if (components_32bit > 1)
data_type = LLVMVectorType(ctx->ac.f32, components_32bit);

- base_data = to_float(&ctx->ac, src_data);
+ base_data = ac_to_float(&ctx->ac, src_data);
base_data = trim_vector(&ctx->ac, base_data, instr->num_components);
base_data = LLVMBuildBitCast(ctx->ac.builder, base_data,
data_type, "");
@@ -3065,7 +3009,7 @@ visit_store_var(struct ac_nir_context *ctx,
{
LLVMValueRef temp_ptr, value;
int idx = instr->variables[0]->var->data.driver_location;
- LLVMValueRef src = to_float(&ctx->ac, get_src(ctx, instr->src[0]));
+ LLVMValueRef src = ac_to_float(&ctx->ac, get_src(ctx, instr->src[0]));
int writemask = instr->const_index[0];
LLVMValueRef indir_index;
unsigned const_index;
@@ -3250,7 +3194,7 @@ static LLVMValueRef adjust_sample_index_using_fmask(struct ac_llvm_context *ctx,

res = ac_build_image_opcode(ctx, &args);

- res = to_integer(ctx, res);
+ res = ac_to_integer(ctx, res);
LLVMValueRef four = LLVMConstInt(ctx->i32, 4, false);
LLVMValueRef F = LLVMConstInt(ctx->i32, 0xf, false);

@@ -3395,7 +3339,7 @@ static LLVMValueRef visit_image_load(struct ac_nir_context *ctx,
params, 5, 0);

res = trim_vector(&ctx->ac, res, instr->dest.ssa.num_components);
- res = to_integer(&ctx->ac, res);
+ res = ac_to_integer(&ctx->ac, res);
} else {
bool is_da = glsl_sampler_type_is_array(type) ||
glsl_get_sampler_dim(type) == GLSL_SAMPLER_DIM_CUBE;
@@ -3428,7 +3372,7 @@ static LLVMValueRef visit_image_load(struct ac_nir_context *ctx,
res = ac_build_intrinsic(&ctx->ac, intrinsic_name, ctx->ac.v4f32,
params, 7, AC_FUNC_ATTR_READONLY);
}
- return to_integer(&ctx->ac, res);
+ return ac_to_integer(&ctx->ac, res);
}

static void visit_image_store(struct ac_nir_context *ctx,
@@ -3446,7 +3390,7 @@ static void visit_image_store(struct ac_nir_context *ctx,
glc = i1true;

if (glsl_get_sampler_dim(type) == GLSL_SAMPLER_DIM_BUF) {
- params[0] = to_float(&ctx->ac, get_src(ctx, instr->src[2])); /* data */
+ params[0] = ac_to_float(&ctx->ac, get_src(ctx, instr->src[2])); /* data */
params[1] = get_sampler_desc(ctx, instr->variables[0], AC_DESC_BUFFER, true, true);
params[2] = LLVMBuildExtractElement(ctx->ac.builder, get_src(ctx, instr->src[0]),
ctx->ac.i32_0, ""); /* vindex */
@@ -3461,7 +3405,7 @@ static void visit_image_store(struct ac_nir_context *ctx,
LLVMValueRef da = is_da ? i1true : i1false;
LLVMValueRef slc = i1false;

- params[0] = to_float(&ctx->ac, get_src(ctx, instr->src[2]));
+ params[0] = ac_to_float(&ctx->ac, get_src(ctx, instr->src[2]));
params[1] = get_image_coords(ctx, instr); /* coords */
params[2] = get_sampler_desc(ctx, instr->variables[0], AC_DESC_IMAGE, true, true);
params[3] = LLVMConstInt(ctx->ac.i32, 15, false); /* dmask */
@@ -3711,7 +3655,7 @@ static LLVMValueRef visit_var_atomic(struct nir_to_llvm_context *ctx,
return NULL;
}

- result = LLVMBuildAtomicRMW(ctx->builder, op, ptr, to_integer(&ctx->ac, src),
+ result = LLVMBuildAtomicRMW(ctx->builder, op, ptr, ac_to_integer(&ctx->ac, src),
LLVMAtomicOrderingSequentiallyConsistent,
false);
}
@@ -3798,8 +3742,8 @@ static LLVMValueRef visit_interp(struct nir_to_llvm_context *ctx,
}

if (instr->intrinsic == nir_intrinsic_interp_var_at_offset) {
- src_c0 = to_float(&ctx->ac, LLVMBuildExtractElement(ctx->builder, src0, ctx->i32zero, ""));
- src_c1 = to_float(&ctx->ac, LLVMBuildExtractElement(ctx->builder, src0, ctx->i32one, ""));
+ src_c0 = ac_to_float(&ctx->ac, LLVMBuildExtractElement(ctx->builder, src0, ctx->i32zero, ""));
+ src_c1 = ac_to_float(&ctx->ac, LLVMBuildExtractElement(ctx->builder, src0, ctx->i32one, ""));
} else if (instr->intrinsic == nir_intrinsic_interp_var_at_sample) {
LLVMValueRef sample_position;
LLVMValueRef halfval = LLVMConstReal(ctx->f32, 0.5f);
@@ -4430,9 +4374,9 @@ static void tex_fetch_ptrs(struct ac_nir_context *ctx,
static LLVMValueRef apply_round_slice(struct ac_llvm_context *ctx,
LLVMValueRef coord)
{
- coord = to_float(ctx, coord);
+ coord = ac_to_float(ctx, coord);
coord = ac_build_intrinsic(ctx, "llvm.rint.f32", ctx->f32, &coord, 1, 0);
- coord = to_integer(ctx, coord);
+ coord = ac_to_integer(ctx, coord);
return coord;
}

@@ -4558,8 +4502,8 @@ static void visit_tex(struct ac_nir_context *ctx, nir_tex_instr *instr)

/* Pack depth comparison value */
if (instr->is_shadow && comparator) {
- LLVMValueRef z = to_float(&ctx->ac,
- llvm_extract_elem(&ctx->ac, comparator, 0));
+ LLVMValueRef z = ac_to_float(&ctx->ac,
+ llvm_extract_elem(&ctx->ac, comparator, 0));

/* TC-compatible HTILE promotes Z16 and Z24 to Z32_FLOAT,
* so the depth comparison value isn't clamped for Z16 and
@@ -4591,8 +4535,8 @@ static void visit_tex(struct ac_nir_context *ctx, nir_tex_instr *instr)
}

for (unsigned i = 0; i < num_deriv_comp; i++) {
- derivs[i] = to_float(&ctx->ac, llvm_extract_elem(&ctx->ac, ddx, i));
- derivs[num_deriv_comp + i] = to_float(&ctx->ac, llvm_extract_elem(&ctx->ac, ddy, i));
+ derivs[i] = ac_to_float(&ctx->ac, llvm_extract_elem(&ctx->ac, ddx, i));
+ derivs[num_deriv_comp + i] = ac_to_float(&ctx->ac, llvm_extract_elem(&ctx->ac, ddy, i));
}
}

@@ -4600,7 +4544,7 @@ static void visit_tex(struct ac_nir_context *ctx, nir_tex_instr *instr)
if (instr->is_array && instr->op != nir_texop_lod)
coords[3] = apply_round_slice(&ctx->ac, coords[3]);
for (chan = 0; chan < instr->coord_components; chan++)
- coords[chan] = to_float(&ctx->ac, coords[chan]);
+ coords[chan] = ac_to_float(&ctx->ac, coords[chan]);
if (instr->coord_components == 3)
coords[3] = LLVMGetUndef(ctx->ac.f32);
ac_prepare_cube_coords(&ctx->ac,
@@ -4735,7 +4679,7 @@ static void visit_tex(struct ac_nir_context *ctx, nir_tex_instr *instr)
write_result:
if (result) {
assert(instr->dest.is_ssa);
- result = to_integer(&ctx->ac, result);
+ result = ac_to_integer(&ctx->ac, result);
_mesa_hash_table_insert(ctx->defs, &instr->dest.ssa, result);
}
}
@@ -4964,7 +4908,7 @@ handle_vs_input_decl(struct nir_to_llvm_context *ctx,
for (unsigned chan = 0; chan < 4; chan++) {
LLVMValueRef llvm_chan = LLVMConstInt(ctx->i32, chan, false);
ctx->inputs[radeon_llvm_reg_index_soa(idx, chan)] =
- to_integer(&ctx->ac, LLVMBuildExtractElement(ctx->builder,
+ ac_to_integer(&ctx->ac, LLVMBuildExtractElement(ctx->builder,
input, llvm_chan, ""));
}
}
@@ -5320,7 +5264,7 @@ setup_shared(struct ac_nir_context *ctx,
static LLVMValueRef
emit_float_saturate(struct ac_llvm_context *ctx, LLVMValueRef v, float lo, float hi)
{
- v = to_float(ctx, v);
+ v = ac_to_float(ctx, v);
v = emit_intrin_2f_param(ctx, "llvm.maxnum.f32", ctx->f32, v, LLVMConstReal(ctx->f32, lo));
return emit_intrin_2f_param(ctx, "llvm.minnum.f32", ctx->f32, v, LLVMConstReal(ctx->f32, hi));
}
@@ -5454,7 +5398,7 @@ si_llvm_init_export_args(struct nir_to_llvm_context *ctx,
LLVMValueRef max_alpha = !is_int10 ? max_rgb : LLVMConstInt(ctx->i32, 3, 0);

for (unsigned chan = 0; chan < 4; chan++) {
- val[chan] = to_integer(&ctx->ac, values[chan]);
+ val[chan] = ac_to_integer(&ctx->ac, values[chan]);
val[chan] = emit_minmax_int(&ctx->ac, LLVMIntULT, val[chan], chan == 3 ? max_alpha : max_rgb);
}

@@ -5474,7 +5418,7 @@ si_llvm_init_export_args(struct nir_to_llvm_context *ctx,

/* Clamp. */
for (unsigned chan = 0; chan < 4; chan++) {
- val[chan] = to_integer(&ctx->ac, values[chan]);
+ val[chan] = ac_to_integer(&ctx->ac, values[chan]);
val[chan] = emit_minmax_int(&ctx->ac, LLVMIntSLT, val[chan], chan == 3 ? max_alpha : max_rgb);
val[chan] = emit_minmax_int(&ctx->ac, LLVMIntSGT, val[chan], chan == 3 ? min_alpha : min_rgb);
}
@@ -5494,7 +5438,7 @@ si_llvm_init_export_args(struct nir_to_llvm_context *ctx,
memcpy(&args->out[0], values, sizeof(values[0]) * 4);

for (unsigned i = 0; i < 4; ++i)
- args->out[i] = to_float(&ctx->ac, args->out[i]);
+ args->out[i] = ac_to_float(&ctx->ac, args->out[i]);
}

static void
@@ -5521,7 +5465,7 @@ handle_vs_outputs_post(struct nir_to_llvm_context *ctx,

i = VARYING_SLOT_CLIP_DIST0;
for (j = 0; j < ctx->num_output_clips + ctx->num_output_culls; j++)
- slots[j] = to_float(&ctx->ac, LLVMBuildLoad(ctx->builder,
+ slots[j] = ac_to_float(&ctx->ac, LLVMBuildLoad(ctx->builder,
ctx->nir->outputs[radeon_llvm_reg_index_soa(i, j)], ""));

for (i = ctx->num_output_clips + ctx->num_output_culls; i < 8; i++)
@@ -5611,8 +5555,8 @@ handle_vs_outputs_post(struct nir_to_llvm_context *ctx,
continue;

for (unsigned j = 0; j < 4; j++)
- values[j] = to_float(&ctx->ac, LLVMBuildLoad(ctx->builder,
- ctx->nir->outputs[radeon_llvm_reg_index_soa(i, j)], ""));
+ values[j] = ac_to_float(&ctx->ac, LLVMBuildLoad(ctx->builder,
+ ctx->nir->outputs[radeon_llvm_reg_index_soa(i, j)], ""));

if (i == VARYING_SLOT_LAYER) {
target = V_008DFC_SQ_EXP_PARAM + param_count;
@@ -6040,20 +5984,20 @@ handle_fs_outputs_post(struct nir_to_llvm_context *ctx)

if (i == FRAG_RESULT_DEPTH) {
ctx->shader_info->fs.writes_z = true;
- depth = to_float(&ctx->ac, LLVMBuildLoad(ctx->builder,
+ depth = ac_to_float(&ctx->ac, LLVMBuildLoad(ctx->builder,
ctx->nir->outputs[radeon_llvm_reg_index_soa(i, 0)], ""));
} else if (i == FRAG_RESULT_STENCIL) {
ctx->shader_info->fs.writes_stencil = true;
- stencil = to_float(&ctx->ac, LLVMBuildLoad(ctx->builder,
+ stencil = ac_to_float(&ctx->ac, LLVMBuildLoad(ctx->builder,
ctx->nir->outputs[radeon_llvm_reg_index_soa(i, 0)], ""));
} else if (i == FRAG_RESULT_SAMPLE_MASK) {
ctx->shader_info->fs.writes_sample_mask = true;
- samplemask = to_float(&ctx->ac, LLVMBuildLoad(ctx->builder,
+ samplemask = ac_to_float(&ctx->ac, LLVMBuildLoad(ctx->builder,
ctx->nir->outputs[radeon_llvm_reg_index_soa(i, 0)], ""));
} else {
bool last = false;
for (unsigned j = 0; j < 4; j++)
- values[j] = to_float(&ctx->ac, LLVMBuildLoad(ctx->builder,
+ values[j] = ac_to_float(&ctx->ac, LLVMBuildLoad(ctx->builder,
ctx->nir->outputs[radeon_llvm_reg_index_soa(i, j)], ""));

if (!ctx->shader_info->fs.writes_z && !ctx->shader_info->fs.writes_stencil && !ctx->shader_info->fs.writes_sample_mask)
@@ -6597,7 +6541,7 @@ ac_gs_copy_shader_emit(struct nir_to_llvm_context *ctx)
AC_FUNC_ATTR_LEGACY);

LLVMBuildStore(ctx->builder,
- to_float(&ctx->ac, value), ctx->nir->outputs[radeon_llvm_reg_index_soa(i, j)]);
+ ac_to_float(&ctx->ac, value), ctx->nir->outputs[radeon_llvm_reg_index_soa(i, j)]);
}
idx += slot_inc;
}
--
2.9.4
Connor Abbott
2017-08-08 01:32:37 UTC
Permalink
From: Connor Abbott <***@gmail.com>

---
src/amd/common/ac_llvm_build.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/src/amd/common/ac_llvm_build.c b/src/amd/common/ac_llvm_build.c
index a10ec51..d4b48d1 100644
--- a/src/amd/common/ac_llvm_build.c
+++ b/src/amd/common/ac_llvm_build.c
@@ -99,6 +99,7 @@ ac_get_type_size(LLVMTypeRef type)
return LLVMGetIntTypeWidth(type) / 8;
case LLVMFloatTypeKind:
return 4;
+ case LLVMDoubleTypeKind:
case LLVMPointerTypeKind:
return 8;
case LLVMVectorTypeKind:
--
2.9.4
Connor Abbott
2017-08-08 01:32:40 UTC
Permalink
From: Connor Abbott <***@gmail.com>

---
src/amd/vulkan/radv_device.c | 15 +++++++++++++++
src/amd/vulkan/radv_pipeline.c | 4 ++++
2 files changed, 19 insertions(+)

diff --git a/src/amd/vulkan/radv_device.c b/src/amd/vulkan/radv_device.c
index 21f2437..2255ff8 100644
--- a/src/amd/vulkan/radv_device.c
+++ b/src/amd/vulkan/radv_device.c
@@ -188,6 +188,12 @@ static const VkExtensionProperties ext_sema_device_extensions[] = {
},
};

+static const VkExtensionProperties amd_shader_ballot_extension = {
+ .extensionName = VK_AMD_SHADER_BALLOT_EXTENSION_NAME,
+ .specVersion = 1,
+};
+
+
static VkResult
radv_extensions_register(struct radv_instance *instance,
struct radv_extensions *extensions,
@@ -346,6 +352,15 @@ radv_physical_device_init(struct radv_physical_device *device,
goto fail;
}

+ if (device->rad_info.chip_class >= VI && HAVE_LLVM >= 0x600) {
+ result = radv_extensions_register(instance,
+ &device->extensions,
+ &amd_shader_ballot_extension,
+ 1);
+ if (result != VK_SUCCESS)
+ goto fail;
+ }
+
fprintf(stderr, "WARNING: radv is not a conformant vulkan implementation, testing use only.\n");
device->name = get_chip_name(device->rad_info.family);

diff --git a/src/amd/vulkan/radv_pipeline.c b/src/amd/vulkan/radv_pipeline.c
index 4aecb81..1b9a2b9 100644
--- a/src/amd/vulkan/radv_pipeline.c
+++ b/src/amd/vulkan/radv_pipeline.c
@@ -237,6 +237,10 @@ radv_shader_compile_to_nir(struct radv_device *device,
.variable_pointers = true,
.shader_ballot = true,
.shader_group_vote = true,
+#if HAVE_LLVM >= 0x600
+ .amd_shader_ballot = true,
+ .groups = true,
+#endif
};
entry_point = spirv_to_nir(spirv, module->size / 4,
spec_entries, num_spec_entries,
--
2.9.4
Connor Abbott
2017-08-08 01:32:38 UTC
Permalink
From: Connor Abbott <***@gmail.com>

Using the new WWM and DPP intrinsics introduced in LLVM 6.0. This adds
everything needed to implement SPV_AMD_shader_ballot, including the
Groups capability, to ac_llvm_build.c. That way, it can be shared by a
potential GL_AMD_shader_ballot implementation in the future. Currently,
the implementation only uses the DPP instructions that are available on
VI+, so SI and CI won't be able to use the extension, but it should be
possible (albeit a little tricky) to use ds_swizzle to get support for
SI and CI.
---
src/amd/common/ac_llvm_build.c | 703 +++++++++++++++++++++++++++++++++++++++++
src/amd/common/ac_llvm_build.h | 115 +++++++
2 files changed, 818 insertions(+)

diff --git a/src/amd/common/ac_llvm_build.c b/src/amd/common/ac_llvm_build.c
index d4b48d1..c75bf00 100644
--- a/src/amd/common/ac_llvm_build.c
+++ b/src/amd/common/ac_llvm_build.c
@@ -347,6 +347,709 @@ ac_build_vote_eq(struct ac_llvm_context *ctx, LLVMValueRef value)
return LLVMBuildOr(ctx->builder, all, none, "");
}

+LLVMValueRef ac_reduce_iadd(struct ac_llvm_context *ctx, LLVMValueRef lhs,
+ LLVMValueRef rhs)
+{
+ return LLVMBuildAdd(ctx->builder, lhs, rhs, "");
+}
+
+LLVMValueRef ac_reduce_fadd(struct ac_llvm_context *ctx, LLVMValueRef lhs,
+ LLVMValueRef rhs)
+{
+ return LLVMBuildFAdd(ctx->builder, lhs, rhs, "");
+}
+
+LLVMValueRef ac_reduce_fmin(struct ac_llvm_context *ctx, LLVMValueRef lhs,
+ LLVMValueRef rhs)
+{
+ char name[32], type[8];
+ ac_build_type_name_for_intr(LLVMTypeOf(lhs), type, sizeof(type));
+ snprintf(name, sizeof(name), "llvm.minnum.%s", type);
+ return ac_build_intrinsic(ctx, name, LLVMTypeOf(lhs),
+ (LLVMValueRef []) { lhs, rhs }, 2,
+ AC_FUNC_ATTR_NOUNWIND | AC_FUNC_ATTR_READNONE);
+}
+
+LLVMValueRef ac_reduce_fmax(struct ac_llvm_context *ctx, LLVMValueRef lhs,
+ LLVMValueRef rhs)
+{
+ char name[32], type[8];
+ ac_build_type_name_for_intr(LLVMTypeOf(lhs), type, sizeof(type));
+ snprintf(name, sizeof(name), "llvm.maxnum.%s", type);
+ return ac_build_intrinsic(ctx, name, LLVMTypeOf(lhs),
+ (LLVMValueRef []) { lhs, rhs }, 2,
+ AC_FUNC_ATTR_NOUNWIND | AC_FUNC_ATTR_READNONE);
+}
+
+LLVMValueRef ac_reduce_imin(struct ac_llvm_context *ctx, LLVMValueRef lhs,
+ LLVMValueRef rhs)
+{
+ return LLVMBuildSelect(ctx->builder,
+ LLVMBuildICmp(ctx->builder, LLVMIntSLT,
+ lhs, rhs, ""),
+ lhs, rhs, "");
+}
+
+LLVMValueRef ac_reduce_imax(struct ac_llvm_context *ctx, LLVMValueRef lhs,
+ LLVMValueRef rhs)
+{
+ return LLVMBuildSelect(ctx->builder,
+ LLVMBuildICmp(ctx->builder, LLVMIntSGT,
+ lhs, rhs, ""),
+ lhs, rhs, "");
+}
+
+LLVMValueRef ac_reduce_umin(struct ac_llvm_context *ctx, LLVMValueRef lhs,
+ LLVMValueRef rhs)
+{
+ return LLVMBuildSelect(ctx->builder,
+ LLVMBuildICmp(ctx->builder, LLVMIntULT,
+ lhs, rhs, ""),
+ lhs, rhs, "");
+}
+
+LLVMValueRef ac_reduce_umax(struct ac_llvm_context *ctx, LLVMValueRef lhs,
+ LLVMValueRef rhs)
+{
+ return LLVMBuildSelect(ctx->builder,
+ LLVMBuildICmp(ctx->builder, LLVMIntUGT,
+ lhs, rhs, ""),
+ lhs, rhs, "");
+}
+
+enum dpp_ctrl {
+ _dpp_quad_perm = 0x000,
+ _dpp_row_sl = 0x100,
+ _dpp_row_sr = 0x110,
+ _dpp_row_rr = 0x120,
+ dpp_wf_sl1 = 0x130,
+ dpp_wf_rl1 = 0x134,
+ dpp_wf_sr1 = 0x138,
+ dpp_wf_rr1 = 0x13C,
+ dpp_row_mirror = 0x140,
+ dpp_row_half_mirror = 0x141,
+ dpp_row_bcast15 = 0x142,
+ dpp_row_bcast31 = 0x143
+};
+
+static inline enum dpp_ctrl
+dpp_quad_perm(unsigned lane0, unsigned lane1, unsigned lane2, unsigned lane3)
+{
+ assert(lane0 < 4 && lane1 < 4 && lane2 < 4 && lane3 < 4);
+ return _dpp_quad_perm | lane0 | (lane1 << 2) | (lane2 << 4) | (lane3 << 6);
+}
+
+static inline enum dpp_ctrl
+dpp_row_sl(unsigned amount)
+{
+ assert(amount > 0 && amount < 16);
+ return _dpp_row_sl | amount;
+}
+
+static inline enum dpp_ctrl
+dpp_row_sr(unsigned amount)
+{
+ assert(amount > 0 && amount < 16);
+ return _dpp_row_sr | amount;
+}
+
+static LLVMValueRef
+_ac_build_dpp(struct ac_llvm_context *ctx, LLVMValueRef old, LLVMValueRef src,
+ enum dpp_ctrl dpp_ctrl, unsigned row_mask, unsigned bank_mask,
+ bool bound_ctrl)
+{
+ return ac_build_intrinsic(ctx, "llvm.amdgcn.update.dpp.i32",
+ LLVMTypeOf(old), (LLVMValueRef[]) {
+ old, src,
+ LLVMConstInt(ctx->i32, dpp_ctrl, 0),
+ LLVMConstInt(ctx->i32, row_mask, 0),
+ LLVMConstInt(ctx->i32, bank_mask, 0),
+ LLVMConstInt(ctx->i1, bound_ctrl, 0) },
+ 6, AC_FUNC_ATTR_NOUNWIND | AC_FUNC_ATTR_READNONE |
+ AC_FUNC_ATTR_CONVERGENT);
+}
+
+static LLVMValueRef
+ac_build_dpp(struct ac_llvm_context *ctx, LLVMValueRef old, LLVMValueRef src,
+ enum dpp_ctrl dpp_ctrl, unsigned row_mask, unsigned bank_mask,
+ bool bound_ctrl)
+{
+ LLVMTypeRef src_type = LLVMTypeOf(src);
+ src = ac_to_integer(ctx, src);
+ old = ac_to_integer(ctx, old);
+ unsigned bits = LLVMGetIntTypeWidth(LLVMTypeOf(src));
+ LLVMValueRef ret;
+ if (bits == 32) {
+ ret = _ac_build_dpp(ctx, old, src, dpp_ctrl, row_mask,
+ bank_mask, bound_ctrl);
+ } else {
+ assert(bits % 32 == 0);
+ LLVMTypeRef vec_type = LLVMVectorType(ctx->i32, bits / 32);
+ LLVMValueRef src_vector =
+ LLVMBuildBitCast(ctx->builder, src, vec_type, "");
+ LLVMValueRef old_vector =
+ LLVMBuildBitCast(ctx->builder, old, vec_type, "");
+ ret = LLVMGetUndef(vec_type);
+ for (unsigned i = 0; i < bits / 32; i++) {
+ src = LLVMBuildExtractElement(ctx->builder, src_vector,
+ LLVMConstInt(ctx->i32, i,
+ 0), "");
+ old = LLVMBuildExtractElement(ctx->builder, old_vector,
+ LLVMConstInt(ctx->i32, i,
+ 0), "");
+ LLVMValueRef ret_comp = _ac_build_dpp(ctx, old, src,
+ dpp_ctrl,
+ row_mask,
+ bank_mask,
+ bound_ctrl);
+ ret = LLVMBuildInsertElement(ctx->builder, ret,
+ ret_comp,
+ LLVMConstInt(ctx->i32, i,
+ 0), "");
+ }
+ }
+ return LLVMBuildBitCast(ctx->builder, ret, src_type, "");
+}
+
+static LLVMValueRef
+_ac_build_readlane(struct ac_llvm_context *ctx, LLVMValueRef src,
+ LLVMValueRef lane)
+{
+ return ac_build_intrinsic(ctx, "llvm.amdgcn.readlane",
+ LLVMTypeOf(src), (LLVMValueRef []) {
+ src, lane },
+ 2, AC_FUNC_ATTR_NOUNWIND |
+ AC_FUNC_ATTR_READNONE |
+ AC_FUNC_ATTR_CONVERGENT);
+}
+
+static LLVMValueRef
+ac_build_readlane(struct ac_llvm_context *ctx, LLVMValueRef src,
+ LLVMValueRef lane)
+{
+ LLVMTypeRef src_type = LLVMTypeOf(src);
+ src = ac_to_integer(ctx, src);
+ unsigned bits = LLVMGetIntTypeWidth(LLVMTypeOf(src));
+ LLVMValueRef ret;
+ if (bits == 32) {
+ ret = _ac_build_readlane(ctx, src, lane);
+ } else {
+ assert(bits % 32 == 0);
+ LLVMTypeRef vec_type = LLVMVectorType(ctx->i32, bits / 32);
+ LLVMValueRef src_vector =
+ LLVMBuildBitCast(ctx->builder, src, vec_type, "");
+ ret = LLVMGetUndef(vec_type);
+ for (unsigned i = 0; i < bits / 32; i++) {
+ src = LLVMBuildExtractElement(ctx->builder, src_vector,
+ LLVMConstInt(ctx->i32, i,
+ 0), "");
+ LLVMValueRef ret_comp = _ac_build_readlane(ctx, src,
+ lane);
+ ret = LLVMBuildInsertElement(ctx->builder, ret,
+ ret_comp,
+ LLVMConstInt(ctx->i32, i,
+ 0), "");
+ }
+ }
+ return LLVMBuildBitCast(ctx->builder, ret, src_type, "");
+}
+
+static LLVMValueRef
+_ac_build_ds_swizzle(struct ac_llvm_context *ctx, LLVMValueRef src,
+ unsigned mask)
+{
+ return ac_build_intrinsic(ctx, "llvm.amdgcn.ds.swizzle",
+ LLVMTypeOf(src), (LLVMValueRef []) {
+ src, LLVMConstInt(ctx->i32, mask, 0) },
+ 2, AC_FUNC_ATTR_NOUNWIND |
+ AC_FUNC_ATTR_READNONE |
+ AC_FUNC_ATTR_CONVERGENT);
+}
+
+static LLVMValueRef
+ac_build_ds_swizzle(struct ac_llvm_context *ctx, LLVMValueRef src,
+ unsigned mask)
+{
+ LLVMTypeRef src_type = LLVMTypeOf(src);
+ src = ac_to_integer(ctx, src);
+ unsigned bits = LLVMGetIntTypeWidth(LLVMTypeOf(src));
+ LLVMValueRef ret;
+ if (bits == 32) {
+ ret = _ac_build_ds_swizzle(ctx, src, mask);
+ } else {
+ assert(bits % 32 == 0);
+ LLVMTypeRef vec_type = LLVMVectorType(ctx->i32, bits / 32);
+ LLVMValueRef src_vector =
+ LLVMBuildBitCast(ctx->builder, src, vec_type, "");
+ ret = LLVMGetUndef(vec_type);
+ for (unsigned i = 0; i < bits / 32; i++) {
+ src = LLVMBuildExtractElement(ctx->builder, src_vector,
+ LLVMConstInt(ctx->i32, i,
+ 0), "");
+ LLVMValueRef ret_comp = _ac_build_ds_swizzle(ctx, src,
+ mask);
+ ret = LLVMBuildInsertElement(ctx->builder, ret,
+ ret_comp,
+ LLVMConstInt(ctx->i32, i,
+ 0), "");
+ }
+ }
+ return LLVMBuildBitCast(ctx->builder, ret, src_type, "");
+}
+
+static LLVMValueRef
+ac_build_set_inactive(struct ac_llvm_context *ctx, LLVMValueRef src,
+ LLVMValueRef inactive)
+{
+ char name[32], type[8];
+ LLVMTypeRef src_type = LLVMTypeOf(src);
+ src = ac_to_integer(ctx, src);
+ inactive = ac_to_integer(ctx, inactive);
+ ac_build_type_name_for_intr(LLVMTypeOf(src), type, sizeof(type));
+ snprintf(name, sizeof(name), "llvm.amdgcn.set.inactive.%s", type);
+ LLVMValueRef ret =
+ ac_build_intrinsic(ctx, name,
+ LLVMTypeOf(src), (LLVMValueRef []) {
+ src, inactive }, 2,
+ AC_FUNC_ATTR_NOUNWIND | AC_FUNC_ATTR_READNONE |
+ AC_FUNC_ATTR_CONVERGENT);
+ return LLVMBuildBitCast(ctx->builder, ret, src_type, "");
+}
+
+static LLVMValueRef
+ac_build_wwm(struct ac_llvm_context *ctx, LLVMValueRef src)
+{
+ char name[32], type[8];
+ ac_build_type_name_for_intr(LLVMTypeOf(src), type, sizeof(type));
+ snprintf(name, sizeof(name), "llvm.amdgcn.wwm.%s", type);
+ return ac_build_intrinsic(ctx, name, LLVMTypeOf(src),
+ (LLVMValueRef []) { src }, 1,
+ AC_FUNC_ATTR_NOUNWIND | AC_FUNC_ATTR_READNONE);
+}
+
+LLVMValueRef
+ac_build_subgroup_inclusive_scan(struct ac_llvm_context *ctx,
+ LLVMValueRef src,
+ ac_reduce_op reduce,
+ LLVMValueRef identity)
+{
+ /* See http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
+ *
+ * Note that each dpp/reduce pair is supposed to be compiled down to
+ * one instruction by LLVM, at least for 32-bit values.
+ *
+ * TODO: use @llvm.amdgcn.ds.swizzle on SI and CI
+ */
+ LLVMValueRef value = src;
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, src,
+ dpp_row_sr(1), 0xf, 0xf, false));
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, src,
+ dpp_row_sr(2), 0xf, 0xf, false));
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, src,
+ dpp_row_sr(3), 0xf, 0xf, false));
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, value,
+ dpp_row_sr(4), 0xf, 0xe, false));
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, value,
+ dpp_row_sr(8), 0xf, 0xc, false));
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, value,
+ dpp_row_bcast15, 0xa, 0xf, false));
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, value,
+ dpp_row_bcast31, 0xc, 0xf, false));
+ return value;
+}
+
+LLVMValueRef
+ac_build_subgroup_inclusive_scan_nonuniform(struct ac_llvm_context *ctx,
+ LLVMValueRef value,
+ ac_reduce_op reduce,
+ LLVMValueRef identity)
+{
+ ac_build_optimization_barrier(ctx, &value);
+ value = ac_build_set_inactive(ctx, value, identity);
+ value = ac_build_subgroup_inclusive_scan(ctx, value, reduce, identity);
+ return ac_build_wwm(ctx, value);
+}
+
+
+LLVMValueRef
+ac_build_subgroup_reduce(struct ac_llvm_context *ctx, LLVMValueRef value,
+ ac_reduce_op reduce, LLVMValueRef identity)
+{
+
+ value = ac_build_set_inactive(ctx, value, identity);
+ value = ac_build_subgroup_inclusive_scan(ctx, value, reduce, identity);
+ value = ac_build_readlane(ctx, value, LLVMConstInt(ctx->i32, 63, 0));
+ return ac_build_wwm(ctx, value);
+}
+
+LLVMValueRef
+ac_build_subgroup_reduce_nonuniform(struct ac_llvm_context *ctx,
+ LLVMValueRef value,
+ ac_reduce_op reduce,
+ LLVMValueRef identity)
+{
+ ac_build_optimization_barrier(ctx, &value);
+ return ac_build_subgroup_reduce(ctx, value, reduce, identity);
+}
+
+LLVMValueRef
+ac_build_subgroup_exclusive_scan(struct ac_llvm_context *ctx,
+ LLVMValueRef value,
+ ac_reduce_op reduce,
+ LLVMValueRef identity)
+{
+ value = ac_build_dpp(ctx, identity, value, dpp_wf_sr1, 0xf, 0xf, false);
+ return ac_build_subgroup_inclusive_scan(ctx, value, reduce, identity);
+}
+
+LLVMValueRef
+ac_build_subgroup_exclusive_scan_nonuniform(struct ac_llvm_context *ctx,
+ LLVMValueRef value,
+ ac_reduce_op reduce,
+ LLVMValueRef identity)
+{
+ ac_build_optimization_barrier(ctx, &value);
+ value = ac_build_set_inactive(ctx, value, identity);
+ value = ac_build_subgroup_exclusive_scan(ctx, value, reduce, identity);
+ return ac_build_wwm(ctx, value);
+}
+
+LLVMValueRef
+ac_build_swizzle_quad(struct ac_llvm_context *ctx, LLVMValueRef src,
+ unsigned swizzle_mask)
+{
+ ac_build_optimization_barrier(ctx, &src);
+ /* TODO: use @llvm.amdgcn.ds.swizzle on SI and CI */
+ return ac_build_dpp(ctx, LLVMGetUndef(LLVMTypeOf(src)), src,
+ dpp_quad_perm(swizzle_mask & 0x3,
+ (swizzle_mask >> 2) & 0x3,
+ (swizzle_mask >> 4) & 0x3,
+ (swizzle_mask >> 6) & 0x3),
+ 0xf, 0xf, /*bound_ctrl:0*/ true);
+}
+
+LLVMValueRef
+ac_build_swizzle_masked(struct ac_llvm_context *ctx, LLVMValueRef src,
+ unsigned swizzle_mask)
+{
+ ac_build_optimization_barrier(ctx, &src);
+ /* TODO: For some special mask values, we could use DPP instead on VI+.
+ * We might be able to use DPP entirely, but it would be a little
+ * tricky.
+ */
+ return ac_build_ds_swizzle(ctx, src, swizzle_mask);
+}
+
+LLVMValueRef
+ac_build_writelane(struct ac_llvm_context *ctx, LLVMValueRef src,
+ LLVMValueRef write, LLVMValueRef lane)
+{
+ /* TODO: Use the actual instruction when LLVM adds an intrinsic for it.
+ */
+ LLVMValueRef pred = LLVMBuildICmp(ctx->builder, LLVMIntEQ, lane,
+ ac_get_thread_id(ctx), "");
+ return LLVMBuildSelect(ctx->builder, pred, write, src, "");
+}
+
+LLVMValueRef
+ac_build_mbcnt(struct ac_llvm_context *ctx, LLVMValueRef mask)
+{
+ LLVMValueRef mask_vec = LLVMBuildBitCast(ctx->builder, mask,
+ LLVMVectorType(ctx->i32, 2),
+ "");
+ LLVMValueRef mask_lo = LLVMBuildExtractElement(ctx->builder, mask_vec,
+ ctx->i32_0, "");
+ LLVMValueRef mask_hi = LLVMBuildExtractElement(ctx->builder, mask_vec,
+ ctx->i32_1, "");
+ LLVMValueRef val =
+ ac_build_intrinsic(ctx, "llvm.amdgcn.mbcnt.lo", ctx->i32,
+ (LLVMValueRef []) { mask_lo, ctx->i32_0 },
+ 2, AC_FUNC_ATTR_READNONE);
+ val = ac_build_intrinsic(ctx, "llvm.amdgcn.mbcnt.hi", ctx->i32,
+ (LLVMValueRef []) { mask_hi, val },
+ 2, AC_FUNC_ATTR_READNONE);
+ return val;
+}
+
+/* return true for exactly one thread in the subgroup/wavefront */
+
+static LLVMValueRef
+ac_build_subgroup_elect(struct ac_llvm_context *ctx)
+{
+ LLVMValueRef active_set = ac_build_ballot(ctx, ctx->i32_1);
+ /* mbcnt(EXEC) returns the number of active threads with ID less than
+ * ours, so the lowest thread will return 0.
+ */
+ LLVMValueRef active_tid = ac_build_mbcnt(ctx, active_set);
+ return LLVMBuildICmp(ctx->builder, LLVMIntEQ, active_tid, ctx->i32_0,
+ "");
+}
+
+static LLVMValueRef
+ac_build_subgroup_elect_uniform(struct ac_llvm_context *ctx)
+{
+ return LLVMBuildICmp(ctx->builder, LLVMIntEQ, ac_get_thread_id(ctx),
+ ctx->i32_0, "");
+}
+
+#define LOCAL_ADDR_SPACE 3
+
+static LLVMValueRef
+get_shared_temp(struct ac_llvm_context *ctx,
+ LLVMTypeRef type,
+ unsigned max_workgroup_size)
+{
+ /* TODO only make one variable and share it */
+ return LLVMAddGlobalInAddressSpace(
+ ctx->module,
+ LLVMArrayType(type, DIV_ROUND_UP(max_workgroup_size, 64)),
+ "reduce_temp", LOCAL_ADDR_SPACE);
+}
+
+/* given an array of values, emit code to reduce them to a single value using a
+ * given operator. Note that this isn't cross-thread at all; it's just normal
+ * LLVM code.
+ */
+static LLVMValueRef
+reduce_array(struct ac_llvm_context *ctx, LLVMValueRef array,
+ ac_reduce_op reduce)
+{
+ unsigned size = LLVMGetArrayLength(LLVMTypeOf(array));
+ assert(size > 0);
+ if (size == 1)
+ return LLVMBuildExtractValue(ctx->builder, array, 0, "");
+
+ LLVMTypeRef elem_type = LLVMGetElementType(LLVMTypeOf(array));
+
+ unsigned left_size = size / 2;
+ LLVMValueRef left = LLVMGetUndef(LLVMArrayType(elem_type, left_size));
+ for (unsigned i = 0; i < left_size; i++) {
+ LLVMValueRef val = LLVMBuildExtractValue(ctx->builder, array,
+ i, "");
+ left = LLVMBuildInsertValue(ctx->builder, left, val, i, "");
+ }
+ left = reduce_array(ctx, left, reduce);
+
+ unsigned right_size = size - left_size;
+ LLVMValueRef right = LLVMGetUndef(LLVMArrayType(elem_type, right_size));
+ for (unsigned i = 0; i < right_size; i++) {
+ LLVMValueRef val = LLVMBuildExtractValue(ctx->builder, array,
+ i + left_size, "");
+ right = LLVMBuildInsertValue(ctx->builder, right, val, i, "");
+ }
+ right = reduce_array(ctx, right, reduce);
+
+ return reduce(ctx, left, right);
+}
+
+static LLVMValueRef
+_ac_build_group_reduce(struct ac_llvm_context *ctx,
+ LLVMValueRef value, ac_reduce_op reduce,
+ LLVMValueRef identity, bool exclusive_scan,
+ bool uniform,
+ unsigned max_workgroup_size,
+ LLVMValueRef wavefront_id)
+{
+ if (max_workgroup_size <= 64) {
+ if (exclusive_scan)
+ return identity;
+ else
+ return value;
+ }
+
+ /* Allocate some temporary storage, one value for each wavefront. */
+ LLVMValueRef shared = get_shared_temp(ctx, LLVMTypeOf(value),
+ max_workgroup_size);
+
+ LLVMValueRef func =
+ LLVMGetBasicBlockParent(LLVMGetInsertBlock(ctx->builder));
+ LLVMBasicBlockRef if_block =
+ LLVMAppendBasicBlockInContext(ctx->context, func, "");
+ LLVMBasicBlockRef merge_block =
+ LLVMAppendBasicBlockInContext(ctx->context, func, "");
+
+ /* gather the subgroup-reduced values from each buffer into shared */
+
+ LLVMBuildCondBr(ctx->builder,
+ (uniform ? ac_build_subgroup_elect_uniform :
+ ac_build_subgroup_elect)(ctx),
+ if_block, merge_block);
+ /* if (subgroup_elect()) */
+ {
+ LLVMPositionBuilderAtEnd(ctx->builder, if_block);
+ LLVMValueRef ptr = ac_build_gep0(ctx, shared, wavefront_id);
+ LLVMBuildStore(ctx->builder, value, ptr);
+ LLVMBuildBr(ctx->builder, merge_block);
+ }
+
+ LLVMPositionBuilderAtEnd(ctx->builder, merge_block);
+
+ ac_build_intrinsic(ctx, "llvm.amdgcn.s.barrier", ctx->voidt, NULL, 0,
+ AC_FUNC_ATTR_CONVERGENT);
+
+ /* For each wavefront, load every other wavefront's values from the
+ * previous stage.
+ */
+ LLVMValueRef array = LLVMBuildLoad(ctx->builder, shared, "");
+
+ if (exclusive_scan) {
+ /* mask out values from wavefronts greater than or equal to
+ * ours, to implement exclusive scan
+ */
+ for (unsigned i = 0; 64 * i < max_workgroup_size; i++) {
+ LLVMValueRef wf_value =
+ LLVMBuildExtractValue(ctx->builder, array, i,
+ "");
+ LLVMValueRef pred =
+ LLVMBuildICmp(ctx->builder, LLVMIntULT,
+ LLVMConstInt(ctx->i32, i, 0),
+ wavefront_id,
+ "");
+ wf_value = LLVMBuildSelect(ctx->builder, pred,
+ wf_value, identity, "");
+ array = LLVMBuildInsertValue(ctx->builder, array,
+ wf_value, i, "");
+ }
+ }
+
+ /* finally, manually reduce the values from each wavefront without any
+ * cross-thread tricks.
+ */
+ return reduce_array(ctx, array, reduce);
+}
+
+LLVMValueRef
+ac_build_group_reduce(struct ac_llvm_context *ctx,
+ LLVMValueRef value, ac_reduce_op reduce,
+ LLVMValueRef identity,
+ unsigned max_workgroup_size,
+ LLVMValueRef wavefront_id)
+{
+ value = ac_build_subgroup_reduce(ctx, value, reduce, identity);
+ return _ac_build_group_reduce(ctx, value, reduce, identity, false,
+ true, max_workgroup_size, wavefront_id);
+}
+
+LLVMValueRef
+ac_build_group_reduce_nonuniform(struct ac_llvm_context *ctx,
+ LLVMValueRef value, ac_reduce_op reduce,
+ LLVMValueRef identity,
+ unsigned max_workgroup_size,
+ LLVMValueRef wavefront_id)
+{
+ value = ac_build_subgroup_reduce_nonuniform(ctx, value, reduce,
+ identity);
+ return _ac_build_group_reduce(ctx, value, reduce, identity, false,
+ false, max_workgroup_size, wavefront_id);
+}
+
+LLVMValueRef
+ac_build_group_exclusive_scan(struct ac_llvm_context *ctx,
+ LLVMValueRef value, ac_reduce_op reduce,
+ LLVMValueRef identity,
+ unsigned max_workgroup_size,
+ LLVMValueRef wavefront_id)
+{
+ /* Do the exclusive scan per-wavefront, and at the same time calculate
+ * the fully-reduced value for doing the overall exclusive scan.
+ */
+ value = ac_build_set_inactive(ctx, value, identity);
+ value = ac_build_subgroup_inclusive_scan(ctx, value, reduce, identity);
+ LLVMValueRef reduced = ac_build_readlane(ctx, value,
+ LLVMConstInt(ctx->i32, 63,
+ 0));
+ value = ac_build_dpp(ctx, identity, value, dpp_wf_sr1, 0xf, 0xf,
+ false);
+ reduced = ac_build_wwm(ctx, reduced);
+ value = ac_build_wwm(ctx, value);
+ reduced = _ac_build_group_reduce(ctx, reduced, reduce, identity, true,
+ true, max_workgroup_size,
+ wavefront_id);
+ return reduce(ctx, value, reduced);
+}
+
+LLVMValueRef
+ac_build_group_exclusive_scan_nonuniform(struct ac_llvm_context *ctx,
+ LLVMValueRef value,
+ ac_reduce_op reduce,
+ LLVMValueRef identity,
+ unsigned max_workgroup_size,
+ LLVMValueRef wavefront_id)
+{
+ ac_build_optimization_barrier(ctx, &value);
+ /* Do the exclusive scan per-wavefront, and at the same time calculate
+ * the fully-reduced value for doing the overall exclusive scan.
+ */
+ value = ac_build_set_inactive(ctx, value, identity);
+ value = ac_build_subgroup_inclusive_scan(ctx, value, reduce, identity);
+ LLVMValueRef reduced = ac_build_readlane(ctx, value,
+ LLVMConstInt(ctx->i32, 63,
+ 0));
+ value = ac_build_dpp(ctx, identity, value, dpp_wf_sr1, 0xf, 0xf,
+ false);
+ reduced = ac_build_wwm(ctx, reduced);
+ value = ac_build_wwm(ctx, value);
+ reduced = _ac_build_group_reduce(ctx, reduced, reduce, identity, true,
+ false, max_workgroup_size,
+ wavefront_id);
+ return reduce(ctx, value, reduced);
+}
+
+LLVMValueRef
+ac_build_group_inclusive_scan(struct ac_llvm_context *ctx,
+ LLVMValueRef value, ac_reduce_op reduce,
+ LLVMValueRef identity,
+ unsigned max_workgroup_size,
+ LLVMValueRef wavefront_id)
+{
+ /* Do the inclusive scan per-wavefront, and at the same time calculate
+ * the fully-reduced value for doing the overall exclusive scan.
+ */
+ value = ac_build_set_inactive(ctx, value, identity);
+ value = ac_build_subgroup_inclusive_scan(ctx, value, reduce, identity);
+ LLVMValueRef reduced = ac_build_readlane(ctx, value,
+ LLVMConstInt(ctx->i32, 63,
+ 0));
+ reduced = ac_build_wwm(ctx, reduced);
+ value = ac_build_wwm(ctx, value);
+ reduced = _ac_build_group_reduce(ctx, reduced, reduce, identity, true,
+ true, max_workgroup_size,
+ wavefront_id);
+ return reduce(ctx, value, reduced);
+}
+
+LLVMValueRef
+ac_build_group_inclusive_scan_nonuniform(struct ac_llvm_context *ctx,
+ LLVMValueRef value,
+ ac_reduce_op reduce,
+ LLVMValueRef identity,
+ unsigned max_workgroup_size,
+ LLVMValueRef wavefront_id)
+{
+ ac_build_optimization_barrier(ctx, &value);
+ /* Do the inclusive scan per-wavefront, and at the same time calculate
+ * the fully-reduced value for doing the overall exclusive scan.
+ */
+ value = ac_build_set_inactive(ctx, value, identity);
+ value = ac_build_subgroup_inclusive_scan(ctx, value, reduce, identity);
+ LLVMValueRef reduced = ac_build_readlane(ctx, value,
+ LLVMConstInt(ctx->i32, 63,
+ 0));
+ reduced = ac_build_wwm(ctx, reduced);
+ value = ac_build_wwm(ctx, value);
+ reduced = _ac_build_group_reduce(ctx, reduced, reduce, identity, true,
+ false, max_workgroup_size,
+ wavefront_id);
+ return reduce(ctx, value, reduced);
+}
+
LLVMValueRef
ac_build_gather_values_extended(struct ac_llvm_context *ctx,
LLVMValueRef *values,
diff --git a/src/amd/common/ac_llvm_build.h b/src/amd/common/ac_llvm_build.h
index 1d9850b..463f3a9 100644
--- a/src/amd/common/ac_llvm_build.h
+++ b/src/amd/common/ac_llvm_build.h
@@ -84,6 +84,19 @@ void ac_build_optimization_barrier(struct ac_llvm_context *ctx,
LLVMValueRef *pvgpr);


+LLVMValueRef
+ac_build_swizzle_quad(struct ac_llvm_context *ctx, LLVMValueRef src,
+ unsigned swizzle_mask);
+
+LLVMValueRef
+ac_build_swizzle_masked(struct ac_llvm_context *ctx, LLVMValueRef src,
+ unsigned swizzle_mask);
+
+LLVMValueRef ac_build_writelane(struct ac_llvm_context *ctx, LLVMValueRef src,
+ LLVMValueRef write, LLVMValueRef lane);
+
+LLVMValueRef ac_build_mbcnt(struct ac_llvm_context *ctx, LLVMValueRef mask);
+
LLVMValueRef ac_build_ballot(struct ac_llvm_context *ctx, LLVMValueRef value);

LLVMValueRef ac_build_vote_all(struct ac_llvm_context *ctx, LLVMValueRef value);
@@ -92,6 +105,108 @@ LLVMValueRef ac_build_vote_any(struct ac_llvm_context *ctx, LLVMValueRef value);

LLVMValueRef ac_build_vote_eq(struct ac_llvm_context *ctx, LLVMValueRef value);

+typedef LLVMValueRef (*ac_reduce_op)(struct ac_llvm_context *ctx, LLVMValueRef lhs,
+ LLVMValueRef rhs);
+
+LLVMValueRef ac_reduce_iadd(struct ac_llvm_context *ctx, LLVMValueRef lhs,
+ LLVMValueRef rhs);
+
+LLVMValueRef ac_reduce_fadd(struct ac_llvm_context *ctx, LLVMValueRef lhs,
+ LLVMValueRef rhs);
+
+LLVMValueRef ac_reduce_fmin(struct ac_llvm_context *ctx, LLVMValueRef lhs,
+ LLVMValueRef rhs);
+
+LLVMValueRef ac_reduce_fmax(struct ac_llvm_context *ctx, LLVMValueRef lhs,
+ LLVMValueRef rhs);
+
+LLVMValueRef ac_reduce_imax(struct ac_llvm_context *ctx, LLVMValueRef lhs,
+ LLVMValueRef rhs);
+
+LLVMValueRef ac_reduce_umax(struct ac_llvm_context *ctx, LLVMValueRef lhs,
+ LLVMValueRef rhs);
+
+LLVMValueRef ac_reduce_fmin(struct ac_llvm_context *ctx, LLVMValueRef lhs,
+ LLVMValueRef rhs);
+
+LLVMValueRef ac_reduce_imin(struct ac_llvm_context *ctx, LLVMValueRef lhs,
+ LLVMValueRef rhs);
+
+LLVMValueRef ac_reduce_umin(struct ac_llvm_context *ctx, LLVMValueRef lhs,
+ LLVMValueRef rhs);
+
+LLVMValueRef ac_build_subgroup_reduce(struct ac_llvm_context *ctx,
+ LLVMValueRef value,
+ ac_reduce_op reduce,
+ LLVMValueRef identity);
+
+LLVMValueRef ac_build_subgroup_inclusive_scan(struct ac_llvm_context *ctx,
+ LLVMValueRef value,
+ ac_reduce_op reduce,
+ LLVMValueRef identity);
+
+LLVMValueRef ac_build_subgroup_exclusive_scan(struct ac_llvm_context *ctx,
+ LLVMValueRef value,
+ ac_reduce_op reduce,
+ LLVMValueRef identity);
+
+LLVMValueRef ac_build_subgroup_reduce_nonuniform(struct ac_llvm_context *ctx,
+ LLVMValueRef value,
+ ac_reduce_op reduce,
+ LLVMValueRef identity);
+
+LLVMValueRef ac_build_subgroup_inclusive_scan_nonuniform(struct ac_llvm_context *ctx,
+ LLVMValueRef value,
+ ac_reduce_op reduce,
+ LLVMValueRef identity);
+
+LLVMValueRef ac_build_subgroup_exclusive_scan_nonuniform(struct ac_llvm_context *ctx,
+ LLVMValueRef value,
+ ac_reduce_op reduce,
+ LLVMValueRef identity);
+
+LLVMValueRef ac_build_group_reduce(struct ac_llvm_context *ctx,
+ LLVMValueRef value,
+ ac_reduce_op reduce,
+ LLVMValueRef identity,
+ unsigned max_workgroup_size,
+ LLVMValueRef wavefront_id);
+
+LLVMValueRef ac_build_group_inclusive_scan(struct ac_llvm_context *ctx,
+ LLVMValueRef value,
+ ac_reduce_op reduce,
+ LLVMValueRef identity,
+ unsigned max_workgroup_size,
+ LLVMValueRef wavefront_id);
+
+LLVMValueRef ac_build_group_exclusive_scan(struct ac_llvm_context *ctx,
+ LLVMValueRef value,
+ ac_reduce_op reduce,
+ LLVMValueRef identity,
+ unsigned max_workgroup_size,
+ LLVMValueRef wavefront_id);
+
+LLVMValueRef ac_build_group_reduce_nonuniform(struct ac_llvm_context *ctx,
+ LLVMValueRef value,
+ ac_reduce_op reduce,
+ LLVMValueRef identity,
+ unsigned max_workgroup_size,
+ LLVMValueRef wavefront_id);
+
+LLVMValueRef ac_build_group_inclusive_scan_nonuniform(struct ac_llvm_context *ctx,
+ LLVMValueRef value,
+ ac_reduce_op reduce,
+ LLVMValueRef identity,
+ unsigned max_workgroup_size,
+ LLVMValueRef wavefront_id);
+
+LLVMValueRef ac_build_group_exclusive_scan_nonuniform(struct ac_llvm_context *ctx,
+ LLVMValueRef value,
+ ac_reduce_op reduce,
+ LLVMValueRef identity,
+ unsigned max_workgroup_size,
+ LLVMValueRef wavefront_id);
+
LLVMValueRef
ac_build_gather_values_extended(struct ac_llvm_context *ctx,
LLVMValueRef *values,
--
2.9.4
Dave Airlie
2017-10-31 06:08:28 UTC
Permalink
Post by Connor Abbott
+LLVMValueRef
+ac_build_subgroup_inclusive_scan(struct ac_llvm_context *ctx,
+ LLVMValueRef src,
+ ac_reduce_op reduce,
+ LLVMValueRef identity)
+{
+ /* See http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
+ *
+ * Note that each dpp/reduce pair is supposed to be compiled down to
+ * one instruction by LLVM, at least for 32-bit values.
+ *
+ */
+ LLVMValueRef value = src;
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, src,
+ dpp_row_sr(1), 0xf, 0xf, false));
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, src,
+ dpp_row_sr(2), 0xf, 0xf, false));
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, src,
+ dpp_row_sr(3), 0xf, 0xf, false));
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, value,
+ dpp_row_sr(4), 0xf, 0xe, false));
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, value,
+ dpp_row_sr(8), 0xf, 0xc, false));
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, value,
+ dpp_row_bcast15, 0xa, 0xf, false));
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, value,
+ dpp_row_bcast31, 0xc, 0xf, false));
btw I dumped some shaders from doom on pro,

it looked like it ended up with

1, 0xf, 0xf,
2, 0xf, 0xf,
4, 0xf, 0xf
8, 0xf, 0xf
bcast15 0xa, 0xf
bcast31 0xc, 0xf

It also seems to apply these direct to instructions like
/*000000002b80*/ s_nop 0x0
/*000000002b84*/ v_min_u32 v83, v83, v83 row_shr:1 bank_mask:15
row_mask:15
/*000000002b8c*/ s_nop 0x1
/*000000002b90*/ v_min_u32 v83, v83, v83 row_shr:2 bank_mask:15
row_mask:15
/*000000002b98*/ s_nop 0x1
/*000000002b9c*/ v_min_u32 v83, v83, v83 row_shr:4 bank_mask:15
row_mask:15
/*000000002ba4*/ s_nop 0x1
/*000000002ba8*/ v_min_u32 v83, v83, v83 row_shr:8 bank_mask:15
row_mask:15
/*000000002bb0*/ s_nop 0x1
/*000000002bb4*/ v_min_u32 v83, v83, v83 row_bcast15
bank_mask:15 row_mask:10
/*000000002bbc*/ s_nop 0x1
/*000000002bc0*/ v_min_u32 v83, v83, v83 row_bcast31
bank_mask:15 row_mask:12

I think the instruction combining is probably an llvm job, but I
wonder if the different row_shr
etc is what we should use as well.

Dave.
Connor Abbott
2017-10-31 15:36:18 UTC
Permalink
Post by Dave Airlie
Post by Connor Abbott
+LLVMValueRef
+ac_build_subgroup_inclusive_scan(struct ac_llvm_context *ctx,
+ LLVMValueRef src,
+ ac_reduce_op reduce,
+ LLVMValueRef identity)
+{
+ /* See http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
+ *
+ * Note that each dpp/reduce pair is supposed to be compiled down to
+ * one instruction by LLVM, at least for 32-bit values.
+ *
+ */
+ LLVMValueRef value = src;
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, src,
+ dpp_row_sr(1), 0xf, 0xf, false));
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, src,
+ dpp_row_sr(2), 0xf, 0xf, false));
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, src,
+ dpp_row_sr(3), 0xf, 0xf, false));
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, value,
+ dpp_row_sr(4), 0xf, 0xe, false));
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, value,
+ dpp_row_sr(8), 0xf, 0xc, false));
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, value,
+ dpp_row_bcast15, 0xa, 0xf, false));
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, value,
+ dpp_row_bcast31, 0xc, 0xf, false));
btw I dumped some shaders from doom on pro,
it looked like it ended up with
1, 0xf, 0xf,
2, 0xf, 0xf,
4, 0xf, 0xf
8, 0xf, 0xf
bcast15 0xa, 0xf
bcast31 0xc, 0xf
It also seems to apply these direct to instructions like
/*000000002b80*/ s_nop 0x0
/*000000002b84*/ v_min_u32 v83, v83, v83 row_shr:1 bank_mask:15
row_mask:15
/*000000002b8c*/ s_nop 0x1
/*000000002b90*/ v_min_u32 v83, v83, v83 row_shr:2 bank_mask:15
row_mask:15
/*000000002b98*/ s_nop 0x1
/*000000002b9c*/ v_min_u32 v83, v83, v83 row_shr:4 bank_mask:15
row_mask:15
/*000000002ba4*/ s_nop 0x1
/*000000002ba8*/ v_min_u32 v83, v83, v83 row_shr:8 bank_mask:15
row_mask:15
/*000000002bb0*/ s_nop 0x1
/*000000002bb4*/ v_min_u32 v83, v83, v83 row_bcast15
bank_mask:15 row_mask:10
/*000000002bbc*/ s_nop 0x1
/*000000002bc0*/ v_min_u32 v83, v83, v83 row_bcast31
bank_mask:15 row_mask:12
I think the instruction combining is probably an llvm job, but I
wonder if the different row_shr
etc is what we should use as well.
Yeah, LLVM should be combining the move and min -- hence the comment
here -- but it isn't yet. That shouldn't be too hard to do once we get
it working. Also, I've seen that way of doing it before, and IIRC it's
one instruction slower than the sequence in the blog post I cited,
since even though there's one less instruction, there's an extra
two-cycle stall between the first two instructions since v83 is the
destination of the first instruction and DPP source of the second
(hence the s_nop 0x1). So once we combine instructions this should be
better than what -pro does :)
Post by Dave Airlie
Dave.
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Nicolai Hähnle
2017-11-02 16:10:22 UTC
Permalink
Post by Connor Abbott
Post by Dave Airlie
Post by Connor Abbott
+LLVMValueRef
+ac_build_subgroup_inclusive_scan(struct ac_llvm_context *ctx,
+ LLVMValueRef src,
+ ac_reduce_op reduce,
+ LLVMValueRef identity)
+{
+ /* See http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
+ *
+ * Note that each dpp/reduce pair is supposed to be compiled down to
+ * one instruction by LLVM, at least for 32-bit values.
+ *
+ */
+ LLVMValueRef value = src;
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, src,
+ dpp_row_sr(1), 0xf, 0xf, false));
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, src,
+ dpp_row_sr(2), 0xf, 0xf, false));
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, src,
+ dpp_row_sr(3), 0xf, 0xf, false));
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, value,
+ dpp_row_sr(4), 0xf, 0xe, false));
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, value,
+ dpp_row_sr(8), 0xf, 0xc, false));
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, value,
+ dpp_row_bcast15, 0xa, 0xf, false));
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, value,
+ dpp_row_bcast31, 0xc, 0xf, false));
btw I dumped some shaders from doom on pro,
it looked like it ended up with
1, 0xf, 0xf,
2, 0xf, 0xf,
4, 0xf, 0xf
8, 0xf, 0xf
bcast15 0xa, 0xf
bcast31 0xc, 0xf
It also seems to apply these direct to instructions like
/*000000002b80*/ s_nop 0x0
/*000000002b84*/ v_min_u32 v83, v83, v83 row_shr:1 bank_mask:15
row_mask:15
/*000000002b8c*/ s_nop 0x1
/*000000002b90*/ v_min_u32 v83, v83, v83 row_shr:2 bank_mask:15
row_mask:15
/*000000002b98*/ s_nop 0x1
/*000000002b9c*/ v_min_u32 v83, v83, v83 row_shr:4 bank_mask:15
row_mask:15
/*000000002ba4*/ s_nop 0x1
/*000000002ba8*/ v_min_u32 v83, v83, v83 row_shr:8 bank_mask:15
row_mask:15
/*000000002bb0*/ s_nop 0x1
/*000000002bb4*/ v_min_u32 v83, v83, v83 row_bcast15
bank_mask:15 row_mask:10
/*000000002bbc*/ s_nop 0x1
/*000000002bc0*/ v_min_u32 v83, v83, v83 row_bcast31
bank_mask:15 row_mask:12
I think the instruction combining is probably an llvm job, but I
wonder if the different row_shr
etc is what we should use as well.
Yeah, LLVM should be combining the move and min -- hence the comment
here -- but it isn't yet. That shouldn't be too hard to do once we get
it working. Also, I've seen that way of doing it before, and IIRC it's
one instruction slower than the sequence in the blog post I cited,
since even though there's one less instruction, there's an extra
two-cycle stall between the first two instructions since v83 is the
destination of the first instruction and DPP source of the second
(hence the s_nop 0x1). So once we combine instructions this should be
better than what -pro does :)
Agreed, though even more ideally, LLVM would be able to fill those gaps
with other instructions ;)

Anyway, the combining of instructions is really the important task.

Cheers,
Nicolai
Post by Connor Abbott
Post by Dave Airlie
Dave.
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
Connor Abbott
2017-11-02 17:24:10 UTC
Permalink
Post by Connor Abbott
Post by Dave Airlie
Post by Connor Abbott
+LLVMValueRef
+ac_build_subgroup_inclusive_scan(struct ac_llvm_context *ctx,
+ LLVMValueRef src,
+ ac_reduce_op reduce,
+ LLVMValueRef identity)
+{
+ /* See
http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
+ *
+ * Note that each dpp/reduce pair is supposed to be compiled down to
+ * one instruction by LLVM, at least for 32-bit values.
+ *
+ */
+ LLVMValueRef value = src;
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, src,
+ dpp_row_sr(1), 0xf, 0xf, false));
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, src,
+ dpp_row_sr(2), 0xf, 0xf, false));
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, src,
+ dpp_row_sr(3), 0xf, 0xf, false));
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, value,
+ dpp_row_sr(4), 0xf, 0xe, false));
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, value,
+ dpp_row_sr(8), 0xf, 0xc, false));
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, value,
+ dpp_row_bcast15, 0xa, 0xf, false));
+ value = reduce(ctx, value,
+ ac_build_dpp(ctx, identity, value,
+ dpp_row_bcast31, 0xc, 0xf, false));
btw I dumped some shaders from doom on pro,
it looked like it ended up with
1, 0xf, 0xf,
2, 0xf, 0xf,
4, 0xf, 0xf
8, 0xf, 0xf
bcast15 0xa, 0xf
bcast31 0xc, 0xf
It also seems to apply these direct to instructions like
/*000000002b80*/ s_nop 0x0
/*000000002b84*/ v_min_u32 v83, v83, v83 row_shr:1 bank_mask:15
row_mask:15
/*000000002b8c*/ s_nop 0x1
/*000000002b90*/ v_min_u32 v83, v83, v83 row_shr:2 bank_mask:15
row_mask:15
/*000000002b98*/ s_nop 0x1
/*000000002b9c*/ v_min_u32 v83, v83, v83 row_shr:4 bank_mask:15
row_mask:15
/*000000002ba4*/ s_nop 0x1
/*000000002ba8*/ v_min_u32 v83, v83, v83 row_shr:8 bank_mask:15
row_mask:15
/*000000002bb0*/ s_nop 0x1
/*000000002bb4*/ v_min_u32 v83, v83, v83 row_bcast15
bank_mask:15 row_mask:10
/*000000002bbc*/ s_nop 0x1
/*000000002bc0*/ v_min_u32 v83, v83, v83 row_bcast31
bank_mask:15 row_mask:12
I think the instruction combining is probably an llvm job, but I
wonder if the different row_shr
etc is what we should use as well.
Yeah, LLVM should be combining the move and min -- hence the comment
here -- but it isn't yet. That shouldn't be too hard to do once we get
it working. Also, I've seen that way of doing it before, and IIRC it's
one instruction slower than the sequence in the blog post I cited,
since even though there's one less instruction, there's an extra
two-cycle stall between the first two instructions since v83 is the
destination of the first instruction and DPP source of the second
(hence the s_nop 0x1). So once we combine instructions this should be
better than what -pro does :)
Agreed, though even more ideally, LLVM would be able to fill those gaps with
other instructions ;)
Well, that isn't really possible when the sequence is in WWM and
everything else isn't. We could fill the slot with a scalar
instruction, but I think LLVM is currently overly conservative and
treats instructions writing EXEC as barriers even though it doesn't
need to.
Anyway, the combining of instructions is really the important task.
Agreed. Although I think getting it working first is even more important :)
Cheers,
Nicolai
Post by Connor Abbott
Post by Dave Airlie
Dave.
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
Connor Abbott
2017-08-08 01:32:39 UTC
Permalink
From: Connor Abbott <***@gmail.com>

---
src/amd/common/ac_nir_to_llvm.c | 72 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 72 insertions(+)

diff --git a/src/amd/common/ac_nir_to_llvm.c b/src/amd/common/ac_nir_to_llvm.c
index b39b873..bafe4d3 100644
--- a/src/amd/common/ac_nir_to_llvm.c
+++ b/src/amd/common/ac_nir_to_llvm.c
@@ -4142,6 +4142,78 @@ static void visit_intrinsic(struct ac_nir_context *ctx,
get_src(ctx, instr->src[0])),
ctx->ac.i32, "");
break;
+
+#define __REDUCE(op, identity, type, nir_suffix, reduce_type) \
+ case nir_intrinsic_subgroup_##op##nir_suffix: { \
+ LLVMValueRef src = ac_to_##type(&ctx->ac, \
+ get_src(ctx, instr->src[0])); \
+ LLVMTypeRef src_type = LLVMTypeOf(src); \
+ result = ac_build_subgroup_##reduce_type( \
+ &ctx->ac, src, \
+ ac_reduce_##op, identity); \
+ break; \
+ } \
+ case nir_intrinsic_group_##op##nir_suffix: { \
+ LLVMValueRef src = ac_to_##type(&ctx->ac, \
+ get_src(ctx, instr->src[0])); \
+ LLVMTypeRef src_type = LLVMTypeOf(src); \
+ LLVMValueRef wavefront_id = \
+ LLVMBuildLShr(ctx->ac.builder, \
+ visit_load_local_invocation_index(ctx->nctx), \
+ LLVMConstInt(ctx->ac.i32, 6, 0), ""); \
+ result = ac_build_group_##reduce_type( \
+ &ctx->ac, src, \
+ ac_reduce_##op, identity, \
+ ctx->nctx->max_workgroup_size, \
+ wavefront_id); \
+ break; \
+ } \
+
+#define REDUCE(op, identity, type) \
+ __REDUCE(op, identity, type, , reduce) \
+ __REDUCE(op, identity, type, _inclusive_scan, inclusive_scan) \
+ __REDUCE(op, identity, type, _exclusive_scan, exclusive_scan) \
+ __REDUCE(op, identity, type, _nonuniform, reduce_nonuniform) \
+ __REDUCE(op, identity, type, _inclusive_scan_nonuniform, \
+ inclusive_scan_nonuniform) \
+ __REDUCE(op, identity, type, _exclusive_scan_nonuniform, \
+ exclusive_scan_nonuniform) \
+
+ REDUCE(fadd, LLVMConstReal(src_type, 0), float)
+ REDUCE(iadd, LLVMConstInt(src_type, 0, 0), integer)
+ REDUCE(fmin, LLVMConstReal(src_type, INFINITY), float)
+ REDUCE(imin, LLVMConstInt(src_type, LLVMGetIntTypeWidth(src_type) == 64
+ ? INT64_MAX : INT32_MAX, 0), integer)
+ REDUCE(umin, LLVMConstInt(src_type, LLVMGetIntTypeWidth(src_type) == 64
+ ? UINT64_MAX : UINT32_MAX, 0), integer)
+ REDUCE(fmax, LLVMConstReal(src_type, -INFINITY), float)
+ REDUCE(imax, LLVMConstInt(src_type, LLVMGetIntTypeWidth(src_type) == 64
+ ? INT64_MIN : INT32_MIN, 0), integer)
+ REDUCE(umax, LLVMConstInt(src_type, 0, 0), integer)
+
+ case nir_intrinsic_quad_swizzle_amd:
+ result = ac_build_swizzle_quad(&ctx->ac,
+ get_src(ctx, instr->src[0]),
+ instr->const_index[0]);
+ break;
+
+ case nir_intrinsic_masked_swizzle_amd:
+ result = ac_build_swizzle_masked(&ctx->ac,
+ get_src(ctx, instr->src[0]),
+ instr->const_index[0]);
+ break;
+
+ case nir_intrinsic_write_invocation:
+ result = ac_build_writelane(&ctx->ac,
+ get_src(ctx, instr->src[0]),
+ get_src(ctx, instr->src[1]),
+ get_src(ctx, instr->src[2]));
+ break;
+
+ case nir_intrinsic_mbcnt_amd:
+ result = ac_build_mbcnt(&ctx->ac, get_src(ctx, instr->src[0]));
+ break;
+
default:
fprintf(stderr, "Unknown intrinsic: ");
nir_print_instr(&instr->instr, stderr);
--
2.9.4
Connor Abbott
2017-08-08 01:32:36 UTC
Permalink
From: Connor Abbott <***@gmail.com>

ac_to_float() does a superset of what it does.
---
src/amd/common/ac_llvm_build.c | 18 ++----------------
1 file changed, 2 insertions(+), 16 deletions(-)

diff --git a/src/amd/common/ac_llvm_build.c b/src/amd/common/ac_llvm_build.c
index 2fdb3e8..a10ec51 100644
--- a/src/amd/common/ac_llvm_build.c
+++ b/src/amd/common/ac_llvm_build.c
@@ -211,20 +211,6 @@ ac_build_intrinsic(struct ac_llvm_context *ctx, const char *name,
return call;
}

-static LLVMValueRef bitcast_to_float(struct ac_llvm_context *ctx,
- LLVMValueRef value)
-{
- LLVMTypeRef type = LLVMTypeOf(value);
- LLVMTypeRef new_type;
-
- if (LLVMGetTypeKind(type) == LLVMVectorTypeKind)
- new_type = LLVMVectorType(ctx->f32, LLVMGetVectorSize(type));
- else
- new_type = ctx->f32;
-
- return LLVMBuildBitCast(ctx->builder, value, new_type, "");
-}
-
/**
* Given the i32 or vNi32 \p type, generate the textual name (e.g. for use with
* intrinsic names).
@@ -760,7 +746,7 @@ ac_build_buffer_store_dword(struct ac_llvm_context *ctx,
offset = LLVMBuildAdd(ctx->builder, offset, voffset, "");

LLVMValueRef args[] = {
- bitcast_to_float(ctx, vdata),
+ ac_to_float(ctx, vdata),
LLVMBuildBitCast(ctx->builder, rsrc, ctx->v4i32, ""),
LLVMConstInt(ctx->i32, 0, 0),
offset,
@@ -1217,7 +1203,7 @@ LLVMValueRef ac_build_image_opcode(struct ac_llvm_context *ctx,
a->opcode == ac_image_get_lod;

if (sample)
- args[num_args++] = bitcast_to_float(ctx, a->addr);
+ args[num_args++] = ac_to_float(ctx, a->addr);
else
args[num_args++] = a->addr;
--
2.9.4
Connor Abbott
2017-08-08 01:45:18 UTC
Permalink
From: Connor Abbott <***@gmail.com>

The .f32 was already getting added by emit_intrin_2f_param(). Noticed
when enabling LLVM module verification.
---
src/amd/common/ac_nir_to_llvm.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/amd/common/ac_nir_to_llvm.c b/src/amd/common/ac_nir_to_llvm.c
index bafe4d3..46e15c9 100644
--- a/src/amd/common/ac_nir_to_llvm.c
+++ b/src/amd/common/ac_nir_to_llvm.c
@@ -5337,8 +5337,8 @@ static LLVMValueRef
emit_float_saturate(struct ac_llvm_context *ctx, LLVMValueRef v, float lo, float hi)
{
v = ac_to_float(ctx, v);
- v = emit_intrin_2f_param(ctx, "llvm.maxnum.f32", ctx->f32, v, LLVMConstReal(ctx->f32, lo));
- return emit_intrin_2f_param(ctx, "llvm.minnum.f32", ctx->f32, v, LLVMConstReal(ctx->f32, hi));
+ v = emit_intrin_2f_param(ctx, "llvm.maxnum", ctx->f32, v, LLVMConstReal(ctx->f32, lo));
+ return emit_intrin_2f_param(ctx, "llvm.minnum", ctx->f32, v, LLVMConstReal(ctx->f32, hi));
}
--
2.9.4
Bas Nieuwenhuizen
2017-08-08 07:04:20 UTC
Permalink
This patch is
Post by Connor Abbott
The .f32 was already getting added by emit_intrin_2f_param(). Noticed
when enabling LLVM module verification.
---
src/amd/common/ac_nir_to_llvm.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/amd/common/ac_nir_to_llvm.c b/src/amd/common/ac_nir_to_llvm.c
index bafe4d3..46e15c9 100644
--- a/src/amd/common/ac_nir_to_llvm.c
+++ b/src/amd/common/ac_nir_to_llvm.c
@@ -5337,8 +5337,8 @@ static LLVMValueRef
emit_float_saturate(struct ac_llvm_context *ctx, LLVMValueRef v, float lo, float hi)
{
v = ac_to_float(ctx, v);
- v = emit_intrin_2f_param(ctx, "llvm.maxnum.f32", ctx->f32, v, LLVMConstReal(ctx->f32, lo));
- return emit_intrin_2f_param(ctx, "llvm.minnum.f32", ctx->f32, v, LLVMConstReal(ctx->f32, hi));
+ v = emit_intrin_2f_param(ctx, "llvm.maxnum", ctx->f32, v, LLVMConstReal(ctx->f32, lo));
+ return emit_intrin_2f_param(ctx, "llvm.minnum", ctx->f32, v, LLVMConstReal(ctx->f32, hi));
}
--
2.9.4
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Marek Olšák
2017-08-11 18:31:05 UTC
Permalink
For patches 9, 10, 11, 15:

Reviewed-by: Marek Olšák <***@amd.com>

Marek
Post by Connor Abbott
The .f32 was already getting added by emit_intrin_2f_param(). Noticed
when enabling LLVM module verification.
---
src/amd/common/ac_nir_to_llvm.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/amd/common/ac_nir_to_llvm.c b/src/amd/common/ac_nir_to_llvm.c
index bafe4d3..46e15c9 100644
--- a/src/amd/common/ac_nir_to_llvm.c
+++ b/src/amd/common/ac_nir_to_llvm.c
@@ -5337,8 +5337,8 @@ static LLVMValueRef
emit_float_saturate(struct ac_llvm_context *ctx, LLVMValueRef v, float lo, float hi)
{
v = ac_to_float(ctx, v);
- v = emit_intrin_2f_param(ctx, "llvm.maxnum.f32", ctx->f32, v, LLVMConstReal(ctx->f32, lo));
- return emit_intrin_2f_param(ctx, "llvm.minnum.f32", ctx->f32, v, LLVMConstReal(ctx->f32, hi));
+ v = emit_intrin_2f_param(ctx, "llvm.maxnum", ctx->f32, v, LLVMConstReal(ctx->f32, lo));
+ return emit_intrin_2f_param(ctx, "llvm.minnum", ctx->f32, v, LLVMConstReal(ctx->f32, hi));
}
--
2.9.4
_______________________________________________
mesa-dev mailing list
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Connor Abbott
2017-08-08 19:18:17 UTC
Permalink
Post by Connor Abbott
This series implements VK_AMD_shader_ballot for radv. This extension
builds on VK_EXT_shader_subgroup_ballot and VK_EXT_shader_subgroup_vote
by adding a number of reductions across a subgroup (or wavefront in AMD
terminology). Previously, shaders had to use shared memory to compute,
say, the average across all threads in a workgroup, or the minimum and
maximum values across a workgroup. But that requires a lot of accesses
to LDS memory, which is (relatively) slow. This extension allows the
shader to do part of the reduction directly in registers, as long as it
stays within a single wavefront, reducing the amount of traffic to the
LDS that has to happen. It also adds a few AMD-specific instructions,
like mbcnt. To get an idea of what exactly is in the extension, and what
inclusive scan, exclusive scan, etc. mean, you can look at the GL
extension which exposes mostly the same things [1].
Why should you care? It turns out that with this extension enabled, plus
a few other AMD-specific extensions that are mostly trivial, DOOM will
take a different path that uses shaders that were tuned specifically for
AMD hardware. I haven't actually tested DOOM yet, since a few more
things need to be wired up, but it's a lot less work than this extension
and I'm sure Dave or Bas will be do it for me when they get around to it
:).
It uses a few new features of the AMDGPU LLVM backend that I just
https://reviews.llvm.org/D34718, so it's going to require LLVM 6.0. It
also uses the DPP modifier that was only added on VI since that was
easier than using ds_swizzle (which is available on all GCN cards). It
should be possible to implement support for older cards using
it might be helpful to look at the LLVM changes that this series uses,
https://reviews.llvm.org/rL310087
https://reviews.llvm.org/rL310088
https://reviews.llvm.org/D34718
in order to get the complete picture.
I've just pushed the last LLVM change required as
https://reviews.llvm.org/rL310399, so this series should now work with
upstream LLVM master.
Post by Connor Abbott
This series depends on my previous series [2] to implement
VK_EXT_shader_subgroup_vote and VK_EXT_shader_subgroup_ballot, if
nothing else in order to be able to test the implementation. I think
DOOM also uses the latter two extensions. I've also based on my series
adding cross-thread semantics to NIR [3], which Jason needs to review,
since I was hoping that would land first, although with a little effort
it should be possible to land this first (it would require changing
git://people.freedesktop.org/~cwabbott0/mesa radv-amd-shader-ballot
and the LLVM branch that I've been using to test, with the one patch
https://github.com/cwabbott0/llvm.git dpp-intrinsics-v4
I've also forced-pushed all three Mesa branches (nir-divergence-v4,
radv-shader-ballot-v4, and radv-amd-shader-ballot) with trivial
rebasing after pushing the last patch in this series. I've also pushed
my Crucible tests to

git://people.freedesktop.org/~cwabbott0/crucible amd-shader-ballot

although I haven't yet cleaned things up. At least it'll be useful for
making sure this code still works.
Post by Connor Abbott
I've got some Crucible tests for exercising the various different parts
of the implementation, although I didn't bother to test all the possible
combinations of reductions, since they didn't really require any special
code to implement anyways. I'll try and get that cleaned up and sent out
soon. Maybe I should just push the tests?
Finally, I'm leaving Valve soon (this week) to go back to school, and I
suspect that I won't have too much time to work on this afterwards, so
someone else will probably have to pick it up. I've been working on this
for most of the summer, since it turned out to be a way more complicated
beast to implement than I thought. It's required changes across the
entire stack, from spirv-to-nir all the way down to register allocation
in the LLVM backend. Thankfully, though, most of the tricky LLVM
changes have landed (thanks Nicolai for reviewing!) and what's left is a
lot more straightforward. I should still be around to answer questions,
though. Whew!
[1] https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt
[2] https://lists.freedesktop.org/archives/mesa-dev/2017-August/164903.html
[3] https://lists.freedesktop.org/archives/mesa-dev/2017-August/164898.html
nir: define intrinsics needed for AMD_shader_ballot
spirv: import AMD extensions header
spirv: add plumbing for SPV_AMD_shader_ballot and Groups
nir: rename and generalize nir_lower_read_invocation_to_scalar
nir: scalarize AMD_shader_ballot intrinsics
radv: call nir_lower_cross_thread_to_scalar()
nir: add a lowering pass for some cross-workgroup intrinsics
radv: use nir_lower_group_reduce()
ac: move ac_to_integer() and ac_to_float() to ac_llvm_build.c
ac: remove bitcast_to_float()
ac: fix ac_get_type_size() for doubles
ac: add support for SPV_AMD_shader_ballot
ac/nir: add support for SPV_AMD_shader_ballot
radv: enable VK_AMD_shader_ballot
ac/nir: fix saturate emission
src/amd/common/ac_llvm_build.c | 783 ++++++++++++++++++++-
src/amd/common/ac_llvm_build.h | 120 ++++
src/amd/common/ac_nir_to_llvm.c | 300 ++++----
src/amd/vulkan/radv_device.c | 15 +
src/amd/vulkan/radv_pipeline.c | 6 +
src/compiler/Makefile.sources | 4 +-
src/compiler/nir/nir.h | 11 +-
src/compiler/nir/nir_intrinsics.h | 124 +++-
...scalar.c => nir_lower_cross_thread_to_scalar.c} | 63 +-
src/compiler/nir/nir_lower_group_reduce.c | 179 +++++
src/compiler/nir/nir_print.c | 1 +
src/compiler/spirv/GLSL.ext.AMD.h | 93 +++
src/compiler/spirv/nir_spirv.h | 2 +
src/compiler/spirv/spirv_to_nir.c | 32 +-
src/compiler/spirv/vtn_amd.c | 281 ++++++++
src/compiler/spirv/vtn_private.h | 9 +
src/intel/compiler/brw_nir.c | 2 +-
17 files changed, 1846 insertions(+), 179 deletions(-)
rename src/compiler/nir/{nir_lower_read_invocation_to_scalar.c => nir_lower_cross_thread_to_scalar.c} (56%)
create mode 100644 src/compiler/nir/nir_lower_group_reduce.c
create mode 100644 src/compiler/spirv/GLSL.ext.AMD.h
create mode 100644 src/compiler/spirv/vtn_amd.c
--
2.9.4
Loading...