Discussion:
[libav-devel] [PATCH 00/12] Various patches for RV40
Christophe GISQUET
2012-04-17 18:44:40 UTC
Permalink
These patches are getting old, and so am I, so I prefer sending them while
I still can remember why I did some of things I did in them.

The most controversial ones will probably be around loop filtering. I
suspect that forcing inlining of it only works once the edge/non-edge cases
have been split.

Christophe GISQUET (7):
rv40dsp x86: MMX/MMX2/3DNow/SSE2/SSSE3 implementations of MC
rv34dsp x86: implement MMX2 inverse transform
rv34dsp: factorize a multiplication in the noround inverse transform
rv34dsp: rearrange parameters to loop filter strength functions.
rv40dsp x86: start implementing loop filtering
rv40dsp x86: distinguish non-edge/edge versions.
rv40dsp: always inline inloop filter

Christophe Gisquet (5):
rv40: don't always do the full prev_type search
rv40: cosmetics
rv40: change a logical test into a bitwise one.
rv40: perform bitwise checks in loop filter
rv40dsp: defer pixel access and computations untill they are needed

libavcodec/rv34.c | 19 +-
libavcodec/rv34dsp.c | 18 +-
libavcodec/rv34dsp.h | 5 +-
libavcodec/rv40.c | 66 +++---
libavcodec/rv40dsp.c | 78 ++++---
libavcodec/x86/dsputil_mmx.c | 16 ++
libavcodec/x86/dsputil_mmx.h | 5 +
libavcodec/x86/rv34dsp.asm | 77 +++++++
libavcodec/x86/rv34dsp_init.c | 2 +
libavcodec/x86/rv40dsp.asm | 502 ++++++++++++++++++++++++++++++++++++++++-
libavcodec/x86/rv40dsp_init.c | 150 ++++++++++++
11 files changed, 848 insertions(+), 90 deletions(-)
--
1.7.7.msysgit.0
Christophe GISQUET
2012-04-17 18:44:41 UTC
Permalink
Code mostly inspired by vp8's MC, however:
- its MMX2 horizontal filter is worse because it can't take advantage of
the coefficient redundancy
- that same coefficient redundancy allows better code for non-SSSE3 versions

Benchmark (rounded to tens of unit):
V8x8 H8x8 2D8x8 V16x16 H16x16 2D16x16
C 445 358 985 1785 1559 3280
MMX* 219 271 478 714 929 1443
SSE2 131 158 294 425 515 892
SSSE3 120 122 248 387 390 763

End result is overall around a 15% speedup for SSSE3 version (on 6 sequences);
all loop filter functions now take around 55% of decoding time, while luma MC
dsp functions are around 6%, chroma ones are 1.3% and biweight around 2.3%.
---
libavcodec/x86/dsputil_mmx.c | 16 ++
libavcodec/x86/dsputil_mmx.h | 5 +
libavcodec/x86/rv40dsp.asm | 303 ++++++++++++++++++++++++++++++++++++++++-
libavcodec/x86/rv40dsp_init.c | 137 +++++++++++++++++++
4 files changed, 458 insertions(+), 3 deletions(-)

diff --git a/libavcodec/x86/dsputil_mmx.c b/libavcodec/x86/dsputil_mmx.c
index 192c5c3..70b9b0b 100644
--- a/libavcodec/x86/dsputil_mmx.c
+++ b/libavcodec/x86/dsputil_mmx.c
@@ -1791,6 +1791,22 @@ QPEL_2TAP(avg_, 16, 3dnow)
QPEL_2TAP(put_, 8, 3dnow)
QPEL_2TAP(avg_, 8, 3dnow)

+void ff_put_rv40_qpel8_mc33_mmx(uint8_t *dst, uint8_t *src, int stride)
+{
+ put_pixels8_xy2_mmx(dst, src, stride, 8);
+}
+void ff_put_rv40_qpel16_mc33_mmx(uint8_t *dst, uint8_t *src, int stride)
+{
+ put_pixels16_xy2_mmx(dst, src, stride, 16);
+}
+void ff_avg_rv40_qpel8_mc33_mmx(uint8_t *dst, uint8_t *src, int stride)
+{
+ avg_pixels8_xy2_mmx(dst, src, stride, 8);
+}
+void ff_avg_rv40_qpel16_mc33_mmx(uint8_t *dst, uint8_t *src, int stride)
+{
+ avg_pixels16_xy2_mmx(dst, src, stride, 16);
+}

#if HAVE_YASM
typedef void emu_edge_core_func(uint8_t *buf, const uint8_t *src,
diff --git a/libavcodec/x86/dsputil_mmx.h b/libavcodec/x86/dsputil_mmx.h
index 097739c..37f4581 100644
--- a/libavcodec/x86/dsputil_mmx.h
+++ b/libavcodec/x86/dsputil_mmx.h
@@ -199,6 +199,11 @@ void ff_avg_cavs_qpel16_mc00_mmx2(uint8_t *dst, uint8_t *src, int stride);
void ff_put_vc1_mspel_mc00_mmx(uint8_t *dst, const uint8_t *src, int stride, int rnd);
void ff_avg_vc1_mspel_mc00_mmx2(uint8_t *dst, const uint8_t *src, int stride, int rnd);

+void ff_put_rv40_qpel8_mc33_mmx(uint8_t *block, uint8_t *pixels, int line_size);
+void ff_put_rv40_qpel16_mc33_mmx(uint8_t *block, uint8_t *pixels, int line_size);
+void ff_avg_rv40_qpel8_mc33_mmx(uint8_t *block, uint8_t *pixels, int line_size);
+void ff_avg_rv40_qpel16_mc33_mmx(uint8_t *block, uint8_t *pixels, int line_size);
+
void ff_mmx_idct(DCTELEM *block);
void ff_mmxext_idct(DCTELEM *block);

diff --git a/libavcodec/x86/rv40dsp.asm b/libavcodec/x86/rv40dsp.asm
index 721d3df..cac78bb 100644
--- a/libavcodec/x86/rv40dsp.asm
+++ b/libavcodec/x86/rv40dsp.asm
@@ -25,11 +25,308 @@
SECTION_RODATA

align 16
-shift_round: times 8 dw 1 << (16 - 6)
-cextern pw_16
+shift5_round: times 8 dw 1 << (16 - 6) ; pw_1024
+shift6_round: times 8 dw 1 << (16 - 7) ; pw_512
+
+sixtap_filter_hb_m: times 8 db 1, -5
+ times 8 db 52, 20
+ ; multiplied by 2 to have the same shift
+ times 8 db 2, -10
+ times 8 db 40, 40
+ ; back to normal
+ times 8 db 1, -5
+ times 8 db 20, 52
+
+sixtap_filter_v_m: times 8 dw 1
+ times 8 dw -5
+ times 8 dw 52
+ times 8 dw 20
+ ; multiplied by 2 to have the same shift
+ times 8 dw 2
+ times 8 dw -10
+ times 8 dw 40
+ times 8 dw 40
+ ; back to normal
+ times 8 dw 1
+ times 8 dw -5
+ times 8 dw 20
+ times 8 dw 52
+
+%ifdef PIC
+%define sixtap_filter_hw picregq
+%define sixtap_filter_hb picregq
+%define sixtap_filter_v picregq
+%define npicregs 1
+%else
+%define sixtap_filter_hw sixtap_filter_hw_m
+%define sixtap_filter_hb sixtap_filter_hb_m
+%define sixtap_filter_v sixtap_filter_v_m
+%define npicregs 0
+%endif
+
+filter_h6_shuf1: db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8
+filter_h6_shuf2: db 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10
+filter_h6_shuf3: db 5, 4, 6, 5, 7, 6, 8, 7, 9, 8, 10, 9, 11, 10, 12, 11
+
+cextern pw_32
+cextern pw_16

SECTION .text

+;-----------------------------------------------------------------------------
+; subpel MC functions:
+;
+; void put_vp8_qpel_[h|v]_<opt>(uint8_t *dst, int deststride,
+; uint8_t *src, int srcstride,
+; int height, int m);
+;-----------------------------------------------------------------------------
+%macro STORE 3
+%ifidn %3, avg
+ movh %2, [dstq]
+%endif
+ packuswb %1, %1
+%ifidn %3, avg
+%if cpuflag(3dnow)
+ pavgusb %1, %2
+%else
+ pavgb %1, %2
+%endif
+%endif
+ movh [dstq], %1
+%endmacro
+
+%macro FILTER_V 1
+cglobal %1_rv40_qpel_v, 6,6+npicregs,12, dst, dststride, src, srcstride, height, my, picreg
+%ifdef PIC
+ lea picregq, [sixtap_filter_v_m]
+%endif
+ pxor m7, m7
+ lea myd, [sixtap_filter_v+myq]
+
+ ; read 5 lines
+ sub srcq, srcstrideq
+ sub srcq, srcstrideq
+ movh m0, [srcq]
+ movh m1, [srcq+srcstrideq]
+ movh m2, [srcq+srcstrideq*2]
+ lea srcq, [srcq+srcstrideq*2]
+ add srcq, srcstrideq
+ movh m3, [srcq]
+ movh m4, [srcq+srcstrideq]
+ punpcklbw m0, m7
+ punpcklbw m1, m7
+ punpcklbw m2, m7
+ punpcklbw m3, m7
+ punpcklbw m4, m7
+
+%ifdef m8
+ mova m8, [myq+ 0]
+ mova m9, [myq+16]
+ mova m10, [myq+32]
+ mova m11, [myq+48]
+%define COEFF05 m8
+%define COEFF14 m9
+%define COEFF2 m10
+%define COEFF3 m11
+%else
+%define COEFF05 [myq+ 0]
+%define COEFF14 [myq+16]
+%define COEFF2 [myq+32]
+%define COEFF3 [myq+48]
+%endif
+.nextrow:
+ mova m6, m1
+ movh m5, [srcq+2*srcstrideq] ; read new row
+ paddw m6, m4
+ punpcklbw m5, m7
+ pmullw m6, COEFF14
+ paddw m0, m5
+ pmullw m0, COEFF05
+ paddw m6, m0
+ mova m0, m1
+ paddw m6, [pw_32]
+ mova m1, m2
+ pmullw m2, COEFF2
+ paddw m6, m2
+ mova m2, m3
+ pmullw m3, COEFF3
+ paddw m6, m3
+
+ ; round/clip/store
+ mova m3, m4
+ psraw m6, 6
+ mova m4, m5
+ STORE m6, m5, %1
+
+ ; go to next line
+ add dstq, dststrideq
+ add srcq, srcstrideq
+ dec heightd ; next row
+ jg .nextrow
+ REP_RET
+%endmacro
+
+%macro FILTER_H 1
+cglobal %1_rv40_qpel_h, 6, 6+npicregs, 12, dst, dststride, src, srcstride, height, mx, picreg
+%ifdef PIC
+ lea picregq, [sixtap_filter_v_m]
+%endif
+ pxor m7, m7
+ lea mxd, [sixtap_filter_v+mxq]
+ mova m6, [pw_32]
+%ifdef m8
+ mova m8, [mxq+ 0]
+ mova m9, [mxq+16]
+ mova m10, [mxq+32]
+ mova m11, [mxq+48]
+%define COEFF05 m8
+%define COEFF14 m9
+%define COEFF2 m10
+%define COEFF3 m11
+%else
+%define COEFF05 [mxq+ 0]
+%define COEFF14 [mxq+16]
+%define COEFF2 [mxq+32]
+%define COEFF3 [mxq+48]
+%endif
+.nextrow:
+ movq m0, [srcq-2]
+ movq m5, [srcq+3]
+ movq m1, [srcq-1]
+ movq m4, [srcq+2]
+ punpcklbw m0, m7
+ punpcklbw m5, m7
+ punpcklbw m1, m7
+ punpcklbw m4, m7
+ movq m2, [srcq-0]
+ movq m3, [srcq+1]
+ paddw m0, m5
+ paddw m1, m4
+ punpcklbw m2, m7
+ punpcklbw m3, m7
+ pmullw m0, COEFF05
+ pmullw m1, COEFF14
+ pmullw m2, COEFF2
+ pmullw m3, COEFF3
+ paddw m0, m1
+ paddw m2, m3
+ paddw m0, m6
+ paddw m0, m2
+ psraw m0, 6
+ STORE m0, m1, %1
+
+ ; go to next line
+ add dstq, dststrideq
+ add srcq, srcstrideq
+ dec heightd ; next row
+ jg .nextrow
+ REP_RET
+%endmacro
+
+%if ARCH_X86_32
+INIT_MMX mmx
+FILTER_V put
+FILTER_H put
+
+INIT_MMX mmx2
+FILTER_V avg
+FILTER_H avg
+
+INIT_MMX 3dnow
+FILTER_V avg
+FILTER_H avg
+%endif
+
+INIT_XMM sse2
+FILTER_H put
+FILTER_H avg
+FILTER_V put
+FILTER_V avg
+
+%macro FILTER_SSSE3 1
+cglobal %1_rv40_qpel_v, 6,6+npicregs,8, dst, dststride, src, srcstride, height, my, picreg
+%ifdef PIC
+ lea picregq, [sixtap_filter_hb_m]
+%endif
+
+ ; read 5 lines
+ sub srcq, srcstrideq
+ lea myd, [myq + sixtap_filter_hb]
+ sub srcq, srcstrideq
+ movh m0, [srcq]
+ movh m1, [srcq+srcstrideq]
+ movh m2, [srcq+srcstrideq*2]
+ lea srcq, [srcq+srcstrideq*2]
+ add srcq, srcstrideq
+ mova m5, [myq]
+ movh m3, [srcq]
+ movh m4, [srcq+srcstrideq]
+ lea srcq, [srcq+2*srcstrideq]
+
+.nextrow:
+ mova m6, m2
+ punpcklbw m0, m1
+ punpcklbw m6, m3
+ pmaddubsw m0, m5
+ pmaddubsw m6, [myq+16]
+ movh m7, [srcq] ; read new row
+ paddw m6, m0
+ mova m0, m1
+ mova m1, m2
+ mova m2, m3
+ mova m3, m4
+ mova m4, m7
+ punpcklbw m7, m3
+ pmaddubsw m7, m5
+ paddw m6, m7
+ pmulhrsw m6, [shift6_round]
+ STORE m6, m7, %1
+
+ ; go to next line
+ add dstq, dststrideq
+ add srcq, srcstrideq
+ dec heightd ; next row
+ jg .nextrow
+ REP_RET
+
+cglobal %1_rv40_qpel_h, 6,6+npicregs,8, dst, dststride, src, srcstride, height, mx, picreg
+%ifdef PIC
+ lea picregq, [sixtap_filter_hb_m]
+%endif
+ mova m3, [filter_h6_shuf2]
+ mova m4, [filter_h6_shuf3]
+ lea mxd, [mxq+sixtap_filter_hb]
+ mova m5, [mxq] ; set up 6tap filter in bytes
+ mova m6, [mxq+16]
+ mova m7, [filter_h6_shuf1]
+
+.nextrow:
+ movu m0, [srcq-2]
+ mova m1, m0
+ mova m2, m0
+ pshufb m0, m7
+ pshufb m1, m3
+ pshufb m2, m4
+ pmaddubsw m0, m5
+ pmaddubsw m1, m6
+ pmaddubsw m2, m5
+ paddw m0, m1
+ paddw m0, m2
+ pmulhrsw m0, [shift6_round]
+ STORE m0, m1, %1
+
+ ; go to next line
+ add dstq, dststrideq
+ add srcq, srcstrideq
+ dec heightd ; next row
+ jg .nextrow
+ REP_RET
+%endmacro
+
+INIT_XMM ssse3
+FILTER_SSSE3 put
+FILTER_SSSE3 avg
+
; %1=5bits weights?, %2=dst %3=src1 %4=src3 %5=stride if sse2
%macro RV40_WCORE 4-5
movh m4, [%3 + r6 + 0]
@@ -143,7 +440,7 @@ SECTION .text
%macro RV40_WEIGHT 3
cglobal rv40_weight_func_%1_%2, 6, 7, 8
%if cpuflag(ssse3)
- mova m1, [shift_round]
+ mova m1, [shift5_round]
%else
mova m1, [pw_16]
%endif
diff --git a/libavcodec/x86/rv40dsp_init.c b/libavcodec/x86/rv40dsp_init.c
index df468aa..644ca77 100644
--- a/libavcodec/x86/rv40dsp_init.c
+++ b/libavcodec/x86/rv40dsp_init.c
@@ -22,8 +22,11 @@
/**
* @file
* RV40 decoder motion compensation functions x86-optimised
+ * 2,0 and 0,2 have h264 equivalents.
+ * 3,3 is bugged in the rv40 format and maps to _xy2 version
*/

+#include "libavcodec/x86/dsputil_mmx.h"
#include "libavcodec/rv34dsp.h"

void ff_put_rv40_chroma_mc8_mmx (uint8_t *dst, uint8_t *src,
@@ -53,6 +56,123 @@ DECLARE_WEIGHT(mmx)
DECLARE_WEIGHT(sse2)
DECLARE_WEIGHT(ssse3)

+/** @{ */
+/**
+ * Define one qpel function
+ * LOOPSIZE must be already set to the number of pixels processed per iteration
+ * in the inner loop of the called functions.
+ * COFF(x) must be already defined so as to provide the offset into any array
+ * of coeffs used by the called function for the qpel position x
+ */
+#define QPEL_FUNC_DECL(OP, SIZE, PH, PV, OPT) \
+static void OP ## rv40_qpel ##SIZE ##_mc ##PH ##PV ##OPT(uint8_t *dst, uint8_t *src, int stride) \
+{ \
+ int i; \
+ if (PH && PV) { \
+ DECLARE_ALIGNED(16, uint8_t, tmp)[SIZE * (SIZE + 5)]; \
+ uint8_t *tmpptr = tmp + SIZE * 2; \
+ src -= stride * 2; \
+ \
+ for (i=0; i<SIZE; i += LOOPSIZE) \
+ ff_put_rv40_qpel_h ##OPT(tmp+i, SIZE, src+i, stride, SIZE+5, HCOFF(PH));\
+ for (i=0; i<SIZE; i += LOOPSIZE) \
+ ff_ ##OP ##rv40_qpel_v ##OPT(dst + i, stride, tmpptr + i, SIZE, SIZE, VCOFF(PV));\
+ } else if (PV) { \
+ for (i=0; i<SIZE; i += LOOPSIZE) \
+ ff_ ##OP ##rv40_qpel_v ## OPT(dst + i, stride, src + i, stride, SIZE, VCOFF(PV));\
+ } else { \
+ for (i=0; i<SIZE; i += LOOPSIZE) \
+ ff_ ##OP ##rv40_qpel_h ## OPT(dst + i, stride, src + i, stride, SIZE, HCOFF(PH));\
+ } \
+};
+
+/** Declare functions for sizes 8 and 16 and given operations and qpel position */
+#define QPEL_FUNCS_DECL(OP, PH, PV, OPT) \
+QPEL_FUNC_DECL(OP, 8, PH, PV, OPT) \
+QPEL_FUNC_DECL(OP, 16, PH, PV, OPT)
+
+/** Declare all functions for all sizes and qpel positions */
+#define QPEL_MC_DECL(OP, OPT) \
+void ff_ ##OP ##rv40_qpel_h ##OPT(uint8_t *dst, ptrdiff_t dstStride, \
+ const uint8_t *src, ptrdiff_t srcStride, \
+ int len, int m); \
+void ff_ ##OP ##rv40_qpel_v ##OPT(uint8_t *dst, ptrdiff_t dstStride, \
+ const uint8_t *src, ptrdiff_t srcStride, \
+ int len, int m); \
+QPEL_FUNCS_DECL(OP, 0, 1, OPT) \
+QPEL_FUNCS_DECL(OP, 0, 3, OPT) \
+QPEL_FUNCS_DECL(OP, 1, 0, OPT) \
+QPEL_FUNCS_DECL(OP, 1, 1, OPT) \
+QPEL_FUNCS_DECL(OP, 1, 2, OPT) \
+QPEL_FUNCS_DECL(OP, 1, 3, OPT) \
+QPEL_FUNCS_DECL(OP, 2, 1, OPT) \
+QPEL_FUNCS_DECL(OP, 2, 2, OPT) \
+QPEL_FUNCS_DECL(OP, 2, 3, OPT) \
+QPEL_FUNCS_DECL(OP, 3, 0, OPT) \
+QPEL_FUNCS_DECL(OP, 3, 1, OPT) \
+QPEL_FUNCS_DECL(OP, 3, 2, OPT)
+/** @} */
+
+#define LOOPSIZE 8
+#define HCOFF(x) (32*(x-1))
+#define VCOFF(x) (32*(x-1))
+QPEL_MC_DECL(put_, _ssse3)
+QPEL_MC_DECL(avg_, _ssse3)
+
+#undef LOOPSIZE
+#undef HCOFF
+#undef VCOFF
+#define LOOPSIZE 8
+#define HCOFF(x) (64*(x-1))
+#define VCOFF(x) (64*(x-1))
+QPEL_MC_DECL(put_, _sse2)
+QPEL_MC_DECL(avg_, _sse2)
+
+#if ARCH_X86_32
+#undef LOOPSIZE
+#undef HCOFF
+#undef VCOFF
+#define LOOPSIZE 4
+#define HCOFF(x) (64*(x-1))
+#define VCOFF(x) (64*(x-1))
+
+QPEL_MC_DECL(put_, _mmx)
+
+#define ff_put_rv40_qpel_h_mmx2 ff_put_rv40_qpel_h_mmx
+#define ff_put_rv40_qpel_v_mmx2 ff_put_rv40_qpel_v_mmx
+QPEL_MC_DECL(avg_, _mmx2)
+
+#define ff_put_rv40_qpel_h_3dnow ff_put_rv40_qpel_h_mmx
+#define ff_put_rv40_qpel_v_3dnow ff_put_rv40_qpel_v_mmx
+QPEL_MC_DECL(avg_, _3dnow)
+#endif
+
+/** @{ */
+/** Set one function */
+#define QPEL_FUNC_SET(OP, SIZE, PH, PV, OPT) \
+c-> OP ## pixels_tab[2-SIZE/8][4*PV+PH] = OP ## rv40_qpel ##SIZE ## _mc ##PH ##PV ##OPT;
+
+/** Set functions put and avg for sizes 8 and 16 and a given qpel position */
+#define QPEL_FUNCS_SET(OP, PH, PV, OPT) \
+QPEL_FUNC_SET(OP, 8, PH, PV, OPT) \
+QPEL_FUNC_SET(OP, 16, PH, PV, OPT)
+
+/** Set all functions for all sizes and qpel positions */
+#define QPEL_MC_SET(OP, OPT) \
+QPEL_FUNCS_SET (OP, 0, 1, OPT) \
+QPEL_FUNCS_SET (OP, 0, 3, OPT) \
+QPEL_FUNCS_SET (OP, 1, 0, OPT) \
+QPEL_FUNCS_SET (OP, 1, 1, OPT) \
+QPEL_FUNCS_SET (OP, 1, 2, OPT) \
+QPEL_FUNCS_SET (OP, 1, 3, OPT) \
+QPEL_FUNCS_SET (OP, 2, 1, OPT) \
+QPEL_FUNCS_SET (OP, 2, 2, OPT) \
+QPEL_FUNCS_SET (OP, 2, 3, OPT) \
+QPEL_FUNCS_SET (OP, 3, 0, OPT) \
+QPEL_FUNCS_SET (OP, 3, 1, OPT) \
+QPEL_FUNCS_SET (OP, 3, 2, OPT)
+/** @} */
+
void ff_rv40dsp_init_x86(RV34DSPContext *c, DSPContext *dsp)
{
#if HAVE_YASM
@@ -65,25 +185,42 @@ void ff_rv40dsp_init_x86(RV34DSPContext *c, DSPContext *dsp)
c->rv40_weight_pixels_tab[0][1] = ff_rv40_weight_func_rnd_8_mmx;
c->rv40_weight_pixels_tab[1][0] = ff_rv40_weight_func_nornd_16_mmx;
c->rv40_weight_pixels_tab[1][1] = ff_rv40_weight_func_nornd_8_mmx;
+ c->put_pixels_tab[0][15] = ff_put_rv40_qpel16_mc33_mmx;
+ c->put_pixels_tab[1][15] = ff_put_rv40_qpel8_mc33_mmx;
+ c->avg_pixels_tab[0][15] = ff_avg_rv40_qpel16_mc33_mmx;
+ c->avg_pixels_tab[1][15] = ff_avg_rv40_qpel8_mc33_mmx;
+#if ARCH_X86_32
+ QPEL_MC_SET(put_, _mmx)
+#endif
}
if (mm_flags & AV_CPU_FLAG_MMX2) {
c->avg_chroma_pixels_tab[0] = ff_avg_rv40_chroma_mc8_mmx2;
c->avg_chroma_pixels_tab[1] = ff_avg_rv40_chroma_mc4_mmx2;
+#if ARCH_X86_32
+ QPEL_MC_SET(avg_, _mmx2)
+#endif
} else if (mm_flags & AV_CPU_FLAG_3DNOW) {
c->avg_chroma_pixels_tab[0] = ff_avg_rv40_chroma_mc8_3dnow;
c->avg_chroma_pixels_tab[1] = ff_avg_rv40_chroma_mc4_3dnow;
+#if ARCH_X86_32
+ QPEL_MC_SET(avg_, _3dnow)
+#endif
}
if (mm_flags & AV_CPU_FLAG_SSE2) {
c->rv40_weight_pixels_tab[0][0] = ff_rv40_weight_func_rnd_16_sse2;
c->rv40_weight_pixels_tab[0][1] = ff_rv40_weight_func_rnd_8_sse2;
c->rv40_weight_pixels_tab[1][0] = ff_rv40_weight_func_nornd_16_sse2;
c->rv40_weight_pixels_tab[1][1] = ff_rv40_weight_func_nornd_8_sse2;
+ QPEL_MC_SET(put_, _sse2)
+ QPEL_MC_SET(avg_, _sse2)
}
if (mm_flags & AV_CPU_FLAG_SSSE3) {
c->rv40_weight_pixels_tab[0][0] = ff_rv40_weight_func_rnd_16_ssse3;
c->rv40_weight_pixels_tab[0][1] = ff_rv40_weight_func_rnd_8_ssse3;
c->rv40_weight_pixels_tab[1][0] = ff_rv40_weight_func_nornd_16_ssse3;
c->rv40_weight_pixels_tab[1][1] = ff_rv40_weight_func_nornd_8_ssse3;
+ QPEL_MC_SET(put_, _ssse3)
+ QPEL_MC_SET(avg_, _ssse3)
}
#endif
}
--
1.7.7.msysgit.0
Ronald S. Bultje
2012-04-17 20:37:34 UTC
Permalink
Hi,

On Tue, Apr 17, 2012 at 11:44 AM, Christophe GISQUET
<***@gmail.com> wrote:
[..]
+    lea      myd, [sixtap_filter_v+myq]
lea myq, ...
+    mova      m6, m1
+    movh      m5, [srcq+2*srcstrideq]      ; read new row
+    paddw     m6, m4
Can we use 3-arg stuff here to prepare for AVX functions? I.e. paddw m6, m1, m4.
+    mova      m1, m2
+    pmullw    m2, COEFF2
[..]
+    mova      m2, m3
+    pmullw    m3, COEFF3
Same here, and other places.
+    lea      myd, [myq + sixtap_filter_hb]
lea myq, ... Same in a few more places below.

Ronald
Christophe Gisquet
2012-04-18 10:53:25 UTC
Permalink
Post by Ronald S. Bultje
+    lea      myd, [sixtap_filter_v+myq]
lea myq, ...
Also as an answer to Jason (the actual reply is pending completion of
the updated patch), I did the best I could my homework here when not
trying to cargocult vp8's code: I did try to use add instead of lea,
but this code has some trick to it. myd has the correct value, as its
high bits should be 0, but they may in fact contain garbage at least
on win64, while it is clear they should be 0. That's the classical
problem requiring sign-extension. Here the values are always positive,
so by using that lea instruction, both the add and garbage handling
are done in one step.

That's my explanation of this trick used by vp8's mc code. By just
doing an add, you get garbage in myq and crashes.

The other solution is to have the last argument be ptrdiff_t, but I'd
imagine this to be a loss.
Post by Ronald S. Bultje
+    mova      m6, m1
+    movh      m5, [srcq+2*srcstrideq]      ; read new row
+    paddw     m6, m4
Can we use 3-arg stuff here to prepare for AVX functions? I.e. paddw m6, m1, m4.
Probably. I'm not used to either avx or that syntax, and I don't
really intend to validate the avx code. I could, using obe2, but that
has proven to be too much trouble.
--
Christophe
Loren Merritt
2012-04-19 01:30:22 UTC
Permalink
Post by Christophe Gisquet
Post by Ronald S. Bultje
+ lea myd, [sixtap_filter_v+myq]
lea myq, ...
Also as an answer to Jason (the actual reply is pending completion of
the updated patch), I did the best I could my homework here when not
trying to cargocult vp8's code: I did try to use add instead of lea,
but this code has some trick to it. myd has the correct value, as its
high bits should be 0, but they may in fact contain garbage at least
on win64, while it is clear they should be 0. That's the classical
problem requiring sign-extension. Here the values are always positive,
so by using that lea instruction, both the add and garbage handling
are done in one step.
That's my explanation of this trick used by vp8's mc code. By just
doing an add, you get garbage in myq and crashes.
The other solution is to have the last argument be ptrdiff_t, but I'd
imagine this to be a loss.
32bit add also zeros the high half.

--Loren Merritt
Christophe Gisquet
2012-04-19 06:50:21 UTC
Permalink
Post by Loren Merritt
32bit add also zeros the high half.
And because of PIC, this becomes:
%ifdef PIC
add %1d, picregd
%else
add %1d, %2
%endif

And by the way, what's the advantage of add over lea here? Besides the
potentially 1 byte shorter encoding, I have just checked this is not a
win in latency.
--
Christophe
Jason Garrett-Glaser
2012-04-19 17:05:34 UTC
Permalink
On Wed, Apr 18, 2012 at 11:50 PM, Christophe Gisquet
Post by Christophe GISQUET
Post by Loren Merritt
32bit add also zeros the high half.
%ifdef PIC
   add      %1d, picregd
%else
   add      %1d, %2
%endif
And by the way, what's the advantage of add over lea here? Besides the
potentially 1 byte shorter encoding, I have just checked this is not a
win in latency.
All else equal, add is better. On most CPUs, lea is either 1/1 or
1/0.5, while add is 1/0.33. 3-argument lea is 3/1 on some CPUs.

Jason
Christophe Gisquet
2012-04-19 21:26:04 UTC
Permalink
Hi,
All else equal, add is better.  On most CPUs, lea is either 1/1 or
1/0.5, while add is 1/0.33.  3-argument lea is 3/1 on some CPUs.
Ok, I guess I misread the later as 1/throughtput in some document.

New patch attached to use only add, and address other comments. As
reported in previous mails, I didn't get any improvement by loading
some more vectors in xmm8+.

Christophe
Ronald S. Bultje
2012-04-28 17:47:07 UTC
Permalink
Hi,

On Thu, Apr 19, 2012 at 2:26 PM, Christophe Gisquet
Post by Christophe Gisquet
All else equal, add is better.  On most CPUs, lea is either 1/1 or
1/0.5, while add is 1/0.33.  3-argument lea is 3/1 on some CPUs.
Ok, I guess I misread the later as 1/throughtput in some document.
New patch attached to use only add, and address other comments. As
reported in previous mails, I didn't get any improvement by loading
some more vectors in xmm8+.
Patch looks good, but crashes on Mac64 when running e.g.
fate-real-rv40 (Mac32 works fine):

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x0000000000533970
0x0000000100533e8a in ff_put_rv40_qpel_h_ssse3 ()
(gdb) bt
#0 0x0000000100533e8a in ff_put_rv40_qpel_h_ssse3 ()
#1 0x00000001003e96c2 in rv34_mc (r=0x10187e8a8,
block_type=1606408880, xoff=1606408880, yoff=1606408880, mv_off=8,
width=1606408880, height=1, dir=0, thirdpel=0, weighted=0,
qpel_mc=0x10105d330, chroma_mc=0x10105d730) at rv34.c:763
(gdb) disass
Dump of assembler code for function ff_put_rv40_qpel_h_ssse3:
0x0000000100533e70 <ff_put_rv40_qpel_h_ssse3+0>: lea
-0x507(%rip),%rax # 0x100533970 <sixtap_filter_hb_m>
0x0000000100533e77 <ff_put_rv40_qpel_h_ssse3+7>: movdqa
-0x3df(%rip),%xmm3 # 0x100533aa0 <filter_h6_shuf2>
0x0000000100533e7f <ff_put_rv40_qpel_h_ssse3+15>: movdqa
-0x3d7(%rip),%xmm4 # 0x100533ab0 <filter_h6_shuf3>
0x0000000100533e87 <ff_put_rv40_qpel_h_ssse3+23>: add %eax,%r9d
0x0000000100533e8a <ff_put_rv40_qpel_h_ssse3+26>: movdqa (%r9),%xmm5
0x0000000100533e8f <ff_put_rv40_qpel_h_ssse3+31>: movdqa 0x10(%r9),%xmm6
0x0000000100533e95 <ff_put_rv40_qpel_h_ssse3+37>: movdqa
-0x40d(%rip),%xmm7 # 0x100533a90 <filter_h6_shuf1>

(gdb) info registers
rax 0x100533970 4300421488
rbx 0x101059800 4312111104
rcx 0x240 576
rdx 0x10183a687 4320372359
rsi 0x240 576
rdi 0x10187e8a8 4320651432
rbp 0xc5 0xc5
rsp 0x7fff5fbfdaa8 0x7fff5fbfdaa8
r8 0x8 8
r9 0x533970 5454192
r10 0x140 320
r11 0x0 0
r12 0x1018a5e54 4320812628
r13 0x10187e8a8 4320651432
r14 0x10183a687 4320372359
r15 0x10189aa54 4320766548
rip 0x100533e8a 0x100533e8a <ff_put_rv40_qpel_h_ssse3+26>
eflags 0x10202 66050
cs 0x27 39
ss 0x0 0
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0

Ronald
Ronald S. Bultje
2012-04-28 17:57:58 UTC
Permalink
Hi,
Post by Ronald S. Bultje
On Thu, Apr 19, 2012 at 2:26 PM, Christophe Gisquet
Post by Christophe Gisquet
All else equal, add is better.  On most CPUs, lea is either 1/1 or
1/0.5, while add is 1/0.33.  3-argument lea is 3/1 on some CPUs.
Ok, I guess I misread the later as 1/throughtput in some document.
New patch attached to use only add, and address other comments. As
reported in previous mails, I didn't get any improvement by loading
some more vectors in xmm8+.
Patch looks good, but crashes on Mac64 when running e.g.
+%macro LOAD 2
+%ifdef PIC
+ add %1d, picregd
+%else

This clears the upper 4 bytes on x86-64, thus creating an invalid
pointer. I can't quickly figure out how to fix this. movsxd+add %1q,
picregq makes it work, as a proof of principle of where the bug is.

Ronald
Christophe Gisquet
2012-05-10 14:06:00 UTC
Permalink
Hi,
Post by Ronald S. Bultje
Patch looks good, but crashes on Mac64 when running e.g.
+%macro LOAD  2
+%ifdef PIC
+   add      %1d, picregd
+%else
This clears the upper 4 bytes on x86-64, thus creating an invalid
pointer. I can't quickly figure out how to fix this. movsxd+add %1q,
picregq makes it work, as a proof of principle of where the bug is.
For some reason (I'm not knowledgeable enough with its abi), win64
looked fine with picregq address on 32bits when debugging, so 32 bits
add looked fine.

But, to be careful, as only win64 has potential garbage in %1, I
special-cased it, and now always use full register width in the new
attached patch. Tested on win32 and win64.

Christophe
Ronald S. Bultje
2012-05-10 16:20:37 UTC
Permalink
Hi,

On Thu, May 10, 2012 at 7:06 AM, Christophe Gisquet
Post by Christophe Gisquet
Hi,
Post by Ronald S. Bultje
Patch looks good, but crashes on Mac64 when running e.g.
+%macro LOAD  2
+%ifdef PIC
+   add      %1d, picregd
+%else
This clears the upper 4 bytes on x86-64, thus creating an invalid
pointer. I can't quickly figure out how to fix this. movsxd+add %1q,
picregq makes it work, as a proof of principle of where the bug is.
For some reason (I'm not knowledgeable enough with its abi), win64
looked fine with picregq address on 32bits when debugging, so 32 bits
add looked fine.
But, to be careful, as only win64 has potential garbage in %1, I
special-cased it, and now always use full register width in the new
attached patch. Tested on win32 and win64.
This works for me. Thanks, patch LGTM.

Ronald
Diego Biurrun
2012-05-10 16:43:15 UTC
Permalink
Post by Ronald S. Bultje
On Thu, May 10, 2012 at 7:06 AM, Christophe Gisquet
Post by Christophe Gisquet
Post by Ronald S. Bultje
Patch looks good, but crashes on Mac64 when running e.g.
+%macro LOAD  2
+%ifdef PIC
+   add      %1d, picregd
+%else
This clears the upper 4 bytes on x86-64, thus creating an invalid
pointer. I can't quickly figure out how to fix this. movsxd+add %1q,
picregq makes it work, as a proof of principle of where the bug is.
For some reason (I'm not knowledgeable enough with its abi), win64
looked fine with picregq address on 32bits when debugging, so 32 bits
add looked fine.
But, to be careful, as only win64 has potential garbage in %1, I
special-cased it, and now always use full register width in the new
attached patch. Tested on win32 and win64.
This works for me. Thanks, patch LGTM.
Prettified and queued, will push with the next batch after FATE passes.

Diego
Jason Garrett-Glaser
2012-04-17 22:00:13 UTC
Permalink
Post by Christophe GISQUET
+%macro FILTER_V 1
+cglobal %1_rv40_qpel_v, 6,6+npicregs,12, dst, dststride, src, srcstride, height, my, picreg
+%ifdef PIC
+    lea  picregq, [sixtap_filter_v_m]
+%endif
+    pxor      m7, m7
+    lea      myd, [sixtap_filter_v+myq]
Can't this be add, at least on 32-bit?
Post by Christophe GISQUET
+    ; read 5 lines
+    sub     srcq, srcstrideq
+    sub     srcq, srcstrideq
+    movh      m0, [srcq]
+    movh      m1, [srcq+srcstrideq]
+    movh      m2, [srcq+srcstrideq*2]
+    lea     srcq, [srcq+srcstrideq*2]
+    add     srcq, srcstrideq
This add can be postponed until after the following movhs. You don't
want pointer dependency chains going on here.
Post by Christophe GISQUET
+    paddw     m0, m1
+    paddw     m2, m3
+    paddw     m0, m6
+    paddw     m0, m2
m6 should be added first because there is no dependency for m6.
Post by Christophe GISQUET
+    pmulhrsw  m0, [shift6_round]
Put this in a register on x86_64.

Jason
Christophe Gisquet
2012-04-18 11:15:48 UTC
Permalink
Post by Jason Garrett-Glaser
+    lea      myd, [sixtap_filter_v+myq]
Can't this be add, at least on 32-bit?
I don't think so, see my reply to Ronald.
Post by Jason Garrett-Glaser
This add can be postponed until after the following movhs.  You don't
want pointer dependency chains going on here.
I did test this (and revalidated it). It is 1-2 cycles slower, which I
attribute to the different address patterns. This can be debated,
though, because I would also think the effect you mention should be
stronger.

There's the inverse occurrence later on with a [myq+16] address:
adding 16 to myq and then using [myq] proved to be slower by 1 cycle.
Post by Jason Garrett-Glaser
+    paddw     m0, m1
+    paddw     m2, m3
+    paddw     m0, m6
+    paddw     m0, m2
m6 should be added first because there is no dependency for m6.
It didn't make a difference on a core i5 and a penrynn, but that might
depend on pmullw latency, so, agreed.
Post by Jason Garrett-Glaser
+    pmulhrsw  m0, [shift6_round]
Put this in a register on x86_64.
I tested extensively when and when not to do this for various elements
(filter coeffs or rounders), and the current state represents the best
I could come up with. For the detail here (which I just revalidated on
ssse3 functions), doing this in the horizontal case led to a constant
loss (around 4 cycles), and no difference for the vertical case. I
attribute this to the need for backing up xmm8 to the stack, which
offsets the gain of accessing data that ends up being already fetched
for the following iterations.

So, I can do it, because it is also debatable, but reality on a
penrynn and a core i5 disagrees convincingly.

Seeing all the particular choices I found, I was wondering whether to
add them as comment, as commit message or just leave it to this
discussion.
--
Christophe
Ronald S. Bultje
2012-04-18 18:18:13 UTC
Permalink
Hi,

On Wed, Apr 18, 2012 at 4:15 AM, Christophe Gisquet
Post by Christophe Gisquet
Post by Jason Garrett-Glaser
+    lea      myd, [sixtap_filter_v+myq]
Can't this be add, at least on 32-bit?
I don't think so, see my reply to Ronald.
Post by Jason Garrett-Glaser
This add can be postponed until after the following movhs.  You don't
want pointer dependency chains going on here.
I did test this (and revalidated it). It is 1-2 cycles slower, which I
attribute to the different address patterns. This can be debated,
though, because I would also think the effect you mention should be
stronger.
adding 16 to myq and then using [myq] proved to be slower by 1 cycle.
Post by Jason Garrett-Glaser
+    paddw     m0, m1
+    paddw     m2, m3
+    paddw     m0, m6
+    paddw     m0, m2
m6 should be added first because there is no dependency for m6.
It didn't make a difference on a core i5 and a penrynn, but that might
depend on pmullw latency, so, agreed.
Post by Jason Garrett-Glaser
+    pmulhrsw  m0, [shift6_round]
Put this in a register on x86_64.
I tested extensively when and when not to do this for various elements
(filter coeffs or rounders), and the current state represents the best
I could come up with. For the detail here (which I just revalidated on
ssse3 functions), doing this in the horizontal case led to a constant
loss (around 4 cycles), and no difference for the vertical case. I
attribute this to the need for backing up xmm8 to the stack, which
offsets the gain of accessing data that ends up being already fetched
for the following iterations.
That's Win64-specific, the Mac/Unix 64 builds could still benefit, no?

Ronald
Christophe Gisquet
2012-04-18 19:46:08 UTC
Permalink
Post by Ronald S. Bultje
That's Win64-specific, the Mac/Unix 64 builds could still benefit, no?
So you recommend something like:
%if WIN64
lea myd, [sixtap_filter_v+myq]
%else
add myq, sixtap_filter_v
%endif

Or better (not tested):
%macro LOAD_COEFF_TAB 2
%if WIN64
lea %1d, [%2+%1q]
%else
add %1q, %2
%endif
%endmacro
--
Christophe
Ronald S. Bultje
2012-04-18 21:56:33 UTC
Permalink
Hi,

On Wed, Apr 18, 2012 at 12:46 PM, Christophe Gisquet
Post by Christophe GISQUET
Post by Ronald S. Bultje
That's Win64-specific, the Mac/Unix 64 builds could still benefit, no?
%if WIN64
lea      myd, [sixtap_filter_v+myq]
%else
add myq, sixtap_filter_v
%endif
%macro LOAD_COEFF_TAB 2
%if WIN64
lea      %1d, [%2+%1q]
%else
add %1q, %2
%endif
%endmacro
I actually meant the xmm8 pushing to stack comment.

Ronald
Christophe Gisquet
2012-04-19 12:06:36 UTC
Permalink
Hi,
Post by Ronald S. Bultje
I actually meant the xmm8 pushing to stack comment.
Tested that on linux x86_64 for sse2 (vertical) and ssse3
(horizontal): at best, no difference. I kind of remember testing that,
and that's why the horizontal sse2 does load pw_32 into m6.

Christophe
Christophe GISQUET
2012-04-17 18:44:42 UTC
Permalink
141 cycles down to 56.
---
libavcodec/x86/rv34dsp.asm | 77 +++++++++++++++++++++++++++++++++++++++++
libavcodec/x86/rv34dsp_init.c | 2 +
2 files changed, 79 insertions(+), 0 deletions(-)

diff --git a/libavcodec/x86/rv34dsp.asm b/libavcodec/x86/rv34dsp.asm
index 2d2f6e1..3d08500 100644
--- a/libavcodec/x86/rv34dsp.asm
+++ b/libavcodec/x86/rv34dsp.asm
@@ -22,6 +22,16 @@
%include "x86inc.asm"
%include "x86util.asm"

+SECTION_RODATA
+pw_row_coeffs: times 4 dw 13
+ times 4 dw 17
+ times 4 dw 7
+pd_512: times 2 dd 0x200
+pw_col_coeffs: dw 13, 13, 13, -13
+ dw 17, 7, 7, -17
+ dw 13, -13, 13, 13
+ dw -7, 17, -17, -7
+
SECTION .text

%macro IDCT_DC_NOROUND 1
@@ -88,6 +98,73 @@ cglobal rv34_idct_dc_add_mmx, 3, 3
movh [r2+r1], m5
RET

+; Load coeffs and perform row transform
+; Output: coeffs in mm[0467], rounder in mm5
+%macro ROW_TRANSFORM 1
+ pxor mm7, mm7
+ movq mm0, [%1+ 0*8]
+ movq mm1, [%1+ 1*8]
+ movq mm2, [%1+ 2*8]
+ movq mm3, [%1+ 3*8]
+ movq [%1+ 0*8], mm7
+ movq [%1+ 1*8], mm7
+ movq [%1+ 2*8], mm7
+ movq [%1+ 3*8], mm7
+ movq mm4, mm0
+ paddsw mm0, mm2 ; b0 + b2
+ psubsw mm4, mm2 ; b0 - b2
+ pmullw mm0, [pw_row_coeffs+ 0] ; *13 = z0
+ pmullw mm4, [pw_row_coeffs+ 0] ; *13 = z1
+ movq mm5, mm1
+ pmullw mm1, [pw_row_coeffs+ 8] ; b1*17
+ pmullw mm5, [pw_row_coeffs+16] ; b1* 7
+ movq mm7, mm3
+ pmullw mm3, [pw_row_coeffs+ 8] ; b3*17
+ pmullw mm7, [pw_row_coeffs+16] ; b3* 7
+ paddsw mm1, mm7 ; z3 = b1*17 + b3* 7
+ psubsw mm5, mm3 ; z2 = b1* 7 - b3*17
+ movq mm7, mm0
+ movq mm6, mm4
+ paddsw mm0, mm1 ; z0 + z3
+ psubsw mm7, mm1 ; z0 - z3
+ paddsw mm4, mm5 ; z1 + z2
+ psubsw mm6, mm5 ; z1 - z2
+ movq mm5, [pd_512] ; 0x200
+%endmacro
+
+; ff_rv34_idct_add_mmx2(uint8_t *dst, ptrdiff_t stride, DCTELEM *block);
+%macro COL_TRANSFORM 4
+ pshufw mm3, %2, 0xDD ; col. 1,3,1,3
+ pshufw %2, %2, 0x88 ; col. 0,2,0,2
+ pmaddwd %2, %3 ; 13*c0+13*c2 | 13*c0-13*c2 = z0 | z1
+ pmaddwd mm3, %4 ; 17*c1+ 7*c3 | 7*c1-17*c3 = z3 | z2
+ paddd %2, mm5
+ pshufw mm1, %2, 01001110b ; z1 | z0
+ pshufw mm2, mm3, 01001110b ; z2 | z3
+ paddd %2, mm3 ; z0+z3 | z1+z2
+ psubd mm1, mm2 ; z1-z2 | z0-z3
+ movd mm3, %1
+ psrad %2, 10
+ pxor mm2, mm2
+ psrad mm1, 10
+ punpcklbw mm3, mm2
+ packssdw %2, mm1
+ paddw %2, mm3
+ packuswb %2, %2
+ movd %1, %2
+%endmacro
+INIT_MMX mmx2
+cglobal rv34_idct_add, 3,3,0, d, s, b
+ ROW_TRANSFORM bq, 1
+ COL_TRANSFORM [dq], mm0, [pw_col_coeffs+ 0], [pw_col_coeffs+ 8]
+ movq mm0, [pw_col_coeffs+ 0]
+ COL_TRANSFORM [dq+sq], mm4, mm0, [pw_col_coeffs+ 8]
+ movq mm4, [pw_col_coeffs+ 8]
+ lea dq, [dq + 2*sq]
+ COL_TRANSFORM [dq], mm6, mm0, mm4
+ COL_TRANSFORM [dq+sq], mm7, mm0, mm4
+ ret
+
; ff_rv34_idct_dc_add_sse4(uint8_t *dst, int stride, int dc);
INIT_XMM
cglobal rv34_idct_dc_add_sse4, 3, 3, 6
diff --git a/libavcodec/x86/rv34dsp_init.c b/libavcodec/x86/rv34dsp_init.c
index 3883125..d91818c 100644
--- a/libavcodec/x86/rv34dsp_init.c
+++ b/libavcodec/x86/rv34dsp_init.c
@@ -28,6 +28,7 @@ void ff_rv34_idct_dc_mmx2(DCTELEM *block);
void ff_rv34_idct_dc_noround_mmx2(DCTELEM *block);
void ff_rv34_idct_dc_add_mmx(uint8_t *dst, ptrdiff_t stride, int dc);
void ff_rv34_idct_dc_add_sse4(uint8_t *dst, ptrdiff_t stride, int dc);
+void ff_rv34_idct_add_mmx2(uint8_t *dst, ptrdiff_t stride, DCTELEM *block);

av_cold void ff_rv34dsp_init_x86(RV34DSPContext* c, DSPContext *dsp)
{
@@ -38,6 +39,7 @@ av_cold void ff_rv34dsp_init_x86(RV34DSPContext* c, DSPContext *dsp)
c->rv34_idct_dc_add = ff_rv34_idct_dc_add_mmx;
if (mm_flags & AV_CPU_FLAG_MMX2) {
c->rv34_inv_transform_dc = ff_rv34_idct_dc_noround_mmx2;
+ c->rv34_idct_add = ff_rv34_idct_add_mmx2;
}
if (mm_flags & AV_CPU_FLAG_SSE4)
c->rv34_idct_dc_add = ff_rv34_idct_dc_add_sse4;
--
1.7.7.msysgit.0
Christophe Gisquet
2012-04-17 18:53:43 UTC
Permalink
Post by Christophe GISQUET
141 cycles down to 56.
Part of the patch series, so it was resent unintentionally. Ronald was
having a look at it already.

Christophe
Ronald S. Bultje
2012-04-17 23:20:59 UTC
Permalink
Hi,

On Tue, Apr 17, 2012 at 11:44 AM, Christophe GISQUET
Post by Christophe GISQUET
141 cycles down to 56.
---
 libavcodec/x86/rv34dsp.asm    |   77 +++++++++++++++++++++++++++++++++++++++++
 libavcodec/x86/rv34dsp_init.c |    2 +
 2 files changed, 79 insertions(+), 0 deletions(-)
I have some ideas but lack time to actually try them, but they weren't
directly faster and thus probably need a little more work, so let's
put this in, I'll mark a mental todo to look at this later (if ever).

Shorter: OK.

Ronald
Jason Garrett-Glaser
2012-04-18 00:21:07 UTC
Permalink
+    movq        mm0, [%1+ 0*8]
+    movq        mm1, [%1+ 1*8]
+    movq        mm2, [%1+ 2*8]
+    movq        mm3, [%1+ 3*8]
+    movq  [%1+ 0*8], mm7
+    movq  [%1+ 1*8], mm7
+    movq  [%1+ 2*8], mm7
+    movq  [%1+ 3*8], mm7
s/movq/mova/g to match typical coding style.
+    movq        mm4, mm0
+    paddsw      mm0, mm2                ; b0 + b2
+    psubsw      mm4, mm2                ; b0 - b2
+    pmullw      mm0, [pw_row_coeffs+ 0] ; *13 = z0
+    pmullw      mm4, [pw_row_coeffs+ 0] ; *13 = z1
+    movq        mm5, mm1
+    pmullw      mm1, [pw_row_coeffs+ 8] ; b1*17
+    pmullw      mm5, [pw_row_coeffs+16] ; b1* 7
+    movq        mm7, mm3
+    pmullw      mm3, [pw_row_coeffs+ 8] ; b3*17
+    pmullw      mm7, [pw_row_coeffs+16] ; b3* 7
We have some extra regs here, maybe use them temporarily to store the
coeffs so that you avoid the redundant loads?

Jason
Christophe Gisquet
2012-04-18 07:03:39 UTC
Permalink
Post by Jason Garrett-Glaser
+    movq        mm0, [%1+ 0*8]
+    movq        mm1, [%1+ 1*8]
+    movq        mm2, [%1+ 2*8]
+    movq        mm3, [%1+ 3*8]
+    movq  [%1+ 0*8], mm7
+    movq  [%1+ 1*8], mm7
+    movq  [%1+ 2*8], mm7
+    movq  [%1+ 3*8], mm7
s/movq/mova/g to match typical coding style.
OK. I thought explicitely using mmx regs would favor using movq over
mova, but ok.
Post by Jason Garrett-Glaser
+    movq        mm4, mm0
+    paddsw      mm0, mm2                ; b0 + b2
+    psubsw      mm4, mm2                ; b0 - b2
+    pmullw      mm0, [pw_row_coeffs+ 0] ; *13 = z0
+    pmullw      mm4, [pw_row_coeffs+ 0] ; *13 = z1
+    movq        mm5, mm1
+    pmullw      mm1, [pw_row_coeffs+ 8] ; b1*17
+    pmullw      mm5, [pw_row_coeffs+16] ; b1* 7
+    movq        mm7, mm3
+    pmullw      mm3, [pw_row_coeffs+ 8] ; b3*17
+    pmullw      mm7, [pw_row_coeffs+16] ; b3* 7
We have some extra regs here, maybe use them temporarily to store the
coeffs so that you avoid the redundant loads?
I had already tried this, but not thoroughly. In fact, it is
worthwhile only once (for 1 cycle gain on core i5) as the other are
probably already fetched once one is done (benchmarked to cause a
slowdown of 1-2 cycles). Done in attached patch.
--
Christophe
Jason Garrett-Glaser
2012-04-18 10:25:37 UTC
Permalink
On Wed, Apr 18, 2012 at 12:03 AM, Christophe Gisquet
Post by Christophe Gisquet
Post by Jason Garrett-Glaser
+    movq        mm0, [%1+ 0*8]
+    movq        mm1, [%1+ 1*8]
+    movq        mm2, [%1+ 2*8]
+    movq        mm3, [%1+ 3*8]
+    movq  [%1+ 0*8], mm7
+    movq  [%1+ 1*8], mm7
+    movq  [%1+ 2*8], mm7
+    movq  [%1+ 3*8], mm7
s/movq/mova/g to match typical coding style.
OK. I thought explicitely using mmx regs would favor using movq over
mova, but ok.
Oh, that's okay then, I missed that. Better to use m than mm in new
code, but either is okay, it doesn't really matter.

Jason
Ronald S. Bultje
2012-04-28 18:00:01 UTC
Permalink
Hi,
Post by Jason Garrett-Glaser
On Wed, Apr 18, 2012 at 12:03 AM, Christophe Gisquet
Post by Christophe Gisquet
Post by Jason Garrett-Glaser
+    movq        mm0, [%1+ 0*8]
+    movq        mm1, [%1+ 1*8]
+    movq        mm2, [%1+ 2*8]
+    movq        mm3, [%1+ 3*8]
+    movq  [%1+ 0*8], mm7
+    movq  [%1+ 1*8], mm7
+    movq  [%1+ 2*8], mm7
+    movq  [%1+ 3*8], mm7
s/movq/mova/g to match typical coding style.
OK. I thought explicitely using mmx regs would favor using movq over
mova, but ok.
Oh, that's okay then, I missed that.  Better to use m than mm in new
code, but either is okay, it doesn't really matter.
Pushed.

Ronald
Christophe GISQUET
2012-04-17 18:44:43 UTC
Permalink
---
libavcodec/rv34dsp.c | 18 +++++++++---------
1 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/libavcodec/rv34dsp.c b/libavcodec/rv34dsp.c
index 4145c4d..1ddcea4 100644
--- a/libavcodec/rv34dsp.c
+++ b/libavcodec/rv34dsp.c
@@ -88,15 +88,15 @@ static void rv34_inv_transform_noround_c(DCTELEM *block){
rv34_row_transform(temp, block);

for(i = 0; i < 4; i++){
- const int z0 = 13*(temp[4*0+i] + temp[4*2+i]);
- const int z1 = 13*(temp[4*0+i] - temp[4*2+i]);
- const int z2 = 7* temp[4*1+i] - 17*temp[4*3+i];
- const int z3 = 17* temp[4*1+i] + 7*temp[4*3+i];
-
- block[i*4+0] = ((z0 + z3) * 3) >> 11;
- block[i*4+1] = ((z1 + z2) * 3) >> 11;
- block[i*4+2] = ((z1 - z2) * 3) >> 11;
- block[i*4+3] = ((z0 - z3) * 3) >> 11;
+ const int z0 = 39*(temp[4*0+i] + temp[4*2+i]);
+ const int z1 = 39*(temp[4*0+i] - temp[4*2+i]);
+ const int z2 = 21* temp[4*1+i] - 51*temp[4*3+i];
+ const int z3 = 51* temp[4*1+i] + 21*temp[4*3+i];
+
+ block[i*4+0] = (z0 + z3) >> 11;
+ block[i*4+1] = (z1 + z2) >> 11;
+ block[i*4+2] = (z1 - z2) >> 11;
+ block[i*4+3] = (z0 - z3) >> 11;
}
}
--
1.7.7.msysgit.0
Ronald S. Bultje
2012-04-17 23:19:27 UTC
Permalink
Hi,

On Tue, Apr 17, 2012 at 11:44 AM, Christophe GISQUET
---
 libavcodec/rv34dsp.c |   18 +++++++++---------
 1 files changed, 9 insertions(+), 9 deletions(-)
OK.

Ronald
Ronald S. Bultje
2012-04-28 18:16:42 UTC
Permalink
Hi,
Post by Ronald S. Bultje
On Tue, Apr 17, 2012 at 11:44 AM, Christophe GISQUET
---
 libavcodec/rv34dsp.c |   18 +++++++++---------
 1 files changed, 9 insertions(+), 9 deletions(-)
OK.
Pushed.

Ronald
Christophe GISQUET
2012-04-17 18:44:44 UTC
Permalink
This should allow simpler code for x86 SIMD implementations.
---
libavcodec/rv34dsp.h | 3 +--
libavcodec/rv40.c | 14 +++++++-------
libavcodec/rv40dsp.c | 25 +++++++++++--------------
3 files changed, 19 insertions(+), 23 deletions(-)

diff --git a/libavcodec/rv34dsp.h b/libavcodec/rv34dsp.h
index 58da59f..bdc189b 100644
--- a/libavcodec/rv34dsp.h
+++ b/libavcodec/rv34dsp.h
@@ -50,8 +50,7 @@ typedef void (*rv40_strong_loop_filter_func)(uint8_t *src, ptrdiff_t stride,
int dmode, int chroma);

typedef int (*rv40_loop_filter_strength_func)(uint8_t *src, ptrdiff_t stride,
- int beta, int beta2, int edge,
- int *p1, int *q1);
+ int32_t *betas, int edge, int32_t *p1q1);

typedef struct RV34DSPContext {
qpel_mc_func put_pixels_tab[4][16];
diff --git a/libavcodec/rv40.c b/libavcodec/rv40.c
index b65a200..d29698c 100644
--- a/libavcodec/rv40.c
+++ b/libavcodec/rv40.c
@@ -300,23 +300,23 @@ static void rv40_adaptive_loop_filter(RV34DSPContext *rdsp,
int alpha, int beta, int beta2,
int chroma, int edge, int dir)
{
- int filter_p1, filter_q1;
+ int32_t betas[2] = { beta*4, beta2 };
+ int32_t p1q1[2];
int strong;
int lims;

- strong = rdsp->rv40_loop_filter_strength[dir](src, stride, beta, beta2,
- edge, &filter_p1, &filter_q1);
+ strong = rdsp->rv40_loop_filter_strength[dir](src, stride, betas, edge, p1q1);

- lims = filter_p1 + filter_q1 + ((lim_q1 + lim_p1) >> 1) + 1;
+ lims = p1q1[0] + p1q1[1] + ((lim_q1 + lim_p1) >> 1) + 1;

if (strong) {
rdsp->rv40_strong_loop_filter[dir](src, stride, alpha,
lims, dmode, chroma);
- } else if (filter_p1 & filter_q1) {
+ } else if (p1q1[0] & p1q1[1]) {
rdsp->rv40_weak_loop_filter[dir](src, stride, 1, 1, alpha, beta,
lims, lim_q1, lim_p1);
- } else if (filter_p1 | filter_q1) {
- rdsp->rv40_weak_loop_filter[dir](src, stride, filter_p1, filter_q1,
+ } else if (p1q1[0] | p1q1[1]) {
+ rdsp->rv40_weak_loop_filter[dir](src, stride, p1q1[0], p1q1[1],
alpha, beta, lims >> 1, lim_q1 >> 1,
lim_p1 >> 1);
}
diff --git a/libavcodec/rv40dsp.c b/libavcodec/rv40dsp.c
index 19a18d3..9166671 100644
--- a/libavcodec/rv40dsp.c
+++ b/libavcodec/rv40dsp.c
@@ -468,9 +468,8 @@ static void rv40_v_strong_loop_filter(uint8_t *src, const ptrdiff_t stride,

static av_always_inline int rv40_loop_filter_strength(uint8_t *src,
int step, ptrdiff_t stride,
- int beta, int beta2,
- int edge,
- int *p1, int *q1)
+ int32_t *betas, int edge,
+ int32_t *p1q1)
{
int sum_p1p0 = 0, sum_q1q0 = 0, sum_p1p2 = 0, sum_q1q2 = 0;
int strong0 = 0, strong1 = 0;
@@ -482,10 +481,10 @@ static av_always_inline int rv40_loop_filter_strength(uint8_t *src,
sum_q1q0 += ptr[ 1*step] - ptr[ 0*step];
}

- *p1 = FFABS(sum_p1p0) < (beta << 2);
- *q1 = FFABS(sum_q1q0) < (beta << 2);
+ p1q1[0] = FFABS(sum_p1p0) < betas[0];
+ p1q1[1] = FFABS(sum_q1q0) < betas[0];

- if(!*p1 && !*q1)
+ if(!(p1q1[0]|p1q1[1]))
return 0;

if (!edge)
@@ -496,24 +495,22 @@ static av_always_inline int rv40_loop_filter_strength(uint8_t *src,
sum_q1q2 += ptr[ 1*step] - ptr[ 2*step];
}

- strong0 = *p1 && (FFABS(sum_p1p2) < beta2);
- strong1 = *q1 && (FFABS(sum_q1q2) < beta2);
+ strong0 = p1q1[0] && (FFABS(sum_p1p2) < betas[1]);
+ strong1 = p1q1[1] && (FFABS(sum_q1q2) < betas[1]);

return strong0 && strong1;
}

static int rv40_h_loop_filter_strength(uint8_t *src, ptrdiff_t stride,
- int beta, int beta2, int edge,
- int *p1, int *q1)
+ int32_t *betas, int edge, int32_t *p1q1)
{
- return rv40_loop_filter_strength(src, stride, 1, beta, beta2, edge, p1, q1);
+ return rv40_loop_filter_strength(src, stride, 1, betas, edge, p1q1);
}

static int rv40_v_loop_filter_strength(uint8_t *src, ptrdiff_t stride,
- int beta, int beta2, int edge,
- int *p1, int *q1)
+ int32_t *betas, int edge, int32_t *p1q1)
{
- return rv40_loop_filter_strength(src, 1, stride, beta, beta2, edge, p1, q1);
+ return rv40_loop_filter_strength(src, 1, stride, betas, edge, p1q1);
}

av_cold void ff_rv40dsp_init(RV34DSPContext *c, DSPContext* dsp) {
--
1.7.7.msysgit.0
Måns Rullgård
2012-04-17 18:49:11 UTC
Permalink
Post by Christophe GISQUET
This should allow simpler code for x86 SIMD implementations.
---
libavcodec/rv34dsp.h | 3 +--
libavcodec/rv40.c | 14 +++++++-------
libavcodec/rv40dsp.c | 25 +++++++++++--------------
3 files changed, 19 insertions(+), 23 deletions(-)
This breaks the NEON asm.
--
Måns Rullgård
***@mansr.com
Christophe Gisquet
2012-04-17 18:55:38 UTC
Permalink
Post by Måns Rullgård
This breaks the NEON asm.
Indeed, that's why I tagged it as "controversial" in the first mail,
and that I have split it from my original implementation.

I don't know if it can be a win for the neon implementation.

Christophe
Måns Rullgård
2012-04-17 21:11:45 UTC
Permalink
Post by Christophe Gisquet
Post by Måns Rullgård
This breaks the NEON asm.
Indeed, that's why I tagged it as "controversial" in the first mail,
and that I have split it from my original implementation.
I don't know if it can be a win for the neon implementation.
Probably not much difference, but it will need to be updated.
I don't expect you to make the change. I do, however, expect any
patches that break existing asm to be clearly marked so people can spot
them and action.
--
Måns Rullgård
***@mansr.com
Ronald S. Bultje
2012-04-28 18:03:50 UTC
Permalink
Hi,
Post by Måns Rullgård
Post by Christophe Gisquet
Post by Måns Rullgård
This breaks the NEON asm.
Indeed, that's why I tagged it as "controversial" in the first mail,
and that I have split it from my original implementation.
I don't know if it can be a win for the neon implementation.
Probably not much difference, but it will need to be updated.
I don't expect you to make the change.  I do, however, expect any
patches that break existing asm to be clearly marked so people can spot
them and action.
Progress on this for the neon part?

Ronald
Christophe GISQUET
2012-04-17 18:44:45 UTC
Permalink
v: 66/57 -> 33/31 cycles
h: 60/47 -> 43/35 cycles
---
libavcodec/x86/rv40dsp.asm | 165 +++++++++++++++++++++++++++++++++++++++++
libavcodec/x86/rv40dsp_init.c | 7 ++
2 files changed, 172 insertions(+), 0 deletions(-)

diff --git a/libavcodec/x86/rv40dsp.asm b/libavcodec/x86/rv40dsp.asm
index cac78bb..4c580da 100644
--- a/libavcodec/x86/rv40dsp.asm
+++ b/libavcodec/x86/rv40dsp.asm
@@ -68,12 +68,177 @@ filter_h6_shuf1: db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8
filter_h6_shuf2: db 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10
filter_h6_shuf3: db 5, 4, 6, 5, 7, 6, 8, 7, 9, 8, 10, 9, 11, 10, 12, 11

+pw_mask: dw 0xFFFF, 0x0000, 0xFFFF, 0xFFFF
+
cextern pw_32
cextern pw_16

SECTION .text

;-----------------------------------------------------------------------------
+; loop filter strength functions:
+; rv40_loop_filter_strength(uint8_t *src, ptrdiff_t stride,
+; int32_t* betas, int edge, int32_t *p1q1);
+; x86_32: all params on stack
+; win64: rcx,rdx,r8,r9, stack
+; rest: rdi,rsi,rdx,rcx,r8,r9,...
+;-----------------------------------------------------------------------------
+INIT_MMX mmx2
+cglobal rv40_v_loop_filter_strength, 2,2,0, src, stride
+ pxor mm0, mm0
+ movq mm5, [pw_mask]
+ movd mm1, [srcq+0*strideq-3]
+ movd mm2, [srcq+0*strideq+1]
+ punpcklbw mm1, mm0
+ punpcklbw mm2, mm0
+ movd mm3, [srcq+1*strideq-3]
+ movd mm4, [srcq+1*strideq+1]
+ lea srcq, [srcq+2*strideq]
+ punpcklbw mm3, mm0
+ punpcklbw mm4, mm0
+ paddw mm1, mm3
+ paddw mm2, mm4
+ movd mm3, [srcq+0*strideq-3]
+ movd mm4, [srcq+0*strideq+1]
+ punpcklbw mm3, mm0
+ punpcklbw mm4, mm0
+ paddw mm1, mm3
+ paddw mm2, mm4
+ movd mm3, [srcq+1*strideq-3]
+ movd mm4, [srcq+1*strideq+1]
+ punpcklbw mm3, mm0
+ punpcklbw mm4, mm0
+ paddw mm1, mm3 ; p[-3] p[-2] | p[-1] p[ 0]
+ paddw mm2, mm4 ; p[ 1] p[ 2]
+
+ ; Now check thresholds
+ movq mm3, mm1 ; p[-3] p[-2] | p[-1] p[ 0]
+ punpckldq mm1, mm2 ; p[-3] p[-2] | p[ 1] p[ 2]
+ pand mm3, mm5 ; p[-3] 0 | p[-1] p[ 0]
+ pshufw mm1, mm1, 10011001b ; p[-2] p[ 1] | p[-2] p[ 1]
+ pandn mm5, mm2 ; 0 p[ 2]
+ movq mm2, mm1
+ por mm5, mm3 ; p[-3] p[ 2] | p[-1] p[ 0]
+%if ARCH_X86_32
+ mov srcq, [esp + stack_offset + 3*4]
+%define BETAS srcq
+%else
+%if WIN64
+%define BETAS r8
+%else
+%define BETAS rdx
+%endif
+%endif
+ psubusw mm1, mm5 ; p1p2 q1q2 | p1p0 q1q0
+ movq mm3, [BETAS] ; 0 4b | 0 b2
+ psubusw mm5, mm2 ; -p1p2 -q1q2 | -p1p0 -q1q0
+ pshufw mm3, mm3, 1010b ; b2 b2 | 4b 4b
+ por mm5, mm1 ; |p1p2|q1q2| | |p1p0|q1q0|
+ pcmpgtw mm3, mm5 ; p12<b2 q12<b2 | p10<4b q10<4b
+%if ARCH_X86_32
+ mov srcq, [esp + stack_offset + 4*4]
+ mov strideq, [esp + stack_offset + 5*4]
+%define EDGE srcq
+%define P1Q1 strideq
+%else
+%if WIN64
+%define EDGE r9d
+ mov strideq, [rsp + stack_offset + 5*8]
+%define P1Q1 strideq
+%else
+%define EDGE rcx
+%define P1Q1 r8
+%endif ; WIN64
+%endif
+ movq mm2, mm3
+ pshufw mm3, mm3, 11111010b
+ cmp EDGE, 0
+ psrld mm3, 31
+ movq [P1Q1], mm3
+ jz .ret0
+ pshufw mm2, mm2, 0101b
+ pand mm2, mm3
+ pshufw mm0, mm2, 10b
+ pand mm0, mm2
+ movd rax, mm0
+ REP_RET
+.ret0:
+ xor rax, rax
+ REP_RET
+
+cglobal rv40_h_loop_filter_strength, 3,3,0, src, stride, betas
+ pxor mm0, mm0
+ movd mm1, [srcq+0*strideq] ; p[0]
+ movd mm2, [srcq+1*strideq] ; p[1]
+ neg strideq
+ psadbw mm1, mm0
+ add srcq, strideq
+ psadbw mm2, mm0
+ movd mm3, [srcq+0*strideq] ; p[-1]
+ movd mm4, [srcq+1*strideq] ; p[-2]
+ psadbw mm3, mm0
+ psadbw mm4, mm0
+ punpcklwd mm3, mm1 ; p[-1] p[ 0]
+ punpcklwd mm4, mm2 ; p[-2] p[ 1]
+ movq mm5, mm4
+ movd mm1, [betasq] ; 0 4b
+ movd mm6, [betasq+4]
+ psubusw mm5, mm3
+ psubusw mm3, mm4
+ punpcklwd mm1, mm1 ; 4b 4b
+%if ARCH_X86_32
+ mov betasq, [esp + stack_offset + 4*4]
+%define EDGE betasq
+%else
+%if WIN64
+%define EDGE r9d
+%else
+%define EDGE rcx
+%endif ; WIN64
+%endif
+ por mm5, mm3
+ pcmpgtw mm1, mm5
+ cmp EDGE, 0
+ psrlw mm1, 15
+%if ARCH_X86_32
+ mov betasq, [esp + stack_offset + 5*4]
+%define P1Q1 betasq
+%else
+%if WIN64
+ mov betasq, [rsp + stack_offset + 5*8]
+%define P1Q1 betasq
+%else
+%define P1Q1 r8
+%endif ; WIN64
+%endif
+ pshufw mm3, mm1, 10011000b
+ movq [P1Q1], mm3
+ jz .ret0
+
+ movd mm2, [srcq+2*strideq] ; p[-3]
+ neg strideq
+ psadbw mm2, mm0
+ lea srcq, [srcq+2*strideq]
+ movd mm3, [srcq+strideq] ; p[ 2]
+ movq mm5, mm4
+ psadbw mm3, mm0
+ punpcklwd mm2, mm3 ; p[-3] p[2]
+ punpcklwd mm6, mm6
+ psubusw mm5, mm2
+ psubusw mm2, mm4
+ por mm2, mm5
+ pcmpgtw mm6, mm2
+ pand mm6, mm1
+ pshufw mm1, mm6, 1001b
+ pand mm1, mm6
+ movd eax, mm1
+ REP_RET
+.ret0:
+ xor rax, rax
+ REP_RET
+
+
+;-----------------------------------------------------------------------------
; subpel MC functions:
;
; void put_vp8_qpel_[h|v]_<opt>(uint8_t *dst, int deststride,
diff --git a/libavcodec/x86/rv40dsp_init.c b/libavcodec/x86/rv40dsp_init.c
index 644ca77..7c0de62 100644
--- a/libavcodec/x86/rv40dsp_init.c
+++ b/libavcodec/x86/rv40dsp_init.c
@@ -173,6 +173,11 @@ QPEL_FUNCS_SET (OP, 3, 1, OPT) \
QPEL_FUNCS_SET (OP, 3, 2, OPT)
/** @} */

+int ff_rv40_v_loop_filter_strength_mmx2(uint8_t *src, ptrdiff_t stride,
+ int32_t *betas, int edge, int32_t *p1q1);
+int ff_rv40_h_loop_filter_strength_mmx2(uint8_t *src, ptrdiff_t stride,
+ int32_t *betas, int edge, int32_t *p1q1);
+
void ff_rv40dsp_init_x86(RV34DSPContext *c, DSPContext *dsp)
{
#if HAVE_YASM
@@ -196,6 +201,8 @@ void ff_rv40dsp_init_x86(RV34DSPContext *c, DSPContext *dsp)
if (mm_flags & AV_CPU_FLAG_MMX2) {
c->avg_chroma_pixels_tab[0] = ff_avg_rv40_chroma_mc8_mmx2;
c->avg_chroma_pixels_tab[1] = ff_avg_rv40_chroma_mc4_mmx2;
+ c->rv40_loop_filter_strength[0] = ff_rv40_h_loop_filter_strength_mmx2;
+ c->rv40_loop_filter_strength[1] = ff_rv40_v_loop_filter_strength_mmx2;
#if ARCH_X86_32
QPEL_MC_SET(avg_, _mmx2)
#endif
--
1.7.7.msysgit.0
Ronald S. Bultje
2012-04-17 23:29:02 UTC
Permalink
Hi,

On Tue, Apr 17, 2012 at 11:44 AM, Christophe GISQUET
<***@gmail.com> wrote:

The title should probably mention "loop filter _strength_".
Post by Christophe GISQUET
+cglobal rv40_v_loop_filter_strength, 2,2,0, src, stride
[..]
Post by Christophe GISQUET
+%if ARCH_X86_32
+    mov        srcq, [esp + stack_offset + 3*4]
+%define BETAS  srcq
+%else
+%if WIN64
+%define BETAS  r8
+%else
+%define BETAS  rdx
+%endif
+%endif
[..]
Post by Christophe GISQUET
+    movq        mm3, [BETAS]       ;    0    4b  |    0    b2
cglobal functionname, 2, 2, 0, src, stride, betas
..
%if ARCH_X86_32
mov srcq, betasm
%define betasq srcq
%endif
..
movq mm3, [betasq]

Is slightly easier to understand, imo.
Post by Christophe GISQUET
+%if ARCH_X86_32
+    mov        srcq, [esp + stack_offset + 4*4]
+    mov     strideq, [esp + stack_offset + 5*4]
+%define EDGE   srcq
+%define P1Q1   strideq
+%else
+%if WIN64
+%define EDGE    r9d
+    mov     strideq, [rsp + stack_offset + 5*8]
+%define P1Q1   strideq
+%else
+%define EDGE    rcx
+%define P1Q1    r8
+%endif ; WIN64
+%endif
Same here. Just define them as cglobal arguments so you can access them named.
Post by Christophe GISQUET
+cglobal rv40_h_loop_filter_strength, 3,3,0, src, stride, betas
[..]
Post by Christophe GISQUET
+%if ARCH_X86_32
+    mov      betasq, [esp + stack_offset + 4*4]
+%define EDGE   betasq
+%else
+%if WIN64
+%define EDGE    r9d
+%else
+%define EDGE    rcx
+%endif ; WIN64
+%endif
[..]
Post by Christophe GISQUET
+%if ARCH_X86_32
+    mov      betasq, [esp + stack_offset + 5*4]
+%define P1Q1    betasq
+%else
+%if WIN64
+    mov      betasq, [rsp + stack_offset + 5*8]
+%define P1Q1    betasq
+%else
+%define P1Q1    r8
+%endif ; WIN64
+%endif
Same here.

Ronald
Christophe Gisquet
2012-04-18 12:38:18 UTC
Permalink
Hi,
Post by Ronald S. Bultje
Same here. Just define them as cglobal arguments so you can access them named.
Completely agree, I had scavenged the fact that this was possible, for
instance with the information in x86inc.asm, yet I had not figured how
this was done. That should help immensely and avoid bug-prone code
such as this one.

Christophe
Christophe Gisquet
2012-04-18 19:33:57 UTC
Permalink
Post by Ronald S. Bultje
Same here. Just define them as cglobal arguments so you can access them named.
Here's an updated patch following these recommandations.

Note: it may be superseded by the patches next in the series, cf.
"rv40dsp: distinguish non-edge/edge versions".

Christophe
Christophe Gisquet
2012-05-11 15:42:36 UTC
Permalink
Post by Christophe Gisquet
Here's an updated patch following these recommandations.
Note: it may be superseded by the patches next in the series, cf.
"rv40dsp: distinguish non-edge/edge versions".
Ping.

Also, Ronald, could you give your opinion on the topic raised in
"[PATCH 06/12] rv40dsp x86: distinguish non-edge/edge versions." ?

Mans seems OK with the interface change to distinguish edge cases. The
advantage is minor, so maybe you have another issue to point, so we
can decide what to do (eg accept only first dsp patch and drop the 2nd
one, or collapse the 2 into a single one).
--
Christophe
Ronald S. Bultje
2012-05-11 16:00:16 UTC
Permalink
Hi,

On Fri, May 11, 2012 at 8:42 AM, Christophe Gisquet
Post by Christophe Gisquet
Post by Christophe Gisquet
Here's an updated patch following these recommandations.
Note: it may be superseded by the patches next in the series, cf.
"rv40dsp: distinguish non-edge/edge versions".
Ping.
Also, Ronald, could you give your opinion on the topic raised in
"[PATCH 06/12] rv40dsp x86: distinguish non-edge/edge versions." ?
Mans seems OK with the interface change to distinguish edge cases. The
advantage is minor, so maybe you have another issue to point, so we
can decide what to do (eg accept only first dsp patch and drop the 2nd
one, or collapse the 2 into a single one).
Will look sometime this weekend.

Ronald
Christophe Gisquet
2012-05-11 16:13:27 UTC
Permalink
Post by Ronald S. Bultje
Will look sometime this weekend.
OK, I thought indeed it was too much work for a quick reply, but I
preferred pinging instead of risking to let it slip.

Christophe
Ronald S. Bultje
2012-05-15 21:05:55 UTC
Permalink
Hi,

On Fri, May 11, 2012 at 9:13 AM, Christophe Gisquet
Post by Christophe Gisquet
Post by Ronald S. Bultje
Will look sometime this weekend.
OK, I thought indeed it was too much work for a quick reply, but I
preferred pinging instead of risking to let it slip.
So, looked both here and at the "split edge vs. non-edge" patch. I
think I like the split-approach, it gives slightly better performance.
I'm wondering (just a suggestion, it's OK to say no) if it's possible
to merge the assembly for the edge/non-edge V loopfilter function.
Christophe GISQUET
2012-04-17 18:44:46 UTC
Permalink
C MMX
32 64 32 64
h: 60 47 43 35
v: 66 57 33 31

h_ne: 45 36 27 24
h_e: 81 64 35 31
v_ne: 42 33 28 27
v_e: 74 53 41 32
---
libavcodec/rv34dsp.h | 4 +-
libavcodec/rv40.c | 2 +-
libavcodec/rv40dsp.c | 34 +++++++++----
libavcodec/x86/rv40dsp.asm | 110 +++++++++++++++++++++++++++--------------
libavcodec/x86/rv40dsp_init.c | 18 +++++--
5 files changed, 111 insertions(+), 57 deletions(-)

diff --git a/libavcodec/rv34dsp.h b/libavcodec/rv34dsp.h
index bdc189b..87b0e78 100644
--- a/libavcodec/rv34dsp.h
+++ b/libavcodec/rv34dsp.h
@@ -50,7 +50,7 @@ typedef void (*rv40_strong_loop_filter_func)(uint8_t *src, ptrdiff_t stride,
int dmode, int chroma);

typedef int (*rv40_loop_filter_strength_func)(uint8_t *src, ptrdiff_t stride,
- int32_t *betas, int edge, int32_t *p1q1);
+ int32_t *betas, int32_t *p1q1);

typedef struct RV34DSPContext {
qpel_mc_func put_pixels_tab[4][16];
@@ -69,7 +69,7 @@ typedef struct RV34DSPContext {
rv34_idct_dc_add_func rv34_idct_dc_add;
rv40_weak_loop_filter_func rv40_weak_loop_filter[2];
rv40_strong_loop_filter_func rv40_strong_loop_filter[2];
- rv40_loop_filter_strength_func rv40_loop_filter_strength[2];
+ rv40_loop_filter_strength_func rv40_loop_filter_strength[2][2];
} RV34DSPContext;

void ff_rv30dsp_init(RV34DSPContext *c, DSPContext* dsp);
diff --git a/libavcodec/rv40.c b/libavcodec/rv40.c
index d29698c..e140acb 100644
--- a/libavcodec/rv40.c
+++ b/libavcodec/rv40.c
@@ -305,7 +305,7 @@ static void rv40_adaptive_loop_filter(RV34DSPContext *rdsp,
int strong;
int lims;

- strong = rdsp->rv40_loop_filter_strength[dir](src, stride, betas, edge, p1q1);
+ strong = rdsp->rv40_loop_filter_strength[dir][edge](src, stride, betas, p1q1);

lims = p1q1[0] + p1q1[1] + ((lim_q1 + lim_p1) >> 1) + 1;

diff --git a/libavcodec/rv40dsp.c b/libavcodec/rv40dsp.c
index 9166671..e2a44e8 100644
--- a/libavcodec/rv40dsp.c
+++ b/libavcodec/rv40dsp.c
@@ -484,10 +484,10 @@ static av_always_inline int rv40_loop_filter_strength(uint8_t *src,
p1q1[0] = FFABS(sum_p1p0) < betas[0];
p1q1[1] = FFABS(sum_q1q0) < betas[0];

- if(!(p1q1[0]|p1q1[1]))
+ if (!edge)
return 0;

- if (!edge)
+ if(!(p1q1[0]|p1q1[1]))
return 0;

for (i = 0, ptr = src; i < 4; i++, ptr += stride) {
@@ -501,16 +501,28 @@ static av_always_inline int rv40_loop_filter_strength(uint8_t *src,
return strong0 && strong1;
}

-static int rv40_h_loop_filter_strength(uint8_t *src, ptrdiff_t stride,
- int32_t *betas, int edge, int32_t *p1q1)
+static int rv40_h_loop_filter_strength_ne(uint8_t *src, ptrdiff_t stride,
+ int32_t *betas, int32_t *p1q1)
+{
+ return rv40_loop_filter_strength(src, stride, 1, betas, 0, p1q1);
+}
+
+static int rv40_v_loop_filter_strength_ne(uint8_t *src, ptrdiff_t stride,
+ int32_t *betas, int32_t *p1q1)
+{
+ return rv40_loop_filter_strength(src, 1, stride, betas, 0, p1q1);
+}
+
+static int rv40_h_loop_filter_strength_e(uint8_t *src, ptrdiff_t stride,
+ int32_t *betas, int32_t *p1q1)
{
- return rv40_loop_filter_strength(src, stride, 1, betas, edge, p1q1);
+ return rv40_loop_filter_strength(src, stride, 1, betas, 1, p1q1);
}

-static int rv40_v_loop_filter_strength(uint8_t *src, ptrdiff_t stride,
- int32_t *betas, int edge, int32_t *p1q1)
+static int rv40_v_loop_filter_strength_e(uint8_t *src, ptrdiff_t stride,
+ int32_t *betas, int32_t *p1q1)
{
- return rv40_loop_filter_strength(src, 1, stride, betas, edge, p1q1);
+ return rv40_loop_filter_strength(src, 1, stride, betas, 1, p1q1);
}

av_cold void ff_rv40dsp_init(RV34DSPContext *c, DSPContext* dsp) {
@@ -596,8 +608,10 @@ av_cold void ff_rv40dsp_init(RV34DSPContext *c, DSPContext* dsp) {
c->rv40_weak_loop_filter[1] = rv40_v_weak_loop_filter;
c->rv40_strong_loop_filter[0] = rv40_h_strong_loop_filter;
c->rv40_strong_loop_filter[1] = rv40_v_strong_loop_filter;
- c->rv40_loop_filter_strength[0] = rv40_h_loop_filter_strength;
- c->rv40_loop_filter_strength[1] = rv40_v_loop_filter_strength;
+ c->rv40_loop_filter_strength[0][0] = rv40_h_loop_filter_strength_ne;
+ c->rv40_loop_filter_strength[1][0] = rv40_v_loop_filter_strength_ne;
+ c->rv40_loop_filter_strength[0][1] = rv40_h_loop_filter_strength_e;
+ c->rv40_loop_filter_strength[1][1] = rv40_v_loop_filter_strength_e;

if (HAVE_MMX)
ff_rv40dsp_init_x86(c, dsp);
diff --git a/libavcodec/x86/rv40dsp.asm b/libavcodec/x86/rv40dsp.asm
index 4c580da..2598037 100644
--- a/libavcodec/x86/rv40dsp.asm
+++ b/libavcodec/x86/rv40dsp.asm
@@ -78,13 +78,13 @@ SECTION .text
;-----------------------------------------------------------------------------
; loop filter strength functions:
; rv40_loop_filter_strength(uint8_t *src, ptrdiff_t stride,
-; int32_t* betas, int edge, int32_t *p1q1);
+; int32_t* betas, int32_t *p1q1);
; x86_32: all params on stack
-; win64: rcx,rdx,r8,r9, stack
-; rest: rdi,rsi,rdx,rcx,r8,r9,...
+; win64: rcx,rdx,r8 ,r9 , stack
+; rest: rdi,rsi,rdx,rcx,r8,r9,...
;-----------------------------------------------------------------------------
INIT_MMX mmx2
-cglobal rv40_v_loop_filter_strength, 2,2,0, src, stride
+cglobal rv40_v_loop_filter_strength_e, 2,2,0, src, stride
pxor mm0, mm0
movq mm5, [pw_mask]
movd mm1, [srcq+0*strideq-3]
@@ -137,36 +137,79 @@ cglobal rv40_v_loop_filter_strength, 2,2,0, src, stride
pcmpgtw mm3, mm5 ; p12<b2 q12<b2 | p10<4b q10<4b
%if ARCH_X86_32
mov srcq, [esp + stack_offset + 4*4]
- mov strideq, [esp + stack_offset + 5*4]
-%define EDGE srcq
-%define P1Q1 strideq
+%define P1Q1 srcq
%else
%if WIN64
-%define EDGE r9d
- mov strideq, [rsp + stack_offset + 5*8]
-%define P1Q1 strideq
+%define P1Q1 r9d
%else
-%define EDGE rcx
-%define P1Q1 r8
+%define P1Q1 rcx
%endif ; WIN64
%endif
movq mm2, mm3
pshufw mm3, mm3, 11111010b
- cmp EDGE, 0
psrld mm3, 31
- movq [P1Q1], mm3
- jz .ret0
pshufw mm2, mm2, 0101b
+ movq [P1Q1], mm3
pand mm2, mm3
pshufw mm0, mm2, 10b
pand mm0, mm2
movd rax, mm0
REP_RET
-.ret0:
+
+cglobal rv40_v_loop_filter_strength_ne, 2,2,0, src, stride
+ pxor mm0, mm0
+ movd mm1, [srcq+0*strideq-2] ; p[-2] p[-1] | p[ 0] p[ 1]
+ movd mm2, [srcq+1*strideq-2]
+ lea srcq, [srcq+2*strideq]
+ punpcklbw mm1, mm0
+ punpcklbw mm2, mm0
+ movd mm3, [srcq+0*strideq-2]
+ movd mm4, [srcq+1*strideq-2]
+ punpcklbw mm3, mm0
+ punpcklbw mm4, mm0
+ paddw mm1, mm3
+ paddw mm2, mm4
+ paddw mm1, mm2
+
+ ; Now check thresholds
+%if ARCH_X86_32
+ mov srcq, [esp + stack_offset + 3*4]
+%define BETAS srcq
+%else
+%if WIN64
+%define P1Q1 r8
+%else
+%define P1Q1 rdx
+%endif ; WIN64
+%endif
+ ; Now check thresholds
+ pshufw mm3, mm1, 11110000b ; p[-2] p[-2] | p[ 1] p[ 1]
+ pshufw mm2, mm1, 10100101b ; p[-1] p[-1] | p[ 0] p[ 0]
+ movq mm4, mm3
+ movd mm0, [BETAS] ; 0 b2
+ psubusw mm3, mm2
+ psubusw mm2, mm4
+ pshufw mm0, mm0, 0
+ por mm2, mm3
+
+ pcmpgtw mm0, mm2 ; p12<b2 q12<b2 | p10<4b q10<4b
+%if ARCH_X86_32
+ mov strideq, [esp + stack_offset + 4*4]
+%define P1Q1 strideq
+%else
+%if WIN64
+%define P1Q1 r9
+%else
+%define P1Q1 rcx
+%endif ; WIN64
+%endif
+ psrld mm0, 31
+ movq [P1Q1], mm0
xor rax, rax
REP_RET

-cglobal rv40_h_loop_filter_strength, 3,3,0, src, stride, betas
+%macro H_LOOP_FILTER 1
+cglobal rv40_h_loop_filter_strength_%1, 3,3,0, src, stride, betas
pxor mm0, mm0
movd mm1, [srcq+0*strideq] ; p[0]
movd mm2, [srcq+1*strideq] ; p[1]
@@ -182,39 +225,28 @@ cglobal rv40_h_loop_filter_strength, 3,3,0, src, stride, betas
punpcklwd mm4, mm2 ; p[-2] p[ 1]
movq mm5, mm4
movd mm1, [betasq] ; 0 4b
+%ifidn %1, e
movd mm6, [betasq+4]
+%endif
psubusw mm5, mm3
psubusw mm3, mm4
punpcklwd mm1, mm1 ; 4b 4b
-%if ARCH_X86_32
- mov betasq, [esp + stack_offset + 4*4]
-%define EDGE betasq
-%else
-%if WIN64
-%define EDGE r9d
-%else
-%define EDGE rcx
-%endif ; WIN64
-%endif
por mm5, mm3
pcmpgtw mm1, mm5
- cmp EDGE, 0
- psrlw mm1, 15
%if ARCH_X86_32
- mov betasq, [esp + stack_offset + 5*4]
+ mov betasq, [esp + stack_offset + 4*4]
%define P1Q1 betasq
%else
%if WIN64
- mov betasq, [rsp + stack_offset + 5*8]
-%define P1Q1 betasq
+%define P1Q1 r9
%else
-%define P1Q1 r8
+%define P1Q1 rcx
%endif ; WIN64
%endif
+ psrlw mm1, 15
pshufw mm3, mm1, 10011000b
movq [P1Q1], mm3
- jz .ret0
-
+%ifidn %1, e
movd mm2, [srcq+2*strideq] ; p[-3]
neg strideq
psadbw mm2, mm0
@@ -232,11 +264,13 @@ cglobal rv40_h_loop_filter_strength, 3,3,0, src, stride, betas
pshufw mm1, mm6, 1001b
pand mm1, mm6
movd eax, mm1
- REP_RET
-.ret0:
+%else
xor rax, rax
+%endif
REP_RET
-
+%endmacro
+H_LOOP_FILTER ne
+H_LOOP_FILTER e

;-----------------------------------------------------------------------------
; subpel MC functions:
diff --git a/libavcodec/x86/rv40dsp_init.c b/libavcodec/x86/rv40dsp_init.c
index 7c0de62..6e56e90 100644
--- a/libavcodec/x86/rv40dsp_init.c
+++ b/libavcodec/x86/rv40dsp_init.c
@@ -173,10 +173,14 @@ QPEL_FUNCS_SET (OP, 3, 1, OPT) \
QPEL_FUNCS_SET (OP, 3, 2, OPT)
/** @} */

-int ff_rv40_v_loop_filter_strength_mmx2(uint8_t *src, ptrdiff_t stride,
- int32_t *betas, int edge, int32_t *p1q1);
-int ff_rv40_h_loop_filter_strength_mmx2(uint8_t *src, ptrdiff_t stride,
- int32_t *betas, int edge, int32_t *p1q1);
+int ff_rv40_v_loop_filter_strength_ne_mmx2(uint8_t *src, ptrdiff_t stride,
+ int32_t *betas, int32_t *p1q1);
+int ff_rv40_h_loop_filter_strength_ne_mmx2(uint8_t *src, ptrdiff_t stride,
+ int32_t *betas, int32_t *p1q1);
+int ff_rv40_v_loop_filter_strength_e_mmx2(uint8_t *src, ptrdiff_t stride,
+ int32_t *betas, int32_t *p1q1);
+int ff_rv40_h_loop_filter_strength_e_mmx2(uint8_t *src, ptrdiff_t stride,
+ int32_t *betas, int32_t *p1q1);

void ff_rv40dsp_init_x86(RV34DSPContext *c, DSPContext *dsp)
{
@@ -201,8 +205,10 @@ void ff_rv40dsp_init_x86(RV34DSPContext *c, DSPContext *dsp)
if (mm_flags & AV_CPU_FLAG_MMX2) {
c->avg_chroma_pixels_tab[0] = ff_avg_rv40_chroma_mc8_mmx2;
c->avg_chroma_pixels_tab[1] = ff_avg_rv40_chroma_mc4_mmx2;
- c->rv40_loop_filter_strength[0] = ff_rv40_h_loop_filter_strength_mmx2;
- c->rv40_loop_filter_strength[1] = ff_rv40_v_loop_filter_strength_mmx2;
+ c->rv40_loop_filter_strength[0][0] = ff_rv40_h_loop_filter_strength_ne_mmx2;
+ c->rv40_loop_filter_strength[1][0] = ff_rv40_v_loop_filter_strength_ne_mmx2;
+ c->rv40_loop_filter_strength[0][1] = ff_rv40_h_loop_filter_strength_e_mmx2;
+ c->rv40_loop_filter_strength[1][1] = ff_rv40_v_loop_filter_strength_e_mmx2;
#if ARCH_X86_32
QPEL_MC_SET(avg_, _mmx2)
#endif
--
1.7.7.msysgit.0
Christophe Gisquet
2012-04-17 18:56:39 UTC
Permalink
This will also break the NEON asm.
Måns Rullgård
2012-04-17 21:38:03 UTC
Permalink
Post by Christophe Gisquet
This will also break the NEON asm.
Which means the x86 tag is incorrect.
--
Måns Rullgård
***@mansr.com
Christophe Gisquet
2012-04-18 19:30:09 UTC
Permalink
Post by Måns Rullgård
Post by Christophe Gisquet
This will also break the NEON asm.
Which means the x86 tag is incorrect.
So how should I go about it? I have split the C part from the x86 part
in the attached patch, but should I deactivate the neon code too?

Also, this patch depends on patch "rearrange parameters to loop filter
strength functions", and the final patch would depend on the consensus
reached, and may have assembly code folded in.

Christophe
Måns Rullgård
2012-04-18 20:30:14 UTC
Permalink
Post by Christophe Gisquet
Post by Måns Rullgård
Post by Christophe Gisquet
This will also break the NEON asm.
Which means the x86 tag is incorrect.
So how should I go about it? I have split the C part from the x86 part
in the attached patch, but should I deactivate the neon code too?
If you need to change functions that have asm versions, you must make
sure the relevant people are made aware so they can update the code.
The patch should not be pushed until the existing asm has been fixed or
said developers have approved of temporarily disabling the functions.

In this particular case, I'd ask that you hammer out the x86 asm first,
then I'll update the NEON code.
--
Måns Rullgård
***@mansr.com
Ronald S. Bultje
2012-04-28 18:05:34 UTC
Permalink
Hi,
Post by Måns Rullgård
Post by Christophe Gisquet
Post by Måns Rullgård
Post by Christophe Gisquet
This will also break the NEON asm.
Which means the x86 tag is incorrect.
So how should I go about it? I have split the C part from the x86 part
in the attached patch, but should I deactivate the neon code too?
If you need to change functions that have asm versions, you must make
sure the relevant people are made aware so they can update the code.
The patch should not be pushed until the existing asm has been fixed or
said developers have approved of temporarily disabling the functions.
In this particular case, I'd ask that you hammer out the x86 asm first,
then I'll update the NEON code.
Progress on neon asm part?

Ronald
Christophe Gisquet
2012-05-10 13:49:20 UTC
Permalink
Hi,
Post by Ronald S. Bultje
Progress on neon asm part?
I think the intent was also to discuss some of the choices I've made.
There are 2 implementations: one distinguishing edge cases (seemed to
help, especially with inlining of the loop filter function), the other
not.

Both also modified the prototypes, requiring indeed changes to any dsp
implementation such as neon. Neither Mans nor any neon expert stated
an opinion.

Christophe
Måns Rullgård
2012-05-10 14:15:05 UTC
Permalink
Post by Christophe Gisquet
Hi,
Post by Ronald S. Bultje
Progress on neon asm part?
I think the intent was also to discuss some of the choices I've made.
There are 2 implementations: one distinguishing edge cases (seemed to
help, especially with inlining of the loop filter function), the other
not.
Both also modified the prototypes, requiring indeed changes to any dsp
implementation such as neon. Neither Mans nor any neon expert stated
an opinion.
The new interface should not be a problem. Have you guys settled the
x86 stuff so the interface is stable?
--
Måns Rullgård
***@mansr.com
Christophe Gisquet
2012-05-10 14:31:51 UTC
Permalink
Hi,
Post by Christophe Gisquet
Both also modified the prototypes, requiring indeed changes to any dsp
implementation such as neon. Neither Mans nor any neon expert stated
an opinion.
The new interface should not be a problem.  Have you guys settled the
x86 stuff so the interface is stable?
The only discussion I recall having with Ronald on this topic was
"[PATCH 05/12] rv40dsp x86: start implementing loop filtering". It was
mostly about a cleaner way to write x86 asm code, but not related to
the interface. I have addressed that as far as I know, but let us wait
for Ronald to confirm I did, and whether he has an opinion on any of
the 2 proposed interfaces (ie edge-aware vs. not).

Christophe
Christophe Gisquet
2012-04-18 19:31:25 UTC
Permalink
2012/4/17 Christophe GISQUET <***@gmail.com>:
[SNIP]

Here's a new patch for the hunks of the old patch related to x86.

Christophe
Christophe GISQUET
2012-04-17 18:44:47 UTC
Permalink
3% speedup on x86-64.
---
libavcodec/rv40.c | 11 ++++++-----
1 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/libavcodec/rv40.c b/libavcodec/rv40.c
index e140acb..3466d60 100644
--- a/libavcodec/rv40.c
+++ b/libavcodec/rv40.c
@@ -294,11 +294,12 @@ enum RV40BlockPos{
static const int neighbour_offs_x[4] = { 0, 0, -1, 0 };
static const int neighbour_offs_y[4] = { 0, -1, 0, 1 };

-static void rv40_adaptive_loop_filter(RV34DSPContext *rdsp,
- uint8_t *src, int stride, int dmode,
- int lim_q1, int lim_p1,
- int alpha, int beta, int beta2,
- int chroma, int edge, int dir)
+static av_always_inline void
+rv40_adaptive_loop_filter(RV34DSPContext *rdsp,
+ uint8_t *src, int stride, int dmode,
+ int lim_q1, int lim_p1,
+ int alpha, int beta, int beta2,
+ int chroma, int edge, int dir)
{
int32_t betas[2] = { beta*4, beta2 };
int32_t p1q1[2];
--
1.7.7.msysgit.0
Christophe Gisquet
2012-04-17 19:11:04 UTC
Permalink
Post by Christophe GISQUET
3% speedup on x86-64.
This may depend on the previous patches, and the host, so a test on
another architecture such as arm is probably needed at some point.
Ronald S. Bultje
2012-04-17 23:30:08 UTC
Permalink
Hi,

On Tue, Apr 17, 2012 at 11:44 AM, Christophe GISQUET
Post by Christophe GISQUET
3% speedup on x86-64.
---
 libavcodec/rv40.c |   11 ++++++-----
 1 files changed, 6 insertions(+), 5 deletions(-)
I'm fine with this.

Ronald
Christophe GISQUET
2012-04-17 18:44:48 UTC
Permalink
From: Christophe Gisquet <***@gmail.com>

120->100 cycles.
---
libavcodec/rv40.c | 25 ++++++++++++++-----------
1 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/libavcodec/rv40.c b/libavcodec/rv40.c
index 3466d60..7e26bb6 100644
--- a/libavcodec/rv40.c
+++ b/libavcodec/rv40.c
@@ -228,8 +228,6 @@ static int rv40_decode_mb_info(RV34DecContext *r)
int q, i;
int prev_type = 0;
int mb_pos = s->mb_x + s->mb_y * s->mb_stride;
- int blocks[RV34_MB_TYPES] = {0};
- int count = 0;

if(!r->s.mb_skip_run)
r->s.mb_skip_run = svq3_get_ue_golomb(gb) + 1;
@@ -237,22 +235,27 @@ static int rv40_decode_mb_info(RV34DecContext *r)
if(--r->s.mb_skip_run)
return RV34_MB_SKIP;

- if(r->avail_cache[6-1])
- blocks[r->mb_type[mb_pos - 1]]++;
if(r->avail_cache[6-4]){
+ int blocks[RV34_MB_TYPES] = {0};
+ int count = 0;
+ if(r->avail_cache[6-1])
+ blocks[r->mb_type[mb_pos - 1]]++;
blocks[r->mb_type[mb_pos - s->mb_stride]]++;
if(r->avail_cache[6-2])
blocks[r->mb_type[mb_pos - s->mb_stride + 1]]++;
if(r->avail_cache[6-5])
blocks[r->mb_type[mb_pos - s->mb_stride - 1]]++;
- }
-
- for(i = 0; i < RV34_MB_TYPES; i++){
- if(blocks[i] > count){
- count = blocks[i];
- prev_type = i;
+ for(i = 0; i < RV34_MB_TYPES; i++){
+ if(blocks[i] > count){
+ count = blocks[i];
+ prev_type = i;
+ if(count>1)
+ break;
+ }
}
- }
+ } else if (r->avail_cache[6-1])
+ prev_type = r->mb_type[mb_pos - 1];
+
if(s->pict_type == AV_PICTURE_TYPE_P){
prev_type = block_num_to_ptype_vlc_num[prev_type];
q = get_vlc2(gb, ptype_vlc[prev_type].table, PTYPE_VLC_BITS, 1);
--
1.7.7.msysgit.0
Ronald S. Bultje
2012-04-17 23:33:09 UTC
Permalink
On Tue, Apr 17, 2012 at 11:44 AM, Christophe GISQUET
Post by Christophe GISQUET
120->100 cycles.
---
 libavcodec/rv40.c |   25 ++++++++++++++-----------
 1 files changed, 14 insertions(+), 11 deletions(-)
LGTM.

Ronald
Ronald S. Bultje
2012-04-28 18:06:59 UTC
Permalink
Hi,
Post by Ronald S. Bultje
On Tue, Apr 17, 2012 at 11:44 AM, Christophe GISQUET
Post by Christophe GISQUET
120->100 cycles.
---
 libavcodec/rv40.c |   25 ++++++++++++++-----------
 1 files changed, 14 insertions(+), 11 deletions(-)
LGTM.
Pushed.

Ronald
Christophe GISQUET
2012-04-17 18:44:49 UTC
Permalink
From: Christophe Gisquet <***@gmail.com>

is_block2 was always 0, so just remove it, and change accordingly the code.
---
libavcodec/rv34.c | 17 ++++++-----------
1 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/libavcodec/rv34.c b/libavcodec/rv34.c
index 1247569..4c546c2 100644
--- a/libavcodec/rv34.c
+++ b/libavcodec/rv34.c
@@ -260,20 +260,15 @@ static inline void decode_subblock1(DCTELEM *dst, int code, GetBitContext *gb, V
decode_coeff(dst, coeff, 3, gb, vlc, q);
}

-static inline void decode_subblock3(DCTELEM *dst, int code, const int is_block2, GetBitContext *gb, VLC *vlc,
+static inline void decode_subblock3(DCTELEM *dst, int code, GetBitContext *gb, VLC *vlc,
int q_dc, int q_ac1, int q_ac2)
{
int flags = modulo_three_table[code];

- decode_coeff( dst+0*4+0, (flags >> 6) , 3, gb, vlc, q_dc);
- if(is_block2){
- decode_coeff(dst+1*4+0, (flags >> 4) & 3, 2, gb, vlc, q_ac1);
- decode_coeff(dst+0*4+1, (flags >> 2) & 3, 2, gb, vlc, q_ac1);
- }else{
- decode_coeff(dst+0*4+1, (flags >> 4) & 3, 2, gb, vlc, q_ac1);
- decode_coeff(dst+1*4+0, (flags >> 2) & 3, 2, gb, vlc, q_ac1);
- }
- decode_coeff( dst+1*4+1, (flags >> 0) & 3, 2, gb, vlc, q_ac2);
+ decode_coeff(dst+0*4+0, (flags >> 6) , 3, gb, vlc, q_dc);
+ decode_coeff(dst+0*4+1, (flags >> 4) & 3, 2, gb, vlc, q_ac1);
+ decode_coeff(dst+1*4+0, (flags >> 2) & 3, 2, gb, vlc, q_ac1);
+ decode_coeff(dst+1*4+1, (flags >> 0) & 3, 2, gb, vlc, q_ac2);
}

/**
@@ -298,7 +293,7 @@ static inline int rv34_decode_block(DCTELEM *dst, GetBitContext *gb, RV34VLC *rv
code >>= 3;

if (modulo_three_table[code] & 0x3F) {
- decode_subblock3(dst, code, 0, gb, &rvlc->coefficient, q_dc, q_ac1, q_ac2);
+ decode_subblock3(dst, code, gb, &rvlc->coefficient, q_dc, q_ac1, q_ac2);
} else {
decode_subblock1(dst, code, gb, &rvlc->coefficient, q_dc);
if (!pattern)
--
1.7.7.msysgit.0
Måns Rullgård
2012-04-17 19:52:34 UTC
Permalink
Post by Christophe GISQUET
is_block2 was always 0, so just remove it, and change accordingly the code.
---
libavcodec/rv34.c | 17 ++++++-----------
1 files changed, 6 insertions(+), 11 deletions(-)
diff --git a/libavcodec/rv34.c b/libavcodec/rv34.c
index 1247569..4c546c2 100644
--- a/libavcodec/rv34.c
+++ b/libavcodec/rv34.c
@@ -260,20 +260,15 @@ static inline void decode_subblock1(DCTELEM *dst, int code, GetBitContext *gb, V
decode_coeff(dst, coeff, 3, gb, vlc, q);
}
-static inline void decode_subblock3(DCTELEM *dst, int code, const int is_block2, GetBitContext *gb, VLC *vlc,
+static inline void decode_subblock3(DCTELEM *dst, int code, GetBitContext *gb, VLC *vlc,
int q_dc, int q_ac1, int q_ac2)
{
int flags = modulo_three_table[code];
- decode_coeff( dst+0*4+0, (flags >> 6) , 3, gb, vlc, q_dc);
- if(is_block2){
- decode_coeff(dst+1*4+0, (flags >> 4) & 3, 2, gb, vlc, q_ac1);
- decode_coeff(dst+0*4+1, (flags >> 2) & 3, 2, gb, vlc, q_ac1);
- }else{
- decode_coeff(dst+0*4+1, (flags >> 4) & 3, 2, gb, vlc, q_ac1);
- decode_coeff(dst+1*4+0, (flags >> 2) & 3, 2, gb, vlc, q_ac1);
- }
- decode_coeff( dst+1*4+1, (flags >> 0) & 3, 2, gb, vlc, q_ac2);
+ decode_coeff(dst+0*4+0, (flags >> 6) , 3, gb, vlc, q_dc);
+ decode_coeff(dst+0*4+1, (flags >> 4) & 3, 2, gb, vlc, q_ac1);
+ decode_coeff(dst+1*4+0, (flags >> 2) & 3, 2, gb, vlc, q_ac1);
+ decode_coeff(dst+1*4+1, (flags >> 0) & 3, 2, gb, vlc, q_ac2);
}
/**
@@ -298,7 +293,7 @@ static inline int rv34_decode_block(DCTELEM *dst, GetBitContext *gb, RV34VLC *rv
code >>= 3;
if (modulo_three_table[code] & 0x3F) {
- decode_subblock3(dst, code, 0, gb, &rvlc->coefficient, q_dc, q_ac1, q_ac2);
+ decode_subblock3(dst, code, gb, &rvlc->coefficient, q_dc, q_ac1, q_ac2);
} else {
decode_subblock1(dst, code, gb, &rvlc->coefficient, q_dc);
if (!pattern)
--
The change looks OK, but it is more than "cosmetics."
--
Måns Rullgård
***@mansr.com
Christophe Gisquet
2012-04-19 21:21:44 UTC
Permalink
Post by Måns Rullgård
The change looks OK, but it is more than "cosmetics."
OK. Patch with new title "rv34: remove constant parameter" attached.
Ronald S. Bultje
2012-04-28 18:08:31 UTC
Permalink
Hi,

On Thu, Apr 19, 2012 at 2:21 PM, Christophe Gisquet
Post by Christophe Gisquet
Post by Måns Rullgård
The change looks OK, but it is more than "cosmetics."
OK. Patch with new title "rv34: remove constant parameter" attached.
Pushed.

Ronald
Christophe GISQUET
2012-04-17 18:44:50 UTC
Permalink
From: Christophe Gisquet <***@gmail.com>

---
libavcodec/rv34.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/libavcodec/rv34.c b/libavcodec/rv34.c
index 4c546c2..a3d3fcd 100644
--- a/libavcodec/rv34.c
+++ b/libavcodec/rv34.c
@@ -313,7 +313,7 @@ static inline int rv34_decode_block(DCTELEM *dst, GetBitContext *gb, RV34VLC *rv
code = get_vlc2(gb, rvlc->third_pattern[sc].table, 9, 2);
decode_subblock(dst + 4*2+2, code, 0, gb, &rvlc->coefficient, q_ac2);
}
- return has_ac || pattern;
+ return has_ac | pattern;
}

/**
--
1.7.7.msysgit.0
Måns Rullgård
2012-04-17 19:54:43 UTC
Permalink
Post by Christophe GISQUET
---
libavcodec/rv34.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/libavcodec/rv34.c b/libavcodec/rv34.c
index 4c546c2..a3d3fcd 100644
--- a/libavcodec/rv34.c
+++ b/libavcodec/rv34.c
@@ -313,7 +313,7 @@ static inline int rv34_decode_block(DCTELEM *dst, GetBitContext *gb, RV34VLC *rv
code = get_vlc2(gb, rvlc->third_pattern[sc].table, 9, 2);
decode_subblock(dst + 4*2+2, code, 0, gb, &rvlc->coefficient, q_ac2);
}
- return has_ac || pattern;
+ return has_ac | pattern;
}
/**
--
Is this function not actually inlined? I would not expect this to make
a difference where it is.
--
Måns Rullgård
***@mansr.com
Christophe Gisquet
2012-04-19 21:16:58 UTC
Permalink
Hi,
Is this function not actually inlined?  I would not expect this to make
a difference where it is.
Good question: it is not inlined, and if it is forced to with
av_always_inline, that's a 8KB increase in object size, and a 2% speed
loss here (core i5).
I imagine the same, if not more, for arm. Seeing this, it may make
sense to remove that inline keyword and add a comment mentioning this.

Now, without the code being inline, the disassembly shows 2
instructions less for x86_32 and 1 for x86_64. Not something that will
matter in the end.

Christophe
Måns Rullgård
2012-04-19 21:26:21 UTC
Permalink
Post by Christophe Gisquet
Hi,
Is this function not actually inlined?  I would not expect this to make
a difference where it is.
Good question: it is not inlined, and if it is forced to with
av_always_inline, that's a 8KB increase in object size, and a 2% speed
loss here (core i5).
I imagine the same, if not more, for arm. Seeing this, it may make
sense to remove that inline keyword and add a comment mentioning this.
Patch OK then.
--
Måns Rullgård
***@mansr.com
Ronald S. Bultje
2012-04-28 18:11:58 UTC
Permalink
Hi,
Post by Måns Rullgård
Post by Christophe Gisquet
Is this function not actually inlined?  I would not expect this to make
a difference where it is.
Good question: it is not inlined, and if it is forced to with
av_always_inline, that's a 8KB increase in object size, and a 2% speed
loss here (core i5).
I imagine the same, if not more, for arm. Seeing this, it may make
sense to remove that inline keyword and add a comment mentioning this.
Patch OK then.
Patch pushed, and removal of inline keyword submitted as new patch.
Not sure who to assign as author, I can --author=Christophe since he
actually measured it.

Ronald
Christophe GISQUET
2012-04-17 18:44:51 UTC
Permalink
From: Christophe Gisquet <***@gmail.com>

Down from 95 kcycles to 93 (including all called functions).
---
libavcodec/rv40.c | 16 ++++++++--------
1 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/libavcodec/rv40.c b/libavcodec/rv40.c
index 7e26bb6..1e94118 100644
--- a/libavcodec/rv40.c
+++ b/libavcodec/rv40.c
@@ -432,7 +432,7 @@ static void rv40_loop_filter(RV34DecContext *r, int row)
y_v_deblock &= ~MASK_Y_LEFT_COL;
if(!row)
y_h_deblock &= ~MASK_Y_TOP_ROW;
- if(row == s->mb_height - 1 || (mb_strong[POS_CUR] || mb_strong[POS_BOTTOM]))
+ if(row == s->mb_height - 1 || (mb_strong[POS_CUR] | mb_strong[POS_BOTTOM]))
y_h_deblock &= ~(MASK_Y_TOP_ROW << 16);
/* Calculating chroma patterns is similar and easier since there is
* no motion vector pattern for them.
@@ -449,7 +449,7 @@ static void rv40_loop_filter(RV34DecContext *r, int row)
c_v_deblock[i] &= ~MASK_C_LEFT_COL;
if(!row)
c_h_deblock[i] &= ~MASK_C_TOP_ROW;
- if(row == s->mb_height - 1 || mb_strong[POS_CUR] || mb_strong[POS_BOTTOM])
+ if(row == s->mb_height - 1 || (mb_strong[POS_CUR] | mb_strong[POS_BOTTOM]))
c_h_deblock[i] &= ~(MASK_C_TOP_ROW << 4);
}

@@ -470,7 +470,7 @@ static void rv40_loop_filter(RV34DecContext *r, int row)
0, 0, 0);
}
// filter left block edge in ordinary mode (with low filtering strength)
- if(y_v_deblock & (MASK_CUR << ij) && (i || !(mb_strong[POS_CUR] || mb_strong[POS_LEFT]))){
+ if(y_v_deblock & (MASK_CUR << ij) && (i || !(mb_strong[POS_CUR] | mb_strong[POS_LEFT]))){
if(!i)
clip_left = mvmasks[POS_LEFT] & (MASK_RIGHT << j) ? clip[POS_LEFT] : 0;
else
@@ -481,14 +481,14 @@ static void rv40_loop_filter(RV34DecContext *r, int row)
alpha, beta, betaY, 0, 0, 1);
}
// filter top edge of the current macroblock when filtering strength is high
- if(!j && y_h_deblock & (MASK_CUR << i) && (mb_strong[POS_CUR] || mb_strong[POS_TOP])){
+ if(!j && y_h_deblock & (MASK_CUR << i) && (mb_strong[POS_CUR] | mb_strong[POS_TOP])){
rv40_adaptive_loop_filter(&r->rdsp, Y, s->linesize, dither,
clip_cur,
mvmasks[POS_TOP] & (MASK_TOP << i) ? clip[POS_TOP] : 0,
alpha, beta, betaY, 0, 1, 0);
}
// filter left block edge in edge mode (with high filtering strength)
- if(y_v_deblock & (MASK_CUR << ij) && !i && (mb_strong[POS_CUR] || mb_strong[POS_LEFT])){
+ if(y_v_deblock & (MASK_CUR << ij) && !i && (mb_strong[POS_CUR] | mb_strong[POS_LEFT])){
clip_left = mvmasks[POS_LEFT] & (MASK_RIGHT << j) ? clip[POS_LEFT] : 0;
rv40_adaptive_loop_filter(&r->rdsp, Y, s->linesize, dither,
clip_cur,
@@ -510,7 +510,7 @@ static void rv40_loop_filter(RV34DecContext *r, int row)
clip_cur,
alpha, beta, betaC, 1, 0, 0);
}
- if((c_v_deblock[k] & (MASK_CUR << ij)) && (i || !(mb_strong[POS_CUR] || mb_strong[POS_LEFT]))){
+ if((c_v_deblock[k] & (MASK_CUR << ij)) && (i || !(mb_strong[POS_CUR] | mb_strong[POS_LEFT]))){
if(!i)
clip_left = uvcbp[POS_LEFT][k] & (MASK_CUR << (2*j+1)) ? clip[POS_LEFT] : 0;
else
@@ -520,14 +520,14 @@ static void rv40_loop_filter(RV34DecContext *r, int row)
clip_left,
alpha, beta, betaC, 1, 0, 1);
}
- if(!j && c_h_deblock[k] & (MASK_CUR << ij) && (mb_strong[POS_CUR] || mb_strong[POS_TOP])){
+ if(!j && c_h_deblock[k] & (MASK_CUR << ij) && (mb_strong[POS_CUR] | mb_strong[POS_TOP])){
int clip_top = uvcbp[POS_TOP][k] & (MASK_CUR << (ij+2)) ? clip[POS_TOP] : 0;
rv40_adaptive_loop_filter(&r->rdsp, C, s->uvlinesize, i*8,
clip_cur,
clip_top,
alpha, beta, betaC, 1, 1, 0);
}
- if(c_v_deblock[k] & (MASK_CUR << ij) && !i && (mb_strong[POS_CUR] || mb_strong[POS_LEFT])){
+ if(c_v_deblock[k] & (MASK_CUR << ij) && !i && (mb_strong[POS_CUR] | mb_strong[POS_LEFT])){
clip_left = uvcbp[POS_LEFT][k] & (MASK_CUR << (2*j+1)) ? clip[POS_LEFT] : 0;
rv40_adaptive_loop_filter(&r->rdsp, C, s->uvlinesize, j*8,
clip_cur,
--
1.7.7.msysgit.0
Ronald S. Bultje
2012-04-17 23:34:50 UTC
Permalink
Hi,

On Tue, Apr 17, 2012 at 11:44 AM, Christophe GISQUET
Post by Christophe GISQUET
Down from 95 kcycles to 93 (including all called functions).
---
 libavcodec/rv40.c |   16 ++++++++--------
 1 files changed, 8 insertions(+), 8 deletions(-)
OK.

Ronald
Ronald S. Bultje
2012-04-28 18:13:17 UTC
Permalink
Hi,
Post by Ronald S. Bultje
On Tue, Apr 17, 2012 at 11:44 AM, Christophe GISQUET
Post by Christophe GISQUET
Down from 95 kcycles to 93 (including all called functions).
---
 libavcodec/rv40.c |   16 ++++++++--------
 1 files changed, 8 insertions(+), 8 deletions(-)
OK.
Pushed.

Ronald
Christophe GISQUET
2012-04-17 18:44:52 UTC
Permalink
From: Christophe Gisquet <***@gmail.com>

Although depending on the input data, these do prove to be superfluous.
For x86-64:
- both top and bottom pixels to filter: 197->189 cycles
- one of them: 136->132 cycles
which results in a 1.5% speedup overall.
---
libavcodec/rv40dsp.c | 29 ++++++++++++++++-------------
1 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/libavcodec/rv40dsp.c b/libavcodec/rv40dsp.c
index e2a44e8..e464af7 100644
--- a/libavcodec/rv40dsp.c
+++ b/libavcodec/rv40dsp.c
@@ -341,11 +341,6 @@ static av_always_inline void rv40_weak_loop_filter(uint8_t *src,
int i, t, u, diff;

for (i = 0; i < 4; i++, src += stride) {
- int diff_p1p0 = src[-2*step] - src[-1*step];
- int diff_q1q0 = src[ 1*step] - src[ 0*step];
- int diff_p1p2 = src[-2*step] - src[-3*step];
- int diff_q1q2 = src[ 1*step] - src[ 2*step];
-
t = src[0*step] - src[-1*step];
if (!t)
continue;
@@ -359,18 +354,26 @@ static av_always_inline void rv40_weak_loop_filter(uint8_t *src,
t += src[-2*step] - src[1*step];

diff = CLIP_SYMM((t + 4) >> 3, lim_p0q0);
- src[-1*step] = cm[src[-1*step] + diff];
- src[ 0*step] = cm[src[ 0*step] - diff];

- if (filter_p1 && FFABS(diff_p1p2) <= beta) {
- t = (diff_p1p0 + diff_p1p2 - diff) >> 1;
- src[-2*step] = cm[src[-2*step] - CLIP_SYMM(t, lim_p1)];
+ if (filter_p1) {
+ int diff_p1p2 = src[-2*step] - src[-3*step];
+ if (FFABS(diff_p1p2) <= beta) {
+ int diff_p1p0 = src[-2*step] - src[-1*step];
+ t = (diff_p1p0 + diff_p1p2 - diff) >> 1;
+ src[-2*step] = cm[src[-2*step] - CLIP_SYMM(t, lim_p1)];
+ }
}

- if (filter_q1 && FFABS(diff_q1q2) <= beta) {
- t = (diff_q1q0 + diff_q1q2 + diff) >> 1;
- src[ 1*step] = cm[src[ 1*step] - CLIP_SYMM(t, lim_q1)];
+ if (filter_q1) {
+ int diff_q1q2 = src[ 1*step] - src[ 2*step];
+ if (FFABS(diff_q1q2) <= beta) {
+ int diff_q1q0 = src[ 1*step] - src[ 0*step];
+ t = (diff_q1q0 + diff_q1q2 + diff) >> 1;
+ src[ 1*step] = cm[src[ 1*step] - CLIP_SYMM(t, lim_q1)];
+ }
}
+ src[-1*step] = cm[src[-1*step] + diff];
+ src[ 0*step] = cm[src[ 0*step] - diff];
}
}
--
1.7.7.msysgit.0
Måns Rullgård
2012-04-17 20:53:29 UTC
Permalink
Post by Christophe GISQUET
Although depending on the input data, these do prove to be superfluous.
- both top and bottom pixels to filter: 197->189 cycles
- one of them: 136->132 cycles
which results in a 1.5% speedup overall.
---
libavcodec/rv40dsp.c | 29 ++++++++++++++++-------------
1 files changed, 16 insertions(+), 13 deletions(-)
diff --git a/libavcodec/rv40dsp.c b/libavcodec/rv40dsp.c
index e2a44e8..e464af7 100644
--- a/libavcodec/rv40dsp.c
+++ b/libavcodec/rv40dsp.c
@@ -341,11 +341,6 @@ static av_always_inline void rv40_weak_loop_filter(uint8_t *src,
Isn't there asm for this function?
--
Måns Rullgård
***@mansr.com
Ronald S. Bultje
2012-04-17 23:36:00 UTC
Permalink
Hi,
Post by Måns Rullgård
Post by Christophe GISQUET
Although depending on the input data, these do prove to be superfluous.
- both top and bottom pixels to filter: 197->189 cycles
- one of them: 136->132 cycles
which results in a 1.5% speedup overall.
---
 libavcodec/rv40dsp.c |   29 ++++++++++++++++-------------
 1 files changed, 16 insertions(+), 13 deletions(-)
diff --git a/libavcodec/rv40dsp.c b/libavcodec/rv40dsp.c
index e2a44e8..e464af7 100644
--- a/libavcodec/rv40dsp.c
+++ b/libavcodec/rv40dsp.c
@@ -341,11 +341,6 @@ static av_always_inline void rv40_weak_loop_filter(uint8_t *src,
Isn't there asm for this function?
Not currently.

Ronald
Ronald S. Bultje
2012-04-17 23:37:10 UTC
Permalink
Hi,

On Tue, Apr 17, 2012 at 11:44 AM, Christophe GISQUET
Post by Christophe GISQUET
Although depending on the input data, these do prove to be superfluous.
- both top and bottom pixels to filter: 197->189 cycles
- one of them: 136->132 cycles
which results in a 1.5% speedup overall.
---
 libavcodec/rv40dsp.c |   29 ++++++++++++++++-------------
 1 files changed, 16 insertions(+), 13 deletions(-)
Change itself looks good. As Mans said, this looks like another simd
target (but that's unrelated to this patch going in).

Ronald
Måns Rullgård
2012-04-17 23:41:27 UTC
Permalink
Post by Ronald S. Bultje
Hi,
On Tue, Apr 17, 2012 at 11:44 AM, Christophe GISQUET
Post by Christophe GISQUET
Although depending on the input data, these do prove to be superfluous.
- both top and bottom pixels to filter: 197->189 cycles
- one of them: 136->132 cycles
which results in a 1.5% speedup overall.
---
 libavcodec/rv40dsp.c |   29 ++++++++++++++++-------------
 1 files changed, 16 insertions(+), 13 deletions(-)
Change itself looks good. As Mans said, this looks like another simd
target (but that's unrelated to this patch going in).
The code is much easier to understand for the purpose of writing simd
the way it is.
--
Måns Rullgård
***@mansr.com
Christophe Gisquet
2012-04-18 08:02:29 UTC
Permalink
Hi,
Post by Ronald S. Bultje
Change itself looks good. As Mans said, this looks like another simd
target (but that's unrelated to this patch going in).
Yes, and with 23% of the runtime, they would be good candidates. But
this is so messy for a leisure activity that I am not looking forward
to it, and in fact I will not submit anything for it.

I don't really mind dropping the patch, nobody seems to have ever
needed optimizations for rv40, so this not the 1% that will matter.

Christophe
Vladimir Pantelic
2012-04-18 10:04:17 UTC
Permalink
Post by Christophe Gisquet
Hi,
Post by Ronald S. Bultje
Change itself looks good. As Mans said, this looks like another simd
target (but that's unrelated to this patch going in).
Yes, and with 23% of the runtime, they would be good candidates. But
this is so messy for a leisure activity that I am not looking forward
to it, and in fact I will not submit anything for it.
I don't really mind dropping the patch, nobody seems to have ever
needed optimizations for rv40, so this not the 1% that will matter.
making RV fast on x86 is probably of not much interest, but on
ARM there are a billion people that might like it :)
Christophe Gisquet
2012-04-18 10:07:04 UTC
Permalink
Post by Vladimir Pantelic
Post by Christophe Gisquet
I don't really mind dropping the patch, nobody seems to have ever
needed optimizations for rv40, so this not the 1% that will matter.
making RV fast on x86 is probably of not much interest, but on
ARM there are a billion people that might like it :)
Yes, I did forget the x86 part, all the more since there is already
NEON code for those functions.
--
Christophe
Diego Biurrun
2012-04-18 10:15:15 UTC
Permalink
Post by Christophe Gisquet
I don't really mind dropping the patch, nobody seems to have ever
needed optimizations for rv40, so this not the 1% that will matter.
You'd be surprised - RV40 is somehow very popular in China. There
are companies sponsoring ARM optimizations for it. x86 not so much,
I'm afraid...

Diego
Ronald S. Bultje
2012-04-28 18:15:24 UTC
Permalink
Hi,

On Wed, Apr 18, 2012 at 1:02 AM, Christophe Gisquet
Post by Christophe Gisquet
Post by Ronald S. Bultje
Change itself looks good. As Mans said, this looks like another simd
target (but that's unrelated to this patch going in).
Yes, and with 23% of the runtime, they would be good candidates. But
this is so messy for a leisure activity that I am not looking forward
to it, and in fact I will not submit anything for it.
I don't really mind dropping the patch, nobody seems to have ever
needed optimizations for rv40, so this not the 1% that will matter.
I don't really have an opinion, but I feel general consensus is to
drop the patch and defer the optimizations by writing x86 simd? Is
that correct?

Ronald
Christophe Gisquet
2012-05-10 14:07:26 UTC
Permalink
Post by Ronald S. Bultje
I don't really have an opinion, but I feel general consensus is to
drop the patch and defer the optimizations by writing x86 simd? Is
that correct?
Fine for the dropping, though I don't plan to write simd for the
weak/strong filters, only the strength functions.

Christophe
Ronald S. Bultje
2012-05-10 16:16:47 UTC
Permalink
Hi,

On Thu, May 10, 2012 at 7:07 AM, Christophe Gisquet
Post by Christophe Gisquet
Post by Ronald S. Bultje
I don't really have an opinion, but I feel general consensus is to
drop the patch and defer the optimizations by writing x86 simd? Is
that correct?
Fine for the dropping, though I don't plan to write simd for the
weak/strong filters, only the strength functions.
That sounds fine. We can look at that in the future.

Ronald
Christophe Gisquet
2012-04-17 23:27:51 UTC
Permalink
Post by Christophe GISQUET
These patches are getting old, and so am I, so I prefer sending them while
I still can remember why I did some of things I did in them.
Last rebase was fatal to my updating the patch set, as I am affected
by this (newly created) bug:
https://bugzilla.libav.org/show_bug.cgi?id=272
which prevents me from testing changes.

Christophe
Måns Rullgård
2012-04-17 23:33:30 UTC
Permalink
Post by Christophe Gisquet
Post by Christophe GISQUET
These patches are getting old, and so am I, so I prefer sending them while
I still can remember why I did some of things I did in them.
Last rebase was fatal to my updating the patch set, as I am affected
https://bugzilla.libav.org/show_bug.cgi?id=272
which prevents me from testing changes.
Configure your git to not mangle line endings.
--
Måns Rullgård
***@mansr.com
Christophe Gisquet
2012-04-18 06:19:51 UTC
Permalink
Hi,
Post by Måns Rullgård
Configure your git to not mangle line endings.
Thanks, doing as the documentation says fixed the issue. I also closed
the related bug.

Christophe
Diego Biurrun
2012-04-18 06:44:27 UTC
Permalink
Post by Christophe Gisquet
Post by Måns Rullgård
Configure your git to not mangle line endings.
Thanks, doing as the documentation says fixed the issue. I also closed
the related bug.
FAQ entry very much welcome - this is likely to bite more people in the
future.

Diego
Diego Biurrun
2012-04-18 06:46:04 UTC
Permalink
What is it with the different spellings of your name? IIUC uppercasing
the last name is some sort of French custom, but it looks rather strange
in a more global setting. Please stick to the standard capitalization.

Diego
Christophe Gisquet
2012-04-18 07:00:06 UTC
Permalink
What is it with the different spellings of your name?  IIUC uppercasing
the last name is some sort of French custom, but it looks rather strange
in a more global setting.  Please stick to the standard capitalization.
I have several working places, and it seems one git isn't configure as
the others.

The full capitalization is because in some places, people have trouble
figuring last name from first name (thus calling me Mr Christophe).
Not always a French thing (though we tend to capitalize names in
various official writings).

What you want me to do is thus using "Gisquet" everywhere? I'll have
to look up how to change the author in the corresponding commits then.
--
Christophe
Diego Biurrun
2012-04-18 07:12:50 UTC
Permalink
Post by Christophe Gisquet
What is it with the different spellings of your name?  IIUC uppercasing
the last name is some sort of French custom, but it looks rather strange
in a more global setting.  Please stick to the standard capitalization.
I have several working places, and it seems one git isn't configure as
the others.
What you want me to do is thus using "Gisquet" everywhere?
Yes.
Post by Christophe Gisquet
I'll have
to look up how to change the author in the corresponding commits then.
git commit --amend --author="Christophe Gisquet <***@gmail.com>"

Diego
Loading...