Discussion:
[Xen-devel] [PATCH v6 18/18] Add a command line parameter for VT-d posted-interrupts
Feng Wu
2015-08-25 01:57:57 UTC
Permalink
Enable VT-d Posted-Interrupts and add a command line
parameter for it.

Signed-off-by: Feng Wu <***@intel.com>
Reviewed-by: Kevin Tian <***@intel.com>
---
v6:
- Change the default value to 'false' in xen-command-line.markdown

docs/misc/xen-command-line.markdown | 9 ++++++++-
xen/drivers/passthrough/iommu.c | 3 +++
2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index a2e427c..ecaf221 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -855,7 +855,7 @@ debug hypervisor only).
Default: `new` unless directed-EOI is supported
### iommu
-> `= List of [ <boolean> | force | required | intremap | qinval | snoop | sharept | dom0-passthrough | dom0-strict | amd-iommu-perdev-intremap | workaround_bios_bug | igfx | verbose | debug ]`
+> `= List of [ <boolean> | force | required | intremap | intpost | qinval | snoop | sharept | dom0-passthrough | dom0-strict | amd-iommu-perdev-intremap | workaround_bios_bug | igfx | verbose | debug ]`
@@ -882,6 +882,13 @@ debug hypervisor only).
Control the use of interrupt remapping (DMA remapping will always be enabled
if IOMMU functionality is enabled).
+> `intpost`
+
+> Default: `false`
+
+>> Control the use of interrupt posting, which depends on the availability of
+>> interrupt remapping.
+
`qinval` (VT-d)
Default: `true`
diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
index 36d5cc0..8d03076 100644
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -38,6 +38,7 @@ static void iommu_dump_p2m_table(unsigned char key);
* no-snoop Disable VT-d Snoop Control
* no-qinval Disable VT-d Queued Invalidation
* no-intremap Disable VT-d Interrupt Remapping
+ * no-intpost Disable VT-d Interrupt posting
*/
custom_param("iommu", parse_iommu_param);
bool_t __initdata iommu_enable = 1;
@@ -105,6 +106,8 @@ static void __init parse_iommu_param(char *s)
iommu_qinval = val;
else if ( !strcmp(s, "intremap") )
iommu_intremap = val;
+ else if ( !strcmp(s, "intpost") )
+ iommu_intpost = val;
else if ( !strcmp(s, "debug") )
{
iommu_debug = val;
--
2.1.0
Feng Wu
2015-08-25 01:57:46 UTC
Permalink
This patch initializes the VT-d Posted-interrupt Descriptor.

CC: Kevin Tian <***@intel.com>
CC: Keir Fraser <***@xen.org>
CC: Jan Beulich <***@suse.com>
CC: Andrew Cooper <***@citrix.com>
Signed-off-by: Feng Wu <***@intel.com>
Acked-by: Kevin Tian <***@intel.com>
Reviewed-by: Konrad Rzeszutek Wilk <***@oracle.com>
---
v3:
- Move pi_desc_init() to xen/arch/x86/hvm/vmx/vmcs.c
- Remove the 'inline' flag of pi_desc_init()

xen/arch/x86/hvm/vmx/vmcs.c | 18 ++++++++++++++++++
xen/include/asm-x86/hvm/vmx/vmx.h | 2 ++
2 files changed, 20 insertions(+)

diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index a0a97e7..28c553f 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -39,6 +39,7 @@
#include <asm/flushtlb.h>
#include <asm/shadow.h>
#include <asm/tboot.h>
+#include <asm/apic.h>

static bool_t __read_mostly opt_vpid_enabled = 1;
boolean_param("vpid", opt_vpid_enabled);
@@ -951,6 +952,20 @@ void virtual_vmcs_vmwrite(void *vvmcs, u32 vmcs_encoding, u64 val)
virtual_vmcs_exit(vvmcs);
}

+static void pi_desc_init(struct vcpu *v)
+{
+ uint32_t dest;
+
+ v->arch.hvm_vmx.pi_desc.nv = posted_intr_vector;
+
+ dest = cpu_physical_id(v->processor);
+
+ if ( x2apic_enabled )
+ v->arch.hvm_vmx.pi_desc.ndst = dest;
+ else
+ v->arch.hvm_vmx.pi_desc.ndst = MASK_INSR(dest, PI_xAPIC_NDST_MASK);
+}
+
static int construct_vmcs(struct vcpu *v)
{
struct domain *d = v->domain;
@@ -1089,6 +1104,9 @@ static int construct_vmcs(struct vcpu *v)

if ( cpu_has_vmx_posted_intr_processing )
{
+ if ( iommu_intpost )
+ pi_desc_init(v);
+
__vmwrite(PI_DESC_ADDR, virt_to_maddr(&v->arch.hvm_vmx.pi_desc));
__vmwrite(POSTED_INTR_NOTIFICATION_VECTOR, posted_intr_vector);
}
diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
index acd4aec..03c529c 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -88,6 +88,8 @@ typedef enum {
#define EPT_EMT_WB 6
#define EPT_EMT_RSV2 7

+#define PI_xAPIC_NDST_MASK 0xFF00
+
void vmx_asm_vmexit_handler(struct cpu_user_regs);
void vmx_asm_do_vmentry(void);
void vmx_intr_assist(void);
--
2.1.0
Jan Beulich
2015-09-04 14:47:12 UTC
Permalink
Post by Feng Wu
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -39,6 +39,7 @@
#include <asm/flushtlb.h>
#include <asm/shadow.h>
#include <asm/tboot.h>
+#include <asm/apic.h>
I can't seem to spot what this is needed for.
Post by Feng Wu
@@ -951,6 +952,20 @@ void virtual_vmcs_vmwrite(void *vvmcs, u32 vmcs_encoding, u64 val)
virtual_vmcs_exit(vvmcs);
}
+static void pi_desc_init(struct vcpu *v)
+{
+ uint32_t dest;
+
+ v->arch.hvm_vmx.pi_desc.nv = posted_intr_vector;
+
+ dest = cpu_physical_id(v->processor);
+
+ if ( x2apic_enabled )
+ v->arch.hvm_vmx.pi_desc.ndst = dest;
+ else
+ v->arch.hvm_vmx.pi_desc.ndst = MASK_INSR(dest, PI_xAPIC_NDST_MASK);
+}
+
static int construct_vmcs(struct vcpu *v)
{
struct domain *d = v->domain;
@@ -1089,6 +1104,9 @@ static int construct_vmcs(struct vcpu *v)
if ( cpu_has_vmx_posted_intr_processing )
{
+ if ( iommu_intpost )
+ pi_desc_init(v);
If this is going to remain the only call site of the function, I'd suggest
expanding it here. This is because calling the function from elsewhere
is, at least at the first glance, unsafe: It doesn't update pi_desc
atomically, which is in (only apparent?) conflict with the atomic
modifications helpers added in an earlier patch.

If further call sites will get added by later patches, clarifying in a
comment why the non-atomic updates are okay would seem
necessary; alternatively change them to become atomic.

Jan
Wu, Feng
2015-09-06 02:22:05 UTC
Permalink
-----Original Message-----
Sent: Friday, September 04, 2015 10:47 PM
To: Wu, Feng
Subject: Re: [PATCH v6 07/18] vmx: Initialize VT-d Posted-Interrupts Descriptor
Post by Feng Wu
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -39,6 +39,7 @@
#include <asm/flushtlb.h>
#include <asm/shadow.h>
#include <asm/tboot.h>
+#include <asm/apic.h>
I can't seem to spot what this is needed for.
It is for ' x2apic_enabled '.
Post by Feng Wu
@@ -1089,6 +1104,9 @@ static int construct_vmcs(struct vcpu *v)
if ( cpu_has_vmx_posted_intr_processing )
{
+ if ( iommu_intpost )
+ pi_desc_init(v);
If this is going to remain the only call site of the function, I'd suggest
expanding it here. This is because calling the function from elsewhere
is, at least at the first glance, unsafe: It doesn't update pi_desc
atomically, which is in (only apparent?) conflict with the atomic
modifications helpers added in an earlier patch.
If further call sites will get added by later patches, clarifying in a
comment why the non-atomic updates are okay would seem
necessary; alternatively change them to become atomic.
I will add some comments to this function.

Thanks,
Feng
Jan
Jan Beulich
2015-09-07 10:49:45 UTC
Permalink
Post by Wu, Feng
Sent: Friday, September 04, 2015 10:47 PM
Post by Feng Wu
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -39,6 +39,7 @@
#include <asm/flushtlb.h>
#include <asm/shadow.h>
#include <asm/tboot.h>
+#include <asm/apic.h>
I can't seem to spot what this is needed for.
It is for ' x2apic_enabled '.
Well, not nice to include the header here, but what do we do...

Jan
Feng Wu
2015-08-25 01:57:49 UTC
Permalink
Extend struct iremap_entry according to VT-d Posted-Interrupts Spec.

CC: Yang Zhang <***@intel.com>
CC: Kevin Tian <***@intel.com>
Signed-off-by: Feng Wu <***@intel.com>
Acked-by: Kevin Tian <***@intel.com>
---
v4:
- res_4 is not a bitfiled, correct it.
- Expose 'im' to remapped irte as well.

v3:
- Use u32 instead of u64 to define the bitfields in 'struct iremap_entry'
- Limit using bitfield if possible

xen/drivers/passthrough/vtd/intremap.c | 92 +++++++++++++++++-----------------
xen/drivers/passthrough/vtd/iommu.h | 43 ++++++++++------
xen/drivers/passthrough/vtd/utils.c | 8 +--
3 files changed, 80 insertions(+), 63 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/intremap.c b/xen/drivers/passthrough/vtd/intremap.c
index 987bbe9..e9fffa6 100644
--- a/xen/drivers/passthrough/vtd/intremap.c
+++ b/xen/drivers/passthrough/vtd/intremap.c
@@ -122,9 +122,9 @@ static u16 hpetid_to_bdf(unsigned int hpet_id)
static void set_ire_sid(struct iremap_entry *ire,
unsigned int svt, unsigned int sq, unsigned int sid)
{
- ire->hi.svt = svt;
- ire->hi.sq = sq;
- ire->hi.sid = sid;
+ ire->remap.svt = svt;
+ ire->remap.sq = sq;
+ ire->remap.sid = sid;
}

static void set_ioapic_source_id(int apic_id, struct iremap_entry *ire)
@@ -219,7 +219,7 @@ static unsigned int alloc_remap_entry(struct iommu *iommu, unsigned int nr)
else
p = &iremap_entries[i % (1 << IREMAP_ENTRY_ORDER)];

- if ( p->lo_val || p->hi_val ) /* not a free entry */
+ if ( p->lo || p->hi ) /* not a free entry */
found = 0;
else if ( ++found == nr )
break;
@@ -253,7 +253,7 @@ static int remap_entry_to_ioapic_rte(
GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, index,
iremap_entries, iremap_entry);

- if ( iremap_entry->hi_val == 0 && iremap_entry->lo_val == 0 )
+ if ( iremap_entry->hi == 0 && iremap_entry->lo == 0 )
{
dprintk(XENLOG_ERR VTDPREFIX,
"%s: index (%d) get an empty entry!\n",
@@ -263,13 +263,13 @@ static int remap_entry_to_ioapic_rte(
return -EFAULT;
}

- old_rte->vector = iremap_entry->lo.vector;
- old_rte->delivery_mode = iremap_entry->lo.dlm;
- old_rte->dest_mode = iremap_entry->lo.dm;
- old_rte->trigger = iremap_entry->lo.tm;
+ old_rte->vector = iremap_entry->remap.vector;
+ old_rte->delivery_mode = iremap_entry->remap.dlm;
+ old_rte->dest_mode = iremap_entry->remap.dm;
+ old_rte->trigger = iremap_entry->remap.tm;
old_rte->__reserved_2 = 0;
old_rte->dest.logical.__reserved_1 = 0;
- old_rte->dest.logical.logical_dest = iremap_entry->lo.dst >> 8;
+ old_rte->dest.logical.logical_dest = iremap_entry->remap.dst >> 8;

unmap_vtd_domain_page(iremap_entries);
spin_unlock_irqrestore(&ir_ctrl->iremap_lock, flags);
@@ -317,27 +317,28 @@ static int ioapic_rte_to_remap_entry(struct iommu *iommu,
if ( rte_upper )
{
if ( x2apic_enabled )
- new_ire.lo.dst = value;
+ new_ire.remap.dst = value;
else
- new_ire.lo.dst = (value >> 24) << 8;
+ new_ire.remap.dst = (value >> 24) << 8;
}
else
{
*(((u32 *)&new_rte) + 0) = value;
- new_ire.lo.fpd = 0;
- new_ire.lo.dm = new_rte.dest_mode;
- new_ire.lo.tm = new_rte.trigger;
- new_ire.lo.dlm = new_rte.delivery_mode;
+ new_ire.remap.fpd = 0;
+ new_ire.remap.dm = new_rte.dest_mode;
+ new_ire.remap.tm = new_rte.trigger;
+ new_ire.remap.dlm = new_rte.delivery_mode;
/* Hardware require RH = 1 for LPR delivery mode */
- new_ire.lo.rh = (new_ire.lo.dlm == dest_LowestPrio);
- new_ire.lo.avail = 0;
- new_ire.lo.res_1 = 0;
- new_ire.lo.vector = new_rte.vector;
- new_ire.lo.res_2 = 0;
+ new_ire.remap.rh = (new_ire.remap.dlm == dest_LowestPrio);
+ new_ire.remap.avail = 0;
+ new_ire.remap.res_1 = 0;
+ new_ire.remap.vector = new_rte.vector;
+ new_ire.remap.res_2 = 0;

set_ioapic_source_id(IO_APIC_ID(apic), &new_ire);
- new_ire.hi.res_1 = 0;
- new_ire.lo.p = 1; /* finally, set present bit */
+ new_ire.remap.res_3 = 0;
+ new_ire.remap.res_4 = 0;
+ new_ire.remap.p = 1; /* finally, set present bit */

/* now construct new ioapic rte entry */
remap_rte->vector = new_rte.vector;
@@ -510,7 +511,7 @@ static int remap_entry_to_msi_msg(
GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, index,
iremap_entries, iremap_entry);

- if ( iremap_entry->hi_val == 0 && iremap_entry->lo_val == 0 )
+ if ( iremap_entry->hi == 0 && iremap_entry->lo == 0 )
{
dprintk(XENLOG_ERR VTDPREFIX,
"%s: index (%d) get an empty entry!\n",
@@ -523,25 +524,25 @@ static int remap_entry_to_msi_msg(
msg->address_hi = MSI_ADDR_BASE_HI;
msg->address_lo =
MSI_ADDR_BASE_LO |
- ((iremap_entry->lo.dm == 0) ?
+ ((iremap_entry->remap.dm == 0) ?
MSI_ADDR_DESTMODE_PHYS:
MSI_ADDR_DESTMODE_LOGIC) |
- ((iremap_entry->lo.dlm != dest_LowestPrio) ?
+ ((iremap_entry->remap.dlm != dest_LowestPrio) ?
MSI_ADDR_REDIRECTION_CPU:
MSI_ADDR_REDIRECTION_LOWPRI);
if ( x2apic_enabled )
- msg->dest32 = iremap_entry->lo.dst;
+ msg->dest32 = iremap_entry->remap.dst;
else
- msg->dest32 = (iremap_entry->lo.dst >> 8) & 0xff;
+ msg->dest32 = (iremap_entry->remap.dst >> 8) & 0xff;
msg->address_lo |= MSI_ADDR_DEST_ID(msg->dest32);

msg->data =
MSI_DATA_TRIGGER_EDGE |
MSI_DATA_LEVEL_ASSERT |
- ((iremap_entry->lo.dlm != dest_LowestPrio) ?
+ ((iremap_entry->remap.dlm != dest_LowestPrio) ?
MSI_DATA_DELIVERY_FIXED:
MSI_DATA_DELIVERY_LOWPRI) |
- iremap_entry->lo.vector;
+ iremap_entry->remap.vector;

unmap_vtd_domain_page(iremap_entries);
spin_unlock_irqrestore(&ir_ctrl->iremap_lock, flags);
@@ -600,29 +601,30 @@ static int msi_msg_to_remap_entry(
memcpy(&new_ire, iremap_entry, sizeof(struct iremap_entry));

/* Set interrupt remapping table entry */
- new_ire.lo.fpd = 0;
- new_ire.lo.dm = (msg->address_lo >> MSI_ADDR_DESTMODE_SHIFT) & 0x1;
- new_ire.lo.tm = (msg->data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
- new_ire.lo.dlm = (msg->data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x1;
+ new_ire.remap.fpd = 0;
+ new_ire.remap.dm = (msg->address_lo >> MSI_ADDR_DESTMODE_SHIFT) & 0x1;
+ new_ire.remap.tm = (msg->data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
+ new_ire.remap.dlm = (msg->data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x1;
/* Hardware require RH = 1 for LPR delivery mode */
- new_ire.lo.rh = (new_ire.lo.dlm == dest_LowestPrio);
- new_ire.lo.avail = 0;
- new_ire.lo.res_1 = 0;
- new_ire.lo.vector = (msg->data >> MSI_DATA_VECTOR_SHIFT) &
- MSI_DATA_VECTOR_MASK;
- new_ire.lo.res_2 = 0;
+ new_ire.remap.rh = (new_ire.remap.dlm == dest_LowestPrio);
+ new_ire.remap.avail = 0;
+ new_ire.remap.res_1 = 0;
+ new_ire.remap.vector = (msg->data >> MSI_DATA_VECTOR_SHIFT) &
+ MSI_DATA_VECTOR_MASK;
+ new_ire.remap.res_2 = 0;
if ( x2apic_enabled )
- new_ire.lo.dst = msg->dest32;
+ new_ire.remap.dst = msg->dest32;
else
- new_ire.lo.dst = ((msg->address_lo >> MSI_ADDR_DEST_ID_SHIFT)
- & 0xff) << 8;
+ new_ire.remap.dst = ((msg->address_lo >> MSI_ADDR_DEST_ID_SHIFT)
+ & 0xff) << 8;

if ( pdev )
set_msi_source_id(pdev, &new_ire);
else
set_hpet_source_id(msi_desc->hpet_id, &new_ire);
- new_ire.hi.res_1 = 0;
- new_ire.lo.p = 1; /* finally, set present bit */
+ new_ire.remap.res_3 = 0;
+ new_ire.remap.res_4 = 0;
+ new_ire.remap.p = 1; /* finally, set present bit */

/* now construct new MSI/MSI-X rte entry */
remap_rte = (struct msi_msg_remap_entry *)msg;
diff --git a/xen/drivers/passthrough/vtd/iommu.h b/xen/drivers/passthrough/vtd/iommu.h
index 22abefe..6fca430 100644
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -281,29 +281,44 @@ struct dma_pte {
/* interrupt remap entry */
struct iremap_entry {
union {
- u64 lo_val;
+ struct { u64 lo, hi; };
struct {
- u64 p : 1,
+ u16 p : 1,
fpd : 1,
dm : 1,
rh : 1,
tm : 1,
dlm : 3,
avail : 4,
- res_1 : 4,
- vector : 8,
- res_2 : 8,
- dst : 32;
- }lo;
- };
- union {
- u64 hi_val;
+ res_1 : 3,
+ im : 1;
+ u8 vector;
+ u8 res_2;
+ u32 dst;
+ u16 sid;
+ u16 sq : 2,
+ svt : 2,
+ res_3 : 12;
+ u32 res_4;
+ } remap;
struct {
- u64 sid : 16,
- sq : 2,
+ u16 p : 1,
+ fpd : 1,
+ res_1 : 6,
+ avail : 4,
+ res_2 : 2,
+ urg : 1,
+ im : 1;
+ u8 vector;
+ u8 res_3;
+ u32 res_4 : 6,
+ pda_l : 26;
+ u16 sid;
+ u16 sq : 2,
svt : 2,
- res_1 : 44;
- }hi;
+ res_5 : 12;
+ u32 pda_h;
+ } post;
};
};

diff --git a/xen/drivers/passthrough/vtd/utils.c b/xen/drivers/passthrough/vtd/utils.c
index 162b764..9d556da 100644
--- a/xen/drivers/passthrough/vtd/utils.c
+++ b/xen/drivers/passthrough/vtd/utils.c
@@ -230,13 +230,13 @@ static void dump_iommu_info(unsigned char key)
else
p = &iremap_entries[i % (1 << IREMAP_ENTRY_ORDER)];

- if ( !p->lo.p )
+ if ( !p->remap.p )
continue;
printk(" %04x: %x %x %04x %08x %02x %x %x %x %x %x"
" %x %x\n", i,
- p->hi.svt, p->hi.sq, p->hi.sid, p->lo.dst, p->lo.vector,
- p->lo.avail, p->lo.dlm, p->lo.tm, p->lo.rh, p->lo.dm,
- p->lo.fpd, p->lo.p);
+ p->remap.svt, p->remap.sq, p->remap.sid, p->remap.dst,
+ p->remap.vector, p->remap.avail, p->remap.dlm, p->remap.tm,
+ p->remap.rh, p->remap.dm, p->remap.fpd, p->remap.p);
print_cnt++;
}
if ( iremap_entries )
--
2.1.0
Feng Wu
2015-08-25 01:57:51 UTC
Permalink
Move some APIC related macros to apicdef.h, so they can be used
outside of vlapic.c.

Signed-off-by: Feng Wu <***@intel.com>
---
v6:
- Newly introduced.

xen/arch/x86/hvm/vlapic.c | 5 -----
xen/include/asm-x86/apicdef.h | 4 ++++
2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
index b893b40..9b7c871 100644
--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -65,11 +65,6 @@ static const unsigned int vlapic_lvt_mask[VLAPIC_LVT_NUM] =
LVT_MASK
};

-/* Following could belong in apicdef.h */
-#define APIC_SHORT_MASK 0xc0000
-#define APIC_DEST_NOSHORT 0x0
-#define APIC_DEST_MASK 0x800
-
#define vlapic_lvt_vector(vlapic, lvt_type) \
(vlapic_get_reg(vlapic, lvt_type) & APIC_VECTOR_MASK)

diff --git a/xen/include/asm-x86/apicdef.h b/xen/include/asm-x86/apicdef.h
index 6069fce..6d1fd94 100644
--- a/xen/include/asm-x86/apicdef.h
+++ b/xen/include/asm-x86/apicdef.h
@@ -124,6 +124,10 @@

#define MAX_IO_APICS 128

+#define APIC_SHORT_MASK 0xc0000
+#define APIC_DEST_NOSHORT 0x0
+#define APIC_DEST_MASK 0x800
+
/*
* the local APIC register structure, memory mapped. Not terribly well
* tested, but we might eventually use this one in the future - the
--
2.1.0
Jan Beulich
2015-09-04 15:15:00 UTC
Permalink
Post by Feng Wu
--- a/xen/include/asm-x86/apicdef.h
+++ b/xen/include/asm-x86/apicdef.h
@@ -124,6 +124,10 @@
#define MAX_IO_APICS 128
+#define APIC_SHORT_MASK 0xc0000
+#define APIC_DEST_NOSHORT 0x0
+#define APIC_DEST_MASK 0x800
Moving them into this header is certainly the right thing to do, but
please put them where they belong inside the header (i.e. the first
next to ACPI_DEST_{SELF,ALLINC,ALLBUT} and the latter before
or after APIC_DEST_{LOGICAL,PHYSICAL}.

Jan
Feng Wu
2015-08-25 01:57:55 UTC
Permalink
This patch adds the following arch hooks in scheduler:
- vmx_pre_ctx_switch_pi():
It is called before context switch, we update the posted
interrupt descriptor when the vCPU is preempted, go to sleep,
or is blocked.

- vmx_post_ctx_switch_pi()
It is called after context switch, we update the posted
interrupt descriptor when the vCPU is going to run.

- arch_vcpu_wake_prepare()
It will be called when waking up the vCPU, we update
the posted interrupt descriptor when the vCPU is unblocked.

CC: Keir Fraser <***@xen.org>
CC: Jan Beulich <***@suse.com>
CC: Andrew Cooper <***@citrix.com>
CC: Kevin Tian <***@intel.com>
CC: George Dunlap <***@eu.citrix.com>
CC: Dario Faggioli <***@citrix.com>
Sugguested-by: Dario Faggioli <***@citrix.com>
Signed-off-by: Feng Wu <***@intel.com>
Reviewed-by: Dario Faggioli <***@citrix.com>
---
v6:
- Add two static inline functions for pi context switch
- Fix typos

v5:
- Rename arch_vcpu_wake to arch_vcpu_wake_prepare
- Make arch_vcpu_wake_prepare() inline for ARM
- Merge the ARM dummy hook with together
- Changes to some code comments
- Leave 'pi_ctxt_switch_from' and 'pi_ctxt_switch_to' NULL if
PI is disabled or the vCPU is not in HVM
- Coding style

v4:
- Newly added

xen/arch/x86/domain.c | 19 +++++
xen/arch/x86/hvm/vmx/vmx.c | 147 +++++++++++++++++++++++++++++++++++++
xen/common/schedule.c | 2 +
xen/include/asm-arm/domain.h | 2 +
xen/include/asm-x86/domain.h | 3 +
xen/include/asm-x86/hvm/hvm.h | 2 +
xen/include/asm-x86/hvm/vmx/vmcs.h | 8 ++
7 files changed, 183 insertions(+)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 045f6ff..443986e 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1573,6 +1573,22 @@ static void __context_switch(void)
per_cpu(curr_vcpu, cpu) = n;
}

+static inline void pi_ctxt_switch_from(struct vcpu *prev)
+{
+ /*
+ * When switching from non-idle to idle, we only do a lazy context switch.
+ * However, in order for posted interrupt (if available and enabled) to
+ * work properly, we at least need to update the descriptors.
+ */
+ if ( prev->arch.pi_ctxt_switch_from && !is_idle_vcpu(prev) )
+ prev->arch.pi_ctxt_switch_from(prev);
+}
+
+static inline void pi_ctxt_switch_to(struct vcpu *next)
+{
+ if ( next->arch.pi_ctxt_switch_to && !is_idle_vcpu(next) )
+ next->arch.pi_ctxt_switch_to(next);
+}

void context_switch(struct vcpu *prev, struct vcpu *next)
{
@@ -1605,9 +1621,12 @@ void context_switch(struct vcpu *prev, struct vcpu *next)

set_current(next);

+ pi_ctxt_switch_from(prev);
+
if ( (per_cpu(curr_vcpu, cpu) == next) ||
(is_idle_domain(nextd) && cpu_online(cpu)) )
{
+ pi_ctxt_switch_to(next);
local_irq_enable();
}
else
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 5167fae..889ede3 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -67,6 +67,8 @@ enum handler_return { HNDL_done, HNDL_unhandled, HNDL_exception_raised };

static void vmx_ctxt_switch_from(struct vcpu *v);
static void vmx_ctxt_switch_to(struct vcpu *v);
+static void vmx_pre_ctx_switch_pi(struct vcpu *v);
+static void vmx_post_ctx_switch_pi(struct vcpu *v);

static int vmx_alloc_vlapic_mapping(struct domain *d);
static void vmx_free_vlapic_mapping(struct domain *d);
@@ -117,10 +119,20 @@ static int vmx_vcpu_initialise(struct vcpu *v)
INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_vcpu_on_set_list);

+ v->arch.hvm_vmx.pi_block_cpu = -1;
+
+ spin_lock_init(&v->arch.hvm_vmx.pi_lock);
+
v->arch.schedule_tail = vmx_do_resume;
v->arch.ctxt_switch_from = vmx_ctxt_switch_from;
v->arch.ctxt_switch_to = vmx_ctxt_switch_to;

+ if ( iommu_intpost && is_hvm_vcpu(v) )
+ {
+ v->arch.pi_ctxt_switch_from = vmx_pre_ctx_switch_pi;
+ v->arch.pi_ctxt_switch_to = vmx_post_ctx_switch_pi;
+ }
+
if ( (rc = vmx_create_vmcs(v)) != 0 )
{
dprintk(XENLOG_WARNING,
@@ -718,6 +730,140 @@ static void vmx_fpu_leave(struct vcpu *v)
}
}

+void arch_vcpu_wake_prepare(struct vcpu *v)
+{
+ unsigned long gflags;
+
+ if ( !iommu_intpost || !is_hvm_vcpu(v) || !has_arch_pdevs(v->domain) )
+ return;
+
+ spin_lock_irqsave(&v->arch.hvm_vmx.pi_lock, gflags);
+
+ if ( likely(vcpu_runnable(v)) ||
+ !test_bit(_VPF_blocked, &v->pause_flags) )
+ {
+ struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+ unsigned long flags;
+
+ /*
+ * We don't need to send notification event to a non-running
+ * vcpu, the interrupt information will be delivered to it before
+ * VM-ENTRY when the vcpu is scheduled to run next time.
+ */
+ pi_set_sn(pi_desc);
+
+ /*
+ * Set 'NV' field back to posted_intr_vector, so the
+ * Posted-Interrupts can be delivered to the vCPU by
+ * VT-d HW after it is scheduled to run.
+ */
+ write_atomic((uint8_t*)&pi_desc->nv, posted_intr_vector);
+
+ /*
+ * Delete the vCPU from the related block list
+ * if we are resuming from blocked state.
+ */
+ if ( v->arch.hvm_vmx.pi_block_cpu != -1 )
+ {
+ spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
+ v->arch.hvm_vmx.pi_block_cpu), flags);
+ list_del_init(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
+ spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
+ v->arch.hvm_vmx.pi_block_cpu), flags);
+ }
+ }
+
+ spin_unlock_irqrestore(&v->arch.hvm_vmx.pi_lock, gflags);
+}
+
+static void vmx_pre_ctx_switch_pi(struct vcpu *v)
+{
+ struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+ struct pi_desc old, new;
+ unsigned long flags, gflags;
+
+ if ( !has_arch_pdevs(v->domain) )
+ return;
+
+ spin_lock_irqsave(&v->arch.hvm_vmx.pi_lock, gflags);
+
+ if ( vcpu_runnable(v) || !test_bit(_VPF_blocked, &v->pause_flags) )
+ {
+ /*
+ * The vCPU has been preempted or went to sleep. We don't need to send
+ * notification event to a non-running vcpu, the interrupt information
+ * will be delivered to it before VM-ENTRY when the vcpu is scheduled
+ * to run next time.
+ */
+ pi_set_sn(pi_desc);
+
+ }
+ else if ( test_bit(_VPF_blocked, &v->pause_flags) )
+ {
+ /*
+ * The vCPU is blocking, we need to add it to one of the per pCPU lists.
+ * We save v->processor to v->arch.hvm_vmx.pi_block_cpu and use it for
+ * the per-CPU list, we also save it to posted-interrupt descriptor and
+ * make it as the destination of the wake-up notification event.
+ */
+ v->arch.hvm_vmx.pi_block_cpu = v->processor;
+ spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
+ v->arch.hvm_vmx.pi_block_cpu), flags);
+ list_add_tail(&v->arch.hvm_vmx.pi_blocked_vcpu_list,
+ &per_cpu(pi_blocked_vcpu, v->arch.hvm_vmx.pi_block_cpu));
+ spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
+ v->arch.hvm_vmx.pi_block_cpu), flags);
+
+ do {
+ old.control = new.control = pi_desc->control;
+
+ /* Should not block the vCPU if an interrupt was posted for it. */
+ if ( pi_test_on(&old) )
+ {
+ spin_unlock_irqrestore(&v->arch.hvm_vmx.pi_lock, gflags);
+ vcpu_unblock(v);
+ return;
+ }
+
+ /*
+ * Change the 'NDST' field to v->arch.hvm_vmx.pi_block_cpu,
+ * so when external interrupts from assigned deivces happen,
+ * wakeup notifiction event will go to
+ * v->arch.hvm_vmx.pi_block_cpu, then in pi_wakeup_interrupt()
+ * we can find the vCPU in the right list to wake up.
+ */
+ if ( x2apic_enabled )
+ new.ndst = cpu_physical_id(v->arch.hvm_vmx.pi_block_cpu);
+ else
+ new.ndst = MASK_INSR(cpu_physical_id(
+ v->arch.hvm_vmx.pi_block_cpu),
+ PI_xAPIC_NDST_MASK);
+ pi_clear_sn(&new);
+ new.nv = pi_wakeup_vector;
+ } while ( cmpxchg(&pi_desc->control, old.control, new.control)
+ != old.control );
+ }
+
+ spin_unlock_irqrestore(&v->arch.hvm_vmx.pi_lock, gflags);
+}
+
+static void vmx_post_ctx_switch_pi(struct vcpu *v)
+{
+ struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+
+ if ( !has_arch_pdevs(v->domain) )
+ return;
+
+ if ( x2apic_enabled )
+ write_atomic(&pi_desc->ndst, cpu_physical_id(v->processor));
+ else
+ write_atomic(&pi_desc->ndst,
+ MASK_INSR(cpu_physical_id(v->processor),
+ PI_xAPIC_NDST_MASK));
+
+ pi_clear_sn(pi_desc);
+}
+
static void vmx_ctxt_switch_from(struct vcpu *v)
{
/*
@@ -756,6 +902,7 @@ static void vmx_ctxt_switch_to(struct vcpu *v)

vmx_restore_guest_msrs(v);
vmx_restore_dr(v);
+ vmx_post_ctx_switch_pi(v);
}


diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 3eefed7..bc49098 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -412,6 +412,8 @@ void vcpu_wake(struct vcpu *v)
unsigned long flags;
spinlock_t *lock = vcpu_schedule_lock_irqsave(v, &flags);

+ arch_vcpu_wake_prepare(v);
+
if ( likely(vcpu_runnable(v)) )
{
if ( v->runstate.state >= RUNSTATE_blocked )
diff --git a/xen/include/asm-arm/domain.h b/xen/include/asm-arm/domain.h
index 56aa208..cffe2c6 100644
--- a/xen/include/asm-arm/domain.h
+++ b/xen/include/asm-arm/domain.h
@@ -301,6 +301,8 @@ static inline register_t vcpuid_to_vaffinity(unsigned int vcpuid)
return vaff;
}

+static inline void arch_vcpu_wake_prepare(struct vcpu *v) {}
+
#endif /* __ASM_DOMAIN_H__ */

/*
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 0fce09e..979210a 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -481,6 +481,9 @@ struct arch_vcpu
void (*ctxt_switch_from) (struct vcpu *);
void (*ctxt_switch_to) (struct vcpu *);

+ void (*pi_ctxt_switch_from) (struct vcpu *);
+ void (*pi_ctxt_switch_to) (struct vcpu *);
+
struct vpmu_struct vpmu;

/* Virtual Machine Extensions */
diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h
index 3cac64f..95f5357 100644
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -545,6 +545,8 @@ static inline bool_t hvm_altp2m_supported(void)
return hvm_funcs.altp2m_supported;
}

+void arch_vcpu_wake_prepare(struct vcpu *v);
+
#ifndef NDEBUG
/* Permit use of the Forced Emulation Prefix in HVM guests */
extern bool_t opt_hvm_fep;
diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h b/xen/include/asm-x86/hvm/vmx/vmcs.h
index 9a986d0..209fb39 100644
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
@@ -164,6 +164,14 @@ struct arch_vmx_struct {

struct list_head pi_blocked_vcpu_list;
struct list_head pi_vcpu_on_set_list;
+
+ /*
+ * Before vCPU is blocked, it is added to the global per-cpu list
+ * of 'pi_block_cpu', then VT-d engine can send wakeup notification
+ * event to 'pi_block_cpu' and wakeup the related vCPU.
+ */
+ int pi_block_cpu;
+ spinlock_t pi_lock;
};

int vmx_create_vmcs(struct vcpu *v);
--
2.1.0
Jan Beulich
2015-09-07 12:54:36 UTC
Permalink
Post by Feng Wu
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1573,6 +1573,22 @@ static void __context_switch(void)
per_cpu(curr_vcpu, cpu) = n;
}
+static inline void pi_ctxt_switch_from(struct vcpu *prev)
+{
+ /*
+ * When switching from non-idle to idle, we only do a lazy context switch.
+ * However, in order for posted interrupt (if available and enabled) to
+ * work properly, we at least need to update the descriptors.
+ */
+ if ( prev->arch.pi_ctxt_switch_from && !is_idle_vcpu(prev) )
+ prev->arch.pi_ctxt_switch_from(prev);
+}
+
+static inline void pi_ctxt_switch_to(struct vcpu *next)
+{
+ if ( next->arch.pi_ctxt_switch_to && !is_idle_vcpu(next) )
+ next->arch.pi_ctxt_switch_to(next);
+}
void context_switch(struct vcpu *prev, struct vcpu *next)
{
@@ -1605,9 +1621,12 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
set_current(next);
+ pi_ctxt_switch_from(prev);
+
if ( (per_cpu(curr_vcpu, cpu) == next) ||
(is_idle_domain(nextd) && cpu_online(cpu)) )
{
+ pi_ctxt_switch_to(next);
local_irq_enable();
This placement, if really intended that way, needs explanation (in a
comment) and perhaps even renaming of the involved symbols, as
looking at it from a general perspective it seems wrong (with
pi_ctxt_switch_to() excluding idle vCPU-s it effectively means you
want this only when switching back to what got switched out lazily
before, i.e. this would be not something to take place on an arbitrary
context switch). As to possible alternative names - maybe make the
hooks ctxt_switch_prepare() and ctxt_switch_cancel()?
Post by Feng Wu
@@ -117,10 +119,20 @@ static int vmx_vcpu_initialise(struct vcpu *v)
INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_vcpu_on_set_list);
+ v->arch.hvm_vmx.pi_block_cpu = -1;
+
+ spin_lock_init(&v->arch.hvm_vmx.pi_lock);
+
v->arch.schedule_tail = vmx_do_resume;
v->arch.ctxt_switch_from = vmx_ctxt_switch_from;
v->arch.ctxt_switch_to = vmx_ctxt_switch_to;
+ if ( iommu_intpost && is_hvm_vcpu(v) )
+ {
+ v->arch.pi_ctxt_switch_from = vmx_pre_ctx_switch_pi;
+ v->arch.pi_ctxt_switch_to = vmx_post_ctx_switch_pi;
+ }
Why conditional upon is_hvm_vcpu()?
Post by Feng Wu
@@ -718,6 +730,140 @@ static void vmx_fpu_leave(struct vcpu *v)
}
}
+void arch_vcpu_wake_prepare(struct vcpu *v)
A non-static function with this name should not live in vmx.c.
Post by Feng Wu
+{
+ unsigned long gflags;
Just "flags" please.
Post by Feng Wu
+ if ( !iommu_intpost || !is_hvm_vcpu(v) || !has_arch_pdevs(v->domain) )
DYM !has_hvm_container_vcpu()?
Post by Feng Wu
+ return;
+
+ spin_lock_irqsave(&v->arch.hvm_vmx.pi_lock, gflags);
+
+ if ( likely(vcpu_runnable(v)) ||
+ !test_bit(_VPF_blocked, &v->pause_flags) )
_VPF_blocked set implies !vcpu_runnable(), i.e. afaict just the
latter check would suffice if this is really what you mean.
Post by Feng Wu
+ {
+ struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+ unsigned long flags;
You don't need this - you already know IRQs are off for the lock
acquire below.
Post by Feng Wu
+ /*
+ * We don't need to send notification event to a non-running
+ * vcpu, the interrupt information will be delivered to it before
+ * VM-ENTRY when the vcpu is scheduled to run next time.
+ */
+ pi_set_sn(pi_desc);
+
+ /*
+ * Set 'NV' field back to posted_intr_vector, so the
+ * Posted-Interrupts can be delivered to the vCPU by
+ * VT-d HW after it is scheduled to run.
+ */
+ write_atomic((uint8_t*)&pi_desc->nv, posted_intr_vector);
Bogus cast.
Post by Feng Wu
+
+ /*
+ * Delete the vCPU from the related block list
+ * if we are resuming from blocked state.
+ */
+ if ( v->arch.hvm_vmx.pi_block_cpu != -1 )
+ {
+ spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
+ v->arch.hvm_vmx.pi_block_cpu), flags);
+ list_del_init(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
Shouldn't you set .pi_block_cpu back to -1 along with removing
the vCPU from the list? Which then ought to allow using just
list_del() here?
Post by Feng Wu
+static void vmx_pre_ctx_switch_pi(struct vcpu *v)
+{
+ struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+ struct pi_desc old, new;
Please limit the scope of these two variables.
Post by Feng Wu
+ unsigned long flags, gflags;
See above.
Post by Feng Wu
+ if ( !has_arch_pdevs(v->domain) )
+ return;
+
+ spin_lock_irqsave(&v->arch.hvm_vmx.pi_lock, gflags);
+
+ if ( vcpu_runnable(v) || !test_bit(_VPF_blocked, &v->pause_flags) )
See above.
Post by Feng Wu
+ {
+ /*
+ * The vCPU has been preempted or went to sleep. We don't need to send
+ * notification event to a non-running vcpu, the interrupt information
+ * will be delivered to it before VM-ENTRY when the vcpu is scheduled
+ * to run next time.
+ */
+ pi_set_sn(pi_desc);
+
+ }
+ else if ( test_bit(_VPF_blocked, &v->pause_flags) )
The condition here is redundant with the if() one above.
Post by Feng Wu
+ {
+ /*
+ * The vCPU is blocking, we need to add it to one of the per pCPU lists.
+ * We save v->processor to v->arch.hvm_vmx.pi_block_cpu and use it for
+ * the per-CPU list, we also save it to posted-interrupt descriptor and
+ * make it as the destination of the wake-up notification event.
+ */
+ v->arch.hvm_vmx.pi_block_cpu = v->processor;
+ spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
+ v->arch.hvm_vmx.pi_block_cpu), flags);
+ list_add_tail(&v->arch.hvm_vmx.pi_blocked_vcpu_list,
+ &per_cpu(pi_blocked_vcpu, v->arch.hvm_vmx.pi_block_cpu));
+ spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
+ v->arch.hvm_vmx.pi_block_cpu), flags);
+
+ do {
+ old.control = new.control = pi_desc->control;
+
+ /* Should not block the vCPU if an interrupt was posted for it. */
+ if ( pi_test_on(&old) )
+ {
+ spin_unlock_irqrestore(&v->arch.hvm_vmx.pi_lock, gflags);
+ vcpu_unblock(v);
Calling vcpu_unblock() in the middle of context_switch()? Why? And
is this safe? And if really needed to cover something not dealt with
elsewhere, wouldn't this need to happen _after_ having switched
the notification mechanism below?
Post by Feng Wu
+ return;
+ }
+
+ /*
+ * Change the 'NDST' field to v->arch.hvm_vmx.pi_block_cpu,
+ * so when external interrupts from assigned deivces happen,
+ * wakeup notifiction event will go to
+ * v->arch.hvm_vmx.pi_block_cpu, then in pi_wakeup_interrupt()
+ * we can find the vCPU in the right list to wake up.
+ */
+ if ( x2apic_enabled )
+ new.ndst = cpu_physical_id(v->arch.hvm_vmx.pi_block_cpu);
+ else
+ new.ndst = MASK_INSR(cpu_physical_id(
+ v->arch.hvm_vmx.pi_block_cpu),
+ PI_xAPIC_NDST_MASK);
Indentation is screwed up here. Perhaps it would help if you latched
cpu_phyiscal_id() into a local variable and used it on both branches?
Post by Feng Wu
+ pi_clear_sn(&new);
+ new.nv = pi_wakeup_vector;
+ } while ( cmpxchg(&pi_desc->control, old.control, new.control)
+ != old.control );
Operators belong on the earlier of the split up lines.
Post by Feng Wu
+static void vmx_post_ctx_switch_pi(struct vcpu *v)
+{
+ struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+
+ if ( !has_arch_pdevs(v->domain) )
+ return;
+
+ if ( x2apic_enabled )
+ write_atomic(&pi_desc->ndst, cpu_physical_id(v->processor));
+ else
+ write_atomic(&pi_desc->ndst,
+ MASK_INSR(cpu_physical_id(v->processor),
+ PI_xAPIC_NDST_MASK));
+
+ pi_clear_sn(pi_desc);
+}
So you alter where notifications go, but not via which vector? How
is the vCPU going to get removed from the blocked list then?
Post by Feng Wu
@@ -756,6 +902,7 @@ static void vmx_ctxt_switch_to(struct vcpu *v)
vmx_restore_guest_msrs(v);
vmx_restore_dr(v);
+ vmx_post_ctx_switch_pi(v);
}
Ah, here is the other invocation! But no, this shouldn't be done this
way. This really belongs into vendor independent code, even if the
indirect call overhead is slightly higher.
Post by Feng Wu
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
@@ -164,6 +164,14 @@ struct arch_vmx_struct {
struct list_head pi_blocked_vcpu_list;
struct list_head pi_vcpu_on_set_list;
+
+ /*
+ * Before vCPU is blocked, it is added to the global per-cpu list
+ * of 'pi_block_cpu', then VT-d engine can send wakeup notification
+ * event to 'pi_block_cpu' and wakeup the related vCPU.
+ */
+ int pi_block_cpu;
I can see that using int is in line with storing -1 into the field. But
generally CPU numbers should be unsigned. Please make it so,
using NR_CPUS or UINT_MAX in place of the -1.

Jan
Wu, Feng
2015-09-09 08:56:57 UTC
Permalink
-----Original Message-----
Sent: Monday, September 07, 2015 8:55 PM
To: Wu, Feng
Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
Subject: Re: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted
interrupts
Post by Feng Wu
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1573,6 +1573,22 @@ static void __context_switch(void)
per_cpu(curr_vcpu, cpu) = n;
}
+static inline void pi_ctxt_switch_from(struct vcpu *prev)
+{
+ /*
+ * When switching from non-idle to idle, we only do a lazy context
switch.
Post by Feng Wu
+ * However, in order for posted interrupt (if available and enabled) to
+ * work properly, we at least need to update the descriptors.
+ */
+ if ( prev->arch.pi_ctxt_switch_from && !is_idle_vcpu(prev) )
+ prev->arch.pi_ctxt_switch_from(prev);
+}
+
+static inline void pi_ctxt_switch_to(struct vcpu *next)
+{
+ if ( next->arch.pi_ctxt_switch_to && !is_idle_vcpu(next) )
+ next->arch.pi_ctxt_switch_to(next);
+}
void context_switch(struct vcpu *prev, struct vcpu *next)
{
@@ -1605,9 +1621,12 @@ void context_switch(struct vcpu *prev, struct
vcpu *next)
Post by Feng Wu
set_current(next);
+ pi_ctxt_switch_from(prev);
+
if ( (per_cpu(curr_vcpu, cpu) == next) ||
(is_idle_domain(nextd) && cpu_online(cpu)) )
{
+ pi_ctxt_switch_to(next);
local_irq_enable();
This placement, if really intended that way, needs explanation (in a
comment) and perhaps even renaming of the involved symbols, as
I think maybe removing the static wrapper function can make thing
clearer. What is your opinion?
looking at it from a general perspective it seems wrong (with
pi_ctxt_switch_to() excluding idle vCPU-s it effectively means you
want this only when switching back to what got switched out lazily
before, i.e. this would be not something to take place on an arbitrary
context switch). As to possible alternative names - maybe make the
hooks ctxt_switch_prepare() and ctxt_switch_cancel()?
Post by Feng Wu
@@ -117,10 +119,20 @@ static int vmx_vcpu_initialise(struct vcpu *v)
INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_vcpu_on_set_list);
+ v->arch.hvm_vmx.pi_block_cpu = -1;
+
+ spin_lock_init(&v->arch.hvm_vmx.pi_lock);
+
v->arch.schedule_tail = vmx_do_resume;
v->arch.ctxt_switch_from = vmx_ctxt_switch_from;
v->arch.ctxt_switch_to = vmx_ctxt_switch_to;
+ if ( iommu_intpost && is_hvm_vcpu(v) )
+ {
+ v->arch.pi_ctxt_switch_from = vmx_pre_ctx_switch_pi;
+ v->arch.pi_ctxt_switch_to = vmx_post_ctx_switch_pi;
+ }
Why conditional upon is_hvm_vcpu()?
I only think we need to add these hooks for PV guests, right?
Or you mean I should use has_hvm_container_vcpu() instead?
Post by Feng Wu
@@ -718,6 +730,140 @@ static void vmx_fpu_leave(struct vcpu *v)
}
}
+void arch_vcpu_wake_prepare(struct vcpu *v)
A non-static function with this name should not live in vmx.c.
Post by Feng Wu
+{
+ unsigned long gflags;
Just "flags" please.
Post by Feng Wu
+ if ( !iommu_intpost || !is_hvm_vcpu(v)
|| !has_arch_pdevs(v->domain) )
DYM !has_hvm_container_vcpu()?
Post by Feng Wu
+ return;
+
+ spin_lock_irqsave(&v->arch.hvm_vmx.pi_lock, gflags);
+
+ if ( likely(vcpu_runnable(v)) ||
+ !test_bit(_VPF_blocked, &v->pause_flags) )
_VPF_blocked set implies !vcpu_runnable(), i.e. afaict just the
latter check would suffice if this is really what you mean.
Post by Feng Wu
+ {
+ struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+ unsigned long flags;
You don't need this - you already know IRQs are off for the lock
acquire below.
Post by Feng Wu
+ /*
+ * We don't need to send notification event to a non-running
+ * vcpu, the interrupt information will be delivered to it before
+ * VM-ENTRY when the vcpu is scheduled to run next time.
+ */
+ pi_set_sn(pi_desc);
+
+ /*
+ * Set 'NV' field back to posted_intr_vector, so the
+ * Posted-Interrupts can be delivered to the vCPU by
+ * VT-d HW after it is scheduled to run.
+ */
+ write_atomic((uint8_t*)&pi_desc->nv, posted_intr_vector);
Bogus cast.
Post by Feng Wu
+
+ /*
+ * Delete the vCPU from the related block list
+ * if we are resuming from blocked state.
+ */
+ if ( v->arch.hvm_vmx.pi_block_cpu != -1 )
+ {
+ spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
+ v->arch.hvm_vmx.pi_block_cpu),
flags);
Post by Feng Wu
+ list_del_init(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
Shouldn't you set .pi_block_cpu back to -1 along with removing
the vCPU from the list? Which then ought to allow using just
list_del() here?
Here is the story about using list_del_init() instead of list_del(), (the
same reason for using list_del_init() in pi_wakeup_interrupt() ).
If we use list_del(), that means we need to set .pi_block_cpu to
-1 after removing the vCPU from the block list, there are two
places where the vCPU is removed from the list:
- arch_vcpu_wake_prepare()
- pi_wakeup_interrupt()

That means we also need to set .pi_block_cpu to -1 in
pi_wakeup_interrupt(), which seems problematic to me, since
.pi_block_cpu is used to get the per-CPU spin lock, it is not safe
to change it in pi_wakeup_interrupt(), maybe someone is using
it at the same time.

And the check "if ( v->arch.hvm_vmx.pi_block_cpu != -1 )" here
is only used when .pi_block_cpu is set to -1 at the initial time.

If you have any good suggestions about this, I will be all ears.
Post by Feng Wu
+static void vmx_pre_ctx_switch_pi(struct vcpu *v)
+{
+ struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+ struct pi_desc old, new;
Please limit the scope of these two variables.
Post by Feng Wu
+ unsigned long flags, gflags;
See above.
Post by Feng Wu
+ if ( !has_arch_pdevs(v->domain) )
+ return;
+
+ spin_lock_irqsave(&v->arch.hvm_vmx.pi_lock, gflags);
+
+ if ( vcpu_runnable(v) || !test_bit(_VPF_blocked, &v->pause_flags) )
See above.
Post by Feng Wu
+ {
+ /*
+ * The vCPU has been preempted or went to sleep. We don't need
to send
Post by Feng Wu
+ * notification event to a non-running vcpu, the interrupt
information
Post by Feng Wu
+ * will be delivered to it before VM-ENTRY when the vcpu is
scheduled
Post by Feng Wu
+ * to run next time.
+ */
+ pi_set_sn(pi_desc);
+
+ }
+ else if ( test_bit(_VPF_blocked, &v->pause_flags) )
The condition here is redundant with the if() one above.
Post by Feng Wu
+ {
+ /*
+ * The vCPU is blocking, we need to add it to one of the per pCPU
lists.
Post by Feng Wu
+ * We save v->processor to v->arch.hvm_vmx.pi_block_cpu and
use it for
Post by Feng Wu
+ * the per-CPU list, we also save it to posted-interrupt descriptor
and
Post by Feng Wu
+ * make it as the destination of the wake-up notification event.
+ */
+ v->arch.hvm_vmx.pi_block_cpu = v->processor;
+ spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
+ v->arch.hvm_vmx.pi_block_cpu), flags);
+ list_add_tail(&v->arch.hvm_vmx.pi_blocked_vcpu_list,
+ &per_cpu(pi_blocked_vcpu,
v->arch.hvm_vmx.pi_block_cpu));
Post by Feng Wu
+ spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
+ v->arch.hvm_vmx.pi_block_cpu), flags);
+
+ do {
+ old.control = new.control = pi_desc->control;
+
+ /* Should not block the vCPU if an interrupt was posted for it.
*/
Post by Feng Wu
+ if ( pi_test_on(&old) )
+ {
+ spin_unlock_irqrestore(&v->arch.hvm_vmx.pi_lock,
gflags);
Post by Feng Wu
+ vcpu_unblock(v);
Calling vcpu_unblock() in the middle of context_switch()? Why? And
is this safe?
I cannot see anything unsafe so far, can some scheduler maintainer
help to confirm it? Dario? George?
And if really needed to cover something not dealt with
elsewhere, wouldn't this need to happen _after_ having switched
the notification mechanism below?
If ON is set, we don't need to block the vCPU hence no need to change the
notification vector to the wakeup one, which is used when vCPU is blocked.
Post by Feng Wu
+ return;
+ }
+
+ /*
+ * Change the 'NDST' field to v->arch.hvm_vmx.pi_block_cpu,
+ * so when external interrupts from assigned deivces
happen,
Post by Feng Wu
+ * wakeup notifiction event will go to
+ * v->arch.hvm_vmx.pi_block_cpu, then in
pi_wakeup_interrupt()
Post by Feng Wu
+ * we can find the vCPU in the right list to wake up.
+ */
+ if ( x2apic_enabled )
+ new.ndst =
cpu_physical_id(v->arch.hvm_vmx.pi_block_cpu);
Post by Feng Wu
+ else
+ new.ndst = MASK_INSR(cpu_physical_id(
+ v->arch.hvm_vmx.pi_block_cpu),
+ PI_xAPIC_NDST_MASK);
Indentation is screwed up here. Perhaps it would help if you latched
cpu_phyiscal_id() into a local variable and used it on both branches?
Post by Feng Wu
+ pi_clear_sn(&new);
+ new.nv = pi_wakeup_vector;
+ } while ( cmpxchg(&pi_desc->control, old.control, new.control)
+ != old.control );
Operators belong on the earlier of the split up lines.
Post by Feng Wu
+static void vmx_post_ctx_switch_pi(struct vcpu *v)
+{
+ struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+
+ if ( !has_arch_pdevs(v->domain) )
+ return;
+
+ if ( x2apic_enabled )
+ write_atomic(&pi_desc->ndst, cpu_physical_id(v->processor));
+ else
+ write_atomic(&pi_desc->ndst,
+ MASK_INSR(cpu_physical_id(v->processor),
+ PI_xAPIC_NDST_MASK));
+
+ pi_clear_sn(pi_desc);
+}
So you alter where notifications go, but not via which vector? How
is the vCPU going to get removed from the blocked list then?
vmx_post_ctx_switch_pi() is called when the vCPU is about to run,
in that case, the vector has been set properly and it has been removed
from the block list if it was blocked before.
Post by Feng Wu
@@ -756,6 +902,7 @@ static void vmx_ctxt_switch_to(struct vcpu *v)
vmx_restore_guest_msrs(v);
vmx_restore_dr(v);
+ vmx_post_ctx_switch_pi(v);
}
Ah, here is the other invocation! But no, this shouldn't be done this
way. This really belongs into vendor independent code, even if the
indirect call overhead is slightly higher.
Why do you say this is vendor independent? What is your suggestion?

Thanks,
Feng
Post by Feng Wu
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
@@ -164,6 +164,14 @@ struct arch_vmx_struct {
struct list_head pi_blocked_vcpu_list;
struct list_head pi_vcpu_on_set_list;
+
+ /*
+ * Before vCPU is blocked, it is added to the global per-cpu list
+ * of 'pi_block_cpu', then VT-d engine can send wakeup notification
+ * event to 'pi_block_cpu' and wakeup the related vCPU.
+ */
+ int pi_block_cpu;
I can see that using int is in line with storing -1 into the field. But
generally CPU numbers should be unsigned. Please make it so,
using NR_CPUS or UINT_MAX in place of the -1.
Jan
Jan Beulich
2015-09-09 10:26:40 UTC
Permalink
Post by Wu, Feng
Sent: Monday, September 07, 2015 8:55 PM
Post by Feng Wu
@@ -1605,9 +1621,12 @@ void context_switch(struct vcpu *prev, struct
vcpu *next)
Post by Feng Wu
set_current(next);
+ pi_ctxt_switch_from(prev);
+
if ( (per_cpu(curr_vcpu, cpu) == next) ||
(is_idle_domain(nextd) && cpu_online(cpu)) )
{
+ pi_ctxt_switch_to(next);
local_irq_enable();
This placement, if really intended that way, needs explanation (in a
comment) and perhaps even renaming of the involved symbols, as
I think maybe removing the static wrapper function can make thing
clearer. What is your opinion?
Considering that there's going to be a second use of the
switch-to variant, I think the helpers are okay to be kept (but
I wouldn't insist on that), just that they need to be named in
a away making clear what their purpose is.
Post by Wu, Feng
Post by Feng Wu
@@ -117,10 +119,20 @@ static int vmx_vcpu_initialise(struct vcpu *v)
INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_vcpu_on_set_list);
+ v->arch.hvm_vmx.pi_block_cpu = -1;
+
+ spin_lock_init(&v->arch.hvm_vmx.pi_lock);
+
v->arch.schedule_tail = vmx_do_resume;
v->arch.ctxt_switch_from = vmx_ctxt_switch_from;
v->arch.ctxt_switch_to = vmx_ctxt_switch_to;
+ if ( iommu_intpost && is_hvm_vcpu(v) )
+ {
+ v->arch.pi_ctxt_switch_from = vmx_pre_ctx_switch_pi;
+ v->arch.pi_ctxt_switch_to = vmx_post_ctx_switch_pi;
+ }
Why conditional upon is_hvm_vcpu()?
I only think we need to add these hooks for PV guests, right?
"... we don't need to ..."?
Post by Wu, Feng
Or you mean I should use has_hvm_container_vcpu() instead?
Exactly.
Post by Wu, Feng
Post by Feng Wu
+
+ /*
+ * Delete the vCPU from the related block list
+ * if we are resuming from blocked state.
+ */
+ if ( v->arch.hvm_vmx.pi_block_cpu != -1 )
+ {
+ spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
+ v->arch.hvm_vmx.pi_block_cpu),
flags);
Post by Feng Wu
+ list_del_init(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
Shouldn't you set .pi_block_cpu back to -1 along with removing
the vCPU from the list? Which then ought to allow using just
list_del() here?
Here is the story about using list_del_init() instead of list_del(), (the
same reason for using list_del_init() in pi_wakeup_interrupt() ).
If we use list_del(), that means we need to set .pi_block_cpu to
-1 after removing the vCPU from the block list, there are two
- arch_vcpu_wake_prepare()
- pi_wakeup_interrupt()
That means we also need to set .pi_block_cpu to -1 in
pi_wakeup_interrupt(), which seems problematic to me, since
.pi_block_cpu is used to get the per-CPU spin lock, it is not safe
to change it in pi_wakeup_interrupt(), maybe someone is using
it at the same time.
And the check "if ( v->arch.hvm_vmx.pi_block_cpu != -1 )" here
is only used when .pi_block_cpu is set to -1 at the initial time.
If you have any good suggestions about this, I will be all ears.
Latching pi_block_cpu into a local variable would take care of that
part of the problem. Of course you then may also need to check
that while waiting to acquire the lock pi_block_cpu didn't change.
And if that absolutely can't be made work correctly, these
apparently strange uses of list_del_init() would each need a
justifying comment.
Post by Wu, Feng
Post by Feng Wu
+ do {
+ old.control = new.control = pi_desc->control;
+
+ /* Should not block the vCPU if an interrupt was posted for it.
*/
Post by Feng Wu
+ if ( pi_test_on(&old) )
+ {
+ spin_unlock_irqrestore(&v->arch.hvm_vmx.pi_lock,
gflags);
Post by Feng Wu
+ vcpu_unblock(v);
Calling vcpu_unblock() in the middle of context_switch()? Why? And
is this safe?
I cannot see anything unsafe so far, can some scheduler maintainer
help to confirm it? Dario? George?
But first and foremost you ought to answer the "why".
Post by Wu, Feng
And if really needed to cover something not dealt with
elsewhere, wouldn't this need to happen _after_ having switched
the notification mechanism below?
If ON is set, we don't need to block the vCPU hence no need to change the
notification vector to the wakeup one, which is used when vCPU is blocked.
Post by Feng Wu
+ return;
Oh, right, I somehow managed to ignore the "return" here.
Post by Wu, Feng
Post by Feng Wu
+static void vmx_post_ctx_switch_pi(struct vcpu *v)
+{
+ struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+
+ if ( !has_arch_pdevs(v->domain) )
+ return;
+
+ if ( x2apic_enabled )
+ write_atomic(&pi_desc->ndst, cpu_physical_id(v->processor));
+ else
+ write_atomic(&pi_desc->ndst,
+ MASK_INSR(cpu_physical_id(v->processor),
+ PI_xAPIC_NDST_MASK));
+
+ pi_clear_sn(pi_desc);
+}
So you alter where notifications go, but not via which vector? How
is the vCPU going to get removed from the blocked list then?
vmx_post_ctx_switch_pi() is called when the vCPU is about to run,
in that case, the vector has been set properly and it has been removed
from the block list if it was blocked before.
So why do you two updates then (first [elsewhere] to set the vector
and then here to set the destination)?
Post by Wu, Feng
Post by Feng Wu
@@ -756,6 +902,7 @@ static void vmx_ctxt_switch_to(struct vcpu *v)
vmx_restore_guest_msrs(v);
vmx_restore_dr(v);
+ vmx_post_ctx_switch_pi(v);
}
Ah, here is the other invocation! But no, this shouldn't be done this
way. This really belongs into vendor independent code, even if the
indirect call overhead is slightly higher.
Why do you say this is vendor independent? What is your suggestion?
This needs to become a second invocation of pi_ctxt_switch_to()
from the context switch code.

Jan
Wu, Feng
2015-09-10 02:07:04 UTC
Permalink
-----Original Message-----
Sent: Wednesday, September 09, 2015 6:27 PM
To: Wu, Feng
Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted
interrupts
Post by Wu, Feng
Sent: Monday, September 07, 2015 8:55 PM
Post by Feng Wu
+
+ /*
+ * Delete the vCPU from the related block list
+ * if we are resuming from blocked state.
+ */
+ if ( v->arch.hvm_vmx.pi_block_cpu != -1 )
+ {
+ spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
+ v->arch.hvm_vmx.pi_block_cpu),
flags);
Post by Feng Wu
+ list_del_init(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
Shouldn't you set .pi_block_cpu back to -1 along with removing
the vCPU from the list? Which then ought to allow using just
list_del() here?
Here is the story about using list_del_init() instead of list_del(), (the
same reason for using list_del_init() in pi_wakeup_interrupt() ).
If we use list_del(), that means we need to set .pi_block_cpu to
-1 after removing the vCPU from the block list, there are two
- arch_vcpu_wake_prepare()
- pi_wakeup_interrupt()
That means we also need to set .pi_block_cpu to -1 in
pi_wakeup_interrupt(), which seems problematic to me, since
.pi_block_cpu is used to get the per-CPU spin lock, it is not safe
to change it in pi_wakeup_interrupt(), maybe someone is using
it at the same time.
And the check "if ( v->arch.hvm_vmx.pi_block_cpu != -1 )" here
is only used when .pi_block_cpu is set to -1 at the initial time.
If you have any good suggestions about this, I will be all ears.
Latching pi_block_cpu into a local variable would take care of that
part of the problem. Of course you then may also need to check
that while waiting to acquire the lock pi_block_cpu didn't change.
Thanks for the suggestion! But I think it is not easy to "check
"that while waiting to acquire the lock pi_block_cpu didn't change".
Let's take the following pseudo code as an example:

int local_pi_block_cpu = v->arch.hvm_vmx.pi_block_cpu;

......

spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
local_pi_block_cpu), flags);
list_del(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
local_pi_block_cpu), flags);

If .pi_block_cpu is changed to -1 by others during calling list_del()
above, that means the vCPU is removed by others, then calling
list_del() here again would be problematic. I think there might be
other corner cases for this. So I suggest adding some comments
for list_del_init() as you mentioned below.
And if that absolutely can't be made work correctly, these
apparently strange uses of list_del_init() would each need a
justifying comment.
Post by Wu, Feng
Post by Feng Wu
+ do {
+ old.control = new.control = pi_desc->control;
+
+ /* Should not block the vCPU if an interrupt was posted for
it.
Post by Wu, Feng
*/
Post by Feng Wu
+ if ( pi_test_on(&old) )
+ {
+ spin_unlock_irqrestore(&v->arch.hvm_vmx.pi_lock,
gflags);
Post by Feng Wu
+ vcpu_unblock(v);
Calling vcpu_unblock() in the middle of context_switch()? Why? And
is this safe?
I cannot see anything unsafe so far, can some scheduler maintainer
help to confirm it? Dario? George?
But first and foremost you ought to answer the "why".
I think if you think this solution is unsafe, you need point out where it is, not
just ask "is this safe ?", I don't think this is an effective comments.

Anyway, I go through the main path of the code as below, I really don't find
anything unsafe here.

context_switch() -->
pi_ctxt_switch_from() -->
vmx_pre_ctx_switch_pi() -->
vcpu_unblock() -->
vcpu_wake() -->
SCHED_OP(VCPU2OP(v), wake, v) -->
csched_vcpu_wake() -->
__runq_insert()
__runq_tickle()

If you continue to think it is unsafe, pointing out the place will be welcome!
Post by Wu, Feng
And if really needed to cover something not dealt with
elsewhere, wouldn't this need to happen _after_ having switched
the notification mechanism below?
If ON is set, we don't need to block the vCPU hence no need to change the
notification vector to the wakeup one, which is used when vCPU is blocked.
Post by Feng Wu
+ return;
Oh, right, I somehow managed to ignore the "return" here.
Post by Wu, Feng
Post by Feng Wu
+static void vmx_post_ctx_switch_pi(struct vcpu *v)
+{
+ struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+
+ if ( !has_arch_pdevs(v->domain) )
+ return;
+
+ if ( x2apic_enabled )
+ write_atomic(&pi_desc->ndst, cpu_physical_id(v->processor));
+ else
+ write_atomic(&pi_desc->ndst,
+ MASK_INSR(cpu_physical_id(v->processor),
+ PI_xAPIC_NDST_MASK));
+
+ pi_clear_sn(pi_desc);
+}
So you alter where notifications go, but not via which vector? How
is the vCPU going to get removed from the blocked list then?
vmx_post_ctx_switch_pi() is called when the vCPU is about to run,
in that case, the vector has been set properly and it has been removed
from the block list if it was blocked before.
So why do you two updates then (first [elsewhere] to set the vector
and then here to set the destination)?
When the vCPU is unblocked and then removed from the blocking list, we
need to change the vector to the normal notification vector, since the
wakeup vector is only used when the vCPU is blocked and hence in the
blocked list. And here is the place we can decide which pCPU the vCPU
will be scheduled on, so we change the destination field here.

Thanks,
Feng
Jan Beulich
2015-09-10 08:27:43 UTC
Permalink
Post by Wu, Feng
Sent: Wednesday, September 09, 2015 6:27 PM
Post by Wu, Feng
Sent: Monday, September 07, 2015 8:55 PM
Post by Feng Wu
+
+ /*
+ * Delete the vCPU from the related block list
+ * if we are resuming from blocked state.
+ */
+ if ( v->arch.hvm_vmx.pi_block_cpu != -1 )
+ {
+ spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
+ v->arch.hvm_vmx.pi_block_cpu),
flags);
Post by Feng Wu
+ list_del_init(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
Shouldn't you set .pi_block_cpu back to -1 along with removing
the vCPU from the list? Which then ought to allow using just
list_del() here?
Here is the story about using list_del_init() instead of list_del(), (the
same reason for using list_del_init() in pi_wakeup_interrupt() ).
If we use list_del(), that means we need to set .pi_block_cpu to
-1 after removing the vCPU from the block list, there are two
- arch_vcpu_wake_prepare()
- pi_wakeup_interrupt()
That means we also need to set .pi_block_cpu to -1 in
pi_wakeup_interrupt(), which seems problematic to me, since
.pi_block_cpu is used to get the per-CPU spin lock, it is not safe
to change it in pi_wakeup_interrupt(), maybe someone is using
it at the same time.
And the check "if ( v->arch.hvm_vmx.pi_block_cpu != -1 )" here
is only used when .pi_block_cpu is set to -1 at the initial time.
If you have any good suggestions about this, I will be all ears.
Latching pi_block_cpu into a local variable would take care of that
part of the problem. Of course you then may also need to check
that while waiting to acquire the lock pi_block_cpu didn't change.
Thanks for the suggestion! But I think it is not easy to "check
"that while waiting to acquire the lock pi_block_cpu didn't change".
int local_pi_block_cpu = v->arch.hvm_vmx.pi_block_cpu;
......
spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
local_pi_block_cpu), flags);
list_del(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
local_pi_block_cpu), flags);
If .pi_block_cpu is changed to -1 by others during calling list_del()
above, that means the vCPU is removed by others, then calling
list_del() here again would be problematic. I think there might be
other corner cases for this. So I suggest adding some comments
for list_del_init() as you mentioned below.
That's why I said "check that while waiting to acquire the lock
pi_block_cpu didn't change." The code you present does not do
this.
Post by Wu, Feng
Post by Wu, Feng
Post by Feng Wu
+ do {
+ old.control = new.control = pi_desc->control;
+
+ /* Should not block the vCPU if an interrupt was posted for
it.
Post by Wu, Feng
*/
Post by Feng Wu
+ if ( pi_test_on(&old) )
+ {
+ spin_unlock_irqrestore(&v->arch.hvm_vmx.pi_lock,
gflags);
Post by Feng Wu
+ vcpu_unblock(v);
Calling vcpu_unblock() in the middle of context_switch()? Why? And
is this safe?
I cannot see anything unsafe so far, can some scheduler maintainer
help to confirm it? Dario? George?
But first and foremost you ought to answer the "why".
I think if you think this solution is unsafe, you need point out where it is, not
just ask "is this safe ?", I don't think this is an effective comments.
It is my general understanding that people wanting code to be
included have to prove their code is okay, not reviewers to prove
the code is broken.
Post by Wu, Feng
Anyway, I go through the main path of the code as below, I really don't find
anything unsafe here.
context_switch() -->
pi_ctxt_switch_from() -->
vmx_pre_ctx_switch_pi() -->
vcpu_unblock() -->
vcpu_wake() -->
SCHED_OP(VCPU2OP(v), wake, v) -->
csched_vcpu_wake() -->
__runq_insert()
__runq_tickle()
If you continue to think it is unsafe, pointing out the place will be welcome!
Once again - I didn't say it's unsafe (and hence can't point at a
particular place). What bothers me is that vcpu_unblock() affects
scheduler state, and altering scheduler state from inside the
context switch path looks problematic at the first glance. I'd be
happy if I was told there is no such problem, but be aware that
this would have to include latent ones: Even if right now things
are safe, having such feedback have the potential of becoming
an actual problem with a later scheduler change is already an
issue, as finding the problem is then likely going to be rather hard
(and I suspect it's not going to be you to debug this).
Post by Wu, Feng
Post by Wu, Feng
Post by Feng Wu
+static void vmx_post_ctx_switch_pi(struct vcpu *v)
+{
+ struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+
+ if ( !has_arch_pdevs(v->domain) )
+ return;
+
+ if ( x2apic_enabled )
+ write_atomic(&pi_desc->ndst, cpu_physical_id(v->processor));
+ else
+ write_atomic(&pi_desc->ndst,
+ MASK_INSR(cpu_physical_id(v->processor),
+ PI_xAPIC_NDST_MASK));
+
+ pi_clear_sn(pi_desc);
+}
So you alter where notifications go, but not via which vector? How
is the vCPU going to get removed from the blocked list then?
vmx_post_ctx_switch_pi() is called when the vCPU is about to run,
in that case, the vector has been set properly and it has been removed
from the block list if it was blocked before.
So why do you two updates then (first [elsewhere] to set the vector
and then here to set the destination)?
When the vCPU is unblocked and then removed from the blocking list, we
need to change the vector to the normal notification vector, since the
wakeup vector is only used when the vCPU is blocked and hence in the
blocked list. And here is the place we can decide which pCPU the vCPU
will be scheduled on, so we change the destination field here.
Right, that's what I understood. But isn't the state things are in
between these two updates inconsistent? I.e. - where does the
notification go if one arrives in this window? Particularly - if it went
to the right place, why would this second update be needed at all?

Jan
Wu, Feng
2015-09-10 08:59:49 UTC
Permalink
-----Original Message-----
Sent: Thursday, September 10, 2015 4:28 PM
To: Wu, Feng
Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted
interrupts
Post by Wu, Feng
Sent: Wednesday, September 09, 2015 6:27 PM
Post by Wu, Feng
Sent: Monday, September 07, 2015 8:55 PM
Post by Feng Wu
+
+ /*
+ * Delete the vCPU from the related block list
+ * if we are resuming from blocked state.
+ */
+ if ( v->arch.hvm_vmx.pi_block_cpu != -1 )
+ {
+ spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
+ v->arch.hvm_vmx.pi_block_cpu),
flags);
Post by Feng Wu
+ list_del_init(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
Shouldn't you set .pi_block_cpu back to -1 along with removing
the vCPU from the list? Which then ought to allow using just
list_del() here?
Here is the story about using list_del_init() instead of list_del(), (the
same reason for using list_del_init() in pi_wakeup_interrupt() ).
If we use list_del(), that means we need to set .pi_block_cpu to
-1 after removing the vCPU from the block list, there are two
- arch_vcpu_wake_prepare()
- pi_wakeup_interrupt()
That means we also need to set .pi_block_cpu to -1 in
pi_wakeup_interrupt(), which seems problematic to me, since
.pi_block_cpu is used to get the per-CPU spin lock, it is not safe
to change it in pi_wakeup_interrupt(), maybe someone is using
it at the same time.
And the check "if ( v->arch.hvm_vmx.pi_block_cpu != -1 )" here
is only used when .pi_block_cpu is set to -1 at the initial time.
If you have any good suggestions about this, I will be all ears.
Latching pi_block_cpu into a local variable would take care of that
part of the problem. Of course you then may also need to check
that while waiting to acquire the lock pi_block_cpu didn't change.
Thanks for the suggestion! But I think it is not easy to "check
"that while waiting to acquire the lock pi_block_cpu didn't change".
int local_pi_block_cpu = v->arch.hvm_vmx.pi_block_cpu;
......
spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
local_pi_block_cpu), flags);
list_del(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
local_pi_block_cpu), flags);
If .pi_block_cpu is changed to -1 by others during calling list_del()
above, that means the vCPU is removed by others, then calling
list_del() here again would be problematic. I think there might be
other corner cases for this. So I suggest adding some comments
for list_del_init() as you mentioned below.
That's why I said "check that while waiting to acquire the lock
pi_block_cpu didn't change." The code you present does not do
this.
I didn't see we need implement it as the above code, I just list it
here the show it is hard to do that.
First, how to check it while waiting to acquire the lock .pi_block_cpu
didn't change?
Secondly, even if we can check it, what should we do if .pi_block_cpu
is changed after acquiring the lock as I mentioned above?
Post by Wu, Feng
Post by Wu, Feng
Post by Feng Wu
+ do {
+ old.control = new.control = pi_desc->control;
+
+ /* Should not block the vCPU if an interrupt was posted
for
Post by Wu, Feng
it.
Post by Wu, Feng
*/
Post by Feng Wu
+ if ( pi_test_on(&old) )
+ {
+
spin_unlock_irqrestore(&v->arch.hvm_vmx.pi_lock,
Post by Wu, Feng
Post by Wu, Feng
gflags);
Post by Feng Wu
+ vcpu_unblock(v);
Calling vcpu_unblock() in the middle of context_switch()? Why? And
is this safe?
I cannot see anything unsafe so far, can some scheduler maintainer
help to confirm it? Dario? George?
But first and foremost you ought to answer the "why".
I think if you think this solution is unsafe, you need point out where it is, not
just ask "is this safe ?", I don't think this is an effective comments.
It is my general understanding that people wanting code to be
included have to prove their code is okay, not reviewers to prove
the code is broken.
Post by Wu, Feng
Anyway, I go through the main path of the code as below, I really don't find
anything unsafe here.
context_switch() -->
pi_ctxt_switch_from() -->
vmx_pre_ctx_switch_pi() -->
vcpu_unblock() -->
vcpu_wake() -->
SCHED_OP(VCPU2OP(v), wake, v) -->
csched_vcpu_wake() -->
__runq_insert()
__runq_tickle()
If you continue to think it is unsafe, pointing out the place will be welcome!
Once again - I didn't say it's unsafe (and hence can't point at a
particular place). What bothers me is that vcpu_unblock() affects
scheduler state, and altering scheduler state from inside the
context switch path looks problematic at the first glance. I'd be
happy if I was told there is no such problem, but be aware that
this would have to include latent ones: Even if right now things
are safe, having such feedback have the potential of becoming
an actual problem with a later scheduler change is already an
issue, as finding the problem is then likely going to be rather hard
(and I suspect it's not going to be you to debug this).
What I can say is that after investigation, I don't find anything bad
for this vcpu_unblock(), I tested it in my experiment. So that is why
I'd like to ask some ideas from scheduler expects.
Post by Wu, Feng
Post by Wu, Feng
Post by Feng Wu
+static void vmx_post_ctx_switch_pi(struct vcpu *v)
+{
+ struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+
+ if ( !has_arch_pdevs(v->domain) )
+ return;
+
+ if ( x2apic_enabled )
+ write_atomic(&pi_desc->ndst,
cpu_physical_id(v->processor));
Post by Wu, Feng
Post by Wu, Feng
Post by Feng Wu
+ else
+ write_atomic(&pi_desc->ndst,
+ MASK_INSR(cpu_physical_id(v->processor),
+ PI_xAPIC_NDST_MASK));
+
+ pi_clear_sn(pi_desc);
+}
So you alter where notifications go, but not via which vector? How
is the vCPU going to get removed from the blocked list then?
vmx_post_ctx_switch_pi() is called when the vCPU is about to run,
in that case, the vector has been set properly and it has been removed
from the block list if it was blocked before.
So why do you two updates then (first [elsewhere] to set the vector
and then here to set the destination)?
When the vCPU is unblocked and then removed from the blocking list, we
need to change the vector to the normal notification vector, since the
wakeup vector is only used when the vCPU is blocked and hence in the
blocked list. And here is the place we can decide which pCPU the vCPU
will be scheduled on, so we change the destination field here.
Right, that's what I understood. But isn't the state things are in
between these two updates inconsistent? I.e. - where does the
notification go if one arrives in this window?
Before arriving here, the SN is set, no need to send notification event and
hence no notification at all.

Particularly - if it went
to the right place, why would this second update be needed at all?
So we need to read the ndst from the PI descriptor and compare the
cpu_physical_id(v->processor), if they are different, update it? I don't
think this check is needed, since we need at least a read operation,
an if check, then maybe a write operation.

Thanks,
Feng
Jan
Jan Beulich
2015-09-10 09:26:29 UTC
Permalink
Post by Wu, Feng
Sent: Thursday, September 10, 2015 4:28 PM
Post by Wu, Feng
Sent: Wednesday, September 09, 2015 6:27 PM
Post by Wu, Feng
Sent: Monday, September 07, 2015 8:55 PM
Post by Feng Wu
+
+ /*
+ * Delete the vCPU from the related block list
+ * if we are resuming from blocked state.
+ */
+ if ( v->arch.hvm_vmx.pi_block_cpu != -1 )
+ {
+ spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
+ v->arch.hvm_vmx.pi_block_cpu),
flags);
Post by Feng Wu
+ list_del_init(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
Shouldn't you set .pi_block_cpu back to -1 along with removing
the vCPU from the list? Which then ought to allow using just
list_del() here?
Here is the story about using list_del_init() instead of list_del(), (the
same reason for using list_del_init() in pi_wakeup_interrupt() ).
If we use list_del(), that means we need to set .pi_block_cpu to
-1 after removing the vCPU from the block list, there are two
- arch_vcpu_wake_prepare()
- pi_wakeup_interrupt()
That means we also need to set .pi_block_cpu to -1 in
pi_wakeup_interrupt(), which seems problematic to me, since
.pi_block_cpu is used to get the per-CPU spin lock, it is not safe
to change it in pi_wakeup_interrupt(), maybe someone is using
it at the same time.
And the check "if ( v->arch.hvm_vmx.pi_block_cpu != -1 )" here
is only used when .pi_block_cpu is set to -1 at the initial time.
If you have any good suggestions about this, I will be all ears.
Latching pi_block_cpu into a local variable would take care of that
part of the problem. Of course you then may also need to check
that while waiting to acquire the lock pi_block_cpu didn't change.
Thanks for the suggestion! But I think it is not easy to "check
"that while waiting to acquire the lock pi_block_cpu didn't change".
int local_pi_block_cpu = v->arch.hvm_vmx.pi_block_cpu;
......
spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
local_pi_block_cpu), flags);
list_del(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
local_pi_block_cpu), flags);
If .pi_block_cpu is changed to -1 by others during calling list_del()
above, that means the vCPU is removed by others, then calling
list_del() here again would be problematic. I think there might be
other corner cases for this. So I suggest adding some comments
for list_del_init() as you mentioned below.
That's why I said "check that while waiting to acquire the lock
pi_block_cpu didn't change." The code you present does not do
this.
I didn't see we need implement it as the above code, I just list it
here the show it is hard to do that.
First, how to check it while waiting to acquire the lock .pi_block_cpu
didn't change?
Note the difference between "check while waiting" and "check that
while waiting": The former is indeed hard to implement, while the
latter is pretty straightforward (and we do so elsewhere).
Post by Wu, Feng
Secondly, even if we can check it, what should we do if .pi_block_cpu
is changed after acquiring the lock as I mentioned above?
Drop the lock and start over. I.e. (taking your pseudo code)

restart:
local_pi_block_cpu = ...;
bail-if-invalid (e.g. -1 in current model)
spin_lock_irqsave(&per_cpu(, local_pi_block_cpu), flags);
if(local_pi_block_cpu != actual_pi_block_cpu) {
spin_unlock_irqrestore(&per_cpu(,local_pi_block_cpu), flags);
goto restart;
}
list_del(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
spin_unlock_irqrestore(&per_cpu(,local_pi_block_cpu), flags);
Post by Wu, Feng
Post by Wu, Feng
Anyway, I go through the main path of the code as below, I really don't find
anything unsafe here.
context_switch() -->
pi_ctxt_switch_from() -->
vmx_pre_ctx_switch_pi() -->
vcpu_unblock() -->
vcpu_wake() -->
SCHED_OP(VCPU2OP(v), wake, v) -->
csched_vcpu_wake() -->
__runq_insert()
__runq_tickle()
If you continue to think it is unsafe, pointing out the place will be welcome!
Once again - I didn't say it's unsafe (and hence can't point at a
particular place). What bothers me is that vcpu_unblock() affects
scheduler state, and altering scheduler state from inside the
context switch path looks problematic at the first glance. I'd be
happy if I was told there is no such problem, but be aware that
this would have to include latent ones: Even if right now things
are safe, having such feedback have the potential of becoming
an actual problem with a later scheduler change is already an
issue, as finding the problem is then likely going to be rather hard
(and I suspect it's not going to be you to debug this).
What I can say is that after investigation, I don't find anything bad
for this vcpu_unblock(), I tested it in my experiment. So that is why
I'd like to ask some ideas from scheduler expects.
Agreed - I'm also awaiting their input.
Post by Wu, Feng
Post by Wu, Feng
Post by Wu, Feng
Post by Feng Wu
+static void vmx_post_ctx_switch_pi(struct vcpu *v)
+{
+ struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+
+ if ( !has_arch_pdevs(v->domain) )
+ return;
+
+ if ( x2apic_enabled )
+ write_atomic(&pi_desc->ndst,
cpu_physical_id(v->processor));
Post by Wu, Feng
Post by Wu, Feng
Post by Feng Wu
+ else
+ write_atomic(&pi_desc->ndst,
+ MASK_INSR(cpu_physical_id(v->processor),
+ PI_xAPIC_NDST_MASK));
+
+ pi_clear_sn(pi_desc);
+}
So you alter where notifications go, but not via which vector? How
is the vCPU going to get removed from the blocked list then?
vmx_post_ctx_switch_pi() is called when the vCPU is about to run,
in that case, the vector has been set properly and it has been removed
from the block list if it was blocked before.
So why do you two updates then (first [elsewhere] to set the vector
and then here to set the destination)?
When the vCPU is unblocked and then removed from the blocking list, we
need to change the vector to the normal notification vector, since the
wakeup vector is only used when the vCPU is blocked and hence in the
blocked list. And here is the place we can decide which pCPU the vCPU
will be scheduled on, so we change the destination field here.
Right, that's what I understood. But isn't the state things are in
between these two updates inconsistent? I.e. - where does the
notification go if one arrives in this window?
Before arriving here, the SN is set, no need to send notification event and
hence no notification at all.
Ah, okay.

Jan
Wu, Feng
2015-09-10 09:41:35 UTC
Permalink
-----Original Message-----
Sent: Thursday, September 10, 2015 5:26 PM
To: Wu, Feng
Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted
interrupts
Post by Wu, Feng
Sent: Thursday, September 10, 2015 4:28 PM
Post by Wu, Feng
Sent: Wednesday, September 09, 2015 6:27 PM
Post by Wu, Feng
Sent: Monday, September 07, 2015 8:55 PM
Post by Feng Wu
+
+ /*
+ * Delete the vCPU from the related block list
+ * if we are resuming from blocked state.
+ */
+ if ( v->arch.hvm_vmx.pi_block_cpu != -1 )
+ {
+ spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
+
v->arch.hvm_vmx.pi_block_cpu),
Post by Wu, Feng
Post by Wu, Feng
Post by Wu, Feng
flags);
Post by Feng Wu
+
list_del_init(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
Post by Wu, Feng
Post by Wu, Feng
Post by Wu, Feng
Shouldn't you set .pi_block_cpu back to -1 along with removing
the vCPU from the list? Which then ought to allow using just
list_del() here?
Here is the story about using list_del_init() instead of list_del(), (the
same reason for using list_del_init() in pi_wakeup_interrupt() ).
If we use list_del(), that means we need to set .pi_block_cpu to
-1 after removing the vCPU from the block list, there are two
- arch_vcpu_wake_prepare()
- pi_wakeup_interrupt()
That means we also need to set .pi_block_cpu to -1 in
pi_wakeup_interrupt(), which seems problematic to me, since
.pi_block_cpu is used to get the per-CPU spin lock, it is not safe
to change it in pi_wakeup_interrupt(), maybe someone is using
it at the same time.
And the check "if ( v->arch.hvm_vmx.pi_block_cpu != -1 )" here
is only used when .pi_block_cpu is set to -1 at the initial time.
If you have any good suggestions about this, I will be all ears.
Latching pi_block_cpu into a local variable would take care of that
part of the problem. Of course you then may also need to check
that while waiting to acquire the lock pi_block_cpu didn't change.
Thanks for the suggestion! But I think it is not easy to "check
"that while waiting to acquire the lock pi_block_cpu didn't change".
int local_pi_block_cpu = v->arch.hvm_vmx.pi_block_cpu;
......
spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
local_pi_block_cpu), flags);
list_del(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
local_pi_block_cpu), flags);
If .pi_block_cpu is changed to -1 by others during calling list_del()
above, that means the vCPU is removed by others, then calling
list_del() here again would be problematic. I think there might be
other corner cases for this. So I suggest adding some comments
for list_del_init() as you mentioned below.
That's why I said "check that while waiting to acquire the lock
pi_block_cpu didn't change." The code you present does not do
this.
I didn't see we need implement it as the above code, I just list it
here the show it is hard to do that.
First, how to check it while waiting to acquire the lock .pi_block_cpu
didn't change?
Note the difference between "check while waiting" and "check that
while waiting": The former is indeed hard to implement, while the
latter is pretty straightforward (and we do so elsewhere).
Post by Wu, Feng
Secondly, even if we can check it, what should we do if .pi_block_cpu
is changed after acquiring the lock as I mentioned above?
Drop the lock and start over. I.e. (taking your pseudo code)
local_pi_block_cpu = ...;
bail-if-invalid (e.g. -1 in current model)
spin_lock_irqsave(&per_cpu(, local_pi_block_cpu), flags);
if(local_pi_block_cpu != actual_pi_block_cpu) {
spin_unlock_irqrestore(&per_cpu(,local_pi_block_cpu), flags);
goto restart;
}
Thanks a lot for showing me this pseudo code! My concern is if
.pi_block_vcpu is changed to -1 at this point, it doesn't work.
.pi_block_vcpu being -1 here means the vCPU is remove from
the blocking list by others, then we cannot delete it again via
list_del() here.

BTW, I cannot see performance overhead for list_del_init()
compared to list_del().

list_del():
static inline void list_del(struct list_head *entry)
{
ASSERT(entry->next->prev == entry);
ASSERT(entry->prev->next == entry);
__list_del(entry->prev, entry->next);
entry->next = LIST_POISON1;
entry->prev = LIST_POISON2;
}

list_del_init():
static inline void list_del_init(struct list_head *entry)
{
__list_del(entry->prev, entry->next);
INIT_LIST_HEAD(entry);
}

Thanks,
Feng
list_del(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
spin_unlock_irqrestore(&per_cpu(,local_pi_block_cpu), flags);
Jan Beulich
2015-09-10 10:01:06 UTC
Permalink
Post by Wu, Feng
Sent: Thursday, September 10, 2015 5:26 PM
Post by Wu, Feng
First, how to check it while waiting to acquire the lock .pi_block_cpu
didn't change?
Note the difference between "check while waiting" and "check that
while waiting": The former is indeed hard to implement, while the
latter is pretty straightforward (and we do so elsewhere).
Post by Wu, Feng
Secondly, even if we can check it, what should we do if .pi_block_cpu
is changed after acquiring the lock as I mentioned above?
Drop the lock and start over. I.e. (taking your pseudo code)
local_pi_block_cpu = ...;
bail-if-invalid (e.g. -1 in current model)
spin_lock_irqsave(&per_cpu(, local_pi_block_cpu), flags);
if(local_pi_block_cpu != actual_pi_block_cpu) {
spin_unlock_irqrestore(&per_cpu(,local_pi_block_cpu), flags);
goto restart;
}
Thanks a lot for showing me this pseudo code! My concern is if
.pi_block_vcpu is changed to -1 at this point, it doesn't work.
.pi_block_vcpu being -1 here means the vCPU is remove from
the blocking list by others, then we cannot delete it again via
list_del() here.
Did you miss the "bail-if-invalid" above?
Post by Wu, Feng
BTW, I cannot see performance overhead for list_del_init()
compared to list_del().
static inline void list_del(struct list_head *entry)
{
ASSERT(entry->next->prev == entry);
ASSERT(entry->prev->next == entry);
__list_del(entry->prev, entry->next);
entry->next = LIST_POISON1;
entry->prev = LIST_POISON2;
}
static inline void list_del_init(struct list_head *entry)
{
__list_del(entry->prev, entry->next);
INIT_LIST_HEAD(entry);
}
Well, yes, both do two stores (I forgot about the poisoning), but
arguably the poisoning could become a debug-build-only thing. I.e.
it is an implementation detail that the number of stores currently
is the same. From an abstract perspective one should still prefer
list_del() when the re-init isn't really needed. And in the specific
case here asking you to use list_del() makes sure the code ends
up not even trying the deletion when not needed.

Jan
Wu, Feng
2015-09-10 12:34:53 UTC
Permalink
-----Original Message-----
Sent: Thursday, September 10, 2015 6:01 PM
To: Wu, Feng
Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted
interrupts
Post by Wu, Feng
Sent: Thursday, September 10, 2015 5:26 PM
Post by Wu, Feng
First, how to check it while waiting to acquire the lock .pi_block_cpu
didn't change?
Note the difference between "check while waiting" and "check that
while waiting": The former is indeed hard to implement, while the
latter is pretty straightforward (and we do so elsewhere).
Post by Wu, Feng
Secondly, even if we can check it, what should we do if .pi_block_cpu
is changed after acquiring the lock as I mentioned above?
Drop the lock and start over. I.e. (taking your pseudo code)
local_pi_block_cpu = ...;
bail-if-invalid (e.g. -1 in current model)
spin_lock_irqsave(&per_cpu(, local_pi_block_cpu), flags);
if(local_pi_block_cpu != actual_pi_block_cpu) {
spin_unlock_irqrestore(&per_cpu(,local_pi_block_cpu), flags);
goto restart;
}
Thanks a lot for showing me this pseudo code! My concern is if
.pi_block_vcpu is changed to -1 at this point, it doesn't work.
.pi_block_vcpu being -1 here means the vCPU is remove from
the blocking list by others, then we cannot delete it again via
list_del() here.
Did you miss the "bail-if-invalid" above?
I am sorry, do I miss something here? If .pi_block_cpu becomes
-1 here (after the above 'if' statement is finished with
local_pi_block_cpu == actual_pi_block_cpu ), how can "bail-if-invalid"
above help?

Thanks,
Feng
Post by Wu, Feng
BTW, I cannot see performance overhead for list_del_init()
compared to list_del().
static inline void list_del(struct list_head *entry)
{
ASSERT(entry->next->prev == entry);
ASSERT(entry->prev->next == entry);
__list_del(entry->prev, entry->next);
entry->next = LIST_POISON1;
entry->prev = LIST_POISON2;
}
static inline void list_del_init(struct list_head *entry)
{
__list_del(entry->prev, entry->next);
INIT_LIST_HEAD(entry);
}
Well, yes, both do two stores (I forgot about the poisoning), but
arguably the poisoning could become a debug-build-only thing. I.e.
it is an implementation detail that the number of stores currently
is the same. From an abstract perspective one should still prefer
list_del() when the re-init isn't really needed. And in the specific
case here asking you to use list_del() makes sure the code ends
up not even trying the deletion when not needed.
Jan
Jan Beulich
2015-09-10 12:44:30 UTC
Permalink
Post by Wu, Feng
-----Original Message-----
Sent: Thursday, September 10, 2015 6:01 PM
To: Wu, Feng
Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted
interrupts
Post by Wu, Feng
Sent: Thursday, September 10, 2015 5:26 PM
Post by Wu, Feng
First, how to check it while waiting to acquire the lock .pi_block_cpu
didn't change?
Note the difference between "check while waiting" and "check that
while waiting": The former is indeed hard to implement, while the
latter is pretty straightforward (and we do so elsewhere).
Post by Wu, Feng
Secondly, even if we can check it, what should we do if .pi_block_cpu
is changed after acquiring the lock as I mentioned above?
Drop the lock and start over. I.e. (taking your pseudo code)
local_pi_block_cpu = ...;
bail-if-invalid (e.g. -1 in current model)
spin_lock_irqsave(&per_cpu(, local_pi_block_cpu), flags);
if(local_pi_block_cpu != actual_pi_block_cpu) {
spin_unlock_irqrestore(&per_cpu(,local_pi_block_cpu), flags);
goto restart;
}
Thanks a lot for showing me this pseudo code! My concern is if
.pi_block_vcpu is changed to -1 at this point, it doesn't work.
.pi_block_vcpu being -1 here means the vCPU is remove from
the blocking list by others, then we cannot delete it again via
list_del() here.
Did you miss the "bail-if-invalid" above?
I am sorry, do I miss something here? If .pi_block_cpu becomes
-1 here (after the above 'if' statement is finished with
local_pi_block_cpu == actual_pi_block_cpu ), how can "bail-if-invalid"
above help?
The (obvious I thought) implication is that all assignments to
pi_block_cpu (along with all list manipulations) now need to happen
with the lock held.

Jan
Wu, Feng
2015-09-10 12:58:18 UTC
Permalink
-----Original Message-----
Sent: Thursday, September 10, 2015 8:45 PM
To: Wu, Feng
Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted
interrupts
Post by Wu, Feng
-----Original Message-----
Sent: Thursday, September 10, 2015 6:01 PM
To: Wu, Feng
Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d
posted
Post by Wu, Feng
interrupts
Post by Wu, Feng
Sent: Thursday, September 10, 2015 5:26 PM
Post by Wu, Feng
First, how to check it while waiting to acquire the lock .pi_block_cpu
didn't change?
Note the difference between "check while waiting" and "check that
while waiting": The former is indeed hard to implement, while the
latter is pretty straightforward (and we do so elsewhere).
Post by Wu, Feng
Secondly, even if we can check it, what should we do if .pi_block_cpu
is changed after acquiring the lock as I mentioned above?
Drop the lock and start over. I.e. (taking your pseudo code)
local_pi_block_cpu = ...;
bail-if-invalid (e.g. -1 in current model)
spin_lock_irqsave(&per_cpu(, local_pi_block_cpu), flags);
if(local_pi_block_cpu != actual_pi_block_cpu) {
spin_unlock_irqrestore(&per_cpu(,local_pi_block_cpu), flags);
goto restart;
}
Thanks a lot for showing me this pseudo code! My concern is if
.pi_block_vcpu is changed to -1 at this point, it doesn't work.
.pi_block_vcpu being -1 here means the vCPU is remove from
the blocking list by others, then we cannot delete it again via
list_del() here.
Did you miss the "bail-if-invalid" above?
I am sorry, do I miss something here? If .pi_block_cpu becomes
-1 here (after the above 'if' statement is finished with
local_pi_block_cpu == actual_pi_block_cpu ), how can "bail-if-invalid"
above help?
The (obvious I thought) implication is that all assignments to
pi_block_cpu (along with all list manipulations) now need to happen
with the lock held.
If all the assignment to pi_block_cpu is with one lock held, I don't think
we need to above checkout, we can safely use .pi_block_cpu, right?

Thanks,
Feng
Jan
Jan Beulich
2015-09-10 13:15:22 UTC
Permalink
Post by Wu, Feng
-----Original Message-----
Sent: Thursday, September 10, 2015 8:45 PM
To: Wu, Feng
Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted
interrupts
Post by Wu, Feng
-----Original Message-----
Sent: Thursday, September 10, 2015 6:01 PM
To: Wu, Feng
Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d
posted
Post by Wu, Feng
interrupts
Post by Wu, Feng
Sent: Thursday, September 10, 2015 5:26 PM
Post by Wu, Feng
First, how to check it while waiting to acquire the lock .pi_block_cpu
didn't change?
Note the difference between "check while waiting" and "check that
while waiting": The former is indeed hard to implement, while the
latter is pretty straightforward (and we do so elsewhere).
Post by Wu, Feng
Secondly, even if we can check it, what should we do if .pi_block_cpu
is changed after acquiring the lock as I mentioned above?
Drop the lock and start over. I.e. (taking your pseudo code)
local_pi_block_cpu = ...;
bail-if-invalid (e.g. -1 in current model)
spin_lock_irqsave(&per_cpu(, local_pi_block_cpu), flags);
if(local_pi_block_cpu != actual_pi_block_cpu) {
spin_unlock_irqrestore(&per_cpu(,local_pi_block_cpu), flags);
goto restart;
}
Thanks a lot for showing me this pseudo code! My concern is if
.pi_block_vcpu is changed to -1 at this point, it doesn't work.
.pi_block_vcpu being -1 here means the vCPU is remove from
the blocking list by others, then we cannot delete it again via
list_del() here.
Did you miss the "bail-if-invalid" above?
I am sorry, do I miss something here? If .pi_block_cpu becomes
-1 here (after the above 'if' statement is finished with
local_pi_block_cpu == actual_pi_block_cpu ), how can "bail-if-invalid"
above help?
The (obvious I thought) implication is that all assignments to
pi_block_cpu (along with all list manipulations) now need to happen
with the lock held.
If all the assignment to pi_block_cpu is with one lock held, I don't think
we need to above checkout, we can safely use .pi_block_cpu, right?
No. In order to use it you need to make sure it's valid (or else
there's no valid lock to acquire). And once you acquired the
lock, you have to check whether changed / became invalid.
Plus the up front check allows to avoid acquiring the lock in the
first place when the field is invalid anyway.

Jan
Wu, Feng
2015-09-10 13:27:55 UTC
Permalink
-----Original Message-----
Sent: Thursday, September 10, 2015 9:15 PM
To: Wu, Feng
Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted
interrupts
Post by Wu, Feng
-----Original Message-----
Sent: Thursday, September 10, 2015 8:45 PM
To: Wu, Feng
Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d
posted
Post by Wu, Feng
interrupts
Post by Wu, Feng
-----Original Message-----
Sent: Thursday, September 10, 2015 6:01 PM
To: Wu, Feng
Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d
posted
Post by Wu, Feng
interrupts
Post by Wu, Feng
Sent: Thursday, September 10, 2015 5:26 PM
Post by Wu, Feng
First, how to check it while waiting to acquire the lock .pi_block_cpu
didn't change?
Note the difference between "check while waiting" and "check that
while waiting": The former is indeed hard to implement, while the
latter is pretty straightforward (and we do so elsewhere).
Post by Wu, Feng
Secondly, even if we can check it, what should we do if .pi_block_cpu
is changed after acquiring the lock as I mentioned above?
Drop the lock and start over. I.e. (taking your pseudo code)
local_pi_block_cpu = ...;
bail-if-invalid (e.g. -1 in current model)
spin_lock_irqsave(&per_cpu(, local_pi_block_cpu), flags);
if(local_pi_block_cpu != actual_pi_block_cpu) {
spin_unlock_irqrestore(&per_cpu(,local_pi_block_cpu),
flags);
Post by Wu, Feng
Post by Wu, Feng
Post by Wu, Feng
goto restart;
}
Thanks a lot for showing me this pseudo code! My concern is if
.pi_block_vcpu is changed to -1 at this point, it doesn't work.
.pi_block_vcpu being -1 here means the vCPU is remove from
the blocking list by others, then we cannot delete it again via
list_del() here.
Did you miss the "bail-if-invalid" above?
I am sorry, do I miss something here? If .pi_block_cpu becomes
-1 here (after the above 'if' statement is finished with
local_pi_block_cpu == actual_pi_block_cpu ), how can "bail-if-invalid"
above help?
The (obvious I thought) implication is that all assignments to
pi_block_cpu (along with all list manipulations) now need to happen
with the lock held.
If all the assignment to pi_block_cpu is with one lock held, I don't think
we need to above checkout, we can safely use .pi_block_cpu, right?
No. In order to use it you need to make sure it's valid (or else
there's no valid lock to acquire). And once you acquired the
lock, you have to check whether changed / became invalid.
If all the assignment to .pi_block_cpu is with one lock held,
how can it get changed / become invalid then once I acquired
the lock?

Thanks,
Feng
Plus the up front check allows to avoid acquiring the lock in the
first place when the field is invalid anyway.
Jan
Jan Beulich
2015-09-10 14:01:09 UTC
Permalink
Post by Wu, Feng
-----Original Message-----
Sent: Thursday, September 10, 2015 9:15 PM
To: Wu, Feng
Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted
interrupts
Post by Wu, Feng
-----Original Message-----
Sent: Thursday, September 10, 2015 8:45 PM
To: Wu, Feng
Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d
posted
Post by Wu, Feng
interrupts
Post by Wu, Feng
-----Original Message-----
Sent: Thursday, September 10, 2015 6:01 PM
To: Wu, Feng
Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d
posted
Post by Wu, Feng
interrupts
Post by Wu, Feng
Sent: Thursday, September 10, 2015 5:26 PM
Post by Wu, Feng
First, how to check it while waiting to acquire the lock .pi_block_cpu
didn't change?
Note the difference between "check while waiting" and "check that
while waiting": The former is indeed hard to implement, while the
latter is pretty straightforward (and we do so elsewhere).
Post by Wu, Feng
Secondly, even if we can check it, what should we do if .pi_block_cpu
is changed after acquiring the lock as I mentioned above?
Drop the lock and start over. I.e. (taking your pseudo code)
local_pi_block_cpu = ...;
bail-if-invalid (e.g. -1 in current model)
spin_lock_irqsave(&per_cpu(, local_pi_block_cpu), flags);
if(local_pi_block_cpu != actual_pi_block_cpu) {
spin_unlock_irqrestore(&per_cpu(,local_pi_block_cpu),
flags);
Post by Wu, Feng
Post by Wu, Feng
Post by Wu, Feng
goto restart;
}
Thanks a lot for showing me this pseudo code! My concern is if
.pi_block_vcpu is changed to -1 at this point, it doesn't work.
.pi_block_vcpu being -1 here means the vCPU is remove from
the blocking list by others, then we cannot delete it again via
list_del() here.
Did you miss the "bail-if-invalid" above?
I am sorry, do I miss something here? If .pi_block_cpu becomes
-1 here (after the above 'if' statement is finished with
local_pi_block_cpu == actual_pi_block_cpu ), how can "bail-if-invalid"
above help?
The (obvious I thought) implication is that all assignments to
pi_block_cpu (along with all list manipulations) now need to happen
with the lock held.
If all the assignment to pi_block_cpu is with one lock held, I don't think
we need to above checkout, we can safely use .pi_block_cpu, right?
No. In order to use it you need to make sure it's valid (or else
there's no valid lock to acquire). And once you acquired the
lock, you have to check whether changed / became invalid.
If all the assignment to .pi_block_cpu is with one lock held,
how can it get changed / become invalid then once I acquired
the lock?
Again - if pi_block_cpu is invalid, which lock do you want to acquire?

Jan
Wu, Feng
2015-09-16 08:56:23 UTC
Permalink
-----Original Message-----
Sent: Thursday, September 10, 2015 5:26 PM
To: Wu, Feng
Cc: Andrew Cooper; Dario Faggioli; George Dunlap; Tian, Kevin;
Subject: RE: [PATCH v6 16/18] vmx: Add some scheduler hooks for VT-d posted
interrupts
Post by Wu, Feng
Post by Jan Beulich
Post by Wu, Feng
Anyway, I go through the main path of the code as below, I really don't find
anything unsafe here.
context_switch() -->
pi_ctxt_switch_from() -->
vmx_pre_ctx_switch_pi() -->
vcpu_unblock() -->
vcpu_wake() -->
SCHED_OP(VCPU2OP(v), wake, v) -->
csched_vcpu_wake() -->
__runq_insert()
__runq_tickle()
If you continue to think it is unsafe, pointing out the place will be welcome!
Once again - I didn't say it's unsafe (and hence can't point at a
particular place). What bothers me is that vcpu_unblock() affects
scheduler state, and altering scheduler state from inside the
context switch path looks problematic at the first glance. I'd be
happy if I was told there is no such problem, but be aware that
this would have to include latent ones: Even if right now things
are safe, having such feedback have the potential of becoming
an actual problem with a later scheduler change is already an
issue, as finding the problem is then likely going to be rather hard
(and I suspect it's not going to be you to debug this).
What I can say is that after investigation, I don't find anything bad
for this vcpu_unblock(), I tested it in my experiment. So that is why
I'd like to ask some ideas from scheduler expects.
Agreed - I'm also awaiting their input.
Hi Dario, & George, could you please gave your thoughts about this?
Your input is very important for us!

Thanks,
Feng
George Dunlap
2015-09-16 17:08:08 UTC
Permalink
Post by Wu, Feng
Post by Jan Beulich
Post by Wu, Feng
Post by Jan Beulich
Post by Wu, Feng
Post by Jan Beulich
Calling vcpu_unblock() in the middle of context_switch()? Why? And
is this safe?
I cannot see anything unsafe so far, can some scheduler maintainer
help to confirm it? Dario? George?
But first and foremost you ought to answer the "why".
I think if you think this solution is unsafe, you need point out where it is, not
just ask "is this safe ?", I don't think this is an effective comments.
It is my general understanding that people wanting code to be
included have to prove their code is okay, not reviewers to prove
the code is broken.
Post by Wu, Feng
Anyway, I go through the main path of the code as below, I really don't find
anything unsafe here.
context_switch() -->
pi_ctxt_switch_from() -->
vmx_pre_ctx_switch_pi() -->
vcpu_unblock() -->
vcpu_wake() -->
SCHED_OP(VCPU2OP(v), wake, v) -->
csched_vcpu_wake() -->
__runq_insert()
__runq_tickle()
If you continue to think it is unsafe, pointing out the place will be welcome!
Once again - I didn't say it's unsafe (and hence can't point at a
particular place). What bothers me is that vcpu_unblock() affects
scheduler state, and altering scheduler state from inside the
context switch path looks problematic at the first glance. I'd be
happy if I was told there is no such problem, but be aware that
this would have to include latent ones: Even if right now things
are safe, having such feedback have the potential of becoming
an actual problem with a later scheduler change is already an
issue, as finding the problem is then likely going to be rather hard
(and I suspect it's not going to be you to debug this).
What I can say is that after investigation, I don't find anything bad
for this vcpu_unblock(), I tested it in my experiment. So that is why
I'd like to ask some ideas from scheduler expects.
You still didn't answer Jan's question as to why you chose to call it there.

As to why Jan is suspicious, the hypervisor code is incredibly
complicated, and even hypervisor maintainers are only mortals. :-) One
way we deal with the complication is by trying to restrict the ways
different code interacts, so that people reading, writing, and
maintaining the code can make simplifying assumptions to make it
easier to grasp / harder to make mistakes.

People reading or working on vcpu_wake() code expect it to be called
from interrupt contexts, and also expect it to be called from normal
hypercalls. They *don't* expect it to be called from in the middle of
a context switch, where "current", the registers, and all those things
are all up in the air. It's possible that the code *already*
implicitly assumes it's not called from context switch (i.e., that v
== current means certain state), and even if not, it's possible
someone will make that assumption in the future, causing a very
hard-to-detect bug.

Now I haven't looked at the code in detail yet; it may be that the
only way to make PI work is to make this call. If that's the case we
need to carefully examine assumptions, and carefully comment the code
to make sure people working on the scheduling code continue to think
carefully about their assumptions in the future. But if we can get
away without doing that, it will make things much easier.

But it's certainly reasonable to expect the maintainers to offer you
an alternate solution if your proposed solution is unacceptable. :-)
Let me take a look and see what I think.

-George
George Dunlap
2015-09-16 16:56:33 UTC
Permalink
Post by Jan Beulich
Post by Feng Wu
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1573,6 +1573,22 @@ static void __context_switch(void)
per_cpu(curr_vcpu, cpu) = n;
}
+static inline void pi_ctxt_switch_from(struct vcpu *prev)
+{
+ /*
+ * When switching from non-idle to idle, we only do a lazy context switch.
+ * However, in order for posted interrupt (if available and enabled) to
+ * work properly, we at least need to update the descriptors.
+ */
+ if ( prev->arch.pi_ctxt_switch_from && !is_idle_vcpu(prev) )
+ prev->arch.pi_ctxt_switch_from(prev);
+}
+
+static inline void pi_ctxt_switch_to(struct vcpu *next)
+{
+ if ( next->arch.pi_ctxt_switch_to && !is_idle_vcpu(next) )
+ next->arch.pi_ctxt_switch_to(next);
+}
void context_switch(struct vcpu *prev, struct vcpu *next)
{
@@ -1605,9 +1621,12 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
set_current(next);
+ pi_ctxt_switch_from(prev);
+
if ( (per_cpu(curr_vcpu, cpu) == next) ||
(is_idle_domain(nextd) && cpu_online(cpu)) )
{
+ pi_ctxt_switch_to(next);
local_irq_enable();
This placement, if really intended that way, needs explanation (in a
comment) and perhaps even renaming of the involved symbols, as
looking at it from a general perspective it seems wrong (with
pi_ctxt_switch_to() excluding idle vCPU-s it effectively means you
want this only when switching back to what got switched out lazily
before, i.e. this would be not something to take place on an arbitrary
context switch). As to possible alternative names - maybe make the
hooks ctxt_switch_prepare() and ctxt_switch_cancel()?
Why on earth is this more clear than what he had before?

In the first call, he's not "preparing" anything -- he's actually
switching the PI context out for prev. And in the second call, he's
not "cancelling" anything -- he's actually switching the PI context in
for next. The names you suggest are actively confusing, not helpful.

But before talking about how to make things more clear, one side
question -- do we need to actually call pi_ctxt_switch_to() in
__context_switch()?

The only other place __context_switch() is called is
from__sync_local_execstate(). But the only reason that needs to be
called is because sometimes we *don't* call __context_switch(), and so
there are things on the cpu that aren't copied into the vcpu struct.

That doesn't apply to the PI state -- for one, nothing is copied from
the processor; and for two, pi_ctxt_switch_from() is called
unconditionally anyway.

Would it make more sense to call pi_context_switch(prev, next) just
after "set_current"?

(Keeping in mind I totally may have missed something...)

-George
Wu, Feng
2015-09-17 06:15:57 UTC
Permalink
-----Original Message-----
Dunlap
Sent: Thursday, September 17, 2015 12:57 AM
To: Jan Beulich
Cc: Wu, Feng; Tian, Kevin; Keir Fraser; Andrew Cooper; Dario Faggioli;
Subject: Re: [Xen-devel] [PATCH v6 16/18] vmx: Add some scheduler hooks for
VT-d posted interrupts
Post by Jan Beulich
Post by Feng Wu
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1573,6 +1573,22 @@ static void __context_switch(void)
per_cpu(curr_vcpu, cpu) = n;
}
+static inline void pi_ctxt_switch_from(struct vcpu *prev)
+{
+ /*
+ * When switching from non-idle to idle, we only do a lazy context
switch.
Post by Jan Beulich
Post by Feng Wu
+ * However, in order for posted interrupt (if available and enabled) to
+ * work properly, we at least need to update the descriptors.
+ */
+ if ( prev->arch.pi_ctxt_switch_from && !is_idle_vcpu(prev) )
+ prev->arch.pi_ctxt_switch_from(prev);
+}
+
+static inline void pi_ctxt_switch_to(struct vcpu *next)
+{
+ if ( next->arch.pi_ctxt_switch_to && !is_idle_vcpu(next) )
+ next->arch.pi_ctxt_switch_to(next);
+}
void context_switch(struct vcpu *prev, struct vcpu *next)
{
@@ -1605,9 +1621,12 @@ void context_switch(struct vcpu *prev, struct
vcpu *next)
Post by Jan Beulich
Post by Feng Wu
set_current(next);
+ pi_ctxt_switch_from(prev);
+
if ( (per_cpu(curr_vcpu, cpu) == next) ||
(is_idle_domain(nextd) && cpu_online(cpu)) )
{
+ pi_ctxt_switch_to(next);
local_irq_enable();
This placement, if really intended that way, needs explanation (in a
comment) and perhaps even renaming of the involved symbols, as
looking at it from a general perspective it seems wrong (with
pi_ctxt_switch_to() excluding idle vCPU-s it effectively means you
want this only when switching back to what got switched out lazily
before, i.e. this would be not something to take place on an arbitrary
context switch). As to possible alternative names - maybe make the
hooks ctxt_switch_prepare() and ctxt_switch_cancel()?
Why on earth is this more clear than what he had before?
In the first call, he's not "preparing" anything -- he's actually
switching the PI context out for prev. And in the second call, he's
not "cancelling" anything -- he's actually switching the PI context in
for next. The names you suggest are actively confusing, not helpful.
But before talking about how to make things more clear, one side
question -- do we need to actually call pi_ctxt_switch_to() in
__context_switch()?
The only other place __context_switch() is called is
from__sync_local_execstate(). But the only reason that needs to be
called is because sometimes we *don't* call __context_switch(), and so
there are things on the cpu that aren't copied into the vcpu struct.
Thanks for the comments!

From my understanding, __sync_local_execstate() can only get called
in the following two cases:
#1) this_cpu(curr_vcpu) == current, in this case, __context_switch() is
not called.
#2) this_cpu(curr_vcpu) != current, and current == idle_vcpu, that means
we just switched from a non-idle vCPU to idle vCPU, so here we need to
call __context_switch() to copy things to the original vcpu struct.

Please correct me if the above understanding is wrong or incomplete?

I think calling pi_ctxt_switch_to() in __context_switch() is needed when
we are switching to a non-idle vCPU (we need change the PI state of the
target vCPU), and the call is not needed when switching to idle vCPU.
So if the above understanding is correct, I think you suggestion below
is really good, it makes things clearer.
That doesn't apply to the PI state -- for one, nothing is copied from
the processor; and for two, pi_ctxt_switch_from() is called
unconditionally anyway.
Would it make more sense to call pi_context_switch(prev, next) just
after "set_current"?
I think it is a good point.

Thanks,
Feng
(Keeping in mind I totally may have missed something...)
-George
Feng Wu
2015-08-25 01:57:52 UTC
Permalink
When guest changes its interrupt configuration (such as, vector, etc.)
for direct-assigned devices, we need to update the associated IRTE
with the new guest vector, so external interrupts from the assigned
devices can be injected to guests without VM-Exit.

For lowest-priority interrupts, we use vector-hashing mechamisn to find
the destination vCPU. This follows the hardware behavior, since modern
Intel CPUs use vector hashing to handle the lowest-priority interrupt.

For multicast/broadcast vCPU, we cannot handle it via interrupt posting,
still use interrupt remapping.

CC: Jan Beulich <***@suse.com>
Signed-off-by: Feng Wu <***@intel.com>
---
v6:
- Use macro to replace plain numbers
- Correct the overflow error in a loop

v5:
- Make 'struct vcpu *vcpu' const

v4:
- Make some 'int' variables 'unsigned int' in pi_find_dest_vcpu()
- Make 'dest_id' uint32_t
- Rename 'size' to 'bitmap_array_size'
- find_next_bit() and find_first_bit() always return unsigned int,
so no need to check whether the return value is less than 0.
- Message error level XENLOG_G_WARNING -> XENLOG_G_INFO
- Remove useless warning message
- Create a seperate function vector_hashing_dest() to find the
- destination of lowest-priority interrupts.
- Change some comments

v3:
- Use bitmap to store the all the possible destination vCPUs of an
interrupt, then trying to find the right destination from the bitmap
- Typo and some small changes

xen/drivers/passthrough/io.c | 125 ++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 124 insertions(+), 1 deletion(-)

diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index bda9374..8e36948 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -25,6 +25,7 @@
#include <asm/hvm/iommu.h>
#include <asm/hvm/support.h>
#include <xen/hvm/irq.h>
+#include <asm/io_apic.h>

static DEFINE_PER_CPU(struct list_head, dpci_list);

@@ -198,6 +199,109 @@ void free_hvm_irq_dpci(struct hvm_irq_dpci *dpci)
xfree(dpci);
}

+/*
+ * This routine handles lowest-priority interrupts using vector-hashing
+ * mechanism. As an example, modern Intel CPUs use this method to handle
+ * lowest-priority interrupts.
+ *
+ * Here is the details about the vector-hashing mechanism:
+ * 1. For lowest-priority interrupts, store all the possible destination
+ * vCPUs in an array.
+ * 2. Use "gvec % max number of destination vCPUs" to find the right
+ * destination vCPU in the array for the lowest-priority interrupt.
+ */
+static struct vcpu *vector_hashing_dest(const struct domain *d,
+ uint32_t dest_id,
+ bool_t dest_mode,
+ uint8_t gvec)
+
+{
+ unsigned long *dest_vcpu_bitmap;
+ unsigned int dest_vcpus = 0, idx;
+ unsigned int bitmap_array_size = BITS_TO_LONGS(d->max_vcpus);
+ struct vcpu *v, *dest = NULL;
+ unsigned int i;
+
+ dest_vcpu_bitmap = xzalloc_array(unsigned long, bitmap_array_size);
+ if ( !dest_vcpu_bitmap )
+ {
+ dprintk(XENLOG_G_INFO,
+ "dom%d: failed to allocate memory\n", d->domain_id);
+ return NULL;
+ }
+
+ for_each_vcpu ( d, v )
+ {
+ if ( !vlapic_match_dest(vcpu_vlapic(v), NULL, APIC_DEST_NOSHORT,
+ dest_id, dest_mode) )
+ continue;
+
+ __set_bit(v->vcpu_id, dest_vcpu_bitmap);
+ dest_vcpus++;
+ }
+
+ if ( dest_vcpus != 0 )
+ {
+ unsigned int mod = gvec % dest_vcpus;
+ idx = 0;
+
+ for ( i = 0; i <= mod; i++ )
+ {
+ idx = find_next_bit(dest_vcpu_bitmap, d->max_vcpus, idx) + 1;
+ BUG_ON(idx >= d->max_vcpus);
+ }
+ idx--;
+
+ dest = d->vcpu[idx];
+ }
+
+ xfree(dest_vcpu_bitmap);
+
+ return dest;
+}
+
+/*
+ * The purpose of this routine is to find the right destination vCPU for
+ * an interrupt which will be delivered by VT-d posted-interrupt. There
+ * are several cases as below:
+ *
+ * - For lowest-priority interrupts, use vector-hashing mechanism to find
+ * the destination.
+ * - Otherwise, for single destination interrupt, it is straightforward to
+ * find the destination vCPU and return true.
+ * - For multicast/broadcast vCPU, we cannot handle it via interrupt posting,
+ * so return NULL.
+ */
+static struct vcpu *pi_find_dest_vcpu(const struct domain *d, uint32_t dest_id,
+ bool_t dest_mode, uint8_t delivery_mode,
+ uint8_t gvec)
+{
+ unsigned int dest_vcpus = 0;
+ struct vcpu *v, *dest = NULL;
+
+ if ( delivery_mode == dest_LowestPrio )
+ return vector_hashing_dest(d, dest_id, dest_mode, gvec);
+
+ for_each_vcpu ( d, v )
+ {
+ if ( !vlapic_match_dest(vcpu_vlapic(v), NULL, APIC_DEST_NOSHORT,
+ dest_id, dest_mode) )
+ continue;
+
+ dest_vcpus++;
+ dest = v;
+ }
+
+ /*
+ * For fixed destination, we only handle single-destination
+ * interrupts.
+ */
+ if ( dest_vcpus == 1 )
+ return dest;
+
+ return NULL;
+}
+
int pt_irq_create_bind(
struct domain *d, xen_domctl_bind_pt_irq_t *pt_irq_bind)
{
@@ -256,7 +360,7 @@ int pt_irq_create_bind(
{
case PT_IRQ_TYPE_MSI:
{
- uint8_t dest, dest_mode;
+ uint8_t dest, dest_mode, delivery_mode;
int dest_vcpu_id;

if ( !(pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) )
@@ -329,11 +433,30 @@ int pt_irq_create_bind(
/* Calculate dest_vcpu_id for MSI-type pirq migration. */
dest = pirq_dpci->gmsi.gflags & VMSI_DEST_ID_MASK;
dest_mode = !!(pirq_dpci->gmsi.gflags & VMSI_DM_MASK);
+ delivery_mode = (pirq_dpci->gmsi.gflags >> GFLAGS_SHIFT_DELIV_MODE) &
+ VMSI_DELIV_MASK;
dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode);
pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id;
spin_unlock(&d->event_lock);
if ( dest_vcpu_id >= 0 )
hvm_migrate_pirqs(d->vcpu[dest_vcpu_id]);
+
+ /* Use interrupt posting if it is supported. */
+ if ( iommu_intpost )
+ {
+ const struct vcpu *vcpu = pi_find_dest_vcpu(d, dest, dest_mode,
+ delivery_mode, pirq_dpci->gmsi.gvec);
+
+ if ( vcpu )
+ {
+ rc = pi_update_irte( vcpu, info, pirq_dpci->gmsi.gvec );
+ if ( unlikely(rc) )
+ dprintk(XENLOG_G_INFO,
+ "%pv: failed to update PI IRTE, gvec:%02x\n",
+ vcpu, pirq_dpci->gmsi.gvec);
+ }
+ }
+
break;
}
--
2.1.0
Jan Beulich
2015-09-04 15:59:13 UTC
Permalink
First of all - an empty Cc list on a patch is suspicious.
Post by Feng Wu
@@ -198,6 +199,109 @@ void free_hvm_irq_dpci(struct hvm_irq_dpci *dpci)
xfree(dpci);
}
+/*
+ * This routine handles lowest-priority interrupts using vector-hashing
+ * mechanism. As an example, modern Intel CPUs use this method to handle
+ * lowest-priority interrupts.
+ *
+ * 1. For lowest-priority interrupts, store all the possible destination
+ * vCPUs in an array.
+ * 2. Use "gvec % max number of destination vCPUs" to find the right
+ * destination vCPU in the array for the lowest-priority interrupt.
+ */
+static struct vcpu *vector_hashing_dest(const struct domain *d,
+ uint32_t dest_id,
+ bool_t dest_mode,
+ uint8_t gvec)
+
+{
+ unsigned long *dest_vcpu_bitmap;
+ unsigned int dest_vcpus = 0, idx;
+ unsigned int bitmap_array_size = BITS_TO_LONGS(d->max_vcpus);
+ struct vcpu *v, *dest = NULL;
+ unsigned int i;
+
+ dest_vcpu_bitmap = xzalloc_array(unsigned long, bitmap_array_size);
+ if ( !dest_vcpu_bitmap )
+ {
+ dprintk(XENLOG_G_INFO,
+ "dom%d: failed to allocate memory\n", d->domain_id);
This dprintk() won't really help much. Please remove it.
Post by Feng Wu
+ return NULL;
+ }
+
+ for_each_vcpu ( d, v )
+ {
+ if ( !vlapic_match_dest(vcpu_vlapic(v), NULL, APIC_DEST_NOSHORT,
+ dest_id, dest_mode) )
+ continue;
+
+ __set_bit(v->vcpu_id, dest_vcpu_bitmap);
+ dest_vcpus++;
+ }
+
+ if ( dest_vcpus != 0 )
+ {
+ unsigned int mod = gvec % dest_vcpus;
+ idx = 0;
You don't use idx before here. Declare it here, which also avoids
me telling you that you placed the blank line wrongly.
Post by Feng Wu
+ for ( i = 0; i <= mod; i++ )
+ {
+ idx = find_next_bit(dest_vcpu_bitmap, d->max_vcpus, idx) + 1;
+ BUG_ON(idx >= d->max_vcpus);
+ }
+ idx--;
+
+ dest = d->vcpu[idx];
Just [idx - 1] please.
Post by Feng Wu
+static struct vcpu *pi_find_dest_vcpu(const struct domain *d, uint32_t dest_id,
+ bool_t dest_mode, uint8_t delivery_mode,
+ uint8_t gvec)
+{
+ unsigned int dest_vcpus = 0;
+ struct vcpu *v, *dest = NULL;
+
+ if ( delivery_mode == dest_LowestPrio )
+ return vector_hashing_dest(d, dest_id, dest_mode, gvec);
So at this point delivery_mode == dest_Fixed, right?
Post by Feng Wu
+ for_each_vcpu ( d, v )
+ {
+ if ( !vlapic_match_dest(vcpu_vlapic(v), NULL, APIC_DEST_NOSHORT,
+ dest_id, dest_mode) )
+ continue;
+
+ dest_vcpus++;
+ dest = v;
+ }
+
+ /*
+ * For fixed destination, we only handle single-destination
+ * interrupts.
+ */
+ if ( dest_vcpus == 1 )
+ return dest;
Is it thus even possible for the if() condition to be false? If it isn't,
returning early from the loop would seem the better option. And
even if it is, I would question whether delivering the interrupt to
the first match isn't going to be better than dropping it.
Post by Feng Wu
@@ -329,11 +433,30 @@ int pt_irq_create_bind(
/* Calculate dest_vcpu_id for MSI-type pirq migration. */
dest = pirq_dpci->gmsi.gflags & VMSI_DEST_ID_MASK;
dest_mode = !!(pirq_dpci->gmsi.gflags & VMSI_DM_MASK);
+ delivery_mode = (pirq_dpci->gmsi.gflags >> GFLAGS_SHIFT_DELIV_MODE) &
+ VMSI_DELIV_MASK;
In numbers (gflags >> 12) & 0x7000, which is likely not what you
want.
Post by Feng Wu
dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode);
pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id;
spin_unlock(&d->event_lock);
if ( dest_vcpu_id >= 0 )
hvm_migrate_pirqs(d->vcpu[dest_vcpu_id]);
+
+ /* Use interrupt posting if it is supported. */
+ if ( iommu_intpost )
+ {
+ const struct vcpu *vcpu = pi_find_dest_vcpu(d, dest, dest_mode,
+ delivery_mode, pirq_dpci->gmsi.gvec);
+
+ if ( vcpu )
+ {
+ rc = pi_update_irte( vcpu, info, pirq_dpci->gmsi.gvec );
+ if ( unlikely(rc) )
+ dprintk(XENLOG_G_INFO,
+ "%pv: failed to update PI IRTE, gvec:%02x\n",
+ vcpu, pirq_dpci->gmsi.gvec);
Even if only a debug build printk() - aren't you afraid that if this
ever triggers, it will trigger a lot? And hence be pretty useless?
Post by Feng Wu
+ }
(Not only) depending on the answer, I'd consider adding an "else
printk()" here.

Jan
Wu, Feng
2015-09-06 04:54:27 UTC
Permalink
-----Original Message-----
Sent: Friday, September 04, 2015 11:59 PM
To: Wu, Feng
Subject: Re: [PATCH v6 13/18] Update IRTE according to guest interrupt config
changes
First of all - an empty Cc list on a patch is suspicious.
I did Cc you for this patch. Why do you say "an empty Cc list"?
Post by Feng Wu
+static struct vcpu *pi_find_dest_vcpu(const struct domain *d, uint32_t
dest_id,
Post by Feng Wu
+ bool_t dest_mode, uint8_t
delivery_mode,
Post by Feng Wu
+ uint8_t gvec)
+{
+ unsigned int dest_vcpus = 0;
+ struct vcpu *v, *dest = NULL;
+
+ if ( delivery_mode == dest_LowestPrio )
+ return vector_hashing_dest(d, dest_id, dest_mode, gvec);
So at this point delivery_mode == dest_Fixed, right?
It won't be dest_LowestPrio here, so it might be proper to add
else if (delivery_mode == dest_Fixed) for the code below. So the
question is: Can deliver modes other than lowest priority and fixed,
such as NMI, SMI, etc. hit here? Any ideas?
Post by Feng Wu
+ for_each_vcpu ( d, v )
+ {
+ if ( !vlapic_match_dest(vcpu_vlapic(v), NULL,
APIC_DEST_NOSHORT,
Post by Feng Wu
+ dest_id, dest_mode) )
+ continue;
+
+ dest_vcpus++;
+ dest = v;
+ }
+
+ /*
+ * For fixed destination, we only handle single-destination
+ * interrupts.
+ */
+ if ( dest_vcpus == 1 )
+ return dest;
Is it thus even possible for the if() condition to be false?
It can be false if the interrupts has multiple destination with Fixed
deliver mode.
If it isn't,
returning early from the loop would seem the better option. And
even if it is, I would question whether delivering the interrupt to
the first match isn't going to be better than dropping it.
Here, if we deliver all the interrupts to the first match, only this
vCPU will receive all the interrupts, even though the irq affinity
shows it has multiple destination. I don't think this is correct.
Furthermore, is there any performance issues if the interrupt frequency
is high and the matched vCPU cannot handle them in time? So here, I
just leave these kind of interrupts to legacy interrupt remapping
mechanism.
Post by Feng Wu
@@ -329,11 +433,30 @@ int pt_irq_create_bind(
/* Calculate dest_vcpu_id for MSI-type pirq migration. */
dest = pirq_dpci->gmsi.gflags & VMSI_DEST_ID_MASK;
dest_mode = !!(pirq_dpci->gmsi.gflags & VMSI_DM_MASK);
+ delivery_mode = (pirq_dpci->gmsi.gflags >>
GFLAGS_SHIFT_DELIV_MODE) &
Post by Feng Wu
+ VMSI_DELIV_MASK;
In numbers (gflags >> 12) & 0x7000, which is likely not what you
want.
Post by Feng Wu
dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode);
pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id;
spin_unlock(&d->event_lock);
if ( dest_vcpu_id >= 0 )
hvm_migrate_pirqs(d->vcpu[dest_vcpu_id]);
+
+ /* Use interrupt posting if it is supported. */
+ if ( iommu_intpost )
+ {
+ const struct vcpu *vcpu = pi_find_dest_vcpu(d, dest,
dest_mode,
Post by Feng Wu
+ delivery_mode,
pirq_dpci->gmsi.gvec);
Post by Feng Wu
+
+ if ( vcpu )
+ {
+ rc = pi_update_irte( vcpu, info, pirq_dpci->gmsi.gvec );
+ if ( unlikely(rc) )
+ dprintk(XENLOG_G_INFO,
+ "%pv: failed to update PI IRTE,
gvec:%02x\n",
Post by Feng Wu
+ vcpu, pirq_dpci->gmsi.gvec);
Even if only a debug build printk() - aren't you afraid that if this
ever triggers, it will trigger a lot? And hence be pretty useless?
I think it reaches this debug printk rarely, but a least, when it is really failed, it
can give people some hints about why we are using interrupt remapping instead
of interrupt posing for the external interrupts.
Post by Feng Wu
+ }
(Not only) depending on the answer, I'd consider adding an "else
printk()" here.
So do you mean something like this:

if ( vcpu )
{
rc = pi_update_irte( vcpu, info, pirq_dpci->gmsi.gvec );
if ( unlikely(rc) )
dprintk(XENLOG_G_INFO,
"%pv: failed to update PI IRTE, gvec:%02x, use interrupt remapping\n",
vcpu, pirq_dpci->gmsi.gvec);
else
dprintk(XENLOG_G_INFO,
"%pv: Succeed to update PI IRTE, gvec:%02x, use interrupt posting\n",
vcpu, pirq_dpci->gmsi.gvec);
}

Thanks,
Feng
Jan
Jan Beulich
2015-09-07 11:03:15 UTC
Permalink
Post by Wu, Feng
Sent: Friday, September 04, 2015 11:59 PM
First of all - an empty Cc list on a patch is suspicious.
I did Cc you for this patch. Why do you say "an empty Cc list"?
Oops, sorry - this really belongs to patch 12.
Post by Wu, Feng
Post by Feng Wu
+static struct vcpu *pi_find_dest_vcpu(const struct domain *d, uint32_t
dest_id,
Post by Feng Wu
+ bool_t dest_mode, uint8_t
delivery_mode,
Post by Feng Wu
+ uint8_t gvec)
+{
+ unsigned int dest_vcpus = 0;
+ struct vcpu *v, *dest = NULL;
+
+ if ( delivery_mode == dest_LowestPrio )
+ return vector_hashing_dest(d, dest_id, dest_mode, gvec);
So at this point delivery_mode == dest_Fixed, right?
It won't be dest_LowestPrio here, so it might be proper to add
else if (delivery_mode == dest_Fixed) for the code below. So the
question is: Can deliver modes other than lowest priority and fixed,
such as NMI, SMI, etc. hit here? Any ideas?
That's for you to validate. If it can't, add an ASSERT(). If it can, use
an if() as you suggest.
Post by Wu, Feng
Post by Feng Wu
+ for_each_vcpu ( d, v )
+ {
+ if ( !vlapic_match_dest(vcpu_vlapic(v), NULL,
APIC_DEST_NOSHORT,
Post by Feng Wu
+ dest_id, dest_mode) )
+ continue;
+
+ dest_vcpus++;
+ dest = v;
+ }
+
+ /*
+ * For fixed destination, we only handle single-destination
+ * interrupts.
+ */
+ if ( dest_vcpus == 1 )
+ return dest;
Is it thus even possible for the if() condition to be false?
It can be false if the interrupts has multiple destination with Fixed
deliver mode.
If it isn't,
returning early from the loop would seem the better option. And
even if it is, I would question whether delivering the interrupt to
the first match isn't going to be better than dropping it.
Here, if we deliver all the interrupts to the first match, only this
vCPU will receive all the interrupts, even though the irq affinity
shows it has multiple destination. I don't think this is correct.
Furthermore, is there any performance issues if the interrupt frequency
is high and the matched vCPU cannot handle them in time? So here, I
just leave these kind of interrupts to legacy interrupt remapping
mechanism.
Oh, okay, if they're not getting lost, that's fine then.
Post by Wu, Feng
Post by Feng Wu
dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode);
pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id;
spin_unlock(&d->event_lock);
if ( dest_vcpu_id >= 0 )
hvm_migrate_pirqs(d->vcpu[dest_vcpu_id]);
+
+ /* Use interrupt posting if it is supported. */
+ if ( iommu_intpost )
+ {
+ const struct vcpu *vcpu = pi_find_dest_vcpu(d, dest,
dest_mode,
Post by Feng Wu
+ delivery_mode,
pirq_dpci->gmsi.gvec);
Post by Feng Wu
+
+ if ( vcpu )
+ {
+ rc = pi_update_irte( vcpu, info, pirq_dpci->gmsi.gvec );
+ if ( unlikely(rc) )
+ dprintk(XENLOG_G_INFO,
+ "%pv: failed to update PI IRTE,
gvec:%02x\n",
Post by Feng Wu
+ vcpu, pirq_dpci->gmsi.gvec);
Even if only a debug build printk() - aren't you afraid that if this
ever triggers, it will trigger a lot? And hence be pretty useless?
I think it reaches this debug printk rarely, but a least, when it is really failed, it
can give people some hints about why we are using interrupt remapping instead
of interrupt posing for the external interrupts.
I understand your motivation, but you don't really answer my
question. (And btw., if you really mean "rarely", then there's a bug
somewhere that you need to fix. It should _never_ trigger if
everything is working correctly.)
Post by Wu, Feng
Post by Feng Wu
+ }
(Not only) depending on the answer, I'd consider adding an "else
printk()" here.
if ( vcpu )
{
rc = pi_update_irte( vcpu, info, pirq_dpci->gmsi.gvec );
if ( unlikely(rc) )
dprintk(XENLOG_G_INFO,
"%pv: failed to update PI IRTE, gvec:%02x, use interrupt remapping\n",
vcpu, pirq_dpci->gmsi.gvec);
else
dprintk(XENLOG_G_INFO,
"%pv: Succeed to update PI IRTE, gvec:%02x, use interrupt posting\n",
vcpu, pirq_dpci->gmsi.gvec);
}
No. I intentionally placed my question _after_ the closing brace, i.e.
I suggested an "else" to the outer "if".

Jan
Wu, Feng
2015-09-08 04:47:00 UTC
Permalink
-----Original Message-----
Sent: Monday, September 07, 2015 7:03 PM
To: Wu, Feng
Subject: RE: [PATCH v6 13/18] Update IRTE according to guest interrupt config
changes
Post by Wu, Feng
Sent: Friday, September 04, 2015 11:59 PM
Post by Feng Wu
+ if ( vcpu )
+ {
+ rc = pi_update_irte( vcpu, info,
pirq_dpci->gmsi.gvec );
Post by Wu, Feng
Post by Feng Wu
+ if ( unlikely(rc) )
+ dprintk(XENLOG_G_INFO,
+ "%pv: failed to update PI IRTE,
gvec:%02x\n",
Post by Feng Wu
+ vcpu, pirq_dpci->gmsi.gvec);
Even if only a debug build printk() - aren't you afraid that if this
ever triggers, it will trigger a lot? And hence be pretty useless?
I think it reaches this debug printk rarely, but a least, when it is really failed, it
can give people some hints about why we are using interrupt remapping instead
of interrupt posing for the external interrupts.
I understand your motivation, but you don't really answer my
question. (And btw., if you really mean "rarely", then there's a bug
somewhere that you need to fix. It should _never_ trigger if
everything is working correctly.)
I mean pi_update_irte() can return failure theoretically, because there
are some pointer check in it. You said, it would trigger a lot if this ever
triggers. Do you mean for a specific pirq, it will always fail after the first
failure? So what is your suggestion here? Adding an ASSERT()?

Thanks,
Feng
Jan Beulich
2015-09-08 09:02:05 UTC
Permalink
Post by Feng Wu
-----Original Message-----
Sent: Monday, September 07, 2015 7:03 PM
To: Wu, Feng
Subject: RE: [PATCH v6 13/18] Update IRTE according to guest interrupt
config
changes
Post by Wu, Feng
Sent: Friday, September 04, 2015 11:59 PM
Post by Feng Wu
+ if ( vcpu )
+ {
+ rc = pi_update_irte( vcpu, info,
pirq_dpci->gmsi.gvec );
Post by Wu, Feng
Post by Feng Wu
+ if ( unlikely(rc) )
+ dprintk(XENLOG_G_INFO,
+ "%pv: failed to update PI IRTE,
gvec:%02x\n",
Post by Feng Wu
+ vcpu, pirq_dpci->gmsi.gvec);
Even if only a debug build printk() - aren't you afraid that if this
ever triggers, it will trigger a lot? And hence be pretty useless?
I think it reaches this debug printk rarely, but a least, when it is really
failed, it
can give people some hints about why we are using interrupt remapping instead
of interrupt posing for the external interrupts.
I understand your motivation, but you don't really answer my
question. (And btw., if you really mean "rarely", then there's a bug
somewhere that you need to fix. It should _never_ trigger if
everything is working correctly.)
I mean pi_update_irte() can return failure theoretically, because there
are some pointer check in it. You said, it would trigger a lot if this ever
triggers.
I didn't say it would, I asked whether it might. That's on the basis
that often once an error like this happened, more similar errors are
likely to occur subsequently. Whether that's the case here I can't
easily tell, hence I'm asking you (supposedly having a better
understanding of what can go wrong when and how often).
Post by Feng Wu
Do you mean for a specific pirq, it will always fail after the first
failure? So what is your suggestion here? Adding an ASSERT()?
An ASSERT() would make things worse (bringing down a debug
hypervisor). Limiting the amount of printing is what I asked to
consider.

Jan
Feng Wu
2015-08-25 01:57:45 UTC
Permalink
This patch adds some helper functions to manipulate the
Posted-Interrupts Descriptor.

CC: Kevin Tian <***@intel.com>
CC: Keir Fraser <***@xen.org>
CC: Jan Beulich <***@suse.com>
CC: Andrew Cooper <***@citrix.com>
Signed-off-by: Feng Wu <***@intel.com>
Reviewed-by: Konrad Rzeszutek Wilk <***@oracle.com>
---
v4:
- Newly added

xen/include/asm-x86/hvm/vmx/vmx.h | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)

diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
index 3fbfa44..acd4aec 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -101,6 +101,7 @@ void vmx_update_cpu_exec_control(struct vcpu *v);
void vmx_update_secondary_exec_control(struct vcpu *v);

#define POSTED_INTR_ON 0
+#define POSTED_INTR_SN 1
static inline int pi_test_and_set_pir(int vector, struct pi_desc *pi_desc)
{
return test_and_set_bit(vector, pi_desc->pir);
@@ -121,11 +122,31 @@ static inline int pi_test_and_clear_on(struct pi_desc *pi_desc)
return test_and_clear_bit(POSTED_INTR_ON, &pi_desc->control);
}

+static inline int pi_test_on(struct pi_desc *pi_desc)
+{
+ return test_bit(POSTED_INTR_ON, &pi_desc->control);
+}
+
static inline unsigned long pi_get_pir(struct pi_desc *pi_desc, int group)
{
return xchg(&pi_desc->pir[group], 0);
}

+static inline int pi_test_sn(struct pi_desc *pi_desc)
+{
+ return test_bit(POSTED_INTR_SN, &pi_desc->control);
+}
+
+static inline void pi_set_sn(struct pi_desc *pi_desc)
+{
+ set_bit(POSTED_INTR_SN, &pi_desc->control);
+}
+
+static inline void pi_clear_sn(struct pi_desc *pi_desc)
+{
+ clear_bit(POSTED_INTR_SN, &pi_desc->control);
+}
+
/*
* Exit Reasons
*/
--
2.1.0
Jan Beulich
2015-09-04 14:40:07 UTC
Permalink
Post by Feng Wu
@@ -121,11 +122,31 @@ static inline int pi_test_and_clear_on(struct pi_desc *pi_desc)
return test_and_clear_bit(POSTED_INTR_ON, &pi_desc->control);
}
+static inline int pi_test_on(struct pi_desc *pi_desc)
+{
+ return test_bit(POSTED_INTR_ON, &pi_desc->control);
+}
For this and ...
Post by Feng Wu
+static inline int pi_test_sn(struct pi_desc *pi_desc)
+{
+ return test_bit(POSTED_INTR_SN, &pi_desc->control);
+}
... this I wonder whether using the bitfield you defined in the
previous patch wouldn't allow the compiler more freedom in
how to carry this out.

Jan
Wu, Feng
2015-09-06 02:05:57 UTC
Permalink
-----Original Message-----
Sent: Friday, September 04, 2015 10:40 PM
To: Wu, Feng
Subject: Re: [PATCH v6 06/18] vmx: Add some helper functions for
Posted-Interrupts
Post by Feng Wu
@@ -121,11 +122,31 @@ static inline int pi_test_and_clear_on(struct
pi_desc *pi_desc)
Post by Feng Wu
return test_and_clear_bit(POSTED_INTR_ON, &pi_desc->control);
}
+static inline int pi_test_on(struct pi_desc *pi_desc)
+{
+ return test_bit(POSTED_INTR_ON, &pi_desc->control);
+}
For this and ...
Post by Feng Wu
+static inline int pi_test_sn(struct pi_desc *pi_desc)
+{
+ return test_bit(POSTED_INTR_SN, &pi_desc->control);
+}
... this I wonder whether using the bitfield you defined in the
previous patch wouldn't allow the compiler more freedom in
how to carry this out.
I am sorry, I don't quite understand it. Do you mean: the bitfield
defined in previous patch is pointless, or using the bitfield here?

Thanks,
Feng
Jan
Jan Beulich
2015-09-07 10:46:04 UTC
Permalink
Post by Wu, Feng
-----Original Message-----
Sent: Friday, September 04, 2015 10:40 PM
To: Wu, Feng
Subject: Re: [PATCH v6 06/18] vmx: Add some helper functions for
Posted-Interrupts
Post by Feng Wu
@@ -121,11 +122,31 @@ static inline int pi_test_and_clear_on(struct
pi_desc *pi_desc)
Post by Feng Wu
return test_and_clear_bit(POSTED_INTR_ON, &pi_desc->control);
}
+static inline int pi_test_on(struct pi_desc *pi_desc)
+{
+ return test_bit(POSTED_INTR_ON, &pi_desc->control);
+}
For this and ...
Post by Feng Wu
+static inline int pi_test_sn(struct pi_desc *pi_desc)
+{
+ return test_bit(POSTED_INTR_SN, &pi_desc->control);
+}
... this I wonder whether using the bitfield you defined in the
previous patch wouldn't allow the compiler more freedom in
how to carry this out.
I am sorry, I don't quite understand it. Do you mean: the bitfield
defined in previous patch is pointless, or using the bitfield here?
Use it here would seem preferable. (But please recall that I
questioned this two fold access model - partly using bitfields,
partly using bitops - earlier on, and ideally _all_ accesses to
a certain kind of data structure would follow a single, uniform
model.)

Jan
Wu, Feng
2015-09-08 02:39:14 UTC
Permalink
-----Original Message-----
Sent: Monday, September 07, 2015 6:46 PM
To: Wu, Feng
Subject: RE: [PATCH v6 06/18] vmx: Add some helper functions for
Posted-Interrupts
Post by Wu, Feng
-----Original Message-----
Sent: Friday, September 04, 2015 10:40 PM
To: Wu, Feng
Subject: Re: [PATCH v6 06/18] vmx: Add some helper functions for
Posted-Interrupts
Post by Feng Wu
@@ -121,11 +122,31 @@ static inline int pi_test_and_clear_on(struct
pi_desc *pi_desc)
Post by Feng Wu
return test_and_clear_bit(POSTED_INTR_ON, &pi_desc->control);
}
+static inline int pi_test_on(struct pi_desc *pi_desc)
+{
+ return test_bit(POSTED_INTR_ON, &pi_desc->control);
+}
For this and ...
Post by Feng Wu
+static inline int pi_test_sn(struct pi_desc *pi_desc)
+{
+ return test_bit(POSTED_INTR_SN, &pi_desc->control);
+}
... this I wonder whether using the bitfield you defined in the
previous patch wouldn't allow the compiler more freedom in
how to carry this out.
I am sorry, I don't quite understand it. Do you mean: the bitfield
defined in previous patch is pointless, or using the bitfield here?
Use it here would seem preferable. (But please recall that I
questioned this two fold access model - partly using bitfields,
partly using bitops - earlier on, and ideally _all_ accesses to
a certain kind of data structure would follow a single, uniform
model.)
Yes, I remember your suggestion about this. So now I only use
bitops everywhere, in that case, do you think I still need to use
bitfields here?

Thanks,
Feng
Jan
Jan Beulich
2015-09-08 05:22:40 UTC
Permalink
Post by Wu, Feng
Sent: Monday, September 07, 2015 6:46 PM
Post by Wu, Feng
Sent: Friday, September 04, 2015 10:40 PM
Post by Feng Wu
@@ -121,11 +122,31 @@ static inline int pi_test_and_clear_on(struct
pi_desc *pi_desc)
Post by Feng Wu
return test_and_clear_bit(POSTED_INTR_ON, &pi_desc->control);
}
+static inline int pi_test_on(struct pi_desc *pi_desc)
+{
+ return test_bit(POSTED_INTR_ON, &pi_desc->control);
+}
For this and ...
Post by Feng Wu
+static inline int pi_test_sn(struct pi_desc *pi_desc)
+{
+ return test_bit(POSTED_INTR_SN, &pi_desc->control);
+}
... this I wonder whether using the bitfield you defined in the
previous patch wouldn't allow the compiler more freedom in
how to carry this out.
I am sorry, I don't quite understand it. Do you mean: the bitfield
defined in previous patch is pointless, or using the bitfield here?
Use it here would seem preferable. (But please recall that I
questioned this two fold access model - partly using bitfields,
partly using bitops - earlier on, and ideally _all_ accesses to
a certain kind of data structure would follow a single, uniform
model.)
Yes, I remember your suggestion about this. So now I only use
bitops everywhere, in that case, do you think I still need to use
bitfields here?
Then you understood only half of it: You now use the bitfield for
accessing multi-bit fields, and bitops for single-bit ones. While
certainly you can't use bitops for multi-bit fields, the corresponding
method would be masks and shifts.

Bottom line - if using a mixture, then use what allows the compiler to
generate most efficient code. And clarify for yourself whether the
modifying bitops really need to be locked ones.

Jan
Feng Wu
2015-08-25 01:57:44 UTC
Permalink
Extend struct pi_desc according to VT-d Posted-Interrupts Spec.

CC: Kevin Tian <***@intel.com>
CC: Keir Fraser <***@xen.org>
CC: Jan Beulich <***@suse.com>
CC: Andrew Cooper <***@citrix.com>
Signed-off-by: Feng Wu <***@intel.com>
Reviewed-by: Andrew Cooper <***@citrix.com>
Acked-by: Kevin Tian <***@intel.com>
Reviewed-by: Konrad Rzeszutek Wilk <***@oracle.com>
---
v3:
- Use u32 instead of u64 for the bitfield in 'struct pi_desc'

xen/include/asm-x86/hvm/vmx/vmcs.h | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h b/xen/include/asm-x86/hvm/vmx/vmcs.h
index f1126d4..7e81752 100644
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
@@ -80,8 +80,19 @@ struct vmx_domain {

struct pi_desc {
DECLARE_BITMAP(pir, NR_VECTORS);
- u32 control;
- u32 rsvd[7];
+ union {
+ struct
+ {
+ u16 on : 1, /* bit 256 - Outstanding Notification */
+ sn : 1, /* bit 257 - Suppress Notification */
+ rsvd_1 : 14; /* bit 271:258 - Reserved */
+ u8 nv; /* bit 279:272 - Notification Vector */
+ u8 rsvd_2; /* bit 287:280 - Reserved */
+ u32 ndst; /* bit 319:288 - Notification Destination */
+ };
+ u64 control;
+ };
+ u32 rsvd[6];
} __attribute__ ((aligned (64)));

#define ept_get_wl(ept) ((ept)->ept_wl)
--
2.1.0
Jan Beulich
2015-09-04 14:32:45 UTC
Permalink
Post by Feng Wu
Extend struct pi_desc according to VT-d Posted-Interrupts Spec.
---
- Use u32 instead of u64 for the bitfield in 'struct pi_desc'
xen/include/asm-x86/hvm/vmx/vmcs.h | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h
b/xen/include/asm-x86/hvm/vmx/vmcs.h
index f1126d4..7e81752 100644
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
@@ -80,8 +80,19 @@ struct vmx_domain {
struct pi_desc {
DECLARE_BITMAP(pir, NR_VECTORS);
- u32 control;
- u32 rsvd[7];
+ union {
+ struct
+ {
The brace belongs on the previous line.
Post by Feng Wu
+ u16 on : 1, /* bit 256 - Outstanding Notification */
+ sn : 1, /* bit 257 - Suppress Notification */
+ rsvd_1 : 14; /* bit 271:258 - Reserved */
+ u8 nv; /* bit 279:272 - Notification Vector */
+ u8 rsvd_2; /* bit 287:280 - Reserved */
+ u32 ndst; /* bit 319:288 - Notification Destination */
Too little indentation.

Jan
Feng Wu
2015-08-25 01:57:40 UTC
Permalink
Add the design doc for VT-d PI.

CC: Kevin Tian <***@intel.com>
CC: Yang Zhang <***@intel.com>
CC: Jan Beulich <***@suse.com>
CC: Keir Fraser <***@xen.org>
CC: Andrew Cooper <***@citrix.com>
CC: George Dunlap <***@eu.citrix.com>
Signed-off-by: Feng Wu <***@intel.com>
Reviewed-by: Kevin Tian <***@intel.com>
Reviewed-by: Konrad Rzeszutek Wilk <***@oracle.com>
---
v6:
- Better description in English

docs/misc/vtd-pi.txt | 332 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 332 insertions(+)
create mode 100644 docs/misc/vtd-pi.txt

diff --git a/docs/misc/vtd-pi.txt b/docs/misc/vtd-pi.txt
new file mode 100644
index 0000000..af5409a
--- /dev/null
+++ b/docs/misc/vtd-pi.txt
@@ -0,0 +1,332 @@
+Authors: Feng Wu <***@intel.com>
+
+VT-d Posted-interrupt (PI) design for XEN
+
+Background
+==========
+With the development of virtualization, there are more and more device
+assignment requirements. However, today when a VM is running with
+assigned devices (such as, NIC), external interrupt handling for the assigned
+devices always needs VMM intervention.
+
+VT-d Posted-interrupt is a more enhanced method to handle interrupts
+in the virtualization environment. Interrupt posting is the process by
+which an interrupt request is recorded in a memory-resident
+posted-interrupt-descriptor structure by the root-complex, followed by
+an optional notification event issued to the CPU complex.
+
+With VT-d Posted-interrupt we can get the following advantages:
+- Direct delivery of external interrupts to running vCPUs without VMM
+intervention
+- Decrease the interrupt migration complexity. On vCPU migration, software
+can atomically co-migrate all interrupts targeting the migrating vCPU. For
+virtual machines with assigned devices, migrating a vCPU across pCPUs
+either incurs the overhead of forwarding interrupts in software (e.g. via VMM
+generated IPIs), or complexity to independently migrate each interrupt targeting
+the vCPU to the new pCPU. However, after enabling VT-d PI, the destination vCPU
+of an external interrupt from assigned devices is stored in the IRTE (i.e.
+Posted-interrupt Descriptor Address), when vCPU is migrated to another pCPU,
+we will set this new pCPU in the 'NDST' filed of Posted-interrupt descriptor, this
+make the interrupt migration automatic.
+
+Here is what Xen currently does for external interrupts from assigned devices:
+
+When a VM is running and an external interrupt from an assigned device occurs
+for it. VM-EXIT happens, then:
+
+vmx_do_extint() --> do_IRQ() --> __do_IRQ_guest() --> hvm_do_IRQ_dpci() -->
+raise_softirq_for(pirq_dpci) --> raise_softirq(HVM_DPCI_SOFTIRQ)
+
+softirq HVM_DPCI_SOFTIRQ is bound to dpci_softirq()
+
+dpci_softirq() --> hvm_dirq_assist() --> vmsi_deliver_pirq() --> vmsi_deliver() -->
+vmsi_inj_irq() --> vlapic_set_irq()
+
+vlapic_set_irq() does the following things:
+1. If CPU-side posted-interrupt is supported, call vmx_deliver_posted_intr() to deliver
+the virtual interrupt via posted-interrupt infrastructure.
+2. Else if CPU-side posted-interrupt is not supported, set the related vIRR in vLAPIC
+page and call vcpu_kick() to kick the related vCPU. Before VM-Entry, vmx_intr_assist()
+will help to inject the interrupt to guests.
+
+However, after VT-d PI is supported, when a guest is running in non-root and an
+external interrupt from an assigned device occurs for it. No VM-Exit is needed,
+the guest can handle this totally in non-root mode, thus avoiding all the above
+code flow.
+
+Posted-interrupt Introduction
+========================
+There are two components to the Posted-interrupt architecture:
+Processor Support and Root-Complex Support
+
+- Processor Support
+Posted-interrupt processing is a feature by which a processor processes
+the virtual interrupts by recording them as pending on the virtual-APIC
+page.
+
+Posted-interrupt processing is enabled by setting the process posted
+interrupts VM-execution control. The processing is performed in response
+to the arrival of an interrupt with the posted-interrupt notification vector.
+In response to such an interrupt, the processor processes virtual interrupts
+recorded in a data structure called a posted-interrupt descriptor.
+
+More information about APICv and CPU-side Posted-interrupt, please refer
+to Chapter 29, and Section 29.6 in the Intel SDM:
+http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
+
+- Root-Complex Support
+Interrupt posting is the process by which an interrupt request (from IOAPIC
+or MSI/MSIx capable sources) is recorded in a memory-resident
+posted-interrupt-descriptor structure by the root-complex, followed by
+an optional notification event issued to the CPU complex. The interrupt
+request arriving at the root-complex carry the identity of the interrupt
+request source and a 'remapping-index'. The remapping-index is used to
+look-up an entry from the memory-resident interrupt-remap-table. Unlike
+interrupt-remapping, the interrupt-remap-table-entry for a posted-interrupt,
+specifies a virtual-vector and a pointer to the posted-interrupt descriptor.
+The virtual-vector specifies the vector of the interrupt to be recorded in
+the posted-interrupt descriptor. The posted-interrupt descriptor hosts storage
+for the virtual-vectors and contains the attributes of the notification event
+(interrupt) to be issued to the CPU complex to inform CPU/software about pending
+interrupts recorded in the posted-interrupt descriptor.
+
+More information about VT-d PI, please refer to
+http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html
+
+Important Definitions
+==================
+There are some changes to IRTE and posted-interrupt descriptor after
+VT-d PI is introduced:
+IRTE: Interrupt Remapping Table Entry
+Posted-interrupt Descriptor Address: the address of the posted-interrupt descriptor
+Virtual Vector: the guest vector of the interrupt
+URG: indicates if the interrupt is urgent
+
+Posted-interrupt descriptor:
+The Posted Interrupt Descriptor hosts the following fields:
+Posted Interrupt Request (PIR): Provide storage for posting (recording) interrupts (one bit
+per vector, for up to 256 vectors).
+
+Outstanding Notification (ON): Indicate if there is a notification event outstanding (not
+processed by processor or software) for this Posted Interrupt Descriptor. When this field is 0,
+hardware modifies it from 0 to 1 when generating a notification event, and the entity receiving
+the notification event (processor or software) resets it as part of posted interrupt processing.
+
+Suppress Notification (SN): Indicate if a notification event is to be suppressed (not
+generated) for non-urgent interrupt requests (interrupts processed through an IRTE with
+URG=0).
+
+Notification Vector (NV): Specify the vector for notification event (interrupt).
+
+Notification Destination (NDST): Specify the physical APIC-ID of the destination logical
+processor for the notification event.
+
+Design Overview
+==============
+In this design, we will cover the following items:
+1. Add a variable to control whether enable VT-d posted-interrupt or not.
+2. VT-d PI feature detection.
+3. Extend posted-interrupt descriptor structure to cover VT-d PI specific items.
+4. Extend IRTE structure to support VT-d PI.
+5. Introduce a new global vector which is used for waking up the blocked vCPU.
+6. Update IRTE when guest modifies the interrupt configuration (MSI/MSIx configuration).
+7. Update posted-interrupt descriptor during vCPU scheduling (when the state
+of the vCPU is transmitted among RUNSTATE_running / RUNSTATE_blocked/
+RUNSTATE_runnable / RUNSTATE_offline).
+8. How to wakeup blocked vCPU when an interrupt is posted for it (wakeup notification handler).
+9. New boot command line for Xen, which controls VT-d PI feature by user.
+10. Multicast/broadcast and lowest priority interrupts consideration.
+
+
+Implementation details
+===================
+- New variable to control VT-d PI
+
+Like variable 'iommu_intremap' for interrupt remapping, it is very straightforward
+to add a new one 'iommu_intpost' for posted-interrupt. 'iommu_intpost' is set
+only when interrupt remapping and VT-d posted-interrupt are both enabled.
+
+- VT-d PI feature detection.
+Bit 59 in VT-d Capability Register is used to report VT-d Posted-interrupt support.
+
+- Extend posted-interrupt descriptor structure to cover VT-d PI specific items.
+Here is the new structure for posted-interrupt descriptor:
+
+struct pi_desc {
+ DECLARE_BITMAP(pir, NR_VECTORS);
+ union {
+ struct
+ {
+ u16 on : 1, /* bit 256 - Outstanding Notification */
+ sn : 1, /* bit 257 - Suppress Notification */
+ rsvd_1 : 14; /* bit 271:258 - Reserved */
+ u8 nv; /* bit 279:272 - Notification Vector */
+ u8 rsvd_2; /* bit 287:280 - Reserved */
+ u32 ndst; /* bit 319:288 - Notification Destination */
+ };
+ u64 control;
+ };
+ u32 rsvd[6];
+} __attribute__ ((aligned (64)));
+
+- Extend IRTE structure to support VT-d PI.
+
+Here is the new structure for IRTE:
+/* interrupt remap entry */
+struct iremap_entry {
+ union {
+ struct { u64 lo, hi; };
+ struct {
+ u16 p : 1,
+ fpd : 1,
+ dm : 1,
+ rh : 1,
+ tm : 1,
+ dlm : 3,
+ avail : 4,
+ res_1 : 4;
+ u8 vector;
+ u8 res_2;
+ u32 dst;
+ u16 sid;
+ u16 sq : 2,
+ svt : 2,
+ res_3 : 12;
+ u32 res_4 : 32;
+ } remap;
+ struct {
+ u16 p : 1,
+ fpd : 1,
+ res_1 : 6,
+ avail : 4,
+ res_2 : 2,
+ urg : 1,
+ im : 1;
+ u8 vector;
+ u8 res_3;
+ u32 res_4 : 6,
+ pda_l : 26;
+ u16 sid;
+ u16 sq : 2,
+ svt : 2,
+ res_5 : 12;
+ u32 pda_h;
+ } post;
+ };
+};
+
+- Introduce a new global vector which is used to wake up the blocked vCPU.
+
+Currently, there is a global vector 'posted_intr_vector', which is used as the
+global notification vector for all vCPUs in the system. This vector is stored in
+VMCS and CPU considers it as a _special_ vector, uses it to notify the related
+pCPU when an interrupt is recorded in the posted-interrupt descriptor.
+
+This existing global vector is a _special_ vector to CPU, CPU handle it in a
+_special_ way compared to normal vectors, please refer to 29.6 in Intel SDM
+http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
+for more information about how CPU handles it.
+
+After having VT-d PI, VT-d engine can issue notification event when the
+assigned devices issue interrupts. We need add a new global vector to
+wakeup the blocked vCPU, please refer to later section in this design for
+how to use this new global vector.
+
+- Update IRTE when guest modifies the interrupt configuration (MSI/MSIx configuration).
+After VT-d PI is introduced, the format of IRTE is changed as follows:
+ Descriptor Address: the address of the posted-interrupt descriptor
+ Virtual Vector: the guest vector of the interrupt
+ URG: indicates if the interrupt is urgent
+ Other fields continue to have the same meaning
+
+'Descriptor Address' tells the destination vCPU of this interrupt, since
+each vCPU has a dedicated posted-interrupt descriptor.
+
+'Virtual Vector' tells the guest vector of the interrupt.
+
+When guest changes the configuration of the interrupts, such as, the
+cpu affinity, or the vector, we need to update the associated IRTE accordingly.
+
+- Update posted-interrupt descriptor during vCPU scheduling
+
+The basic idea here is:
+1. When vCPU's state is RUNSTATE_running,
+ - Set 'NV' to 'posted_intr_vector'.
+ - Clear 'SN' to accept posted-interrupts.
+ - Set 'NDST' to the pCPU on which the vCPU will be running.
+2. When vCPU's state is RUNSTATE_blocked,
+ - Set 'NV' to ' pi_wakeup_vector ', so we can wake up the
+ related vCPU when posted-interrupt happens for it.
+ Please refer to the above section about the new global vector.
+ - Clear 'SN' to accept posted-interrupts
+3. When vCPU's state is RUNSTATE_runnable/RUNSTATE_offline,
+ - Set 'SN' to suppress non-urgent interrupts
+ (Currently, we only support non-urgent interrupts)
+ When vCPU is in RUNSTATE_runnable or RUNSTATE_offline
+ It is not needed to accept posted-interrupt notification event
+ since we don't change the behavior of scheduler when the interrupt
+ occurs, we still need wait for the next scheduling of the vCPU.
+ When external interrupts from assigned devices occur, the interrupts
+ are recorded in PIR, and will be synced to IRR before VM-Entry.
+ - Set 'NV' to 'posted_intr_vector'.
+
+- How to wakeup blocked vCPU when an interrupt is posted for it (wakeup notification handler).
+
+Here is the scenario for the usage of the new global vector:
+
+1. vCPU0 is running on pCPU0
+2. vCPU0 is blocked and vCPU1 is currently running on pCPU0
+3. An external interrupt from an assigned device occurs for vCPU0, if we
+still use 'posted_intr_vector' as the notification vector for vCPU0, the
+notification event for vCPU0 (the event will go to pCPU1) will be consumed
+by vCPU1 incorrectly (remember this is a special vector to CPU). The worst
+case is that vCPU0 will never be woken up again since the wakeup event
+for it is always consumed by other vCPUs incorrectly. So we need introduce
+another global vector, naming 'pi_wakeup_vector' to wake up the blocked vCPU.
+
+After using 'pi_wakeup_vector' for vCPU0, VT-d engine will issue notification
+event using this new vector. Since this new vector is not a SPECIAL one to CPU,
+it is just a normal vector. To CPU, it just receives an normal external interrupt,
+then we can get control in the handler of this new vector. In this case, hypervisor
+can do something in it, such as wakeup the blocked vCPU.
+
+Here are what we do for the blocked vCPU:
+1. Define a per-cpu list 'pi_blocked_vcpu', which stored the blocked
+vCPU on the pCPU.
+2. When the vCPU's state is changed to RUNSTATE_blocked, insert the vCPU
+to the per-cpu list belonging to the pCPU it was running.
+3. When the vCPU is unblocked, remove the vCPU from the related pCPU list.
+
+In the handler of 'pi_wakeup_vector', we do:
+1. Get the physical CPU.
+2. Iterate the list 'pi_blocked_vcpu' of the current pCPU, if 'ON' is set,
+we unblock the associated vCPU.
+
+- New boot command line for Xen, which controls VT-d PI feature by user.
+
+Like 'intremap' for interrupt remapping, we add a new boot command line
+'intpost' for posted-interrupts.
+
+- Multicast/broadcast and lowest priority interrupts consideration.
+
+With VT-d PI, the destination vCPU information of an external interrupt
+from assigned devices is stored in IRTE, this makes the following
+consideration of the design:
+1. Multicast/broadcast interrupts cannot be posted.
+2. For lowest-priority interrupts, new Intel CPU/Chipset/root-complex
+(starting from Nehalem) ignore TPR value, and instead supported two other
+ways (configurable by BIOS) on how the handle lowest priority interrupts:
+ A) Round robin: In this method, the chipset simply delivers lowest priority
+interrupts in a round-robin manner across all the available logical CPUs. While
+this provides good load balancing, this was not the best thing to do always as
+interrupts from the same device (like NIC) will start running on all the CPUs
+thrashing caches and taking locks. This led to the next scheme.
+ B) Vector hashing: In this method, hardware would apply a hash function
+on the vector value in the interrupt request, and use that hash to pick a logical
+CPU to route the lowest priority interrupt. This way, a given vector always goes
+to the same logical CPU, avoiding the thrashing problem above.
+
+So, gist of above is that, lowest priority interrupts has never been delivered as
+"lowest priority" in physical hardware.
+
+Vector hashing is used in this design.
--
2.1.0
Feng Wu
2015-08-25 01:57:53 UTC
Permalink
This patch includes the following aspects:
- Add a global vector to wake up the blocked vCPU
when an interrupt is being posted to it (This
part was sugguested by Yang Zhang <***@intel.com>).
- Adds a new per-vCPU tasklet to wakeup the blocked
vCPU. It can be used in the case vcpu_unblock
cannot be called directly.
- Define two per-cpu variables:
* pi_blocked_vcpu:
A list storing the vCPUs which were blocked on this pCPU.

* pi_blocked_vcpu_lock:
The spinlock to protect pi_blocked_vcpu.

CC: Kevin Tian <***@intel.com>
CC: Keir Fraser <***@xen.org>
CC: Jan Beulich <***@suse.com>
CC: Andrew Cooper <***@citrix.com>
Signed-off-by: Feng Wu <***@intel.com>
---
v6:
- Fix some typos
- Ack the interrupt right after the spin_unlock in pi_wakeup_interrupt()

v4:
- Use local variables in pi_wakeup_interrupt()
- Remove vcpu from the blocked list when pi_desc.on==1, this
- avoid kick vcpu multiple times.
- Remove tasklet

v3:
- This patch is generated by merging the following three patches in v2:
[RFC v2 09/15] Add a new per-vCPU tasklet to wakeup the blocked vCPU
[RFC v2 10/15] vmx: Define two per-cpu variables
[RFC v2 11/15] vmx: Add a global wake-up vector for VT-d Posted-Interrupts
- rename 'vcpu_wakeup_tasklet' to 'pi_vcpu_wakeup_tasklet'
- Move the definition of 'pi_vcpu_wakeup_tasklet' to 'struct arch_vmx_struct'
- rename 'vcpu_wakeup_tasklet_handler' to 'pi_vcpu_wakeup_tasklet_handler'
- Make pi_wakeup_interrupt() static
- Rename 'blocked_vcpu_list' to 'pi_blocked_vcpu_list'
- move 'pi_blocked_vcpu_list' to 'struct arch_vmx_struct'
- Rename 'blocked_vcpu' to 'pi_blocked_vcpu'
- Rename 'blocked_vcpu_lock' to 'pi_blocked_vcpu_lock'

xen/arch/x86/hvm/vmx/vmcs.c | 3 ++
xen/arch/x86/hvm/vmx/vmx.c | 64 ++++++++++++++++++++++++++++++++++++++
xen/include/asm-x86/hvm/vmx/vmcs.h | 3 ++
xen/include/asm-x86/hvm/vmx/vmx.h | 5 +++
4 files changed, 75 insertions(+)

diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index 28c553f..2dabf16 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -661,6 +661,9 @@ int vmx_cpu_up(void)
if ( cpu_has_vmx_vpid )
vpid_sync_all();

+ INIT_LIST_HEAD(&per_cpu(pi_blocked_vcpu, cpu));
+ spin_lock_init(&per_cpu(pi_blocked_vcpu_lock, cpu));
+
return 0;
}

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 2c1c770..9cde9a4 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -83,7 +83,15 @@ static int vmx_msr_write_intercept(unsigned int msr, uint64_t msr_content);
static void vmx_invlpg_intercept(unsigned long vaddr);
static int vmx_vmfunc_intercept(struct cpu_user_regs *regs);

+/*
+ * We maintain a per-CPU linked-list of vCPU, so in PI wakeup handler we
+ * can find which vCPU should be woken up.
+ */
+DEFINE_PER_CPU(struct list_head, pi_blocked_vcpu);
+DEFINE_PER_CPU(spinlock_t, pi_blocked_vcpu_lock);
+
uint8_t __read_mostly posted_intr_vector;
+uint8_t __read_mostly pi_wakeup_vector;

static int vmx_domain_initialise(struct domain *d)
{
@@ -106,6 +114,9 @@ static int vmx_vcpu_initialise(struct vcpu *v)

spin_lock_init(&v->arch.hvm_vmx.vmcs_lock);

+ INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
+ INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_vcpu_on_set_list);
+
v->arch.schedule_tail = vmx_do_resume;
v->arch.ctxt_switch_from = vmx_ctxt_switch_from;
v->arch.ctxt_switch_to = vmx_ctxt_switch_to;
@@ -1976,6 +1987,54 @@ static struct hvm_function_table __initdata vmx_function_table = {
.altp2m_vcpu_emulate_vmfunc = vmx_vcpu_emulate_vmfunc,
};

+/*
+ * Handle VT-d posted-interrupt when VCPU is blocked.
+ */
+static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
+{
+ struct arch_vmx_struct *vmx, *tmp;
+ struct vcpu *v;
+ spinlock_t *lock = &this_cpu(pi_blocked_vcpu_lock);
+ struct list_head *blocked_vcpus = &this_cpu(pi_blocked_vcpu);
+ LIST_HEAD(list);
+
+ spin_lock(lock);
+
+ /*
+ * XXX: The length of the list depends on how many vCPU is current
+ * blocked on this specific pCPU. This may hurt the interrupt latency
+ * if the list grows to too many entries.
+ */
+ list_for_each_entry_safe(vmx, tmp, blocked_vcpus, pi_blocked_vcpu_list)
+ {
+ if ( pi_test_on(&vmx->pi_desc) )
+ {
+ list_del_init(&vmx->pi_blocked_vcpu_list);
+
+ /*
+ * We cannot call vcpu_unblock here, since it also needs
+ * 'pi_blocked_vcpu_lock', we store the vCPUs with ON
+ * set in another list and unblock them after we release
+ * 'pi_blocked_vcpu_lock'.
+ */
+ list_add_tail(&vmx->pi_vcpu_on_set_list, &list);
+ }
+ }
+
+ spin_unlock(lock);
+
+ ack_APIC_irq();
+
+ list_for_each_entry_safe(vmx, tmp, &list, pi_vcpu_on_set_list)
+ {
+ v = container_of(vmx, struct vcpu, arch.hvm_vmx);
+ list_del_init(&vmx->pi_vcpu_on_set_list);
+ vcpu_unblock(v);
+ }
+
+ this_cpu(irq_count)++;
+}
+
const struct hvm_function_table * __init start_vmx(void)
{
set_in_cr4(X86_CR4_VMXE);
@@ -2013,7 +2072,12 @@ const struct hvm_function_table * __init start_vmx(void)
}

if ( cpu_has_vmx_posted_intr_processing )
+ {
alloc_direct_apic_vector(&posted_intr_vector, event_check_interrupt);
+
+ if ( iommu_intpost )
+ alloc_direct_apic_vector(&pi_wakeup_vector, pi_wakeup_interrupt);
+ }
else
{
vmx_function_table.deliver_posted_intr = NULL;
diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h b/xen/include/asm-x86/hvm/vmx/vmcs.h
index 7e81752..9a986d0 100644
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
@@ -161,6 +161,9 @@ struct arch_vmx_struct {
struct page_info *vmwrite_bitmap;

struct page_info *pml_pg;
+
+ struct list_head pi_blocked_vcpu_list;
+ struct list_head pi_vcpu_on_set_list;
};

int vmx_create_vmcs(struct vcpu *v);
diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
index 03c529c..3fd66ba 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -28,6 +28,11 @@
#include <asm/hvm/trace.h>
#include <asm/hvm/vmx/vmcs.h>

+DECLARE_PER_CPU(struct list_head, pi_blocked_vcpu);
+DECLARE_PER_CPU(spinlock_t, pi_blocked_vcpu_lock);
+
+extern uint8_t pi_wakeup_vector;
+
typedef union {
struct {
u64 r : 1, /* bit 0 - Read permission */
--
2.1.0
Jan Beulich
2015-09-07 11:47:35 UTC
Permalink
Post by Feng Wu
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -661,6 +661,9 @@ int vmx_cpu_up(void)
if ( cpu_has_vmx_vpid )
vpid_sync_all();
+ INIT_LIST_HEAD(&per_cpu(pi_blocked_vcpu, cpu));
+ spin_lock_init(&per_cpu(pi_blocked_vcpu_lock, cpu));
This initialization appears to be the sole reason why the respective
variables can't be static (in vmx.c). I'd really like to ask for this to
be adjusted, even if this means adding another call out of
vmx_cpu_up() into vmx.c.
Post by Feng Wu
@@ -106,6 +114,9 @@ static int vmx_vcpu_initialise(struct vcpu *v)
spin_lock_init(&v->arch.hvm_vmx.vmcs_lock);
+ INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
+ INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_vcpu_on_set_list);
Are these really needed (these aren't list heads afaict)?
Post by Feng Wu
@@ -1976,6 +1987,54 @@ static struct hvm_function_table __initdata vmx_function_table = {
.altp2m_vcpu_emulate_vmfunc = vmx_vcpu_emulate_vmfunc,
};
+/*
+ * Handle VT-d posted-interrupt when VCPU is blocked.
+ */
This is (and hence should be formatted as) a single line comment.
Post by Feng Wu
+static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
+{
+ struct arch_vmx_struct *vmx, *tmp;
+ struct vcpu *v;
+ spinlock_t *lock = &this_cpu(pi_blocked_vcpu_lock);
+ struct list_head *blocked_vcpus = &this_cpu(pi_blocked_vcpu);
At least on possibly performance relevant paths please avoid
multiple back-to-back uses of this_cpu() (use per_cpu() instead).
Post by Feng Wu
+ LIST_HEAD(list);
+
+ spin_lock(lock);
+
+ /*
+ * XXX: The length of the list depends on how many vCPU is current
+ * blocked on this specific pCPU. This may hurt the interrupt latency
+ * if the list grows to too many entries.
+ */
+ list_for_each_entry_safe(vmx, tmp, blocked_vcpus, pi_blocked_vcpu_list)
+ {
+ if ( pi_test_on(&vmx->pi_desc) )
+ {
+ list_del_init(&vmx->pi_blocked_vcpu_list);
Why not just list_del() (also further down in the second loop)?
Post by Feng Wu
+
+ /*
+ * We cannot call vcpu_unblock here, since it also needs
+ * 'pi_blocked_vcpu_lock', we store the vCPUs with ON
+ * set in another list and unblock them after we release
+ * 'pi_blocked_vcpu_lock'.
+ */
At least at this point in time this doesn't appear to be correct: I
cannot see how, after this patch, vcpu_unblock() would end up
acquiring the lock. And without that (plus without anything ever
adding to pi_blocked_vcpu), this moving between two lists seems
rather odd; I think the splitting of the series into patches needs
to be re-thought unless (preferably) you can find a (reasonably
clean) way without using two lists in the first place.
Post by Feng Wu
+ list_add_tail(&vmx->pi_vcpu_on_set_list, &list);
+ }
+ }
+
+ spin_unlock(lock);
+
+ ack_APIC_irq();
So why here and not further up or further down? If this is "as
early as possible", a comment should say why it can't be moved
further up.
Post by Feng Wu
+ list_for_each_entry_safe(vmx, tmp, &list, pi_vcpu_on_set_list)
+ {
+ v = container_of(vmx, struct vcpu, arch.hvm_vmx);
+ list_del_init(&vmx->pi_vcpu_on_set_list);
+ vcpu_unblock(v);
+ }
+
+ this_cpu(irq_count)++;
This would probably better pair with ack_APIC_irq().

Jan
Wu, Feng
2015-09-08 08:50:15 UTC
Permalink
-----Original Message-----
Sent: Monday, September 07, 2015 7:48 PM
To: Wu, Feng
Subject: Re: [PATCH v6 14/18] vmx: posted-interrupt handling when vCPU is
blocked
Post by Feng Wu
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -661,6 +661,9 @@ int vmx_cpu_up(void)
if ( cpu_has_vmx_vpid )
vpid_sync_all();
+ INIT_LIST_HEAD(&per_cpu(pi_blocked_vcpu, cpu));
+ spin_lock_init(&per_cpu(pi_blocked_vcpu_lock, cpu));
This initialization appears to be the sole reason why the respective
variables can't be static (in vmx.c). I'd really like to ask for this to
be adjusted, even if this means adding another call out of
vmx_cpu_up() into vmx.c.
Post by Feng Wu
@@ -106,6 +114,9 @@ static int vmx_vcpu_initialise(struct vcpu *v)
spin_lock_init(&v->arch.hvm_vmx.vmcs_lock);
+ INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
+ INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_vcpu_on_set_list);
Are these really needed (these aren't list heads afaict)?
I can remove the initialization operations.
Post by Feng Wu
@@ -1976,6 +1987,54 @@ static struct hvm_function_table __initdata
vmx_function_table = {
Post by Feng Wu
.altp2m_vcpu_emulate_vmfunc = vmx_vcpu_emulate_vmfunc,
};
+/*
+ * Handle VT-d posted-interrupt when VCPU is blocked.
+ */
This is (and hence should be formatted as) a single line comment.
Post by Feng Wu
+static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
+{
+ struct arch_vmx_struct *vmx, *tmp;
+ struct vcpu *v;
+ spinlock_t *lock = &this_cpu(pi_blocked_vcpu_lock);
+ struct list_head *blocked_vcpus = &this_cpu(pi_blocked_vcpu);
At least on possibly performance relevant paths please avoid
multiple back-to-back uses of this_cpu() (use per_cpu() instead).
Thanks for the suggestion, would you mind share more information
about why per_cpu() is preferable in this case? Thanks!
Post by Feng Wu
+ LIST_HEAD(list);
+
+ spin_lock(lock);
+
+ /*
+ * XXX: The length of the list depends on how many vCPU is current
+ * blocked on this specific pCPU. This may hurt the interrupt latency
+ * if the list grows to too many entries.
+ */
+ list_for_each_entry_safe(vmx, tmp, blocked_vcpus,
pi_blocked_vcpu_list)
Post by Feng Wu
+ {
+ if ( pi_test_on(&vmx->pi_desc) )
+ {
+ list_del_init(&vmx->pi_blocked_vcpu_list);
Why not just list_del() (also further down in the second loop)?
What is the bad effect for list_del_init() here?
Post by Feng Wu
+
+ /*
+ * We cannot call vcpu_unblock here, since it also needs
+ * 'pi_blocked_vcpu_lock', we store the vCPUs with ON
+ * set in another list and unblock them after we release
+ * 'pi_blocked_vcpu_lock'.
+ */
At least at this point in time this doesn't appear to be correct: I
cannot see how, after this patch, vcpu_unblock() would end up
acquiring the lock.
vcpu_unblock()
--> vcpu_wake()
--> arch_vcpu_wake_prepare()
--> spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
v->arch.hvm_vmx.pi_block_cpu), flags);
And without that (plus without anything ever
adding to pi_blocked_vcpu), this moving between two lists seems
rather odd; I think the splitting of the series into patches needs
to be re-thought unless (preferably) you can find a (reasonably
clean) way without using two lists in the first place.
Do you mean if using two lists here, I need re-split this patch-set?
I wonder why the two lists affect the patchset splitting?

Thanks,
Feng
Post by Feng Wu
+ list_add_tail(&vmx->pi_vcpu_on_set_list, &list);
+ }
+ }
+
+ spin_unlock(lock);
+
+ ack_APIC_irq();
So why here and not further up or further down? If this is "as
early as possible", a comment should say why it can't be moved
further up.
Post by Feng Wu
+ list_for_each_entry_safe(vmx, tmp, &list, pi_vcpu_on_set_list)
+ {
+ v = container_of(vmx, struct vcpu, arch.hvm_vmx);
+ list_del_init(&vmx->pi_vcpu_on_set_list);
+ vcpu_unblock(v);
+ }
+
+ this_cpu(irq_count)++;
This would probably better pair with ack_APIC_irq().
Jan
Jan Beulich
2015-09-08 09:08:21 UTC
Permalink
Post by Wu, Feng
Sent: Monday, September 07, 2015 7:48 PM
Post by Feng Wu
+static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
+{
+ struct arch_vmx_struct *vmx, *tmp;
+ struct vcpu *v;
+ spinlock_t *lock = &this_cpu(pi_blocked_vcpu_lock);
+ struct list_head *blocked_vcpus = &this_cpu(pi_blocked_vcpu);
At least on possibly performance relevant paths please avoid
multiple back-to-back uses of this_cpu() (use per_cpu() instead).
Thanks for the suggestion, would you mind share more information
about why per_cpu() is preferable in this case? Thanks!
The compiler can't fold the smp_processor_id() uses inside multiple
uses of this_cpu().
Post by Wu, Feng
Post by Feng Wu
+ LIST_HEAD(list);
+
+ spin_lock(lock);
+
+ /*
+ * XXX: The length of the list depends on how many vCPU is current
+ * blocked on this specific pCPU. This may hurt the interrupt latency
+ * if the list grows to too many entries.
+ */
+ list_for_each_entry_safe(vmx, tmp, blocked_vcpus,
pi_blocked_vcpu_list)
Post by Feng Wu
+ {
+ if ( pi_test_on(&vmx->pi_desc) )
+ {
+ list_del_init(&vmx->pi_blocked_vcpu_list);
Why not just list_del() (also further down in the second loop)?
What is the bad effect for list_del_init() here?
Double the writes on a performance critical path.
Post by Wu, Feng
Post by Feng Wu
+
+ /*
+ * We cannot call vcpu_unblock here, since it also needs
+ * 'pi_blocked_vcpu_lock', we store the vCPUs with ON
+ * set in another list and unblock them after we release
+ * 'pi_blocked_vcpu_lock'.
+ */
At least at this point in time this doesn't appear to be correct: I
cannot see how, after this patch, vcpu_unblock() would end up
acquiring the lock.
vcpu_unblock()
--> vcpu_wake()
--> arch_vcpu_wake_prepare()
--> spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
v->arch.hvm_vmx.pi_block_cpu), flags);
No - arch_vcpu_wake_prepare() gets introduced only in patch 16.
Post by Wu, Feng
And without that (plus without anything ever
adding to pi_blocked_vcpu), this moving between two lists seems
rather odd; I think the splitting of the series into patches needs
to be re-thought unless (preferably) you can find a (reasonably
clean) way without using two lists in the first place.
Do you mean if using two lists here, I need re-split this patch-set?
I wonder why the two lists affect the patchset splitting?
Because (see above) at this point in the series it makes no sense to
use two lists, as there's no locking issue to take care of.

Jan
Wu, Feng
2015-09-08 09:14:27 UTC
Permalink
-----Original Message-----
Sent: Tuesday, September 08, 2015 5:08 PM
To: Wu, Feng
Subject: RE: [PATCH v6 14/18] vmx: posted-interrupt handling when vCPU is
blocked
Post by Wu, Feng
Post by Jan Beulich
And without that (plus without anything ever
adding to pi_blocked_vcpu), this moving between two lists seems
rather odd; I think the splitting of the series into patches needs
to be re-thought unless (preferably) you can find a (reasonably
clean) way without using two lists in the first place.
Do you mean if using two lists here, I need re-split this patch-set?
I wonder why the two lists affect the patchset splitting?
Because (see above) at this point in the series it makes no sense to
use two lists, as there's no locking issue to take care of.
Oh, you are right! I need to consider how to split it. Thanks for the
finding.

Thanks,
Feng
Jan
Feng Wu
2015-08-25 01:57:48 UTC
Permalink
Remove pointless casts.

Signed-off-by: Feng Wu <***@intel.com>
Reviewed-by: Konrad Rzeszutek Wilk <***@oracle.com>
---
v5:
- Newly added.

xen/drivers/passthrough/vtd/utils.c | 16 +++++++---------
1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/utils.c b/xen/drivers/passthrough/vtd/utils.c
index 44c4ef5..162b764 100644
--- a/xen/drivers/passthrough/vtd/utils.c
+++ b/xen/drivers/passthrough/vtd/utils.c
@@ -234,10 +234,9 @@ static void dump_iommu_info(unsigned char key)
continue;
printk(" %04x: %x %x %04x %08x %02x %x %x %x %x %x"
" %x %x\n", i,
- (u32)p->hi.svt, (u32)p->hi.sq, (u32)p->hi.sid,
- (u32)p->lo.dst, (u32)p->lo.vector, (u32)p->lo.avail,
- (u32)p->lo.dlm, (u32)p->lo.tm, (u32)p->lo.rh,
- (u32)p->lo.dm, (u32)p->lo.fpd, (u32)p->lo.p);
+ p->hi.svt, p->hi.sq, p->hi.sid, p->lo.dst, p->lo.vector,
+ p->lo.avail, p->lo.dlm, p->lo.tm, p->lo.rh, p->lo.dm,
+ p->lo.fpd, p->lo.p);
print_cnt++;
}
if ( iremap_entries )
@@ -281,11 +280,10 @@ static void dump_iommu_info(unsigned char key)

printk(" %02x: %04x %x %x %x %x %x %x"
" %x %02x\n", i,
- (u32)remap->index_0_14 | ((u32)remap->index_15 << 15),
- (u32)remap->format, (u32)remap->mask, (u32)remap->trigger,
- (u32)remap->irr, (u32)remap->polarity,
- (u32)remap->delivery_status, (u32)remap->delivery_mode,
- (u32)remap->vector);
+ remap->index_0_14 | ((u32)remap->index_15 << 15),
+ remap->format, remap->mask, remap->trigger, remap->irr,
+ remap->polarity, remap->delivery_status, remap->delivery_mode,
+ remap->vector);
}
}
}
--
2.1.0
Jan Beulich
2015-09-04 14:55:47 UTC
Permalink
Post by Feng Wu
Remove pointless casts.
Much appreciated, but ...
Post by Feng Wu
@@ -281,11 +280,10 @@ static void dump_iommu_info(unsigned char key)
printk(" %02x: %04x %x %x %x %x %x %x"
" %x %02x\n", i,
- (u32)remap->index_0_14 | ((u32)remap->index_15 << 15),
- (u32)remap->format, (u32)remap->mask, (u32)remap->trigger,
- (u32)remap->irr, (u32)remap->polarity,
- (u32)remap->delivery_status, (u32)remap->delivery_mode,
- (u32)remap->vector);
+ remap->index_0_14 | ((u32)remap->index_15 << 15),
... why are you retaining this one?

Jan
Feng Wu
2015-08-25 01:57:56 UTC
Permalink
Add the utility to dump the posted format IRTE.

CC: Yang Zhang <***@intel.com>
CC: Kevin Tian <***@intel.com>
Signed-off-by: Feng Wu <***@intel.com>
---
v6:
- Fix a typo

v4:
- Newly added

xen/drivers/passthrough/vtd/utils.c | 43 +++++++++++++++++++++++++++++++++++--
1 file changed, 41 insertions(+), 2 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/utils.c b/xen/drivers/passthrough/vtd/utils.c
index 9d556da..0c7ce3f 100644
--- a/xen/drivers/passthrough/vtd/utils.c
+++ b/xen/drivers/passthrough/vtd/utils.c
@@ -203,6 +203,9 @@ static void dump_iommu_info(unsigned char key)
ecap_intr_remap(iommu->ecap) ? "" : "not ",
(status & DMA_GSTS_IRES) ? " and enabled" : "" );

+ printk(" Interrupt Posting: %ssupported.\n",
+ cap_intr_post(iommu->ecap) ? "" : "not ");
+
if ( status & DMA_GSTS_IRES )
{
/* Dump interrupt remapping table. */
@@ -213,6 +216,7 @@ static void dump_iommu_info(unsigned char key)

printk(" Interrupt remapping table (nr_entry=%#x. "
"Only dump P=1 entries here):\n", nr_entry);
+ printk ("Entries for remapped format:\n");
printk(" SVT SQ SID DST V AVL DLM TM RH DM "
"FPD P\n");
for ( i = 0; i < nr_entry; i++ )
@@ -220,7 +224,7 @@ static void dump_iommu_info(unsigned char key)
struct iremap_entry *p;
if ( i % (1 << IREMAP_ENTRY_ORDER) == 0 )
{
- /* This entry across page boundry */
+ /* This entry across page boundary. */
if ( iremap_entries )
unmap_vtd_domain_page(iremap_entries);

@@ -230,7 +234,7 @@ static void dump_iommu_info(unsigned char key)
else
p = &iremap_entries[i % (1 << IREMAP_ENTRY_ORDER)];

- if ( !p->remap.p )
+ if ( !p->remap.p || p->remap.im )
continue;
printk(" %04x: %x %x %04x %08x %02x %x %x %x %x %x"
" %x %x\n", i,
@@ -239,6 +243,41 @@ static void dump_iommu_info(unsigned char key)
p->remap.rh, p->remap.dm, p->remap.fpd, p->remap.p);
print_cnt++;
}
+
+ if ( iremap_entries )
+ unmap_vtd_domain_page(iremap_entries);
+
+ iremap_entries = NULL;
+ printk ("\nEntries for posted format:\n");
+ printk(" SVT SQ SID PDA V URG AVL FPD P\n");
+ for ( i = 0; i < nr_entry; i++ )
+ {
+ struct iremap_entry *p;
+ if ( i % (1 << IREMAP_ENTRY_ORDER) == 0 )
+ {
+ /* This entry across page boundry */
+ if ( iremap_entries )
+ unmap_vtd_domain_page(iremap_entries);
+
+ GET_IREMAP_ENTRY(iremap_maddr, i,
+ iremap_entries, p);
+ }
+ else
+ p = &iremap_entries[i % (1 << IREMAP_ENTRY_ORDER)];
+
+ if ( !p->post.p || !p->post.im )
+ continue;
+
+ printk(" %04x: %x %x %04x %16lx %02x %x %x %x %x\n",
+ i,
+ p->post.svt, p->post.sq, p->post.sid,
+ ((u64)p->post.pda_h << 32) | (p->post.pda_l << 6),
+ p->post.vector, p->post.urg, p->post.avail, p->post.fpd,
+ p->post.p);
+
+ print_cnt++;
+ }
+
if ( iremap_entries )
unmap_vtd_domain_page(iremap_entries);
if ( iommu_ir_ctrl(iommu)->iremap_num != print_cnt )
--
2.1.0
Jan Beulich
2015-09-07 13:04:43 UTC
Permalink
Post by Feng Wu
@@ -220,7 +224,7 @@ static void dump_iommu_info(unsigned char key)
struct iremap_entry *p;
if ( i % (1 << IREMAP_ENTRY_ORDER) == 0 )
{
- /* This entry across page boundry */
+ /* This entry across page boundary. */
Either leave the comment alone, or fix its content along with its format.
Post by Feng Wu
@@ -230,7 +234,7 @@ static void dump_iommu_info(unsigned char key)
else
p = &iremap_entries[i % (1 << IREMAP_ENTRY_ORDER)];
- if ( !p->remap.p )
+ if ( !p->remap.p || p->remap.im )
continue;
printk(" %04x: %x %x %04x %08x %02x %x %x %x %x %x"
" %x %x\n", i,
@@ -239,6 +243,41 @@ static void dump_iommu_info(unsigned char key)
p->remap.rh, p->remap.dm, p->remap.fpd, p->remap.p);
print_cnt++;
}
+
+ if ( iremap_entries )
+ unmap_vtd_domain_page(iremap_entries);
+
+ iremap_entries = NULL;
+ printk ("\nEntries for posted format:\n");
+ printk(" SVT SQ SID PDA V URG AVL FPD P\n");
+ for ( i = 0; i < nr_entry; i++ )
+ {
+ struct iremap_entry *p;
+ if ( i % (1 << IREMAP_ENTRY_ORDER) == 0 )
+ {
+ /* This entry across page boundry */
+ if ( iremap_entries )
+ unmap_vtd_domain_page(iremap_entries);
+
+ GET_IREMAP_ENTRY(iremap_maddr, i,
+ iremap_entries, p);
+ }
+ else
+ p = &iremap_entries[i % (1 << IREMAP_ENTRY_ORDER)];
+
+ if ( !p->post.p || !p->post.im )
+ continue;
+
+ printk(" %04x: %x %x %04x %16lx %02x %x %x %x %x\n",
+ i,
+ p->post.svt, p->post.sq, p->post.sid,
+ ((u64)p->post.pda_h << 32) | (p->post.pda_l << 6),
+ p->post.vector, p->post.urg, p->post.avail, p->post.fpd,
+ p->post.p);
+
+ print_cnt++;
+ }
Hmm, this two stage approach seems to imply more overhead in
terms of execution time and in terms of analysis (of the output)
time. Can you see to make this a single loop with the output
formatted in a way such that
- one can immediately recognize the kind that each line represents,
- mixed output is still sufficiently aligned to retain overall readability?

Jan
Wu, Feng
2015-09-08 05:38:51 UTC
Permalink
-----Original Message-----
Sent: Monday, September 07, 2015 9:05 PM
To: Wu, Feng
Subject: Re: [Xen-devel] [PATCH v6 17/18] VT-d: Dump the posted format IRTE
Post by Feng Wu
@@ -220,7 +224,7 @@ static void dump_iommu_info(unsigned char key)
struct iremap_entry *p;
if ( i % (1 << IREMAP_ENTRY_ORDER) == 0 )
{
- /* This entry across page boundry */
+ /* This entry across page boundary. */
Either leave the comment alone, or fix its content along with its format.
Post by Feng Wu
@@ -230,7 +234,7 @@ static void dump_iommu_info(unsigned char key)
else
p = &iremap_entries[i % (1 <<
IREMAP_ENTRY_ORDER)];
Post by Feng Wu
- if ( !p->remap.p )
+ if ( !p->remap.p || p->remap.im )
continue;
printk(" %04x: %x %x %04x %08x %02x %x %x %x %x %x"
Post by Feng Wu
" %x %x\n", i,
@@ -239,6 +243,41 @@ static void dump_iommu_info(unsigned char key)
p->remap.rh, p->remap.dm, p->remap.fpd,
p->remap.p);
Post by Feng Wu
print_cnt++;
}
+
+ if ( iremap_entries )
+ unmap_vtd_domain_page(iremap_entries);
+
+ iremap_entries = NULL;
+ printk ("\nEntries for posted format:\n");
+ printk(" SVT SQ SID PDA V
URG AVL FPD P\n");
Post by Feng Wu
+ for ( i = 0; i < nr_entry; i++ )
+ {
+ struct iremap_entry *p;
+ if ( i % (1 << IREMAP_ENTRY_ORDER) == 0 )
+ {
+ /* This entry across page boundry */
+ if ( iremap_entries )
+ unmap_vtd_domain_page(iremap_entries);
+
+ GET_IREMAP_ENTRY(iremap_maddr, i,
+ iremap_entries, p);
+ }
+ else
+ p = &iremap_entries[i % (1 <<
IREMAP_ENTRY_ORDER)];
Post by Feng Wu
+
+ if ( !p->post.p || !p->post.im )
+ continue;
+
+
printk(" %04x: %x %x %04x %16lx %02x %x %x %x %x\n",
Post by Feng Wu
+ i,
+ p->post.svt, p->post.sq, p->post.sid,
+ ((u64)p->post.pda_h << 32) | (p->post.pda_l << 6),
+ p->post.vector, p->post.urg, p->post.avail,
p->post.fpd,
Post by Feng Wu
+ p->post.p);
+
+ print_cnt++;
+ }
Hmm, this two stage approach seems to imply more overhead in
terms of execution time and in terms of analysis (of the output)
time. Can you see to make this a single loop with the output
formatted in a way such that
- one can immediately recognize the kind that each line represents,
This is easy, I can add a prefix in each line, such as "posted", "remapped".
- mixed output is still sufficiently aligned to retain overall readability?
I think this is a little hard, there are many different fields between posted
format and remapped format, and the number of filed for each format
is different, mixing them total will hurt the readability very big.

And consider this is just a debug function, is it that bad to add a
two stage search here? The output is clear and nice in this way.

Thanks,
Feng
Jan
Jan Beulich
2015-09-08 09:16:51 UTC
Permalink
Post by Wu, Feng
And consider this is just a debug function, is it that bad to add a
two stage search here? The output is clear and nice in this way.
The primary questions to ask are: How much longer will the function
take on a huge system with many guests when using two loops
compared with just one? And how much longer will it take someone
looking at the output to find the entry (s)he's after, possibly not
immediately realizing that it missing among dozens or hundreds of
lines of one kind of output means it could be in the other half, far
away in the log?

Jan
Feng Wu
2015-08-25 01:57:43 UTC
Permalink
VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
With VT-d Posted-Interrupts enabled, external interrupts from
direct-assigned devices can be delivered to guests without VMM
intervention when guest is running in non-root mode.

CC: Yang Zhang <***@intel.com>
CC: Kevin Tian <***@intel.com>
Signed-off-by: Feng Wu <***@intel.com>
Reviewed-by: Konrad Rzeszutek Wilk <***@oracle.com>
---
v5:
- Remove blank line

v4:
- Correct a logic error when setting iommu_intpost to 0

v3:
- Remove the "if no intremap then no intpost" logic in
intel_vtd_setup(), it is covered in the iommu_setup().
- Add "if no intremap then no intpost" logic in the end
of init_vtd_hw() which is called by vtd_resume().

So the logic exists in the following three places:
- parse_iommu_param()
- iommu_setup()
- init_vtd_hw()

xen/drivers/passthrough/vtd/iommu.c | 17 +++++++++++++++--
xen/drivers/passthrough/vtd/iommu.h | 1 +
2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 1dffc40..52c7cc9 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2079,6 +2079,9 @@ static int init_vtd_hw(void)
disable_intremap(drhd->iommu);
}

+ if ( !iommu_intremap )
+ iommu_intpost = 0;
+
/*
* Set root entries for each VT-d engine. After set root entry,
* must globally invalidate context cache, and then globally
@@ -2147,8 +2150,8 @@ int __init intel_vtd_setup(void)
}

/* We enable the following features only if they are supported by all VT-d
- * engines: Snoop Control, DMA passthrough, Queued Invalidation and
- * Interrupt Remapping.
+ * engines: Snoop Control, DMA passthrough, Queued Invalidation, Interrupt
+ * Remapping, and Posted Interrupt
*/
for_each_drhd_unit ( drhd )
{
@@ -2176,6 +2179,14 @@ int __init intel_vtd_setup(void)
if ( iommu_intremap && !ecap_intr_remap(iommu->ecap) )
iommu_intremap = 0;

+ /*
+ * We cannot use posted interrupt if X86_FEATURE_CX16 is
+ * not supported, since we count on this feature to
+ * atomically update 16-byte IRTE in posted format.
+ */
+ if ( !iommu_intremap || !cap_intr_post(iommu->cap) || !cpu_has_cx16 )
+ iommu_intpost = 0;
+
if ( !vtd_ept_page_compatible(iommu) )
iommu_hap_pt_share = 0;

@@ -2201,6 +2212,7 @@ int __init intel_vtd_setup(void)
P(iommu_passthrough, "Dom0 DMA Passthrough");
P(iommu_qinval, "Queued Invalidation");
P(iommu_intremap, "Interrupt Remapping");
+ P(iommu_intpost, "Posted Interrupt");
P(iommu_hap_pt_share, "Shared EPT tables");
#undef P

@@ -2220,6 +2232,7 @@ int __init intel_vtd_setup(void)
iommu_passthrough = 0;
iommu_qinval = 0;
iommu_intremap = 0;
+ iommu_intpost = 0;
return ret;
}

diff --git a/xen/drivers/passthrough/vtd/iommu.h b/xen/drivers/passthrough/vtd/iommu.h
index ac71ed1..22abefe 100644
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -61,6 +61,7 @@
/*
* Decoding Capability Register
*/
+#define cap_intr_post(c) (((c) >> 59) & 1)
#define cap_read_drain(c) (((c) >> 55) & 1)
#define cap_write_drain(c) (((c) >> 54) & 1)
#define cap_max_amask_val(c) (((c) >> 48) & 0x3f)
--
2.1.0
Jan Beulich
2015-09-04 14:31:22 UTC
Permalink
Post by Feng Wu
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2079,6 +2079,9 @@ static int init_vtd_hw(void)
disable_intremap(drhd->iommu);
}
+ if ( !iommu_intremap )
+ iommu_intpost = 0;
Why is this being repeated here? iommu_setup() is already taking
care of this thanks to the earlier patch.
Post by Feng Wu
@@ -2176,6 +2179,14 @@ int __init intel_vtd_setup(void)
if ( iommu_intremap && !ecap_intr_remap(iommu->ecap) )
iommu_intremap = 0;
+ /*
+ * We cannot use posted interrupt if X86_FEATURE_CX16 is
+ * not supported, since we count on this feature to
+ * atomically update 16-byte IRTE in posted format.
+ */
+ if ( !iommu_intremap || !cap_intr_post(iommu->cap) || !cpu_has_cx16 )
+ iommu_intpost = 0;
You should check iommu_intpost instead of iommu_intremap to
match the other checks above here.

Jan
Wu, Feng
2015-09-06 01:49:37 UTC
Permalink
-----Original Message-----
Sent: Friday, September 04, 2015 10:31 PM
To: Wu, Feng
Subject: Re: [Xen-devel] [PATCH v6 04/18] vt-d: VT-d Posted-Interrupts feature
detection
Post by Feng Wu
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2079,6 +2079,9 @@ static int init_vtd_hw(void)
disable_intremap(drhd->iommu);
}
+ if ( !iommu_intremap )
+ iommu_intpost = 0;
Why is this being repeated here? iommu_setup() is already taking
care of this thanks to the earlier patch.
Here is my original thought, we have two path to get here as below:

iommu_setup() -> ... -> int_vtd_hw()
vtd_resume() -> ... -> int_vtd_hw()

I added the logic here to cover the 'vtd_resume()' case, however,
after thinking more, seems I don't need to do this, because the
logic has been done before resume, it should be there after back,
right?
Post by Feng Wu
@@ -2176,6 +2179,14 @@ int __init intel_vtd_setup(void)
if ( iommu_intremap && !ecap_intr_remap(iommu->ecap) )
iommu_intremap = 0;
+ /*
+ * We cannot use posted interrupt if X86_FEATURE_CX16 is
+ * not supported, since we count on this feature to
+ * atomically update 16-byte IRTE in posted format.
+ */
+ if ( !iommu_intremap || !cap_intr_post(iommu->cap)
|| !cpu_has_cx16 )
Post by Feng Wu
+ iommu_intpost = 0;
You should check iommu_intpost instead of iommu_intremap to
match the other checks above here.
Here I followed your advice:
http://lists.xenproject.org/archives/html/xen-devel/2015-06/msg01261.html

I think the following code should be ok here though, if you like it, I can change back.

+ if ( !iommu_intpost || !cap_intr_post(iommu->cap)
+ || !cpu_has_cx16 )
+ iommu_intpost = 0;


Thanks,
Feng
Jan
Jan Beulich
2015-09-07 10:43:17 UTC
Permalink
Post by Wu, Feng
Sent: Friday, September 04, 2015 10:31 PM
Post by Feng Wu
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2079,6 +2079,9 @@ static int init_vtd_hw(void)
disable_intremap(drhd->iommu);
}
+ if ( !iommu_intremap )
+ iommu_intpost = 0;
Why is this being repeated here? iommu_setup() is already taking
care of this thanks to the earlier patch.
iommu_setup() -> ... -> int_vtd_hw()
vtd_resume() -> ... -> int_vtd_hw()
I added the logic here to cover the 'vtd_resume()' case, however,
after thinking more, seems I don't need to do this, because the
logic has been done before resume, it should be there after back,
right?
Obviously any assignment to variables like this one should be benign
(as in not change the variable's value) during resume, or else it'll be
very likely for other stuff to get out of sync and hence broken.
Post by Wu, Feng
Post by Feng Wu
@@ -2176,6 +2179,14 @@ int __init intel_vtd_setup(void)
if ( iommu_intremap && !ecap_intr_remap(iommu->ecap) )
iommu_intremap = 0;
+ /*
+ * We cannot use posted interrupt if X86_FEATURE_CX16 is
+ * not supported, since we count on this feature to
+ * atomically update 16-byte IRTE in posted format.
+ */
+ if ( !iommu_intremap || !cap_intr_post(iommu->cap)
|| !cpu_has_cx16 )
Post by Feng Wu
+ iommu_intpost = 0;
You should check iommu_intpost instead of iommu_intremap to
match the other checks above here.
http://lists.xenproject.org/archives/html/xen-devel/2015-06/msg01261.html
I think the following code should be ok here though, if you like it, I can change back.
+ if ( !iommu_intpost || !cap_intr_post(iommu->cap)
+ || !cpu_has_cx16 )
+ iommu_intpost = 0;
The first part is neither in line with the earlier adjustments (as _still_
see in the patch context above) not does it make much sense to set
a variable to zero when it already is zero.

Jan
Wu, Feng
2015-09-08 02:35:53 UTC
Permalink
-----Original Message-----
Sent: Monday, September 07, 2015 6:43 PM
To: Wu, Feng
Subject: RE: [Xen-devel] [PATCH v6 04/18] vt-d: VT-d Posted-Interrupts feature
detection
Post by Wu, Feng
Sent: Friday, September 04, 2015 10:31 PM
Post by Feng Wu
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2079,6 +2079,9 @@ static int init_vtd_hw(void)
disable_intremap(drhd->iommu);
}
@@ -2176,6 +2179,14 @@ int __init intel_vtd_setup(void)
if ( iommu_intremap && !ecap_intr_remap(iommu->ecap) )
iommu_intremap = 0;
+ /*
+ * We cannot use posted interrupt if X86_FEATURE_CX16 is
+ * not supported, since we count on this feature to
+ * atomically update 16-byte IRTE in posted format.
+ */
+ if ( !iommu_intremap || !cap_intr_post(iommu->cap)
|| !cpu_has_cx16 )
Post by Feng Wu
+ iommu_intpost = 0;
You should check iommu_intpost instead of iommu_intremap to
match the other checks above here.
http://lists.xenproject.org/archives/html/xen-devel/2015-06/msg01261.html
I think the following code should be ok here though, if you like it, I can change back.
+ if ( !iommu_intpost || !cap_intr_post(iommu->cap)
+ || !cpu_has_cx16 )
+ iommu_intpost = 0;
The first part is neither in line with the earlier adjustments (as _still_
see in the patch context above) not does it make much sense to set
a variable to zero when it already is zero.
So, what about this?

+ if ( !cap_intr_post(iommu->cap) || !cpu_has_cx16 )
+ iommu_intpost = 0;

Thanks,
Feng
Jan
Jan Beulich
2015-09-08 05:18:08 UTC
Permalink
Post by Wu, Feng
Sent: Monday, September 07, 2015 6:43 PM
Post by Wu, Feng
Sent: Friday, September 04, 2015 10:31 PM
Post by Feng Wu
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2079,6 +2079,9 @@ static int init_vtd_hw(void)
disable_intremap(drhd->iommu);
}
@@ -2176,6 +2179,14 @@ int __init intel_vtd_setup(void)
if ( iommu_intremap && !ecap_intr_remap(iommu->ecap) )
iommu_intremap = 0;
+ /*
+ * We cannot use posted interrupt if X86_FEATURE_CX16 is
+ * not supported, since we count on this feature to
+ * atomically update 16-byte IRTE in posted format.
+ */
+ if ( !iommu_intremap || !cap_intr_post(iommu->cap)
|| !cpu_has_cx16 )
Post by Feng Wu
+ iommu_intpost = 0;
You should check iommu_intpost instead of iommu_intremap to
match the other checks above here.
http://lists.xenproject.org/archives/html/xen-devel/2015-06/msg01261.html
I think the following code should be ok here though, if you like it, I can
change back.
+ if ( !iommu_intpost || !cap_intr_post(iommu->cap)
+ || !cpu_has_cx16 )
+ iommu_intpost = 0;
The first part is neither in line with the earlier adjustments (as _still_
see in the patch context above) not does it make much sense to set
a variable to zero when it already is zero.
So, what about this?
+ if ( !cap_intr_post(iommu->cap) || !cpu_has_cx16 )
+ iommu_intpost = 0;
Yes. While still not in sync with the other checks, this one is actually better than
those (which kind of pointlessly check the feature they want to disable).

Jan
Feng Wu
2015-08-25 01:57:41 UTC
Permalink
This patch adds cmpxchg16b support for x86-64, so software
can perform 128-bit atomic write/read.

CC: Keir Fraser <***@xen.org>
CC: Jan Beulich <***@suse.com>
CC: Andrew Cooper <***@citrix.com>
Signed-off-by: Feng Wu <***@intel.com>
---
v6:
- Fix a typo

v5:
- Change back the parameters of __cmpxchg16b() to __uint128_t *
- Remove pointless cast for 'ptr'
- Remove pointless parentheses
- Use A constraint for the output

v4:
- Use pointer as the parameter of __cmpxchg16b().
- Use gcc's __uint128_t built-in type
- Make the parameters of __cmpxchg16b() void *

v3:
- Newly added.

xen/include/asm-x86/x86_64/system.h | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)

diff --git a/xen/include/asm-x86/x86_64/system.h b/xen/include/asm-x86/x86_64/system.h
index 662813a..e4e959d 100644
--- a/xen/include/asm-x86/x86_64/system.h
+++ b/xen/include/asm-x86/x86_64/system.h
@@ -6,6 +6,34 @@
(unsigned long)(n),sizeof(*(ptr))))

/*
+ * Atomic 16 bytes compare and exchange. Compare OLD with MEM, if
+ * identical, store NEW in MEM. Return the initial value in MEM.
+ * Success is indicated by comparing RETURN with OLD.
+ *
+ * This function can only be called when cpu_has_cx16 is true.
+ */
+
+static always_inline __uint128_t __cmpxchg16b(
+ volatile void *ptr, __uint128_t *old, __uint128_t *new)
+{
+ __uint128_t prev;
+ uint64_t new_high = *new >> 64;
+ uint64_t new_low = (uint64_t)*new;
+
+ ASSERT(cpu_has_cx16);
+
+ asm volatile ( "lock; cmpxchg16b %3"
+ : "=A" (prev)
+ : "c" (new_high), "b" (new_low), "m" (*__xg(ptr)), "0" (*old)
+ : "memory" );
+
+ return prev;
+}
+
+#define cmpxchg16b(ptr,o,n) \
+ __cmpxchg16b((ptr), (__uint128_t *)(o), (__uint128_t *)(n))
+
+/*
* This function causes value _o to be changed to _n at location _p.
* If this access causes a fault then we return 1, otherwise we return 0.
* If no fault occurs then _o is updated to the value we saw at _p. If this
--
2.1.0
Jan Beulich
2015-09-04 14:22:52 UTC
Permalink
Post by Feng Wu
This patch adds cmpxchg16b support for x86-64, so software
can perform 128-bit atomic write/read.
---
- Fix a typo
- Change back the parameters of __cmpxchg16b() to __uint128_t *
- Remove pointless cast for 'ptr'
- Remove pointless parentheses
- Use A constraint for the output
There are a few comments I had on v4 which neither have been
addressed verbally nor have got incorporated into the patch. See
http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg04694.html

Jan
Wu, Feng
2015-09-06 06:07:16 UTC
Permalink
-----Original Message-----
Sent: Friday, September 04, 2015 10:23 PM
To: Wu, Feng
Subject: Re: [PATCH v6 02/18] Add cmpxchg16b support for x86-64
Post by Feng Wu
This patch adds cmpxchg16b support for x86-64, so software
can perform 128-bit atomic write/read.
---
- Fix a typo
- Change back the parameters of __cmpxchg16b() to __uint128_t *
- Remove pointless cast for 'ptr'
- Remove pointless parentheses
- Use A constraint for the output
There are a few comments I had on v4 which neither have been
addressed verbally nor have got incorporated into the patch. See
http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg04694.html
Do you mean I missed the following part in the above link?

" But I think now I recall that it may have been me asking for using
void pointers here: That way one can pass pointers to other types.
But then you would better cast to void * here, and have the inline
function have properly typed pointers. And it should go without
saying that if you cast (or implicitly convert to void), you should
validate that what your caller handed you is at least the size of a
__uint128_t (and for "ptr" perhaps also, as I think Andrew already
pointed out, that it is suitably aligned). "

If that is the case, what about the changes below?

#define cmpxchg16b(ptr,o,n) \
ASSERT((unsigned long)ptr & 0xF == 0); \
ASSERT(sizeof(*o) == sizeof(__uint128_t)) \
ASSERT(sizeof(*n) == sizeof(__uint128_t)) \
__cmpxchg16b((ptr), (void *)(o), (void *)(n))

Thanks,
Feng
Jan
Wu, Feng
2015-09-06 06:32:51 UTC
Permalink
-----Original Message-----
From: Wu, Feng
Sent: Sunday, September 06, 2015 2:07 PM
To: Jan Beulich
Subject: RE: [PATCH v6 02/18] Add cmpxchg16b support for x86-64
-----Original Message-----
Sent: Friday, September 04, 2015 10:23 PM
To: Wu, Feng
Subject: Re: [PATCH v6 02/18] Add cmpxchg16b support for x86-64
Post by Feng Wu
This patch adds cmpxchg16b support for x86-64, so software
can perform 128-bit atomic write/read.
---
- Fix a typo
- Change back the parameters of __cmpxchg16b() to __uint128_t *
- Remove pointless cast for 'ptr'
- Remove pointless parentheses
- Use A constraint for the output
There are a few comments I had on v4 which neither have been
addressed verbally nor have got incorporated into the patch. See
http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg04694.html
Do you mean I missed the following part in the above link?
" But I think now I recall that it may have been me asking for using
void pointers here: That way one can pass pointers to other types.
But then you would better cast to void * here, and have the inline
function have properly typed pointers. And it should go without
saying that if you cast (or implicitly convert to void), you should
validate that what your caller handed you is at least the size of a
__uint128_t (and for "ptr" perhaps also, as I think Andrew already
pointed out, that it is suitably aligned). "
If that is the case, what about the changes below?
#define cmpxchg16b(ptr,o,n)
\
ASSERT((unsigned long)ptr & 0xF == 0);
\
ASSERT(sizeof(*o) == sizeof(__uint128_t))
\
ASSERT(sizeof(*n) == sizeof(__uint128_t))
\
__cmpxchg16b((ptr), (void *)(o), (void *)(n))
Seems there is a build error with this change, we cannot
add stuff before __cmpxchg() since it needs return some
value to the caller. Any suggestion here?

Thanks,
Feng
Thanks,
Feng
Jan
Jan Beulich
2015-09-07 10:36:36 UTC
Permalink
Post by Wu, Feng
From: Wu, Feng
Sent: Sunday, September 06, 2015 2:07 PM
If that is the case, what about the changes below?
#define cmpxchg16b(ptr,o,n)
\
ASSERT((unsigned long)ptr & 0xF == 0);
\
ASSERT(sizeof(*o) == sizeof(__uint128_t))
\
ASSERT(sizeof(*n) == sizeof(__uint128_t))
\
__cmpxchg16b((ptr), (void *)(o), (void *)(n))
Seems there is a build error with this change, we cannot
add stuff before __cmpxchg() since it needs return some
value to the caller. Any suggestion here?
You of course first need to make this proper C (missing semicolons)
and convert it to a construct that is just a single statement (if
necessary using gcc's compound expression extension).

Jan
Wu, Feng
2015-09-08 07:37:56 UTC
Permalink
-----Original Message-----
Sent: Monday, September 07, 2015 6:37 PM
To: Wu, Feng
Subject: RE: [PATCH v6 02/18] Add cmpxchg16b support for x86-64
Post by Wu, Feng
From: Wu, Feng
Sent: Sunday, September 06, 2015 2:07 PM
If that is the case, what about the changes below?
#define cmpxchg16b(ptr,o,n)
\
ASSERT((unsigned long)ptr & 0xF == 0);
\
ASSERT(sizeof(*o) == sizeof(__uint128_t))
\
ASSERT(sizeof(*n) == sizeof(__uint128_t))
\
__cmpxchg16b((ptr), (void *)(o), (void *)(n))
Seems there is a build error with this change, we cannot
add stuff before __cmpxchg() since it needs return some
value to the caller. Any suggestion here?
You of course first need to make this proper C (missing semicolons)
and convert it to a construct that is just a single statement (if
necessary using gcc's compound expression extension).
Thanks for your suggestion. How about this?

#define cmpxchg16b(ptr,o,n) \
( ({ ASSERT(((unsigned long)ptr & 0xF) == 0); }), \
BUILD_BUG_ON(sizeof(*o) != sizeof(__uint128_t)), \
BUILD_BUG_ON(sizeof(*n) != sizeof(__uint128_t)), \
__cmpxchg16b((ptr), (__uint128_t *)(o), (__uint128_t *)(n)) )

Thanks,
Feng
Jan
Jan Beulich
2015-09-08 08:52:11 UTC
Permalink
Post by Wu, Feng
Thanks for your suggestion. How about this?
#define cmpxchg16b(ptr,o,n) \
( ({ ASSERT(((unsigned long)ptr & 0xF) == 0); }), \
BUILD_BUG_ON(sizeof(*o) != sizeof(__uint128_t)), \
BUILD_BUG_ON(sizeof(*n) != sizeof(__uint128_t)), \
__cmpxchg16b((ptr), (__uint128_t *)(o), (__uint128_t *)(n)) )
Yes (properly parenthesized and the ({ }) either removed or extended
to the whole expression).

Jan
Wu, Feng
2015-09-08 08:57:19 UTC
Permalink
-----Original Message-----
Sent: Tuesday, September 08, 2015 4:52 PM
To: Wu, Feng
Subject: RE: [PATCH v6 02/18] Add cmpxchg16b support for x86-64
Post by Wu, Feng
Thanks for your suggestion. How about this?
#define cmpxchg16b(ptr,o,n)
\
Post by Wu, Feng
( ({ ASSERT(((unsigned long)ptr & 0xF) == 0); }), \
BUILD_BUG_ON(sizeof(*o) != sizeof(__uint128_t)),
\
Post by Wu, Feng
BUILD_BUG_ON(sizeof(*n) != sizeof(__uint128_t)),
\
Post by Wu, Feng
__cmpxchg16b((ptr), (__uint128_t *)(o), (__uint128_t *)(n)) )
Yes (properly parenthesized and the ({ }) either removed or extended
We need the ({ }) for the ASSERT statement, or we will meet build error.

Thanks,
Feng
to the whole expression).
Jan
Jan Beulich
2015-09-08 09:19:08 UTC
Permalink
Post by Wu, Feng
-----Original Message-----
Sent: Tuesday, September 08, 2015 4:52 PM
To: Wu, Feng
Subject: RE: [PATCH v6 02/18] Add cmpxchg16b support for x86-64
Post by Wu, Feng
Thanks for your suggestion. How about this?
#define cmpxchg16b(ptr,o,n)
\
Post by Wu, Feng
( ({ ASSERT(((unsigned long)ptr & 0xF) == 0); }), \
BUILD_BUG_ON(sizeof(*o) != sizeof(__uint128_t)),
\
Post by Wu, Feng
BUILD_BUG_ON(sizeof(*n) != sizeof(__uint128_t)),
\
Post by Wu, Feng
__cmpxchg16b((ptr), (__uint128_t *)(o), (__uint128_t *)(n)) )
Yes (properly parenthesized and the ({ }) either removed or extended
We need the ({ }) for the ASSERT statement, or we will meet build error.
And note that I didn't unconditionally say drop it, I provided the
alternative of making it cover the entire expression (at once allowing
you to convert the comma expressions to more conventional individual
statements).

Jan
Wu, Feng
2015-09-08 09:30:34 UTC
Permalink
-----Original Message-----
Sent: Tuesday, September 08, 2015 5:19 PM
To: Wu, Feng
Subject: RE: [PATCH v6 02/18] Add cmpxchg16b support for x86-64
Post by Wu, Feng
-----Original Message-----
Sent: Tuesday, September 08, 2015 4:52 PM
To: Wu, Feng
Subject: RE: [PATCH v6 02/18] Add cmpxchg16b support for x86-64
Post by Wu, Feng
Thanks for your suggestion. How about this?
#define cmpxchg16b(ptr,o,n)
\
Post by Wu, Feng
( ({ ASSERT(((unsigned long)ptr & 0xF) == 0); }),
\
Post by Wu, Feng
Post by Wu, Feng
BUILD_BUG_ON(sizeof(*o) != sizeof(__uint128_t)),
\
Post by Wu, Feng
BUILD_BUG_ON(sizeof(*n) != sizeof(__uint128_t)),
\
Post by Wu, Feng
__cmpxchg16b((ptr), (__uint128_t *)(o), (__uint128_t *)(n)) )
Yes (properly parenthesized and the ({ }) either removed or extended
We need the ({ }) for the ASSERT statement, or we will meet build error.
And note that I didn't unconditionally say drop it, I provided the
alternative of making it cover the entire expression (at once allowing
you to convert the comma expressions to more conventional individual
statements).
This is not a big issue, If you don't like the above solution, please be explicit,
just type it, in code, save efforts for both of us. Thanks!

Thanks,
Feng
Jan
Jan Beulich
2015-09-07 10:34:41 UTC
Permalink
Post by Wu, Feng
Sent: Friday, September 04, 2015 10:23 PM
Post by Feng Wu
---
- Fix a typo
- Change back the parameters of __cmpxchg16b() to __uint128_t *
- Remove pointless cast for 'ptr'
- Remove pointless parentheses
- Use A constraint for the output
There are a few comments I had on v4 which neither have been
addressed verbally nor have got incorporated into the patch. See
http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg04694.html
Do you mean I missed the following part in the above link?
" But I think now I recall that it may have been me asking for using
void pointers here: That way one can pass pointers to other types.
But then you would better cast to void * here, and have the inline
function have properly typed pointers. And it should go without
saying that if you cast (or implicitly convert to void), you should
validate that what your caller handed you is at least the size of a
__uint128_t (and for "ptr" perhaps also, as I think Andrew already
pointed out, that it is suitably aligned). "
If that is the case, what about the changes below?
#define cmpxchg16b(ptr,o,n) \
ASSERT((unsigned long)ptr & 0xF == 0); \
ASSERT(sizeof(*o) == sizeof(__uint128_t)) \
ASSERT(sizeof(*n) == sizeof(__uint128_t)) \
__cmpxchg16b((ptr), (void *)(o), (void *)(n))
Along those lines, but with BUILD_BUG_ON() and properly
parenthesized please.

Jan
Jan Beulich
2015-09-07 10:39:36 UTC
Permalink
Post by Wu, Feng
Sent: Friday, September 04, 2015 10:23 PM
Post by Feng Wu
---
- Fix a typo
- Change back the parameters of __cmpxchg16b() to __uint128_t *
- Remove pointless cast for 'ptr'
- Remove pointless parentheses
- Use A constraint for the output
There are a few comments I had on v4 which neither have been
addressed verbally nor have got incorporated into the patch. See
http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg04694.html
Do you mean I missed the following part in the above link?
" But I think now I recall that it may have been me asking for using
void pointers here: That way one can pass pointers to other types.
But then you would better cast to void * here, and have the inline
function have properly typed pointers. And it should go without
saying that if you cast (or implicitly convert to void), you should
validate that what your caller handed you is at least the size of a
__uint128_t (and for "ptr" perhaps also, as I think Andrew already
pointed out, that it is suitably aligned). "
And for the record - I was referring to more than just that (in
particular the need for a memory clobber). Please, prior to asking
back, check what parts of the comments you actually did take care
of, and deal with _everything_ you didn't.

Jan
Jan Beulich
2015-09-04 15:12:06 UTC
Permalink
Post by Feng Wu
--- a/xen/include/asm-x86/x86_64/system.h
+++ b/xen/include/asm-x86/x86_64/system.h
@@ -6,6 +6,34 @@
(unsigned long)(n),sizeof(*(ptr))))
/*
+ * Atomic 16 bytes compare and exchange. Compare OLD with MEM, if
+ * identical, store NEW in MEM. Return the initial value in MEM.
+ * Success is indicated by comparing RETURN with OLD.
+ *
+ * This function can only be called when cpu_has_cx16 is true.
+ */
+
+static always_inline __uint128_t __cmpxchg16b(
+ volatile void *ptr, __uint128_t *old, __uint128_t *new)
Noted just now - the latter two should both be const.

Jan
Feng Wu
2015-08-25 01:57:42 UTC
Permalink
VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
With VT-d Posted-Interrupts enabled, external interrupts from
direct-assigned devices can be delivered to guests without VMM
intervention when guest is running in non-root mode.

This patch adds variable 'iommu_intpost' to control whether enable VT-d
posted-interrupt or not in the generic IOMMU code.

CC: Jan Beulich <***@suse.com>
CC: Kevin Tian <***@intel.com>
Signed-off-by: Feng Wu <***@intel.com>
Reviewed-by: Kevin Tian <***@intel.com>
Reviewed-by: Konrad Rzeszutek Wilk <***@oracle.com>
---
v5:
- Remove the "if no intremap then no intpost" logic in parse_iommu_param(), which
can be covered in iommu_setup()

v3:
- Remove pointless initializer for 'iommu_intpost'.
- Some adjustment for "if no intremap then no intpost" logic.
* For parse_iommu_param(), move it to the end of the function,
so we don't need to add the some logic when introduing the
new kernel parameter 'intpost' in later patch.
* Add this logic in iommu_setup() after iommu_hardware_setup()
is called.

xen/drivers/passthrough/iommu.c | 13 ++++++++++++-
xen/include/xen/iommu.h | 2 +-
2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
index fc7831e..36d5cc0 100644
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -51,6 +51,14 @@ bool_t __read_mostly iommu_passthrough;
bool_t __read_mostly iommu_snoop = 1;
bool_t __read_mostly iommu_qinval = 1;
bool_t __read_mostly iommu_intremap = 1;
+
+/*
+ * In the current implementation of VT-d posted interrupts, in some extreme
+ * cases, the per cpu list which saves the blocked vCPU will be very long,
+ * and this will affect the interrupt latency, so let this feature off by
+ * default until we find a good solution to resolve it.
+ */
+bool_t __read_mostly iommu_intpost;
bool_t __read_mostly iommu_hap_pt_share = 1;
bool_t __read_mostly iommu_debug;
bool_t __read_mostly amd_iommu_perdev_intremap = 1;
@@ -307,6 +315,9 @@ int __init iommu_setup(void)
panic("Couldn't enable %s and iommu=required/force",
!iommu_enabled ? "IOMMU" : "Interrupt Remapping");

+ if ( !iommu_intremap )
+ iommu_intpost = 0;
+
if ( !iommu_enabled )
{
iommu_snoop = 0;
@@ -374,7 +385,7 @@ void iommu_crash_shutdown(void)
const struct iommu_ops *ops = iommu_get_ops();
if ( iommu_enabled )
ops->crash_shutdown();
- iommu_enabled = iommu_intremap = 0;
+ iommu_enabled = iommu_intremap = iommu_intpost = 0;
}

int iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
index 8f3a20e..1f5d04a 100644
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -30,7 +30,7 @@
extern bool_t iommu_enable, iommu_enabled;
extern bool_t force_iommu, iommu_verbose;
extern bool_t iommu_workaround_bios_bug, iommu_igfx, iommu_passthrough;
-extern bool_t iommu_snoop, iommu_qinval, iommu_intremap;
+extern bool_t iommu_snoop, iommu_qinval, iommu_intremap, iommu_intpost;
extern bool_t iommu_hap_pt_share;
extern bool_t iommu_debug;
extern bool_t amd_iommu_perdev_intremap;
--
2.1.0
Jan Beulich
2015-09-04 14:26:03 UTC
Permalink
Post by Feng Wu
VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
With VT-d Posted-Interrupts enabled, external interrupts from
direct-assigned devices can be delivered to guests without VMM
intervention when guest is running in non-root mode.
This patch adds variable 'iommu_intpost' to control whether enable VT-d
posted-interrupt or not in the generic IOMMU code.
Acked-by: Jan Beulich <***@suse.com>
Feng Wu
2015-08-25 01:57:54 UTC
Permalink
When a vCPU is running in Root mode and a notification event
has been injected to it. we need to set VCPU_KICK_SOFTIRQ for
the current cpu, so the pending interrupt in PIRR will be
synced to vIRR before VM-Exit in time.

CC: Kevin Tian <***@intel.com>
CC: Keir Fraser <***@xen.org>
CC: Jan Beulich <***@suse.com>
CC: Andrew Cooper <***@citrix.com>
Signed-off-by: Feng Wu <***@intel.com>
Acked-by: Kevin Tian <***@intel.com>
---
v6:
- Ack the interrupt in the beginning of pi_notification_interrupt()

v4:
- Coding style.

v3:
- Make pi_notification_interrupt() static

xen/arch/x86/hvm/vmx/vmx.c | 48 +++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 47 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 9cde9a4..5167fae 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -2035,6 +2035,52 @@ static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
this_cpu(irq_count)++;
}

+/* Handle VT-d posted-interrupt when VCPU is running. */
+static void pi_notification_interrupt(struct cpu_user_regs *regs)
+{
+ ack_APIC_irq();
+
+ /*
+ * We get here when a vCPU is running in root-mode (such as via hypercall,
+ * or any other reasons which can result in VM-Exit), and before vCPU is
+ * back to non-root, external interrupts from an assigned device happen
+ * and a notification event is delivered to this logical CPU.
+ *
+ * we need to set VCPU_KICK_SOFTIRQ for the current cpu, just like
+ * __vmx_deliver_posted_interrupt(). So the pending interrupt in PIRR will
+ * be synced to vIRR before VM-Exit in time.
+ *
+ * Please refer to the following code fragments from
+ * xen/arch/x86/hvm/vmx/entry.S:
+ *
+ * .Lvmx_do_vmentry
+ *
+ * ......
+ * point 1
+ *
+ * cmp %ecx,(%rdx,%rax,1)
+ * jnz .Lvmx_process_softirqs
+ *
+ * ......
+ *
+ * je .Lvmx_launch
+ *
+ * ......
+ *
+ * .Lvmx_process_softirqs:
+ * sti
+ * call do_softirq
+ * jmp .Lvmx_do_vmentry
+ *
+ * If VT-d engine issues a notification event at point 1 above, it cannot
+ * be delivered to the guest during this VM-entry without raising the
+ * softirq in this notification handler.
+ */
+ raise_softirq(VCPU_KICK_SOFTIRQ);
+
+ this_cpu(irq_count)++;
+}
+
const struct hvm_function_table * __init start_vmx(void)
{
set_in_cr4(X86_CR4_VMXE);
@@ -2073,7 +2119,7 @@ const struct hvm_function_table * __init start_vmx(void)

if ( cpu_has_vmx_posted_intr_processing )
{
- alloc_direct_apic_vector(&posted_intr_vector, event_check_interrupt);
+ alloc_direct_apic_vector(&posted_intr_vector, pi_notification_interrupt);

if ( iommu_intpost )
alloc_direct_apic_vector(&pi_wakeup_vector, pi_wakeup_interrupt);
--
2.1.0
Jan Beulich
2015-09-07 12:10:21 UTC
Permalink
Post by Feng Wu
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -2035,6 +2035,52 @@ static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
this_cpu(irq_count)++;
}
+/* Handle VT-d posted-interrupt when VCPU is running. */
+static void pi_notification_interrupt(struct cpu_user_regs *regs)
+{
+ ack_APIC_irq();
+
+ /*
+ * We get here when a vCPU is running in root-mode (such as via hypercall,
+ * or any other reasons which can result in VM-Exit), and before vCPU is
+ * back to non-root, external interrupts from an assigned device happen
+ * and a notification event is delivered to this logical CPU.
+ *
+ * we need to set VCPU_KICK_SOFTIRQ for the current cpu, just like
+ * __vmx_deliver_posted_interrupt(). So the pending interrupt in PIRR will
+ * be synced to vIRR before VM-Exit in time.
+ *
+ * Please refer to the following code fragments from
+ *
+ * .Lvmx_do_vmentry
+ *
+ * ......
+ * point 1
+ *
+ * cmp %ecx,(%rdx,%rax,1)
I think retaining the "cli" right above this is quite critical for
understanding why code past this check isn't susceptible to the
issue anymore.
Post by Feng Wu
+ * jnz .Lvmx_process_softirqs
+ *
+ * ......
+ *
+ * je .Lvmx_launch
+ *
+ * ......
+ *
+ * sti
+ * call do_softirq
+ * jmp .Lvmx_do_vmentry
+ *
+ * If VT-d engine issues a notification event at point 1 above, it cannot
+ * be delivered to the guest during this VM-entry without raising the
+ * softirq in this notification handler.
+ */
+ raise_softirq(VCPU_KICK_SOFTIRQ);
+
+ this_cpu(irq_count)++;
+}
+
const struct hvm_function_table * __init start_vmx(void)
{
set_in_cr4(X86_CR4_VMXE);
@@ -2073,7 +2119,7 @@ const struct hvm_function_table * __init start_vmx(void)
if ( cpu_has_vmx_posted_intr_processing )
{
- alloc_direct_apic_vector(&posted_intr_vector, event_check_interrupt);
+ alloc_direct_apic_vector(&posted_intr_vector, pi_notification_interrupt);
if ( iommu_intpost )
alloc_direct_apic_vector(&pi_wakeup_vector, pi_wakeup_interrupt);
Considering that you do this setup independent of iommu_intpost,
won't this (namely, but not only) for the !iommu_inpost case result
in a whole lot of useless softirqs to be raised? IOW - shouldn't this
setup be conditional, and shouldn't the handler also only
conditionally raise the softirq (to as much as possible limit their
amount)?

Yang, in this context: Why does __vmx_deliver_posted_interrupt()
not use cpu_raise_softirq(), instead kind of open coding it (see your
d7dafa375b ["VMX: Add posted interrupt supporting"])?

Jan
Zhang, Yang Z
2015-09-07 13:00:58 UTC
Permalink
Post by Feng Wu
Post by Feng Wu
+ * jnz .Lvmx_process_softirqs
+ *
+ * ......
+ *
+ * je .Lvmx_launch
+ *
+ * ......
+ *
+ * sti
+ * call do_softirq
+ * jmp .Lvmx_do_vmentry
+ *
+ * If VT-d engine issues a notification event at point 1 above, it cannot
+ * be delivered to the guest during this VM-entry without raising the
+ * softirq in this notification handler.
+ */
+ raise_softirq(VCPU_KICK_SOFTIRQ);
+
+ this_cpu(irq_count)++;
+}
+
const struct hvm_function_table * __init start_vmx(void)
{
set_in_cr4(X86_CR4_VMXE);
@@ -2073,7 +2119,7 @@ const struct hvm_function_table * __init start_vmx(void)
if ( cpu_has_vmx_posted_intr_processing )
{
- alloc_direct_apic_vector(&posted_intr_vector,
event_check_interrupt); +
alloc_direct_apic_vector(&posted_intr_vector,
pi_notification_interrupt);
if ( iommu_intpost )
alloc_direct_apic_vector(&pi_wakeup_vector,
pi_wakeup_interrupt);
Considering that you do this setup independent of iommu_intpost, won't
this (namely, but not only) for the !iommu_inpost case result in a whole
lot of useless softirqs to be raised? IOW - shouldn't this setup be
conditional, and shouldn't the handler also only conditionally raise the
softirq (to as much as possible limit their amount)?
Yang, in this context: Why does __vmx_deliver_posted_interrupt()
not use cpu_raise_softirq(), instead kind of open coding it (see your
d7dafa375b ["VMX: Add posted interrupt supporting"])?
Sorry, I am not in the context. What do you mean of using cpu_raise_softirq() in __vmx_deliver_posted_interrupt()?

Best regards,
Yang
Jan Beulich
2015-09-07 13:12:35 UTC
Permalink
Post by Zhang, Yang Z
Post by Jan Beulich
Yang, in this context: Why does __vmx_deliver_posted_interrupt()
not use cpu_raise_softirq(), instead kind of open coding it (see your
d7dafa375b ["VMX: Add posted interrupt supporting"])?
Sorry, I am not in the context. What do you mean of using
cpu_raise_softirq() in __vmx_deliver_posted_interrupt()?
Why is the function not using that ready to use helper? Looking at
it ...
Post by Zhang, Yang Z
static void __vmx_deliver_posted_interrupt(struct vcpu *v)
{
bool_t running = v->is_running;
vcpu_unblock(v);
if ( running && (in_irq() || (v != current)) )
{
unsigned int cpu = v->processor;
if ( !test_and_set_bit(VCPU_KICK_SOFTIRQ, &softirq_pending(cpu))
... this line as well as ...
Post by Zhang, Yang Z
&& (cpu != smp_processor_id()) )
send_IPI_mask(cpumask_of(cpu), posted_intr_vector);
... this one ...
Post by Zhang, Yang Z
}
}
... pretty certainly don't belong into vmx.c, or the apparent open
coding of cpu_raise_softirq() would require a justifying comment.

Jan
Zhang, Yang Z
2015-09-08 01:38:50 UTC
Permalink
Post by Jan Beulich
Post by Zhang, Yang Z
Post by Jan Beulich
Yang, in this context: Why does __vmx_deliver_posted_interrupt()
not use cpu_raise_softirq(), instead kind of open coding it (see your
d7dafa375b ["VMX: Add posted interrupt supporting"])?
Sorry, I am not in the context. What do you mean of using
cpu_raise_softirq() in __vmx_deliver_posted_interrupt()?
Why is the function not using that ready to use helper? Looking at
it ...
Post by Zhang, Yang Z
static void __vmx_deliver_posted_interrupt(struct vcpu *v)
{
bool_t running = v->is_running;
vcpu_unblock(v);
if ( running && (in_irq() || (v != current)) )
{
unsigned int cpu = v->processor;
if ( !test_and_set_bit(VCPU_KICK_SOFTIRQ, &softirq_pending(cpu))
... this line as well as ...
Post by Zhang, Yang Z
&& (cpu != smp_processor_id()) )
send_IPI_mask(cpumask_of(cpu), posted_intr_vector);
... this one ...
Post by Zhang, Yang Z
}
}
... pretty certainly don't belong into vmx.c, or the apparent open
coding of cpu_raise_softirq() would require a justifying comment.
I still don't see how to use cpu_raise_softirq() since the posted_intr_vector doesn't belong to softirq.

Best regards,
Yang
Jan Beulich
2015-09-08 08:57:40 UTC
Permalink
Post by Zhang, Yang Z
Post by Jan Beulich
Post by Zhang, Yang Z
Post by Jan Beulich
Yang, in this context: Why does __vmx_deliver_posted_interrupt()
not use cpu_raise_softirq(), instead kind of open coding it (see your
d7dafa375b ["VMX: Add posted interrupt supporting"])?
Sorry, I am not in the context. What do you mean of using
cpu_raise_softirq() in __vmx_deliver_posted_interrupt()?
Why is the function not using that ready to use helper? Looking at
it ...
Post by Zhang, Yang Z
static void __vmx_deliver_posted_interrupt(struct vcpu *v)
{
bool_t running = v->is_running;
vcpu_unblock(v);
if ( running && (in_irq() || (v != current)) )
{
unsigned int cpu = v->processor;
if ( !test_and_set_bit(VCPU_KICK_SOFTIRQ, &softirq_pending(cpu))
... this line as well as ...
Post by Zhang, Yang Z
&& (cpu != smp_processor_id()) )
send_IPI_mask(cpumask_of(cpu), posted_intr_vector);
... this one ...
Post by Zhang, Yang Z
}
}
... pretty certainly don't belong into vmx.c, or the apparent open
coding of cpu_raise_softirq() would require a justifying comment.
I still don't see how to use cpu_raise_softirq() since the
posted_intr_vector doesn't belong to softirq.
I realize that because of the special in-processor treatment of that
vector it ma not be possible to eliminate this open coding. That's why
I said above "would require a justifying comment". After all, namely
with the handler for the vector being the same as for normal event
check interrupts, it is far from obvious why the open coding can't be
avoided.

Jan
Wu, Feng
2015-09-08 05:18:58 UTC
Permalink
-----Original Message-----
Sent: Monday, September 07, 2015 8:10 PM
To: Wu, Feng; Zhang, Yang Z
Subject: Re: [PATCH v6 15/18] vmx: Properly handle notification event when
vCPU is running
Post by Feng Wu
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
const struct hvm_function_table * __init start_vmx(void)
{
set_in_cr4(X86_CR4_VMXE);
@@ -2073,7 +2119,7 @@ const struct hvm_function_table * __init
start_vmx(void)
Post by Feng Wu
if ( cpu_has_vmx_posted_intr_processing )
{
- alloc_direct_apic_vector(&posted_intr_vector,
event_check_interrupt);
Post by Feng Wu
+ alloc_direct_apic_vector(&posted_intr_vector,
pi_notification_interrupt);
Post by Feng Wu
if ( iommu_intpost )
alloc_direct_apic_vector(&pi_wakeup_vector,
pi_wakeup_interrupt);
Considering that you do this setup independent of iommu_intpost,
won't this (namely, but not only) for the !iommu_inpost case result
in a whole lot of useless softirqs to be raised?
I don't think lots of useless softirqs will be raised in !iommu_intpost
case, since we can get pi_notification_interrupt() only when someone
called __vmx_deliver_posted_interrupt() in which, the VCPU_KICK_SOFTIRQ
bit was set.
IOW - shouldn't this setup be conditional,
We cannot setup pi_notification_interrupt() conditional, since it is
for all cases, non-vtd-pi and vt-d pi.
and shouldn't the handler also only conditionally raise the softirq
(to as much as possible limit their amount)?
As mentioned above, there are not extra softirq raised. If you think
It is necessary, I can add a condition here as below, which I really think
is useless, but I am fine to add it if you really think it is a must.

if ( iommu_intpost )
raise_softirq(VCPU_KICK_SOFTIRQ);

Thanks,
Feng
Jan Beulich
2015-09-08 09:13:39 UTC
Permalink
Post by Wu, Feng
-----Original Message-----
Sent: Monday, September 07, 2015 8:10 PM
To: Wu, Feng; Zhang, Yang Z
Subject: Re: [PATCH v6 15/18] vmx: Properly handle notification event when
vCPU is running
Post by Feng Wu
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
const struct hvm_function_table * __init start_vmx(void)
{
set_in_cr4(X86_CR4_VMXE);
@@ -2073,7 +2119,7 @@ const struct hvm_function_table * __init
start_vmx(void)
Post by Feng Wu
if ( cpu_has_vmx_posted_intr_processing )
{
- alloc_direct_apic_vector(&posted_intr_vector,
event_check_interrupt);
Post by Feng Wu
+ alloc_direct_apic_vector(&posted_intr_vector,
pi_notification_interrupt);
Post by Feng Wu
if ( iommu_intpost )
alloc_direct_apic_vector(&pi_wakeup_vector,
pi_wakeup_interrupt);
Considering that you do this setup independent of iommu_intpost,
won't this (namely, but not only) for the !iommu_inpost case result
in a whole lot of useless softirqs to be raised?
I don't think lots of useless softirqs will be raised in !iommu_intpost
case, since we can get pi_notification_interrupt() only when someone
called __vmx_deliver_posted_interrupt() in which, the VCPU_KICK_SOFTIRQ
bit was set.
Now that you say it, this looks even more odd: Why would you need
to raise that softirq if the only way to come here is via the triggering
in __vmx_deliver_posted_interrupt() (which already raised that
softirq)? I suppose I must be missing something...
Post by Wu, Feng
IOW - shouldn't this setup be conditional,
We cannot setup pi_notification_interrupt() conditional, since it is
for all cases, non-vtd-pi and vt-d pi.
But you could continue to use event_check_interrupt in the
!iommu_intpost case.
Post by Wu, Feng
and shouldn't the handler also only conditionally raise the softirq
(to as much as possible limit their amount)?
As mentioned above, there are not extra softirq raised. If you think
It is necessary, I can add a condition here as below, which I really think
is useless, but I am fine to add it if you really think it is a must.
if ( iommu_intpost )
raise_softirq(VCPU_KICK_SOFTIRQ);
See above - if this doesn't raise any extra softirqs, then why do
you call raise_softirq() in the first place?

Jan
Wu, Feng
2015-09-08 09:23:32 UTC
Permalink
-----Original Message-----
Sent: Tuesday, September 08, 2015 5:14 PM
To: Wu, Feng
Fraser
Subject: RE: [PATCH v6 15/18] vmx: Properly handle notification event when
vCPU is running
Post by Wu, Feng
-----Original Message-----
Sent: Monday, September 07, 2015 8:10 PM
To: Wu, Feng; Zhang, Yang Z
Subject: Re: [PATCH v6 15/18] vmx: Properly handle notification event when
vCPU is running
Post by Feng Wu
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
const struct hvm_function_table * __init start_vmx(void)
{
set_in_cr4(X86_CR4_VMXE);
@@ -2073,7 +2119,7 @@ const struct hvm_function_table * __init
start_vmx(void)
Post by Feng Wu
if ( cpu_has_vmx_posted_intr_processing )
{
- alloc_direct_apic_vector(&posted_intr_vector,
event_check_interrupt);
Post by Feng Wu
+ alloc_direct_apic_vector(&posted_intr_vector,
pi_notification_interrupt);
Post by Feng Wu
if ( iommu_intpost )
alloc_direct_apic_vector(&pi_wakeup_vector,
pi_wakeup_interrupt);
Considering that you do this setup independent of iommu_intpost,
won't this (namely, but not only) for the !iommu_inpost case result
in a whole lot of useless softirqs to be raised?
I don't think lots of useless softirqs will be raised in !iommu_intpost
case, since we can get pi_notification_interrupt() only when someone
called __vmx_deliver_posted_interrupt() in which, the VCPU_KICK_SOFTIRQ
bit was set.
Now that you say it, this looks even more odd: Why would you need
to raise that softirq if the only way to come here is via the triggering
in __vmx_deliver_posted_interrupt() (which already raised that
softirq)? I suppose I must be missing something...
Before VT-d PI, __vmx_deliver_posted_interrupt() is the only way
to trigger interrupt with vector ' posted_intr_vector ', but after
introducing VT-d PI, VT-d hardware can issue the interrupt with
that vector as well. In __vmx_deliver_posted_interrupt(), it set
the softirqs, the reason of which is described in the comments
of pi_notification_interrupt(), however, I need do the same thing
when VT-d hardware issue the interrupt, so pi_notification_interrupt()
is the proper place to do it.

Thanks,
Feng
Jan Beulich
2015-09-08 09:31:10 UTC
Permalink
Post by Wu, Feng
-----Original Message-----
Sent: Tuesday, September 08, 2015 5:14 PM
To: Wu, Feng
Fraser
Subject: RE: [PATCH v6 15/18] vmx: Properly handle notification event when
vCPU is running
Post by Wu, Feng
-----Original Message-----
Sent: Monday, September 07, 2015 8:10 PM
To: Wu, Feng; Zhang, Yang Z
Subject: Re: [PATCH v6 15/18] vmx: Properly handle notification event when
vCPU is running
Post by Feng Wu
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
const struct hvm_function_table * __init start_vmx(void)
{
set_in_cr4(X86_CR4_VMXE);
@@ -2073,7 +2119,7 @@ const struct hvm_function_table * __init
start_vmx(void)
Post by Feng Wu
if ( cpu_has_vmx_posted_intr_processing )
{
- alloc_direct_apic_vector(&posted_intr_vector,
event_check_interrupt);
Post by Feng Wu
+ alloc_direct_apic_vector(&posted_intr_vector,
pi_notification_interrupt);
Post by Feng Wu
if ( iommu_intpost )
alloc_direct_apic_vector(&pi_wakeup_vector,
pi_wakeup_interrupt);
Considering that you do this setup independent of iommu_intpost,
won't this (namely, but not only) for the !iommu_inpost case result
in a whole lot of useless softirqs to be raised?
I don't think lots of useless softirqs will be raised in !iommu_intpost
case, since we can get pi_notification_interrupt() only when someone
called __vmx_deliver_posted_interrupt() in which, the VCPU_KICK_SOFTIRQ
bit was set.
Now that you say it, this looks even more odd: Why would you need
to raise that softirq if the only way to come here is via the triggering
in __vmx_deliver_posted_interrupt() (which already raised that
softirq)? I suppose I must be missing something...
Before VT-d PI, __vmx_deliver_posted_interrupt() is the only way
to trigger interrupt with vector ' posted_intr_vector ', but after
introducing VT-d PI, VT-d hardware can issue the interrupt with
that vector as well. In __vmx_deliver_posted_interrupt(), it set
the softirqs, the reason of which is described in the comments
of pi_notification_interrupt(), however, I need do the same thing
when VT-d hardware issue the interrupt, so pi_notification_interrupt()
is the proper place to do it.
But again - my main concern is about the !iommu_intpost case: What
good does the raising of the softirq there? (As a general remark -
please, when you add code to support a new feature, don't just
think about the case where the new feature is available in hardware,
but also about the case where it's not. While over time the set of
systems lacking the feature will likely decrease, initially it may be the
vast majority of systems Xen gets run on which would get penalized.)

Jan
Wu, Feng
2015-09-08 09:36:39 UTC
Permalink
-----Original Message-----
Sent: Tuesday, September 08, 2015 5:31 PM
To: Wu, Feng
Fraser
Subject: RE: [PATCH v6 15/18] vmx: Properly handle notification event when
vCPU is running
Post by Wu, Feng
Post by Jan Beulich
Now that you say it, this looks even more odd: Why would you need
to raise that softirq if the only way to come here is via the triggering
in __vmx_deliver_posted_interrupt() (which already raised that
softirq)? I suppose I must be missing something...
Before VT-d PI, __vmx_deliver_posted_interrupt() is the only way
to trigger interrupt with vector ' posted_intr_vector ', but after
introducing VT-d PI, VT-d hardware can issue the interrupt with
that vector as well. In __vmx_deliver_posted_interrupt(), it set
the softirqs, the reason of which is described in the comments
of pi_notification_interrupt(), however, I need do the same thing
when VT-d hardware issue the interrupt, so pi_notification_interrupt()
is the proper place to do it.
But again - my main concern is about the !iommu_intpost case: What
good does the raising of the softirq there? (As a general remark -
please, when you add code to support a new feature, don't just
think about the case where the new feature is available in hardware,
but also about the case where it's not. While over time the set of
systems lacking the feature will likely decrease, initially it may be the
vast majority of systems Xen gets run on which would get penalized.)
No problem. Two solutions:
#1, Register pi_notification_interrupt when iommu_intpost and still
use event_check_interrupt in the !iommu_intpost case.
#1, Use pi_notification_interrupt() for both iommu_intpost and
!iommu_intpost, add the following check it:
if ( iommu_intpost )
raise_softirq(VCPU_KICK_SOFTIRQ);

Which one do you prefer?

Thanks,
Feng
Jan
Jan Beulich
2015-09-08 10:13:01 UTC
Permalink
Post by Wu, Feng
-----Original Message-----
Sent: Tuesday, September 08, 2015 5:31 PM
To: Wu, Feng
Fraser
Subject: RE: [PATCH v6 15/18] vmx: Properly handle notification event when
vCPU is running
Post by Wu, Feng
Post by Jan Beulich
Now that you say it, this looks even more odd: Why would you need
to raise that softirq if the only way to come here is via the triggering
in __vmx_deliver_posted_interrupt() (which already raised that
softirq)? I suppose I must be missing something...
Before VT-d PI, __vmx_deliver_posted_interrupt() is the only way
to trigger interrupt with vector ' posted_intr_vector ', but after
introducing VT-d PI, VT-d hardware can issue the interrupt with
that vector as well. In __vmx_deliver_posted_interrupt(), it set
the softirqs, the reason of which is described in the comments
of pi_notification_interrupt(), however, I need do the same thing
when VT-d hardware issue the interrupt, so pi_notification_interrupt()
is the proper place to do it.
But again - my main concern is about the !iommu_intpost case: What
good does the raising of the softirq there? (As a general remark -
please, when you add code to support a new feature, don't just
think about the case where the new feature is available in hardware,
but also about the case where it's not. While over time the set of
systems lacking the feature will likely decrease, initially it may be the
vast majority of systems Xen gets run on which would get penalized.)
#1, Register pi_notification_interrupt when iommu_intpost and still
use event_check_interrupt in the !iommu_intpost case.
#1, Use pi_notification_interrupt() for both iommu_intpost and
if ( iommu_intpost )
raise_softirq(VCPU_KICK_SOFTIRQ);
Which one do you prefer?
#1 ;-) (the earlier one, to avoid you guessing)

Jan
Wu, Feng
2015-09-08 10:15:39 UTC
Permalink
-----Original Message-----
Sent: Tuesday, September 08, 2015 6:13 PM
To: Wu, Feng
Fraser
Subject: RE: [PATCH v6 15/18] vmx: Properly handle notification event when
vCPU is running
Post by Wu, Feng
-----Original Message-----
Sent: Tuesday, September 08, 2015 5:31 PM
To: Wu, Feng
Keir
Post by Wu, Feng
Fraser
Subject: RE: [PATCH v6 15/18] vmx: Properly handle notification event when
vCPU is running
#1, Register pi_notification_interrupt when iommu_intpost and still
use event_check_interrupt in the !iommu_intpost case.
#1, Use pi_notification_interrupt() for both iommu_intpost and
if ( iommu_intpost )
raise_softirq(VCPU_KICK_SOFTIRQ);
Which one do you prefer?
#1 ;-) (the earlier one, to avoid you guessing)
Hmm, sorry for the typo. :)

Thanks,
Feng
Jan
Feng Wu
2015-08-25 01:57:50 UTC
Permalink
This patch adds an API which is used to update the IRTE
for posted-interrupt when guest changes MSI/MSI-X information.

CC: Yang Zhang <***@intel.com>
CC: Kevin Tian <***@intel.com>
CC: Keir Fraser <***@xen.org>
CC: Jan Beulich <***@suse.com>
CC: Andrew Cooper <***@citrix.com>
Signed-off-by: Feng Wu <***@intel.com>
Acked-by: Kevin Tian <***@intel.com>
---
v6:
- In some error cases, the desc->lock will be unlocked twice, fix it.
- Coding style fix.
- Add some comments.

v5:
- Make some function parameters const
- Call "spin_unlock_irq(&desc->lock);" a little eariler
- Add "ASSERT(spin_is_locked(&pcidevs_lock))"
- -EBADSLT -> -ENODEV, EBADSLT is removed in the lasted Xen

v4:
- Don't inline setup_posted_irte()
- const struct pi_desc *pi_desc for setup_posted_irte()
- return -EINVAL when pirq_spin_lock_irq_desc() fails.
- Make some variables const
- Release irq desc lock earlier in pi_update_irte()
- Remove the pointless do-while() loop when doing cmpxchg16b()

v3:
- Remove "adding PDA_MASK()" when updating 'pda_l' and 'pda_h' for IRTE.
- Change the return type of pi_update_irte() to int.
- Remove some pointless printk message in pi_update_irte().
- Use structure assignment instead of memcpy() for irte copy.

xen/drivers/passthrough/vtd/intremap.c | 107 +++++++++++++++++++++++++++++++++
xen/drivers/passthrough/vtd/iommu.h | 6 ++
xen/include/asm-x86/iommu.h | 2 +
3 files changed, 115 insertions(+)

diff --git a/xen/drivers/passthrough/vtd/intremap.c b/xen/drivers/passthrough/vtd/intremap.c
index e9fffa6..9bb5006 100644
--- a/xen/drivers/passthrough/vtd/intremap.c
+++ b/xen/drivers/passthrough/vtd/intremap.c
@@ -899,3 +899,110 @@ void iommu_disable_x2apic_IR(void)
for_each_drhd_unit ( drhd )
disable_qinval(drhd->iommu);
}
+
+static void setup_posted_irte(struct iremap_entry *new_ire,
+ const struct pi_desc *pi_desc, const uint8_t gvec)
+{
+ new_ire->post.urg = 0;
+ new_ire->post.vector = gvec;
+ new_ire->post.pda_l = virt_to_maddr(pi_desc) >> (32 - PDA_LOW_BIT);
+ new_ire->post.pda_h = virt_to_maddr(pi_desc) >> 32;
+
+ new_ire->post.res_1 = 0;
+ new_ire->post.res_2 = 0;
+ new_ire->post.res_3 = 0;
+ new_ire->post.res_4 = 0;
+ new_ire->post.res_5 = 0;
+
+ new_ire->post.im = 1;
+}
+
+/*
+ * This function is used to update the IRTE for posted-interrupt
+ * when guest changes MSI/MSI-X information.
+ */
+int pi_update_irte(const struct vcpu *v, const struct pirq *pirq,
+ const uint8_t gvec)
+{
+ struct irq_desc *desc;
+ const struct msi_desc *msi_desc;
+ int remap_index;
+ int rc = 0;
+ const struct pci_dev *pci_dev;
+ const struct acpi_drhd_unit *drhd;
+ struct iommu *iommu;
+ struct ir_ctrl *ir_ctrl;
+ struct iremap_entry *iremap_entries = NULL, *p = NULL;
+ struct iremap_entry new_ire, old_ire;
+ const struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+ __uint128_t ret;
+
+ desc = pirq_spin_lock_irq_desc(pirq, NULL);
+ if ( !desc )
+ return -EINVAL;
+
+ msi_desc = desc->msi_desc;
+ if ( !msi_desc )
+ {
+ rc = -ENODEV;
+ goto unlock_out;
+ }
+
+ pci_dev = msi_desc->dev;
+ if ( !pci_dev )
+ {
+ rc = -ENODEV;
+ goto unlock_out;
+ }
+
+ remap_index = msi_desc->remap_index;
+
+ spin_unlock_irq(&desc->lock);
+
+ ASSERT(spin_is_locked(&pcidevs_lock));
+
+ /*
+ * For performance concern, we will store the 'iommu' pointer in
+ * 'struct msi_desc' in some other place, so we don't need to waste
+ * time searching it here. I will fix this later.
+ */
+ drhd = acpi_find_matched_drhd_unit(pci_dev);
+ if ( !drhd )
+ return -ENODEV;
+
+ iommu = drhd->iommu;
+ ir_ctrl = iommu_ir_ctrl(iommu);
+ if ( !ir_ctrl )
+ return -ENODEV;
+
+ spin_lock_irq(&ir_ctrl->iremap_lock);
+
+ GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, remap_index, iremap_entries, p);
+
+ old_ire = new_ire = *p;
+
+ /* Setup/Update interrupt remapping table entry. */
+ setup_posted_irte(&new_ire, pi_desc, gvec);
+ ret = cmpxchg16b(p, &old_ire, &new_ire);
+
+ /*
+ * In the above, we use cmpxchg16 to atomically update the 128-bit IRTE,
+ * and the hardware cannot update the IRTE behind us, so the return value
+ * of cmpxchg16 should be the same as old_ire. This ASSERT validate it.
+ */
+ ASSERT(ret == *(__uint128_t *)&old_ire);
+
+ iommu_flush_cache_entry(p, sizeof(*p));
+ iommu_flush_iec_index(iommu, 0, remap_index);
+
+ unmap_vtd_domain_page(iremap_entries);
+
+ spin_unlock_irq(&ir_ctrl->iremap_lock);
+
+ return 0;
+
+ unlock_out:
+ spin_unlock_irq(&desc->lock);
+
+ return rc;
+}
diff --git a/xen/drivers/passthrough/vtd/iommu.h b/xen/drivers/passthrough/vtd/iommu.h
index 6fca430..6f196ec 100644
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -322,6 +322,12 @@ struct iremap_entry {
};
};

+/*
+ * Posted-interrupt descriptor address is 64 bits with 64-byte aligned, only
+ * the upper 26 bits of lest significiant 32 bits is available.
+ */
+#define PDA_LOW_BIT 26
+
/* Max intr remapping table page order is 8, as max number of IRTEs is 64K */
#define IREMAP_PAGE_ORDER 8

diff --git a/xen/include/asm-x86/iommu.h b/xen/include/asm-x86/iommu.h
index 29203d7..92f0900 100644
--- a/xen/include/asm-x86/iommu.h
+++ b/xen/include/asm-x86/iommu.h
@@ -31,6 +31,8 @@ int iommu_supports_eim(void);
int iommu_enable_x2apic_IR(void);
void iommu_disable_x2apic_IR(void);

+int pi_update_irte(const struct vcpu *v, const struct pirq *pirq, const uint8_t gvec);
+
#endif /* !__ARCH_X86_IOMMU_H__ */
/*
* Local variables:
--
2.1.0
Jan Beulich
2015-09-04 15:11:18 UTC
Permalink
Post by Feng Wu
This patch adds an API which is used to update the IRTE
for posted-interrupt when guest changes MSI/MSI-X information.
---
- In some error cases, the desc->lock will be unlocked twice, fix it.
Bug fixes invalidate prior acks.
Post by Feng Wu
--- a/xen/drivers/passthrough/vtd/intremap.c
+++ b/xen/drivers/passthrough/vtd/intremap.c
@@ -899,3 +899,110 @@ void iommu_disable_x2apic_IR(void)
for_each_drhd_unit ( drhd )
disable_qinval(drhd->iommu);
}
+
+static void setup_posted_irte(struct iremap_entry *new_ire,
+ const struct pi_desc *pi_desc, const uint8_t gvec)
+{
+ new_ire->post.urg = 0;
+ new_ire->post.vector = gvec;
+ new_ire->post.pda_l = virt_to_maddr(pi_desc) >> (32 - PDA_LOW_BIT);
+ new_ire->post.pda_h = virt_to_maddr(pi_desc) >> 32;
+
+ new_ire->post.res_1 = 0;
+ new_ire->post.res_2 = 0;
+ new_ire->post.res_3 = 0;
+ new_ire->post.res_4 = 0;
+ new_ire->post.res_5 = 0;
+
+ new_ire->post.im = 1;
+}
I think it would be better to just clear out the structure first. This
may then also allow for no longer naming all of the bitfield res_*
member, making it even more obvious that they're reserved. Or
wait - looking at the use below you seem to be updating the
descriptor. Why would you need to zero out reserved fields in
that case?
Post by Feng Wu
+int pi_update_irte(const struct vcpu *v, const struct pirq *pirq,
+ const uint8_t gvec)
+{
+ struct irq_desc *desc;
+ const struct msi_desc *msi_desc;
+ int remap_index;
+ int rc = 0;
+ const struct pci_dev *pci_dev;
+ const struct acpi_drhd_unit *drhd;
+ struct iommu *iommu;
+ struct ir_ctrl *ir_ctrl;
+ struct iremap_entry *iremap_entries = NULL, *p = NULL;
+ struct iremap_entry new_ire, old_ire;
+ const struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+ __uint128_t ret;
+
+ desc = pirq_spin_lock_irq_desc(pirq, NULL);
+ if ( !desc )
+ return -EINVAL;
+
+ msi_desc = desc->msi_desc;
+ if ( !msi_desc )
+ {
+ rc = -ENODEV;
+ goto unlock_out;
+ }
+
+ pci_dev = msi_desc->dev;
+ if ( !pci_dev )
+ {
+ rc = -ENODEV;
+ goto unlock_out;
+ }
+
+ remap_index = msi_desc->remap_index;
+
+ spin_unlock_irq(&desc->lock);
+
+ ASSERT(spin_is_locked(&pcidevs_lock));
+
+ /*
+ * For performance concern, we will store the 'iommu' pointer in
DYM "For performance reasons we should store ..."?
Post by Feng Wu
+ * 'struct msi_desc' in some other place, so we don't need to waste
+ * time searching it here. I will fix this later.
Instead of this last sentence I'd rather see the comment flagged
with FIXME or alike.
Post by Feng Wu
+ */
+ drhd = acpi_find_matched_drhd_unit(pci_dev);
+ if ( !drhd )
+ return -ENODEV;
+
+ iommu = drhd->iommu;
+ ir_ctrl = iommu_ir_ctrl(iommu);
+ if ( !ir_ctrl )
+ return -ENODEV;
+
+ spin_lock_irq(&ir_ctrl->iremap_lock);
+
+ GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, remap_index, iremap_entries, p);
+
+ old_ire = new_ire = *p;
+
+ /* Setup/Update interrupt remapping table entry. */
+ setup_posted_irte(&new_ire, pi_desc, gvec);
+ ret = cmpxchg16b(p, &old_ire, &new_ire);
+
+ /*
+ * In the above, we use cmpxchg16 to atomically update the 128-bit IRTE,
+ * and the hardware cannot update the IRTE behind us, so the return value
+ * of cmpxchg16 should be the same as old_ire. This ASSERT validate it.
+ */
+ ASSERT(ret == *(__uint128_t *)&old_ire);
I continue to object to such casts. Make the union have a
__uint128_t variant, which you can then check here.

Jan
Wu, Feng
2015-09-06 05:24:28 UTC
Permalink
-----Original Message-----
Sent: Friday, September 04, 2015 11:11 PM
To: Wu, Feng
Fraser
Subject: Re: [PATCH v6 11/18] vt-d: Add API to update IRTE when VT-d PI is used
Post by Feng Wu
This patch adds an API which is used to update the IRTE
for posted-interrupt when guest changes MSI/MSI-X information.
---
- In some error cases, the desc->lock will be unlocked twice, fix it.
Bug fixes invalidate prior acks.
Post by Feng Wu
--- a/xen/drivers/passthrough/vtd/intremap.c
+++ b/xen/drivers/passthrough/vtd/intremap.c
@@ -899,3 +899,110 @@ void iommu_disable_x2apic_IR(void)
for_each_drhd_unit ( drhd )
disable_qinval(drhd->iommu);
}
+
+static void setup_posted_irte(struct iremap_entry *new_ire,
+ const struct pi_desc *pi_desc, const uint8_t gvec)
+{
+ new_ire->post.urg = 0;
+ new_ire->post.vector = gvec;
+ new_ire->post.pda_l = virt_to_maddr(pi_desc) >> (32 - PDA_LOW_BIT);
+ new_ire->post.pda_h = virt_to_maddr(pi_desc) >> 32;
+
+ new_ire->post.res_1 = 0;
+ new_ire->post.res_2 = 0;
+ new_ire->post.res_3 = 0;
+ new_ire->post.res_4 = 0;
+ new_ire->post.res_5 = 0;
+
+ new_ire->post.im = 1;
+}
I think it would be better to just clear out the structure first. This
may then also allow for no longer naming all of the bitfield res_*
member, making it even more obvious that they're reserved. Or
wait - looking at the use below you seem to be updating the
descriptor. Why would you need to zero out reserved fields in
that case?
Here we first get the IRTE from hardware, which may be in
remapped format, we still need some of the information in it
after updating it to posted format, such as, fpd, sid, etc. So
I cannot clear it here. Besides that, there are some fields which
are needed in remapped format are defined as reserved in posted
format, I need to clear the reserved field one by one if we get a
remapped format from the hardware, here I just simply clear it
for all the cases and it is not a frequent operations, I think it is
not a big deal.

Thanks,
Feng
Jan Beulich
2015-09-07 10:54:54 UTC
Permalink
Post by Wu, Feng
Sent: Friday, September 04, 2015 11:11 PM
Post by Feng Wu
--- a/xen/drivers/passthrough/vtd/intremap.c
+++ b/xen/drivers/passthrough/vtd/intremap.c
@@ -899,3 +899,110 @@ void iommu_disable_x2apic_IR(void)
for_each_drhd_unit ( drhd )
disable_qinval(drhd->iommu);
}
+
+static void setup_posted_irte(struct iremap_entry *new_ire,
+ const struct pi_desc *pi_desc, const uint8_t gvec)
+{
+ new_ire->post.urg = 0;
+ new_ire->post.vector = gvec;
+ new_ire->post.pda_l = virt_to_maddr(pi_desc) >> (32 - PDA_LOW_BIT);
+ new_ire->post.pda_h = virt_to_maddr(pi_desc) >> 32;
+
+ new_ire->post.res_1 = 0;
+ new_ire->post.res_2 = 0;
+ new_ire->post.res_3 = 0;
+ new_ire->post.res_4 = 0;
+ new_ire->post.res_5 = 0;
+
+ new_ire->post.im = 1;
+}
I think it would be better to just clear out the structure first. This
may then also allow for no longer naming all of the bitfield res_*
member, making it even more obvious that they're reserved. Or
wait - looking at the use below you seem to be updating the
descriptor. Why would you need to zero out reserved fields in
that case?
Here we first get the IRTE from hardware, which may be in
remapped format, we still need some of the information in it
after updating it to posted format, such as, fpd, sid, etc. So
I cannot clear it here. Besides that, there are some fields which
are needed in remapped format are defined as reserved in posted
format, I need to clear the reserved field one by one if we get a
remapped format from the hardware, here I just simply clear it
for all the cases and it is not a frequent operations, I think it is
not a big deal.
While perhaps indeed not a big deal performance wise, I would
suppose you at least partly agree that the code above looks
ugly. I'd much rather see made explicit which fields you re-use
than which (reserved) fields you clear. I.e. start out from a
zeroed structure, copy over all fields from the original which you
want to retain, and fill in any non-reserved fields you have to
give new values.

Jan
Feng Wu
2015-08-25 01:57:47 UTC
Permalink
Currently, we don't support urgent interrupt, all interrupts
are recognized as non-urgent interrupt, so we cannot send
posted-interrupt when 'SN' is set.

CC: Kevin Tian <***@intel.com>
CC: Keir Fraser <***@xen.org>
CC: Jan Beulich <***@suse.com>
CC: Andrew Cooper <***@citrix.com>
Signed-off-by: Feng Wu <***@intel.com>
Reviewed-by: Konrad Rzeszutek Wilk <***@oracle.com>
---
v6:
- Add some comments

v5:
- keep the vcpu_kick() at the end of vmx_deliver_posted_intr()
- Keep the 'return' after calling __vmx_deliver_posted_interrupt()

v4:
- Coding style.
- V3 removes a vcpu_kick() from the eoi_exitmap_changed path
incorrectly, fix it.

v3:
- use cmpxchg to test SN/ON and set ON

xen/arch/x86/hvm/vmx/vmx.c | 30 +++++++++++++++++++++++++++++-
1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index c32d863..2c1c770 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -1701,8 +1701,36 @@ static void vmx_deliver_posted_intr(struct vcpu *v, u8 vector)
*/
pi_set_on(&v->arch.hvm_vmx.pi_desc);
}
- else if ( !pi_test_and_set_on(&v->arch.hvm_vmx.pi_desc) )
+ else
{
+ struct pi_desc old, new, prev;
+
+ /* To skip over first check in the loop below. */
+ prev.control = 0;
+
+ do {
+ /*
+ * Currently, we don't support urgent interrupt, all
+ * interrupts are recognized as non-urgent interrupt,
+ * so we cannot send posted-interrupt when 'SN' is set.
+ * Besides that, if 'ON' is already set, we cannot set
+ * posted-interrupts as well.
+ */
+ if ( pi_test_sn(&prev) || pi_test_on(&prev) )
+ {
+ vcpu_kick(v);
+ return;
+ }
+
+ old.control = v->arch.hvm_vmx.pi_desc.control &
+ ~( 1 << POSTED_INTR_ON | 1 << POSTED_INTR_SN );
+ new.control = v->arch.hvm_vmx.pi_desc.control |
+ 1 << POSTED_INTR_ON;
+
+ prev.control = cmpxchg(&v->arch.hvm_vmx.pi_desc.control,
+ old.control, new.control);
+ } while ( prev.control != old.control );
+
__vmx_deliver_posted_interrupt(v);
return;
}
--
2.1.0
Jan Beulich
2015-09-04 14:53:16 UTC
Permalink
Post by Feng Wu
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -1701,8 +1701,36 @@ static void vmx_deliver_posted_intr(struct vcpu *v, u8 vector)
*/
pi_set_on(&v->arch.hvm_vmx.pi_desc);
}
- else if ( !pi_test_and_set_on(&v->arch.hvm_vmx.pi_desc) )
+ else
{
+ struct pi_desc old, new, prev;
+
+ /* To skip over first check in the loop below. */
+ prev.control = 0;
Why don't you just read the field instead of adding the comment?
Post by Feng Wu
+ do {
+ /*
+ * Currently, we don't support urgent interrupt, all
+ * interrupts are recognized as non-urgent interrupt,
+ * so we cannot send posted-interrupt when 'SN' is set.
+ * Besides that, if 'ON' is already set, we cannot set
+ * posted-interrupts as well.
+ */
+ if ( pi_test_sn(&prev) || pi_test_on(&prev) )
+ {
+ vcpu_kick(v);
+ return;
+ }
+
+ old.control = v->arch.hvm_vmx.pi_desc.control &
+ ~( 1 << POSTED_INTR_ON | 1 << POSTED_INTR_SN );
+ new.control = v->arch.hvm_vmx.pi_desc.control |
+ 1 << POSTED_INTR_ON;
Missing parentheses around the shifts. In the former case there are
also stray blanks inside the parentheses.

Jan
Wu, Feng
2015-09-06 02:33:30 UTC
Permalink
-----Original Message-----
Sent: Friday, September 04, 2015 10:53 PM
To: Wu, Feng
Subject: Re: [PATCH v6 08/18] vmx: Suppress posting interrupts when 'SN' is set
Post by Feng Wu
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -1701,8 +1701,36 @@ static void vmx_deliver_posted_intr(struct vcpu
*v, u8 vector)
Post by Feng Wu
*/
pi_set_on(&v->arch.hvm_vmx.pi_desc);
}
- else if ( !pi_test_and_set_on(&v->arch.hvm_vmx.pi_desc) )
+ else
{
+ struct pi_desc old, new, prev;
+
+ /* To skip over first check in the loop below. */
+ prev.control = 0;
Why don't you just read the field instead of adding the comment?
What do you mean by "read the field"? Could you please elaborate it?

Thanks,
Feng
Jan Beulich
2015-09-07 10:51:20 UTC
Permalink
Post by Wu, Feng
Sent: Friday, September 04, 2015 10:53 PM
Post by Feng Wu
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -1701,8 +1701,36 @@ static void vmx_deliver_posted_intr(struct vcpu
*v, u8 vector)
Post by Feng Wu
*/
pi_set_on(&v->arch.hvm_vmx.pi_desc);
}
- else if ( !pi_test_and_set_on(&v->arch.hvm_vmx.pi_desc) )
+ else
{
+ struct pi_desc old, new, prev;
+
+ /* To skip over first check in the loop below. */
+ prev.control = 0;
Why don't you just read the field instead of adding the comment?
What do you mean by "read the field"? Could you please elaborate it?
prev.control = v->arch.hvm_vmx.pi_desc.control;

Jan
Wu, Feng
2015-09-01 05:13:46 UTC
Permalink
Kindly ping ...

Since this series really last for a long period, I'd like to close it as soon as possible.

Thanks,
Feng
-----Original Message-----
From: Wu, Feng
Sent: Tuesday, August 25, 2015 9:58 AM
Cc: Wu, Feng
Subject: [PATCH v6 00/18] Add VT-d Posted-Interrupts support
VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
With VT-d Posted-Interrupts enabled, external interrupts from
direct-assigned devices can be delivered to guests without VMM
intervention when guest is running in non-root mode.
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technolog
y/vt-directed-io-spec.html
VT-d Posted-intterrupt (PI) design
Add cmpxchg16b support for x86-64
iommu: Add iommu_intpost to control VT-d Posted-Interrupts feature
vt-d: VT-d Posted-Interrupts feature detection
vmx: Extend struct pi_desc to support VT-d Posted-Interrupts
vmx: Add some helper functions for Posted-Interrupts
vmx: Initialize VT-d Posted-Interrupts Descriptor
vmx: Suppress posting interrupts when 'SN' is set
VT-d: Remove pointless casts
vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts
vt-d: Add API to update IRTE when VT-d PI is used
x86: move some APIC related macros to apicdef.h
Update IRTE according to guest interrupt config changes
vmx: posted-interrupt handling when vCPU is blocked
vmx: Properly handle notification event when vCPU is running
vmx: Add some scheduler hooks for VT-d posted interrupts
VT-d: Dump the posted format IRTE
Add a command line parameter for VT-d posted-interrupts
docs/misc/vtd-pi.txt | 332
+++++++++++++++++++++++++++++++++
docs/misc/xen-command-line.markdown | 9 +-
xen/arch/x86/domain.c | 19 ++
xen/arch/x86/hvm/vlapic.c | 5 -
xen/arch/x86/hvm/vmx/vmcs.c | 21 +++
xen/arch/x86/hvm/vmx/vmx.c | 289
+++++++++++++++++++++++++++-
xen/common/schedule.c | 2 +
xen/drivers/passthrough/io.c | 125 ++++++++++++-
xen/drivers/passthrough/iommu.c | 16 +-
xen/drivers/passthrough/vtd/intremap.c | 199 +++++++++++++++-----
xen/drivers/passthrough/vtd/iommu.c | 17 +-
xen/drivers/passthrough/vtd/iommu.h | 50 +++--
xen/drivers/passthrough/vtd/utils.c | 59 ++++--
xen/include/asm-arm/domain.h | 2 +
xen/include/asm-x86/apicdef.h | 4 +
xen/include/asm-x86/domain.h | 3 +
xen/include/asm-x86/hvm/hvm.h | 2 +
xen/include/asm-x86/hvm/vmx/vmcs.h | 26 ++-
xen/include/asm-x86/hvm/vmx/vmx.h | 28 +++
xen/include/asm-x86/iommu.h | 2 +
xen/include/asm-x86/x86_64/system.h | 28 +++
xen/include/xen/iommu.h | 2 +-
22 files changed, 1155 insertions(+), 85 deletions(-)
create mode 100644 docs/misc/vtd-pi.txt
--
2.1.0
Jan Beulich
2015-09-01 05:20:41 UTC
Permalink
Post by Wu, Feng
Kindly ping ...
Since this series really last for a long period, I'd like to close it as soon as possible.
Understood, but you know the tree didn't re-open yet? Nor did I get around to
look at the most recent version (apparently the same for others). Please be
patient, it's not being forgotten and not going to get lost.

Jan
Wu, Feng
2015-09-01 05:32:06 UTC
Permalink
-----Original Message-----
Sent: Tuesday, September 01, 2015 1:21 PM
To: Wu, Feng
Subject: Re: [Xen-devel] [PATCH v6 00/18] Add VT-d Posted-Interrupts support
Post by Wu, Feng
Kindly ping ...
Since this series really last for a long period, I'd like to close it as soon as
possible.
Understood, but you know the tree didn't re-open yet? Nor did I get around to
look at the most recent version (apparently the same for others). Please be
patient, it's not being forgotten and not going to get lost.
Thanks a lot for your reply. Good to know this series are not being forgotten by people. :)

Thanks,
Feng
Jan
Jan Beulich
2015-09-07 13:05:44 UTC
Permalink
Post by Feng Wu
Enable VT-d Posted-Interrupts and add a command line
parameter for it.
Acked-by: Jan Beulich <***@suse.com>
Loading...