Discussion:
[boost] [compute] review
Gruenke,Matt
2014-12-28 09:54:22 UTC
Permalink
> 1. What is your evaluation of the design?



I agree with other comments made about synchronization. The design should be more explicit about what's asynchronous, and ideally utilize only one mechanism for synchronization. One approach that would have made this clearer is if high-level operations were actually objects that got posted to the command_queue.



Regarding synchronization, I'm a also bit concerned about the performance impact of synchronizing on all copies to host memory. Overuse of synchronization can easily result in performance deterioration. On this point, I think it might be worth limiting host memory usable with algorithms, to containers that perform implicit synchronization to actual use (or destruction) of results. Give users the choice between that or performing explicit copies to raw types.



Also, I agree with Thomas M that it'd be useful for operations to return events. That said, if you told me it doubled the overhead involved in dispatching operations, I'd suggest to make it optional through overloads or via a mechanism in the command_queue itself.





> 2. What is your evaluation of the implementation?



It seems a bit heavy to be header-only.



I'd agree with Andrey Semashev, regarding thread safety, except I think it shouldn't be an option at all. That said, not all operations on all objects need be thread safe. I think the Boost.ASIO documentation provides a good example of how to express different thread safety limitations. Now, I haven't examined the use of thread-local storage, but do you feel there's a strong case to be made for the thread safety it's providing (especially, given that you seem to feel it's non-essential)?



I'm a bit concerned about the use of type names ending in '_', such as float4_. Is this consistent with Boost conventions? I've only seen that used in other Boost libraries to denote class member variables.



I didn't notice any unit tests specifically intended to verify exception-safety. I'd want to know that had been thoroughly vetted, before I'd consider using Boost.Compute in production code.



Finally, based on the docs, it seems there's a bug in boost::compute::command_queue::enqueue_copy_image_to_buffer() and boost::compute::command_queue::enqueue_copy_buffer_to_image(). The region and origin parameters should be arrays, in the 2D and 3D cases. If this discrepancy is intentional, it should be noted. I haven't exhaustively reviewed the reference docs, so there might be other issues of this sort.





> 3. What is your evaluation of the documentation?



It needs more and better-documented tutorials (and I'm including the "Advanced Topics", here).



Some of the reference pages could use more details, as well, especially concerning synchronization and exception usage.



Regarding exceptions, there should be discussion of specifically what is invalidated by which kinds of errors. If any errors are non-recoverable, this should also be called out and perhaps derived from a different or additional baseclass (not a bad use case for multiple inheritance, actually).



I agree with Mathias Gaunard that it would be helpful to discuss computational complexity, device memory requirements, and the various methods used to implement some of the algorithms. You could continue to omit these details on the simple algorithms, though.





> 4. What is your evaluation of the potential usefulness of the library?



This is my biggest misgiving, by far. In the very near future, I expect developers will opt for either SYCL (https://www.khronos.org/opencl/sycl) or Bolt (https://github.com/HSA-Libraries/Bolt). SYCL provides a modern, standard C++11 wrapper around OpenCL, with better concurrency control and support for integrating kernels inline. Bolt provides many of the same higher-level abstractions found in Boost.Compute, but with forthcoming support for HSA.



To have the kind of lasting relevance and broad applicability to which all Boost libraries should aspire, I think Boost.Compute should be architected to support multiple backends. Though OpenCL support is currently ascendant, it's far from universal and is already flagging on some platforms (Nvidia, not the least). And HSA provides a foundation on which alternatives are actively being built. Most importantly, there exist multitudes of multi-core and multiprocessor systems which lack OpenCL support. It would be eminently useful to support these with such backends as thread pool, OpenMP, etc. And backends could be added to support new technologies, as they mature.





> 5. Did you try to use the library? With what compiler? Did you have any problems?



I downloaded and inspected the sources and some examples. I didn't have any questions or concerns that would be addressed by experimentation.





> 6. How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?



Review of the documentation, some sources, some examples, and a refresher on SYCL & Bolt. I also read all of the reviews and replies sent thus far.





> 7. Are you knowledgeable about the problem domain?



I've been developing production code for parallel and heterogeneous computing platforms for the majority of the last 16 years. I've followed OpenCL since version 1.0 was finalized. I've also submitted bug reports and minor patches for Khronos' official C++ OpenCL interface.





> 8. Do you think the library should be accepted as a Boost library? Be sure to say this explicitly so that your other comments don't obscure your overall opinion.



Not in its current state. The primary issue is that it's too OpenCL-centric. It should support more backends, even if these are just compile-time options. This is crucial for both current and lasting relevance.



Beyond that, better concurrency control is crucial for optimum performance. If done well, it could even help further reduce usage errors.





Matt


________________________________

This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Sebastian Schaetz
2014-12-28 12:26:23 UTC
Permalink
Gruenke,Matt <mgruenke <at> Tycoint.com> writes:

> This is my biggest misgiving, by far. In the very near future, I expect
developers will opt for either SYCL
> (https://www.khronos.org/opencl/sycl) or Bolt
(https://github.com/HSA-Libraries/Bolt). SYCL
> provides a modern, standard C++11 wrapper around OpenCL, with better
concurrency control and support
> for integrating kernels inline. Bolt provides many of the same
higher-level abstractions found in
> Boost.Compute, but with forthcoming support for HSA.

Bolt relies on an extension to OpenCL called "OpenCL Static C++ Kernel
Language Extension". Only AMD bothered to implement it as to my knowledge.

C++ AMP is in my opinion a more promising proposal compared to SYCL.
Developers opt for C++ AMP today. But both SYCL and C++ AMP are higher level
tools and have the disadvantages of any higher level library compared to a
lower level library. In addition, they need a custom compiler or compiler
extensions. This increases the fragmentation of the accelerator co-processor
field further.

I think Boost.Compute does the right thing here. Identify the lowest common
denominator: OpenCL. Build a library on top of it that anyone can use on any
platform, provided a standard C++ compiler is available and the OpenCL
library is implemented. Build whatever fancy thing you want on top of that.

> To have the kind of lasting relevance and broad applicability to which all
Boost libraries should aspire, I
> think Boost.Compute should be architected to support multiple backends.
Though OpenCL support is
> currently ascendant, it's far from universal and is already flagging on
some platforms (Nvidia, not the
> least). And HSA provides a foundation on which alternatives are actively
being built. Most importantly,
> there exist multitudes of multi-core and multiprocessor systems which lack
OpenCL support. It would be
> eminently useful to support these with such backends as thread pool,
OpenMP, etc. And backends could be
> added to support new technologies, as they mature.

OpenCL is supposed to be the abstraction layer that does all that, remember?
That is, support multi-core, multi-processor and many-core vector
co-processors. Asking Boost.Compute to support threading and OpenMP is
asking it to do the job of OpenCL library implementers. To play the heretic
for the sake of argument: why stop at single nodes then? Why not add, on top
of the OpenMP/threading layer you ask Boost.Compute to support an MPI layer?
I urge you to not open this can of worms.


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Mathias Gaunard
2014-12-28 13:00:06 UTC
Permalink
On 28/12/2014 13:26, Sebastian Schaetz wrote:

> C++ AMP is in my opinion a more promising proposal compared to SYCL.
> Developers opt for C++ AMP today. But both SYCL and C++ AMP are higher level
> tools and have the disadvantages of any higher level library compared to a
> lower level library. In addition, they need a custom compiler or compiler
> extensions. This increases the fragmentation of the accelerator co-processor
> field further.

C++AMP is Microsoft proprietary technology for Visual Studio, with all
the associated problems one might expect.

SYCL is an open standard designed to integrate with C++11 as well as
possible, and has in implementation within the LLVM framework.

It's pretty much DirectX vs OpenGL again.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Sebastian Schaetz
2014-12-28 15:21:27 UTC
Permalink
Mathias Gaunard <mathias.gaunard <at> ens-lyon.org> writes:

> C++AMP is Microsoft proprietary technology for Visual Studio, with all
> the associated problems one might expect.
>
> SYCL is an open standard designed to integrate with C++11 as well as
> possible, and has in implementation within the LLVM framework.
>
> It's pretty much DirectX vs OpenGL again.

It is not. C++ AMP is an open standard [0] implemented not only by Microsoft
[1].

[0]
http://blogs.msdn.com/b/nativeconcurrency/archive/2013/12/12/c-amp-open-spec-v1-2-published.aspx
[1] https://bitbucket.org/multicoreware/cppamp-driver-ng/wiki/Home


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Mathias Gaunard
2014-12-28 19:36:17 UTC
Permalink
On 28/12/2014 16:21, Sebastian Schaetz wrote:
> Mathias Gaunard <mathias.gaunard <at> ens-lyon.org> writes:
>
>> C++AMP is Microsoft proprietary technology for Visual Studio, with all
>> the associated problems one might expect.
>>
>> SYCL is an open standard designed to integrate with C++11 as well as
>> possible, and has in implementation within the LLVM framework.
>>
>> It's pretty much DirectX vs OpenGL again.
>
> It is not. C++ AMP is an open standard [0]

It's not recognized as a standard by any recognized standards bodies
though, and its writing didn't involve anyone but Microsoft.

It's pretty much just Microsoft publishing their specifications (which
their implementation probably doesn't even follow to the letter) so that
others can try to make interoperable technology.


> implemented not only by Microsoft [1].

This implementation is highly experimental for the moment, however.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Antony Polukhin
2014-12-28 16:36:01 UTC
Permalink
2014-12-28 15:26 GMT+03:00 Sebastian Schaetz <***@gmail.com>:

> Gruenke,Matt <mgruenke <at> Tycoint.com> writes:
> > To have the kind of lasting relevance and broad applicability to which
> all
> Boost libraries should aspire, I
> > think Boost.Compute should be architected to support multiple backends.
> Though OpenCL support is
> > currently ascendant, it's far from universal and is already flagging on
> some platforms (Nvidia, not the
> > least). And HSA provides a foundation on which alternatives are actively
> being built. Most importantly,
> > there exist multitudes of multi-core and multiprocessor systems which
> lack
> OpenCL support. It would be
> > eminently useful to support these with such backends as thread pool,
> OpenMP, etc. And backends could be
> > added to support new technologies, as they mature.
>
> OpenCL is supposed to be the abstraction layer that does all that,
> remember?
> That is, support multi-core, multi-processor and many-core vector
> co-processors. Asking Boost.Compute to support threading and OpenMP is
> asking it to do the job of OpenCL library implementers.
>

Totally agree.

Here's my 0.05 rubles (it's much less than 5 cents):

* OpenCL is the non proprietary open widely used technology that is meant
to be the abstraction atop of any computing device. That's a 100% right
choice.
* CUDA is a proprietary technology nailed to the NVidia graphics. That's a
very platform dependent solution for Boost. Some CUDA code could be
possibly added to Compute only if it gives significant performance boost,
but OpenCL implementation must be provided for not NVidia based platforms.
* C++ AMP is not licensed with MIT, GPL, Boost or BSD. It's a "Microsoft
Community Promise" which makes me nervous and it does not look like a
perfect license for open standard
http://en.wikipedia.org/wiki/Microsoft_Open_Specification_Promise#Scope_limitation
. C++ AMP is not as popular as OpenCL or CUDA. The only one non Windows
based implementation is the C++ AMP compiler that outputs to OpenCL. I see
no reasons to use this technology.

--
Best regards,
Antony Polukhin

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Gruenke,Matt
2014-12-28 23:55:06 UTC
Permalink
-----Original Message-----
From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Sebastian Schaetz
Sent: Sunday, December 28, 2014 7:26
To: ***@lists.boost.org
Subject: Re: [boost] [compute] review

> Gruenke,Matt writes:

> C++ AMP is in my opinion a more promising proposal compared to SYCL.
> Developers opt for C++ AMP today. But both SYCL and C++ AMP are
> higher level tools and have the disadvantages of any higher level
> library compared to a lower level library.

How is SYCL a higher-level tool? Have a look at the provisional spec:

https://www.khronos.org/registry/sycl


It has equivalents of everything you find in Boost.Compute, *except* for the higher-level functions. Moreover, they introduce the notion of command groups, and possibly other low level features.


> In addition, they need a custom compiler or compiler extensions.

[Addressed in previous message. If you have any evidence to support this, please reply to that thread.]


> OpenCL is supposed to be the abstraction layer that does all that, remember?

In fact, that has been the primary factor fueling my interest, over the years. But there are many systems that still don't support it. And it's only one solution to this problem. There will doubtlessly be others, possibly spurred on by the advent of HSA and other technologies yet to arrive on the scene.


> I urge you to not open this can of worms.

I didn't mean to imply that it *needed* to have a backend for XYZ. I am merely *suggesting* backends such as a threadpool or possibly OpenMP. My point was about the design - that it should facilitate the addition of backends, in order to address both existing and future systems where OpenCL support is absent or inefficient.

Again, the key point is that the design should accommodate different backends. Whether a given backend is developed depends on whether there's enough interest for someone to write and maintain it. And perhaps some backends will exist only as proprietary patches maintained in private repositories of users. The main contribution of Boost.Compute would then be the framework, interface, and high-level algorithms.


Matt


________________________________

This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Sebastian Schaetz
2014-12-29 00:36:58 UTC
Permalink
Gruenke,Matt <mgruenke <at> Tycoint.com> writes:

> > C++ AMP is in my opinion a more promising proposal compared to SYCL.
> > Developers opt for C++ AMP today. But both SYCL and C++ AMP are
> > higher level tools and have the disadvantages of any higher level
> > library compared to a lower level library.
>
> How is SYCL a higher-level tool? Have a look at the provisional spec:
>
> https://www.khronos.org/registry/sycl
>
> It has equivalents of everything you find in Boost.Compute, *except* for
the higher-level functions.
> Moreover, they introduce the notion of command groups, and possibly other
low level features.

I have and I judge it to be a higher level compared to OpenCL (or
Boost.Compute). SYCL abstracts memory access through 'accessors' and I'm not
sure that you can issue explicit memory copies in SYCL. Both OpenCl and
Boost.Compute have explicit copy functions with synchronous or asynchronous
semantics. This is also one of my major points of critique of C++ AMP - it
is unclear when data is transferred between host and device.

> In fact, that has been the primary factor fueling my interest, over the
years. But there are many systems
> that still don't support it. And it's only one solution to this problem.

Agreed.

> > I urge you to not open this can of worms.
>
> I didn't mean to imply that it *needed* to have a backend for XYZ. I am
merely *suggesting* backends such as a
> threadpool or possibly OpenMP. My point was about the design - that it
should facilitate the addition of
> backends, in order to address both existing and future systems where
OpenCL support is absent or inefficient.
>
> Again, the key point is that the design should accommodate different
backends. Whether a given backend is
> developed depends on whether there's enough interest for someone to write
and maintain it. And perhaps
> some backends will exist only as proprietary patches maintained in private
repositories of users. The
> main contribution of Boost.Compute would then be the framework, interface,
and high-level algorithms.

Such a multi-backend support would be at the STL level and could not go
lower than that. Devices, command queues and contexts don't make any sense
when thinking of a OpenMP backend. Let alone a compute::vector with
associated compute::copy. The library you're asking for is very different
from library proposed here.


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Kyle Lutz
2014-12-29 01:58:08 UTC
Permalink
On Sun, Dec 28, 2014 at 3:55 PM, Gruenke,Matt <***@tycoint.com> wrote:
> -----Original Message-----
> From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Sebastian Schaetz
> Sent: Sunday, December 28, 2014 7:26
> To: ***@lists.boost.org
> Subject: Re: [boost] [compute] review
>
>> I urge you to not open this can of worms.
>
> I didn't mean to imply that it *needed* to have a backend for XYZ. I am merely *suggesting* backends such as a threadpool or possibly OpenMP. My point was about the design - that it should facilitate the addition of backends, in order to address both existing and future systems where OpenCL support is absent or inefficient.
>
> Again, the key point is that the design should accommodate different backends. Whether a given backend is developed depends on whether there's enough interest for someone to write and maintain it. And perhaps some backends will exist only as proprietary patches maintained in private repositories of users. The main contribution of Boost.Compute would then be the framework, interface, and high-level algorithms.

While I agree that this would be useful, and API like this isn't the
goal of Boost.Compute. I think there is room for a higher-level
library to provide a more abstract parallel interface which could
potentially use Boost.Compute in addition to other available parallel
APIs (e.g. OpenMP, TBB, CUDA, MPI, etc.). In fact this is very much
what the C++ parallelism TS is attempting to provide.

-kyle

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Francois Duranleau
2014-12-29 15:58:07 UTC
Permalink
On Sun, Dec 28, 2014 at 8:58 PM, Kyle Lutz <***@gmail.com> wrote:
>
> On Sun, Dec 28, 2014 at 3:55 PM, Gruenke,Matt <***@tycoint.com> wrote:
> >
> > I didn't mean to imply that it *needed* to have a backend for XYZ. I am merely *suggesting* backends such as a threadpool or possibly OpenMP. My point was about the design - that it should facilitate the addition of backends, in order to address both existing and future systems where OpenCL support is absent or inefficient.
> >
> > Again, the key point is that the design should accommodate different backends. Whether a given backend is developed depends on whether there's enough interest for someone to write and maintain it. And perhaps some backends will exist only as proprietary patches maintained in private repositories of users. The main contribution of Boost.Compute would then be the framework, interface, and high-level algorithms.
>
> While I agree that this would be useful, and API like this isn't the
> goal of Boost.Compute. I think there is room for a higher-level
> library to provide a more abstract parallel interface which could
> potentially use Boost.Compute in addition to other available parallel
> APIs (e.g. OpenMP, TBB, CUDA, MPI, etc.). In fact this is very much
> what the C++ parallelism TS is attempting to provide.

I generally agree with Matt. If we intend Boost.Compute to be widely
available and useful on many platforms (Windows, Mac, and Linux is far
from being the entire world today; just think of iOS, where there is
no OpenCL), absolutely sticking to OpenCL seems to me as not being a
good solution. Thus, if adding the possibility of other backends is
not part of Boost.Compute's goal, then why not name it Boost.OpenCL?

--
François Duranleau

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi
Kyle Lutz
2014-12-28 19:41:36 UTC
Permalink
On Sun, Dec 28, 2014 at 1:54 AM, Gruenke,Matt <***@tycoint.com> wrote:
>> 1. What is your evaluation of the design?
>
>
>
> I agree with other comments made about synchronization. The design should be more explicit about what's asynchronous, and ideally utilize only one mechanism for synchronization. One approach that would have made this clearer is if high-level operations were actually objects that got posted to the command_queue.

Like I mentioned before, there is only one method for asynchrony in
Boost.Compute, the command queue abstraction provided by OpenCL.
Operations are enqueued to be executed on the compute device and this
occurs asynchronously with respect to code executing on the host. The
exception to this rule are functions which interact directly with
host-memory which by default are blocking and offer explicit
"_async()" versions (in order to prevent potential race-conditions on
host-memory).

And I am hesitant to make the command_queue interface more high-level.
But this sort of functionality could certainly be built on top of
current interfaces in Boost.Compute. See some of the previous reviews
for discussion about this.

> Regarding synchronization, I'm a also bit concerned about the performance impact of synchronizing on all copies to host memory. Overuse of synchronization can easily result in performance deterioration. On this point, I think it might be worth limiting host memory usable with algorithms, to containers that perform implicit synchronization to actual use (or destruction) of results. Give users the choice between that or performing explicit copies to raw types.

To be clear, all copies are not synchronized with host memory.
Boost.Compute allows both synchronous and asynchronous memory
transfers between the host and device.

And having an implicitly synchronized container class could be built
on top of the Boost.Compute APIs. I haven't implemented it yet as I
personally feel that blocking operations should be done explicitly by
the user and not hidden away in an implicit API. However, I'd be happy
to review and integrate functionality like this into Boost.Compute if
someone was interested in developing it.

> Also, I agree with Thomas M that it'd be useful for operations to return events. That said, if you told me it doubled the overhead involved in dispatching operations, I'd suggest to make it optional through overloads or via a mechanism in the command_queue itself.

All asynchronous operations in the command queue class do return
events. One of his comments was to also return events from the
synchronous methods for consistency and I am working on adding this.

>> 2. What is your evaluation of the implementation?
>
>
>
> It seems a bit heavy to be header-only.
>
>
>
> I'd agree with Andrey Semashev, regarding thread safety, except I think it shouldn't be an option at all. That said, not all operations on all objects need be thread safe. I think the Boost.ASIO documentation provides a good example of how to express different thread safety limitations. Now, I haven't examined the use of thread-local storage, but do you feel there's a strong case to be made for the thread safety it's providing (especially, given that you seem to feel it's non-essential)?

I absolutely do not feel it is non-essential. The reason for it
currently being a compile-time configurable option is that enabling it
on non-C++11 compilers would make the library non-header-only (it
would require the users to link to Boost.Thread).

This has come up a few times in the review and I have been considering
making the library non-header-only (or at least optionally
non-header-only) which would obviate some of these issues. However,
this is a much larger and potentially non-backwards-compatible change
and I'd need to run it by the community and provide a migration plan
for current users.

> I'm a bit concerned about the use of type names ending in '_', such as float4_. Is this consistent with Boost conventions? I've only seen that used in other Boost libraries to denote class member variables.

I'm not sure if there are any Boost conventions for/against this
(someone please speak up if there are). I chose the trailing
underscore for these types as I needed a consistent spelling for
representing the fundamental OpenCL types (e.g. "float" or "uint" or
"int4") and I couldn't just use the names without the trailing
underscore as they'd conflict with the C++ reserved keywords (e.g.
"float", "int"). Using a leading underscore (e.g. "_float4") looked
worse to me, so I used a trailing underscore. But I'd definitely be
open to hearing other ideas.

> I didn't notice any unit tests specifically intended to verify exception-safety. I'd want to know that had been thoroughly vetted, before I'd consider using Boost.Compute in production code.

There are already some tests which for the error/exception paths in
Boost.Compute (testing building invalid programs in "test_program.cpp"
is the first that comes to mind). I'll work on adding more. Please let
me know if there are any specific areas you'd like to see more heavily
tested.

> Finally, based on the docs, it seems there's a bug in boost::compute::command_queue::enqueue_copy_image_to_buffer() and boost::compute::command_queue::enqueue_copy_buffer_to_image(). The region and origin parameters should be arrays, in the 2D and 3D cases. If this discrepancy is intentional, it should be noted. I haven't exhaustively reviewed the reference docs, so there might be other issues of this sort.

Can you be more specific as to what you consider to be a "bug" here?
These APIs in the command_queue class are very "thin" wrappers on top
of the clEnqueue* functions from the OpenCL C API. If you could
provide a small test-case which demonstrates the issue and submit it
to the bug-tracker [1] that would be great.

>> 3. What is your evaluation of the documentation?
>
>
>
> It needs more and better-documented tutorials (and I'm including the "Advanced Topics", here).

Fully agree, I'll work on this.

> Some of the reference pages could use more details, as well, especially concerning synchronization and exception usage.

Fully agree. I've been working on improving this. And please let me
know if there are any specific classes/functions which you'd like to
see more detailed documentation.

> Regarding exceptions, there should be discussion of specifically what is invalidated by which kinds of errors. If any errors are non-recoverable, this should also be called out and perhaps derived from a different or additional baseclass (not a bad use case for multiple inheritance, actually).

As far as I know, there are no non-recoverable errors or errors which
invalidate host objects thrown by Boost.Compute. This is due to the
fact that Boost.Compute mainly deals with executing operations on a
separate device (e.g. GPU) and exceptions there don't affect the
host's ability to continue execution.

> I agree with Mathias Gaunard that it would be helpful to discuss computational complexity, device memory requirements, and the various methods used to implement some of the algorithms. You could continue to omit these details on the simple algorithms, though.

I will definitely continue to improve the documentation and
specifically add more of these sorts of details for the algorithms.
Please let me know if there are additional areas where you would also
like to see more specific details.

>> 4. What is your evaluation of the potential usefulness of the library?
>
>
>
> This is my biggest misgiving, by far. In the very near future, I expect developers will opt for either SYCL (https://www.khronos.org/opencl/sycl) or Bolt (https://github.com/HSA-Libraries/Bolt). SYCL provides a modern, standard C++11 wrapper around OpenCL, with better concurrency control and support for integrating kernels inline. Bolt provides many of the same higher-level abstractions found in Boost.Compute, but with forthcoming support for HSA.
>
>
>
> To have the kind of lasting relevance and broad applicability to which all Boost libraries should aspire, I think Boost.Compute should be architected to support multiple backends. Though OpenCL support is currently ascendant, it's far from universal and is already flagging on some platforms (Nvidia, not the least). And HSA provides a foundation on which alternatives are actively being built. Most importantly, there exist multitudes of multi-core and multiprocessor systems which lack OpenCL support. It would be eminently useful to support these with such backends as thread pool, OpenMP, etc. And backends could be added to support new technologies, as they mature.
>
>
>
>> 5. Did you try to use the library? With what compiler? Did you have any problems?
>
>
>
> I downloaded and inspected the sources and some examples. I didn't have any questions or concerns that would be addressed by experimentation.
>
>
>
>
>
>> 6. How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?
>
>
>
> Review of the documentation, some sources, some examples, and a refresher on SYCL & Bolt. I also read all of the reviews and replies sent thus far.
>
>
>
>
>
>> 7. Are you knowledgeable about the problem domain?
>
>
>
> I've been developing production code for parallel and heterogeneous computing platforms for the majority of the last 16 years. I've followed OpenCL since version 1.0 was finalized. I've also submitted bug reports and minor patches for Khronos' official C++ OpenCL interface.
>
>
>
>
>
>> 8. Do you think the library should be accepted as a Boost library? Be sure to say this explicitly so that your other comments don't obscure your overall opinion.
>
>
>
> Not in its current state. The primary issue is that it's too OpenCL-centric. It should support more backends, even if these are just compile-time options. This is crucial for both current and lasting relevance.

Boost.Compute is explicitly *not* designed to be an abstraction layer
over every parallel-programming interface currently available. It is
designed to target modern programmable GPUs, multi-core CPUs and some
of the more exotic accelerators (Xeon Phi, Parallella Epiphany, FPGAs,
etc.). In my opinion, OpenCL is the current best and most
widely-supported interface for this purpose and fills this role well.

Others libraries/frameworks (such as SYCL, Bolt, C++AMP, OpenACC,
etc.) are all dependent on either a special compiler or special
compiler extensions. On the other hand, OpenCL is a library-level
solution which allows Boost.Compute to be portable to any platform
with a standard C++ compiler.

And as Boost.Compute and SYCL are both based on OpenCL, interoperating
between them should be fairly trivial. If a non-commercial SYCL
implementation is ever released, I will work on adding supporting for
its single-source compilation model to Boost.Compute.

I feel developing an all-encompassing parallel programming library is
outside the scope of Boost.Compute (and would require considerably
more resources to design, implement, and maintain). That said, I feel
that Boost.Compute is well-suited to provide a back-end to a more
generic/high-level parallel programming library (either something like
a potential Boost.Parallel library or the upcoming standard C++
Parallelism TS).

More concretely, and as Sebastian mentioned, OpenCL itself already
provides a parallel-programming abstraction with many different
back-ends (both vendor-supplied implementations and open-source
implementations). I personally don't see the benefit to adding yet
another abstraction layer between the user and the compute device. I
feel it would greatly increase complexity of the library and prevent
us from spending more time on improving performance.

> Beyond that, better concurrency control is crucial for optimum performance. If done well, it could even help further reduce usage errors.

Boost.Compute provides low-level abstractions for concurrency between
the host and device. More high-level APIs and frameworks could
definitely be built on top of them. However, designing an optimal,
high-level API for concurrency is notoriously difficult and, in my
opinion, still an open-problem.

Thanks for the review. Please let me know if I can explain anything further.

-kyle

[1] https://github.com/kylelutz/compute/issues

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Gruenke,Matt
2014-12-28 23:33:13 UTC
Permalink
-----Original Message-----
From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Kyle Lutz
Sent: Sunday, December 28, 2014 14:42
To: ***@lists.boost.org List
Subject: Re: [boost] [compute] review

> On Sun, Dec 28, 2014 at 1:54 AM, Gruenke,Matt wrote:

> Others libraries/frameworks (such as SYCL, Bolt, C++AMP, OpenACC, etc.)
> are all dependent on either a special compiler or special compiler extensions.

According to Khronos, that's incorrect.

The provisional specification includes the following features:

* Specifications for creating C++ template libraries and compilers
using the C++11 standard

* Easy to use, production grade API - built on-top of OpenCL and SPIR(tm)

* Compatible with standard CPU C++ compilers across multiple platforms,
^^^^^^^^^^ ^^^^ ^^^^^^^^ ^^^ ^^^ ^^^^^^^^^
as well as enabling new SYCL device compilers to target OpenCL devices

* Asynchronous, low-level access to OpenCL features for high performance
and low-latency - while retaining ease of use


Source: https://www.khronos.org/opencl/sycl


If it didn't depend on C++11, or Boost would drop the C++03 compatibility requirement for new libraries, I'd even recommend using their C++ abstractions instead of Boost.Compute's custom core.


Matt


________________________________

This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Andrey Semashev
2014-12-28 23:46:33 UTC
Permalink
On Mon, Dec 29, 2014 at 2:33 AM, Gruenke,Matt <***@tycoint.com> wrote:
>
> If it didn't depend on C++11, or Boost would drop the C++03 compatibility requirement for new libraries, ...

There is no such requirement.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Sebastian Schaetz
2014-12-29 01:11:16 UTC
Permalink
Gruenke,Matt <mgruenke <at> Tycoint.com> writes:

>
>
> -----Original Message-----
> From: Boost [mailto:boost-bounces <at> lists.boost.org] On Behalf Of Kyle Lutz
> Sent: Sunday, December 28, 2014 14:42
> To: boost <at> lists.boost.org List
> Subject: Re: [boost] [compute] review
>
> > On Sun, Dec 28, 2014 at 1:54 AM, Gruenke,Matt wrote:
>
> > Others libraries/frameworks (such as SYCL, Bolt, C++AMP, OpenACC, etc.)
> > are all dependent on either a special compiler or special compiler
extensions.
>
> According to Khronos, that's incorrect.
>
> The provisional specification includes the following features:
>
> * Specifications for creating C++ template libraries and compilers
> using the C++11 standard
>
> * Easy to use, production grade API - built on-top of OpenCL and SPIR(tm)
>
> * Compatible with standard CPU C++ compilers across multiple platforms,
> ^^^^^^^^^^ ^^^^ ^^^^^^^^ ^^^ ^^^ ^^^^^^^^^
> as well as enabling new SYCL device compilers to target OpenCL devices
>
> * Asynchronous, low-level access to OpenCL features for high performance
> and low-latency - while retaining ease of use
>
> Source: https://www.khronos.org/opencl/sycl

Slides 14 and 15 of this slidedeck [0] show SYCL device compilers. SYCL code
compiles using a standard C++11 compiler. To compile SYCL code however you
need "either a special compiler or special compiler extensions". Codeplay
calls this "Shared Source Programming Model", you can learn more about it
from slides 22 onwards in this slide deck [1]. It is the same thing CUDA is
doing since its initial release.

But that is not unexpected at all. I don't think there is a technique to
generate kernel code from

command_group(myQueue, [&]()
{
// Data accessors
// [...]
// Kernel
parallel_for(count, kernel_functor([ = ](id<> item) {
int i = item.get_global(0);
r[i] = a[i] + b[i] + c[i];
}));
});

within standard C++11. You need a second compiler or an extension.

Boost.Compute can do most things SYCL can do - except you don't need
anything but a regular C++ compiler and the OpenCL library.

[0] http://tinyurl.com/q7f5nm4
[1] http://tinyurl.com/nffyyon


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Kyle Lutz
2014-12-29 01:11:51 UTC
Permalink
On Sun, Dec 28, 2014 at 3:33 PM, Gruenke,Matt <***@tycoint.com> wrote:
>
> -----Original Message-----
> From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Kyle Lutz
> Sent: Sunday, December 28, 2014 14:42
> To: ***@lists.boost.org List
> Subject: Re: [boost] [compute] review
>
>> On Sun, Dec 28, 2014 at 1:54 AM, Gruenke,Matt wrote:
>
>> Others libraries/frameworks (such as SYCL, Bolt, C++AMP, OpenACC, etc.)
>> are all dependent on either a special compiler or special compiler extensions.
>
> According to Khronos, that's incorrect.
>
> The provisional specification includes the following features:
>
> * Specifications for creating C++ template libraries and compilers
> using the C++11 standard
>
> * Easy to use, production grade API - built on-top of OpenCL and SPIR(tm)
>
> * Compatible with standard CPU C++ compilers across multiple platforms,
> ^^^^^^^^^^ ^^^^ ^^^^^^^^ ^^^ ^^^ ^^^^^^^^^
> as well as enabling new SYCL device compilers to target OpenCL devices
>
> * Asynchronous, low-level access to OpenCL features for high performance
> and low-latency - while retaining ease of use
>
>
> Source: https://www.khronos.org/opencl/sycl

Well let me back up. Currently there is no publicly available
implementation of SYCL which can offload computation to a GPU. Any
possible implementation of SYCL with the ability to perform work on a
GPU involving custom kernel functions specified in C++ would require
non-standard (w.r.t the C++ standard) support from the compiler. In
fact, I would be very happy to see any proof to the contrary as it
would be very useful for developing Boost.Compute itself. So while you
are correct that the Khronos specification doesn't explicitly require
special compiler support, any practical implementation of SYCL would.

But to more generally address these comments which have come up in
review, the SYCL specification has yet to see a real-world
implementation and, as far as Boost.Compute is concerned, SYCL is
essentially vaporware. While I do see how the SYCL interface could be
used together with Boost.Compute, I cannot currently use it as the
basis for Boost.Compute as it does not exist in any practically
useable form. In order for me to seriously consider using SYCL for
Boost.Compute, I would like to see a publicly available implementation
which supported NVIDIA GPUs, AMD CPUs/GPUs and Intel CPUs/GPUs.

-kyle

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Andrey Semashev
2014-12-28 23:38:54 UTC
Permalink
On Sun, Dec 28, 2014 at 10:41 PM, Kyle Lutz <***@gmail.com> wrote:
> On Sun, Dec 28, 2014 at 1:54 AM, Gruenke,Matt <***@tycoint.com> wrote:
>>
>> I'd agree with Andrey Semashev, regarding thread safety, except I think it shouldn't be an option at all. That said, not all operations on all objects need be thread safe. I think the Boost.ASIO documentation provides a good example of how to express different thread safety limitations. Now, I haven't examined the use of thread-local storage, but do you feel there's a strong case to be made for the thread safety it's providing (especially, given that you seem to feel it's non-essential)?
>
> I absolutely do not feel it is non-essential. The reason for it
> currently being a compile-time configurable option is that enabling it
> on non-C++11 compilers would make the library non-header-only (it
> would require the users to link to Boost.Thread).

I'd say, in C++03 it is only nominally header only. I consider the
multi-threaded mode as the default (i.e. it is what I want in 99% of
use cases and I think that's common), and in this mode users will have
to link with Boost.Thread anyway. If you truly want to pursue the
header-only property, I think you have to make it header-only in all
typical configs.

>> I'm a bit concerned about the use of type names ending in '_', such as float4_. Is this consistent with Boost conventions? I've only seen that used in other Boost libraries to denote class member variables.
>
> I'm not sure if there are any Boost conventions for/against this
> (someone please speak up if there are). I chose the trailing
> underscore for these types as I needed a consistent spelling for
> representing the fundamental OpenCL types (e.g. "float" or "uint" or
> "int4") and I couldn't just use the names without the trailing
> underscore as they'd conflict with the C++ reserved keywords (e.g.
> "float", "int"). Using a leading underscore (e.g. "_float4") looked
> worse to me, so I used a trailing underscore. But I'd definitely be
> open to hearing other ideas.

There is precedent of the standard int types (e.g. uint32_t), but I
assume in your case int4_ means a vector of 4 ints, right? I'd say, a
better name would reflect the fact that this is a vector type and also
specify the element type. I don't know if OpenCL defines sizes of its
types - if it does, so should Boost.Compute. A few examples of what I
have in mind: v4int_t, v4uint32_t, uint32x4_t, floatx4_t. You may also
want to see how Boost.SIMD deals with vector types (packs, in its
terminology).

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Andrey Semashev
2014-12-28 23:44:04 UTC
Permalink
On Mon, Dec 29, 2014 at 1:31 AM, John Bytheway
<jbytheway+***@gmail.com> wrote:
> On 2014-12-28 14:41, Kyle Lutz wrote:
>> On Sun, Dec 28, 2014 at 1:54 AM, Gruenke,Matt <***@tycoint.com> wrote:
>>> I'm a bit concerned about the use of type names ending in '_', such as float4_. Is this consistent with Boost conventions? I've only seen that used in other Boost libraries to denote class member variables.
>>
>> I'm not sure if there are any Boost conventions for/against this
>> (someone please speak up if there are). I chose the trailing
>> underscore for these types as I needed a consistent spelling for
>> representing the fundamental OpenCL types (e.g. "float" or "uint" or
>> "int4") and I couldn't just use the names without the trailing
>> underscore as they'd conflict with the C++ reserved keywords (e.g.
>> "float", "int"). Using a leading underscore (e.g. "_float4") looked
>> worse to me, so I used a trailing underscore. But I'd definitely be
>> open to hearing other ideas.
>
> There is precedent for the trailing underscore convention. See for
> example
> <http://www.boost.org/doc/libs/1_57_0/libs/mpl/doc/refmanual/numeric.html>.

The trailing underscore syntax is typically used when a library
construct has semantics very similar to a language component which
uses a keyword. I.e. mpl::if_ is similar to the language if, and
mpl::int_ to the language int, except that mpl tools operate at
compile time. In this case, I think, the semantics is sufficiently
different (vector types instead of scalar), which should warrant for a
distinct and descriptive names.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Gruenke,Matt
2014-12-29 00:19:57 UTC
Permalink
-----Original Message-----
From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Andrey Semashev
Sent: Sunday, December 28, 2014 18:44
To: ***@lists.boost.org
Subject: Re: [boost] [compute] review

> On Mon, Dec 29, 2014 at 1:31 AM, John Bytheway <jbytheway+***@gmail.com> wrote:

> The trailing underscore syntax is typically used when a library construct has
> semantics very similar to a language component which uses a keyword.
> I.e. mpl::if_ is similar to the language if, and mpl::int_ to the language int,
> except that mpl tools operate at compile time. In this case, I think, the
> semantics is sufficiently different (vector types instead of scalar), which
> should warrant for a distinct and descriptive names.

I appreciate what I assume Kyle was trying to do. These correspond 1:1 to the types used in OpenCL kernels, so I think he wanted the Boost.Compute typedefs to be as similar as possible.

I think this approach is justified, as their meaning will be obvious to anyone familiar with OpenCL. I only meant to raise the question of how they were mangled.


Matt


________________________________

This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Gruenke,Matt
2014-12-29 00:46:34 UTC
Permalink
-----Original Message-----
From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Kyle Lutz
Sent: Sunday, December 28, 2014 14:42
To: ***@lists.boost.org List
Subject: Re: [boost] [compute] review

>On Sun, Dec 28, 2014 at 1:54 AM, Gruenke,Matt wrote:

>> I agree with other comments made about synchronization. The design should
>> be more explicit about what's asynchronous,

> Like I mentioned before, there is only one method for asynchrony in Boost.Compute,
> the command queue abstraction provided by OpenCL. Operations are enqueued to be
> executed on the compute device and this occurs asynchronously with respect to code
> executing on the host. The exception to this rule are functions which interact
> directly with host-memory which by default are blocking and offer explicit
> "_async()" versions (in order to prevent potential race-conditions on host-memory).

>> Regarding synchronization, I'm a also bit concerned about the performance impact
>> of synchronizing on all copies to host memory. Overuse of synchronization can
>> easily result in performance deterioration. On this point, I think it might be
>> worth limiting host memory usable with algorithms, to containers that perform
>> implicit synchronization to actual use (or destruction) of results. Give users
>> the choice between that or performing explicit copies to raw types.

> To be clear, all copies are not synchronized with host memory.
> Boost.Compute allows both synchronous and asynchronous memory transfers between the
> host and device.

My understanding, based on comments you've made to other reviewers, is that functions like boost::compute::transform() are asynchronous when the result is on the device, but block when the result is on the host. This is what I'm concerned about. Is it true?


>> Also, I agree with Thomas M that it'd be useful for operations to return events.

> All asynchronous operations in the command queue class do return events. One of his
> comments was to also return events from the synchronous methods for consistency and
> I am working on adding this.

Well, what I had in mind was events for higher-order operations, like boost::compute::transform().


Matt


________________________________

This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Kyle Lutz
2014-12-29 01:35:37 UTC
Permalink
On Sun, Dec 28, 2014 at 4:46 PM, Gruenke,Matt <***@tycoint.com> wrote:
> -----Original Message-----
> From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Kyle Lutz
> Sent: Sunday, December 28, 2014 14:42
> To: ***@lists.boost.org List
> Subject: Re: [boost] [compute] review
>
>>On Sun, Dec 28, 2014 at 1:54 AM, Gruenke,Matt wrote:
>
>>> I agree with other comments made about synchronization. The design should
>>> be more explicit about what's asynchronous,
>
>> Like I mentioned before, there is only one method for asynchrony in Boost.Compute,
>> the command queue abstraction provided by OpenCL. Operations are enqueued to be
>> executed on the compute device and this occurs asynchronously with respect to code
>> executing on the host. The exception to this rule are functions which interact
>> directly with host-memory which by default are blocking and offer explicit
>> "_async()" versions (in order to prevent potential race-conditions on host-memory).
>
>>> Regarding synchronization, I'm a also bit concerned about the performance impact
>>> of synchronizing on all copies to host memory. Overuse of synchronization can
>>> easily result in performance deterioration. On this point, I think it might be
>>> worth limiting host memory usable with algorithms, to containers that perform
>>> implicit synchronization to actual use (or destruction) of results. Give users
>>> the choice between that or performing explicit copies to raw types.
>
>> To be clear, all copies are not synchronized with host memory.
>> Boost.Compute allows both synchronous and asynchronous memory transfers between the
>> host and device.
>
> My understanding, based on comments you've made to other reviewers, is that functions like boost::compute::transform() are asynchronous when the result is on the device, but block when the result is on the host. This is what I'm concerned about. Is it true?

Yes this is correct. In general, algorithms like transform() are
asynchronous when the input/output ranges are both on the device and
synchronous when one of the ranges is on the host. I'll work on better
ways to allow asynchrony in the latter case. One of my current ideas
is add asynchronous memory-mapping support to the mapped_view class
[1] which can then be used with any of the algorithms in an
asynchronous fashion.

>>> Also, I agree with Thomas M that it'd be useful for operations to return events.
>
>> All asynchronous operations in the command queue class do return events. One of his
>> comments was to also return events from the synchronous methods for consistency and
>> I am working on adding this.
>
> Well, what I had in mind was events for higher-order operations, like boost::compute::transform().

Yes, I would also like to have higher-level support for chaining
together algorithms asynchronously. However, designing a generic and
generally useful API for this is a complex task and may take some time
and (I've shied away from just adding an extra "_async()" function for
all of the algorithm APIs as I think it could be done
better/more-extensibly). Any ideas/proposals for this would be great
to hear.

-kyle

[1] http://kylelutz.github.io/compute/boost/compute/mapped_view.html

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Gruenke,Matt
2014-12-29 02:16:27 UTC
Permalink
-----Original Message-----
From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Kyle Lutz
Sent: Sunday, December 28, 2014 20:36
To: ***@lists.boost.org List
Subject: Re: [boost] Synchronization (RE: [compute] review)

> On Sun, Dec 28, 2014 at 4:46 PM, Gruenke,Matt wrote:

>> My understanding, based on comments you've made to other reviewers, is
>> that functions like boost::compute::transform() are asynchronous when
>> the result is on the device, but block when the result is on the host.
>> This is what I'm concerned about. Is it true?

> Yes this is correct. In general, algorithms like transform() are asynchronous
> when the input/output ranges are both on the device and synchronous when one
> of the ranges is on the host. I'll work on better ways to allow asynchrony in
> the latter case. One of my current ideas is add asynchronous memory-mapping
> support to the mapped_view class [1] which can then be used with any of the
> algorithms in an asynchronous fashion.

Why block when only the source is on the host? Are you worried it might go out of scope?

If so, that's actually not a bad point. I was just pondering how to write exception-safe code using local host datastructures. I guess blocking on all operations involving them is a simple way to ensure nothing is read or written after it's out of scope. Not the only way that comes to mind (nor the most efficient), but it does the job.


Matt


________________________________

This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Kyle Lutz
2014-12-29 02:23:33 UTC
Permalink
On Sun, Dec 28, 2014 at 6:16 PM, Gruenke,Matt <***@tycoint.com> wrote:
> -----Original Message-----
> From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Kyle Lutz
> Sent: Sunday, December 28, 2014 20:36
> To: ***@lists.boost.org List
> Subject: Re: [boost] Synchronization (RE: [compute] review)
>
>> On Sun, Dec 28, 2014 at 4:46 PM, Gruenke,Matt wrote:
>
>>> My understanding, based on comments you've made to other reviewers, is
>>> that functions like boost::compute::transform() are asynchronous when
>>> the result is on the device, but block when the result is on the host.
>>> This is what I'm concerned about. Is it true?
>
>> Yes this is correct. In general, algorithms like transform() are asynchronous
>> when the input/output ranges are both on the device and synchronous when one
>> of the ranges is on the host. I'll work on better ways to allow asynchrony in
>> the latter case. One of my current ideas is add asynchronous memory-mapping
>> support to the mapped_view class [1] which can then be used with any of the
>> algorithms in an asynchronous fashion.
>
> Why block when only the source is on the host? Are you worried it might go out of scope?
>
> If so, that's actually not a bad point. I was just pondering how to write exception-safe code using local host datastructures. I guess blocking on all operations involving them is a simple way to ensure nothing is read or written after it's out of scope. Not the only way that comes to mind (nor the most efficient), but it does the job.

Yes, that is one of the major motivations. Avoiding potential
race-conditions with host code accessing the memory at the same time
is another. I'd be very open to other solutions.

-kyle

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Gruenke,Matt
2014-12-29 03:40:47 UTC
Permalink
-----Original Message-----
From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Kyle Lutz
Sent: Sunday, December 28, 2014 21:24
To: ***@lists.boost.org List
Subject: Re: [boost] Synchronization (RE: [compute] review)

>On Sun, Dec 28, 2014 at 6:16 PM, Gruenke,Matt wrote:

>> Why block when only the source is on the host? Are you worried it might go out of scope?
>
>> If so, that's actually not a bad point. I was just pondering how to write exception-safe
>> code using local host datastructures. I guess blocking on all operations involving them
>> is a simple way to ensure nothing is read or written after it's out of scope. Not the
>> only way that comes to mind (nor the most efficient), but it does the job.
>
> Yes, that is one of the major motivations. Avoiding potential race-conditions with host
> code accessing the memory at the same time is another. I'd be very open to other solutions.

I have some rough ideas, but they'd probably have a deeper impact on your API than you want, at this stage.

Instead, I'm thinking mostly about how to make exception-safe use of the async copy commands to/from host memory. I think async copies will quickly gain popularity with advanced users, and will probably be one of the top optimization tips.

I guess it'd be nice to have a scope guard that blocks on boost::compute::event.

std::vector<int> hv(16000000);
std::generate(hv.begin(), hv.end(), rand);
boost::compute::vector<float> dv(host_vector.size(), context);
boost::compute::guard wg(cq.enqueue_write_buffer_async(dv, 0, hv.size(), &hv.front());

// do some device -> device operations

boost::compute::guard rg(cq.enqueue_read_buffer_async(dv, 0, hv.size(), &hv.front());


[Short names used for the sake of email.]

boost::compute::guard would do nothing more than hold a reference to the event and call boost::compute::event::wait(), upon destruction (if it contains a valid event, at that point).

So, if an exception is thrown after wg is constructed, then the stack unwind blocks on its completion (or failure). Similarly, if the exception is thrown after rg, then the unwind blocks on both wg and rg.

Obviously, this example is somewhat artificial. If the read were really the last operation in scope, then you could just use a synchronous copy. But, I think one strong use case for async copies is to free up device memory for subsequent operations. So, I might transfer something off, even if I'm not going to do anything with it, at that point.


Matt


________________________________

This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Thomas M
2014-12-29 21:19:38 UTC
Permalink
On 29/12/2014 04:40, Gruenke,Matt wrote:
> -----Original Message-----
> From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Kyle Lutz
> Sent: Sunday, December 28, 2014 21:24
> To: ***@lists.boost.org List
> Subject: Re: [boost] Synchronization (RE: [compute] review)
>
>> On Sun, Dec 28, 2014 at 6:16 PM, Gruenke,Matt wrote:
>
>>> Why block when only the source is on the host? Are you worried it might go out of scope?
>>
>>> If so, that's actually not a bad point. I was just pondering how to write exception-safe
>>> code using local host datastructures. I guess blocking on all operations involving them
>>> is a simple way to ensure nothing is read or written after it's out of scope. Not the
>>> only way that comes to mind (nor the most efficient), but it does the job.
>>
>> Yes, that is one of the major motivations. Avoiding potential race-conditions with host
>> code accessing the memory at the same time is another. I'd be very open to other solutions.

I find it truly confusing that an algorithm can run either synchronous
or asynchronous, without its signature clearly and loudly indicating so.
In template code (or in general) it can easily be +- unknown (or
non-trivial to find out) if the input/output range refer to the host or
the device, and thus if the algorithm will execute in synchronous or
asynchronous mode -> and what that implies for the rest of the code
around the algorithm.

If I understood the behavior of transform correctly (and assuming that
for device_vector the input/output ranges count as device side [?]), am
I correct that the following can easily fail?:

compute::command_queue queue1(context, device);
compute::command_queue queue2(context, device);

compute::vector<float> device_vector(n, context);
// copy some data to device_vector

// use queue1
boost::compute::transform(device_vector.begin(), device_vector.end(),
device_vector.begin(),
compute::sqrt<float>(),
queue1);

// use queue2
compute::copy(device_vector.begin(), device_vector.end(),
some_host_vector.begin(), queue2);

And currently the way to make this behave properly would be to force
queue1 to wait for completion of any enqueued job (note: it may be an
out-of-order queue!) after transform has been called?

One way could be to make algorithms simply always treated as
asynchronous at API level (even if internally they may run synchronous)
and get always associated with an event. Another is providing a
synchronous and asynchronous overload. I'd certainly prefer to know if
it runs synchronous or asynchronous just by looking at the transform
invocation itself.


With respect to exception safety, is there any proper behavior defined
by your library if transform has been enqueued to run in asynchronous
mode, but before it has completed device_vector goes out of scope (e.g.
due to an exception thrown in the host code following the transform)? Or
is it the user's responsibility to ensure that, whatever happens,
device_vector must live until the transform has completed?

>
> I have some rough ideas, but they'd probably have a deeper impact on your API than you want, at this stage.
>
> Instead, I'm thinking mostly about how to make exception-safe use of the async copy commands to/from host memory. I think async copies will quickly gain popularity with advanced users, and will probably be one of the top optimization tips.
> I guess it'd be nice to have a scope guard that blocks on boost::compute::event.

Here's another sketch, also considering the points above - though I
obviously don't know if it's doable given the implementation + other
design considerations I might miss, so apologize if it's non-sense.

If input/output ranges generally refer to iterators from the
boost::compute library, then:
-) an iterator can store the container (or other data structure it
belongs to, if any)
-) an algorithm can, via the iterators, "communicate" with the container(s)

For an input operation the data must be available throughout & in
unaltered manner from the time of enqueuing the input operation until
its completion. So when transform (as example) is launched it can inform
the input data container that before any subsequent modification of it
to occur (including destruction / setting new values through iterators)
it must wait until that input operation has completed - i.e. the first
modifying operation blocks until that has finished. Similarly for the
output range, just that for that also any read operation must block
until the output data from the transform has been written to it. So:
-) no matter what causes the destruction of containers (e.g. regularly
end-of-block reached, exception etc.) the lifetime of the
container/iterators extends until the asynchronous operation on it has
finished; thus exceptions thrown are implicitly handled.
-) to the user the code appears as synchronous with respect to visible
behavior, but can run as asynchronous in the background.

Obviously a full-fledged version is neither trivial nor cheap with
respect to performance (e.g. checking any reads/writes to containers if
it must block), let alone threading aspects. But maybe just parts of it
are useful, e.g. deferring container destruction until no OpenCL
operation is enqueued to work on the container (-> handling exceptions).
I think there's a wide range for balances between what the
implementation does automagically and what constraints are placed on the
user to not do "stupid" things.

Thomas




_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Kyle Lutz
2014-12-29 21:51:10 UTC
Permalink
On Mon, Dec 29, 2014 at 1:19 PM, Thomas M <***@gmail.com> wrote:
> On 29/12/2014 04:40, Gruenke,Matt wrote:
>>
>> -----Original Message-----
>> From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Kyle Lutz
>> Sent: Sunday, December 28, 2014 21:24
>> To: ***@lists.boost.org List
>> Subject: Re: [boost] Synchronization (RE: [compute] review)
>>
>>> On Sun, Dec 28, 2014 at 6:16 PM, Gruenke,Matt wrote:
>>
>>
>>>> Why block when only the source is on the host? Are you worried it might
>>>> go out of scope?
>>>
>>>
>>>> If so, that's actually not a bad point. I was just pondering how to
>>>> write exception-safe
>>>> code using local host datastructures. I guess blocking on all
>>>> operations involving them
>>>> is a simple way to ensure nothing is read or written after it's out of
>>>> scope. Not the
>>>> only way that comes to mind (nor the most efficient), but it does the
>>>> job.
>>>
>>>
>>> Yes, that is one of the major motivations. Avoiding potential
>>> race-conditions with host
>>> code accessing the memory at the same time is another. I'd be very open
>>> to other solutions.
>
>
> I find it truly confusing that an algorithm can run either synchronous or
> asynchronous, without its signature clearly and loudly indicating so. In
> template code (or in general) it can easily be +- unknown (or non-trivial to
> find out) if the input/output range refer to the host or the device, and
> thus if the algorithm will execute in synchronous or asynchronous mode ->
> and what that implies for the rest of the code around the algorithm.
>
> If I understood the behavior of transform correctly (and assuming that for
> device_vector the input/output ranges count as device side [?]), am I
> correct that the following can easily fail?:
>
> compute::command_queue queue1(context, device);
> compute::command_queue queue2(context, device);
>
> compute::vector<float> device_vector(n, context);
> // copy some data to device_vector
>
> // use queue1
> boost::compute::transform(device_vector.begin(), device_vector.end(),
> device_vector.begin(),
> compute::sqrt<float>(),
> queue1);
>
> // use queue2
> compute::copy(device_vector.begin(), device_vector.end(),
> some_host_vector.begin(), queue2);
>
> And currently the way to make this behave properly would be to force queue1
> to wait for completion of any enqueued job (note: it may be an out-of-order
> queue!) after transform has been called?

Well this is essentially equivalent to having two separate
host-threads both reading and writing from the same region of memory
at the same time, of course you need to synchronize them.

For this specific case you could just enqueue a barrier to ensure
queue2 doesn't begin its operation before queue1 completes:

// before calling copy() on queue2:
queue2.enqueue_barrier(queue1.enqueue_marker());

> One way could be to make algorithms simply always treated as asynchronous at
> API level (even if internally they may run synchronous) and get always
> associated with an event. Another is providing a synchronous and
> asynchronous overload. I'd certainly prefer to know if it runs synchronous
> or asynchronous just by looking at the transform invocation itself.

Well let me make this more clear: transform() always runs
asynchronously. The only algorithm you really have to worry about is
copy() as it is responsible for moving data between the host and
device and will do this synchronously. If you want an asynchronous
copy then use copy_async() which will return a future that can be used
to wait for the copy operation to complete.

> With respect to exception safety, is there any proper behavior defined by
> your library if transform has been enqueued to run in asynchronous mode, but
> before it has completed device_vector goes out of scope (e.g. due to an
> exception thrown in the host code following the transform)? Or is it the
> user's responsibility to ensure that, whatever happens, device_vector must
> live until the transform has completed?

The user must ensure that the memory being written to remains valid
until the operation completes. Simply imagine you are calling
std::transform() on a std::vector<> from a separate std::thread, you
must wait for that thread to complete its work before destroying the
memory it is writing to. Operations on the compute device can be
reasoned about in a similar manner.

>>
>> I have some rough ideas, but they'd probably have a deeper impact on your
>> API than you want, at this stage.
>>
>> Instead, I'm thinking mostly about how to make exception-safe use of the
>> async copy commands to/from host memory. I think async copies will quickly
>> gain popularity with advanced users, and will probably be one of the top
>> optimization tips.
>> I guess it'd be nice to have a scope guard that blocks on
>> boost::compute::event.
>
>
> Here's another sketch, also considering the points above - though I
> obviously don't know if it's doable given the implementation + other design
> considerations I might miss, so apologize if it's non-sense.
>
> If input/output ranges generally refer to iterators from the boost::compute
> library, then:
> -) an iterator can store the container (or other data structure it belongs
> to, if any)
> -) an algorithm can, via the iterators, "communicate" with the container(s)
>
> For an input operation the data must be available throughout & in unaltered
> manner from the time of enqueuing the input operation until its completion.
> So when transform (as example) is launched it can inform the input data
> container that before any subsequent modification of it to occur (including
> destruction / setting new values through iterators) it must wait until that
> input operation has completed - i.e. the first modifying operation blocks
> until that has finished. Similarly for the output range, just that for that
> also any read operation must block until the output data from the transform
> has been written to it. So:
> -) no matter what causes the destruction of containers (e.g. regularly
> end-of-block reached, exception etc.) the lifetime of the
> container/iterators extends until the asynchronous operation on it has
> finished; thus exceptions thrown are implicitly handled.
> -) to the user the code appears as synchronous with respect to visible
> behavior, but can run as asynchronous in the background.
>
> Obviously a full-fledged version is neither trivial nor cheap with respect
> to performance (e.g. checking any reads/writes to containers if it must
> block), let alone threading aspects. But maybe just parts of it are useful,
> e.g. deferring container destruction until no OpenCL operation is enqueued
> to work on the container (-> handling exceptions).
> I think there's a wide range for balances between what the implementation
> does automagically and what constraints are placed on the user to not do
> "stupid" things.

While these are interesting ideas, I feel like this is sort of
behavior is more high-level/advanced than what the Boost.Compute
algorithms are meant to do. I have tried to mimic as close as possible
the "iterators and algorithms" paradigm from the STL as I consider the
design quite elegant.

However, these sorts of techniques could definitely be implemented on
top of Boost.Compute. I would be very interested to see a
proof-of-concept demonstrating these ideas, would you be interested in
working on this?

-kyle

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Thomas M
2014-12-29 22:58:42 UTC
Permalink
On 29/12/2014 22:51, Kyle Lutz wrote:
> On Mon, Dec 29, 2014 at 1:19 PM, Thomas M <***@gmail.com> wrote:
>> On 29/12/2014 04:40, Gruenke,Matt wrote:
>>>
>>> -----Original Message-----
>>> From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Kyle Lutz
>>> Sent: Sunday, December 28, 2014 21:24
>>> To: ***@lists.boost.org List
>>> Subject: Re: [boost] Synchronization (RE: [compute] review)
>>>
>>>> On Sun, Dec 28, 2014 at 6:16 PM, Gruenke,Matt wrote:
>>>
>>>
>>
>> If I understood the behavior of transform correctly (and assuming that for
>> device_vector the input/output ranges count as device side [?]), am I
>> correct that the following can easily fail?:
>>
>> compute::command_queue queue1(context, device);
>> compute::command_queue queue2(context, device);
>>
>> compute::vector<float> device_vector(n, context);
>> // copy some data to device_vector
>>
>> // use queue1
>> boost::compute::transform(device_vector.begin(), device_vector.end(),
>> device_vector.begin(),
>> compute::sqrt<float>(),
>> queue1);
>>
>> // use queue2
>> compute::copy(device_vector.begin(), device_vector.end(),
>> some_host_vector.begin(), queue2);
>>
>> And currently the way to make this behave properly would be to force queue1
>> to wait for completion of any enqueued job (note: it may be an out-of-order
>> queue!) after transform has been called?
>
> Well this is essentially equivalent to having two separate
> host-threads both reading and writing from the same region of memory
> at the same time, of course you need to synchronize them.
>
> For this specific case you could just enqueue a barrier to ensure
> queue2 doesn't begin its operation before queue1 completes:
>
> // before calling copy() on queue2:
> queue2.enqueue_barrier(queue1.enqueue_marker());

Sorry I haven't expressed myself well. Yes, surely I must synchronize
it; just with the OpenCL API itself I can normally provide a (pointer to
an) cl_event when calling objectclEnqueue... functions, which can
subsequently be used quite flexibly to coordinate other operations (i.e.
not going to the command queue level).

>
>> One way could be to make algorithms simply always treated as asynchronous at
>> API level (even if internally they may run synchronous) and get always
>> associated with an event. Another is providing a synchronous and
>> asynchronous overload. I'd certainly prefer to know if it runs synchronous
>> or asynchronous just by looking at the transform invocation itself.
>
> Well let me make this more clear: transform() always runs
> asynchronously. The only algorithm you really have to worry about is
> copy() as it is responsible for moving data between the host and
> device and will do this synchronously. If you want an asynchronous
> copy then use copy_async() which will return a future that can be used
> to wait for the copy operation to complete.

Now I am really confused :) In this thread I have read, quoting:

[Gruenke, Matt]:
"My understanding, based on comments you've made to other reviewers, is
that functions like boost::compute::transform() are asynchronous when
the result is on the device, but block when the result is on the host.
This is what I'm concerned about. Is it true?"

[Kyle Lutz]
"Yes this is correct. In general, algorithms like transform() are
asynchronous when the input/output ranges are both on the device and
synchronous when one of the ranges is on the host."

This made me believe that some iterators in your library turn
compute::boost::transform into a synchronous operation, and some into an
asynchronous. So now I suppose that this does not seem to be the case ?
In comparison to the OpenCL runtime execution model can I consider all
your algorithms, except copy, basically acting like
clEnqueueNDRangeKernel calls, that is always asynchronous?


>>
>> Here's another sketch, also considering the points above.
>
> While these are interesting ideas, I feel like this is sort of
> behavior is more high-level/advanced than what the Boost.Compute
> algorithms are meant to do. I have tried to mimic as close as possible
> the "iterators and algorithms" paradigm from the STL as I consider the
> design quite elegant.
>
> However, these sorts of techniques could definitely be implemented on
> top of Boost.Compute. I would be very interested to see a
> proof-of-concept demonstrating these ideas, would you be interested in
> working on this?

Interested yes, time is currently a problem though; I'd need to
familiarize myself much deeper with your implementation.
At this stage my main concern is exception safety - how one could relief
users in a simplistic manner from the need to manually taking care that
objects do not get out-of-scope (due to an exception thrown) while an
OpenCL asynchronous operation still needs them. Note that because your
API can throw I consider exception effects to be of much greater concern
than with the (implicitly non-throwing) Khronos API; and just enqueuing
a couple of commands can make a proper, manual cleanup really easily
non-trivial.

cheers,
Thomas



_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Kyle Lutz
2014-12-29 23:49:35 UTC
Permalink
On Mon, Dec 29, 2014 at 2:58 PM, Thomas M <***@gmail.com> wrote:
> On 29/12/2014 22:51, Kyle Lutz wrote:
>>
>> On Mon, Dec 29, 2014 at 1:19 PM, Thomas M <***@gmail.com> wrote:
>>>
>>> On 29/12/2014 04:40, Gruenke,Matt wrote:
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Kyle
>>>> Lutz
>>>> Sent: Sunday, December 28, 2014 21:24
>>>> To: ***@lists.boost.org List
>>>> Subject: Re: [boost] Synchronization (RE: [compute] review)
>>>>
>>>>> On Sun, Dec 28, 2014 at 6:16 PM, Gruenke,Matt wrote:
>>>>
>>>>
>>>>
>>>
>>> If I understood the behavior of transform correctly (and assuming that
>>> for
>>> device_vector the input/output ranges count as device side [?]), am I
>>> correct that the following can easily fail?:
>>>
>>> compute::command_queue queue1(context, device);
>>> compute::command_queue queue2(context, device);
>>>
>>> compute::vector<float> device_vector(n, context);
>>> // copy some data to device_vector
>>>
>>> // use queue1
>>> boost::compute::transform(device_vector.begin(), device_vector.end(),
>>> device_vector.begin(),
>>> compute::sqrt<float>(),
>>> queue1);
>>>
>>> // use queue2
>>> compute::copy(device_vector.begin(), device_vector.end(),
>>> some_host_vector.begin(), queue2);
>>>
>>> And currently the way to make this behave properly would be to force
>>> queue1
>>> to wait for completion of any enqueued job (note: it may be an
>>> out-of-order
>>> queue!) after transform has been called?
>>
>>
>> Well this is essentially equivalent to having two separate
>> host-threads both reading and writing from the same region of memory
>> at the same time, of course you need to synchronize them.
>>
>> For this specific case you could just enqueue a barrier to ensure
>> queue2 doesn't begin its operation before queue1 completes:
>>
>> // before calling copy() on queue2:
>> queue2.enqueue_barrier(queue1.enqueue_marker());
>
>
> Sorry I haven't expressed myself well. Yes, surely I must synchronize it;
> just with the OpenCL API itself I can normally provide a (pointer to an)
> cl_event when calling objectclEnqueue... functions, which can subsequently
> be used quite flexibly to coordinate other operations (i.e. not going to the
> command queue level).

Yes, currently this functionality is not built-in to the algorithms
API. If you want to synchronize between multiple command queues you
must currently do so explicitly.

With the enqueue_marker() method it is quite easy to get an event
object at any point in time which can be used to synchronize with the
host or with other command queues.

>>
>>> One way could be to make algorithms simply always treated as asynchronous
>>> at
>>> API level (even if internally they may run synchronous) and get always
>>> associated with an event. Another is providing a synchronous and
>>> asynchronous overload. I'd certainly prefer to know if it runs
>>> synchronous
>>> or asynchronous just by looking at the transform invocation itself.
>>
>>
>> Well let me make this more clear: transform() always runs
>> asynchronously. The only algorithm you really have to worry about is
>> copy() as it is responsible for moving data between the host and
>> device and will do this synchronously. If you want an asynchronous
>> copy then use copy_async() which will return a future that can be used
>> to wait for the copy operation to complete.
>
>
> Now I am really confused :) In this thread I have read, quoting:
>
> [Gruenke, Matt]:
> "My understanding, based on comments you've made to other reviewers, is that
> functions like boost::compute::transform() are asynchronous when the result
> is on the device, but block when the result is on the host. This is what I'm
> concerned about. Is it true?"
>
> [Kyle Lutz]
> "Yes this is correct. In general, algorithms like transform() are
> asynchronous when the input/output ranges are both on the device and
> synchronous when one of the ranges is on the host."
>
> This made me believe that some iterators in your library turn
> compute::boost::transform into a synchronous operation, and some into an
> asynchronous. So now I suppose that this does not seem to be the case ?
> In comparison to the OpenCL runtime execution model can I consider all your
> algorithms, except copy, basically acting like clEnqueueNDRangeKernel calls,
> that is always asynchronous?

Sorry, I should of prefaced that statement with "For algorithms which
accept both host and device iterators, ...". Currently there are very
few of those (copy(), sort(), and reduce() for its output can use host
iterators). In general, to maximize efficiency, the user should deal
almost entirely with device iterators and only synchronize with the
host at the beginning to transfer input and at the end to transfer
back the output.

And you're correct, nearly all calls to Boost.Compute algorithms
result in creating a kernel, setting its arguments, and then calling
clEnqueueNDRangeKernel() to execute it asynchronously.

>>>
>>> Here's another sketch, also considering the points above.
>>
>>
>> While these are interesting ideas, I feel like this is sort of
>> behavior is more high-level/advanced than what the Boost.Compute
>> algorithms are meant to do. I have tried to mimic as close as possible
>> the "iterators and algorithms" paradigm from the STL as I consider the
>> design quite elegant.
>>
>> However, these sorts of techniques could definitely be implemented on
>> top of Boost.Compute. I would be very interested to see a
>> proof-of-concept demonstrating these ideas, would you be interested in
>> working on this?
>
>
> Interested yes, time is currently a problem though; I'd need to familiarize
> myself much deeper with your implementation.
> At this stage my main concern is exception safety - how one could relief
> users in a simplistic manner from the need to manually taking care that
> objects do not get out-of-scope (due to an exception thrown) while an OpenCL
> asynchronous operation still needs them. Note that because your API can
> throw I consider exception effects to be of much greater concern than with
> the (implicitly non-throwing) Khronos API; and just enqueuing a couple of
> commands can make a proper, manual cleanup really easily non-trivial.

I have tried as much as possible to ensure that the library
exception-safe (I surely hope you aren't assuming that I intentionally
made the library non-exception-safe). If you do find any bugs related
to exception handling please do submit them with a reproducible
test-case to the issue tracker [1] and I will get them fixed as soon
as possible. Also, in case you were unaware, you can always disable
all exceptions in Boost.Compute by defining BOOST_NO_EXCEPTIONS.

-kyle

[1] https://github.com/kylelutz/compute/issues

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Thomas M
2014-12-30 08:31:37 UTC
Permalink
On 30/12/2014 00:49, Kyle Lutz wrote:
> On Mon, Dec 29, 2014 at 2:58 PM, Thomas M <***@gmail.com> wrote:
>> On 29/12/2014 22:51, Kyle Lutz wrote:
>>>
>>> On Mon, Dec 29, 2014 at 1:19 PM, Thomas M <***@gmail.com> wrote:
>>>>

>>
>>
>> Interested yes, time is currently a problem though; I'd need to familiarize
>> myself much deeper with your implementation.
>> At this stage my main concern is exception safety - how one could relief
>> users in a simplistic manner from the need to manually taking care that
>> objects do not get out-of-scope (due to an exception thrown) while an OpenCL
>> asynchronous operation still needs them. Note that because your API can
>> throw I consider exception effects to be of much greater concern than with
>> the (implicitly non-throwing) Khronos API; and just enqueuing a couple of
>> commands can make a proper, manual cleanup really easily non-trivial.
>
> I have tried as much as possible to ensure that the library
> exception-safe (I surely hope you aren't assuming that I intentionally
> made the library non-exception-safe). If you do find any bugs related
> to exception handling please do submit them with a reproducible
> test-case to the issue tracker [1] and I will get them fixed as soon
> as possible.

No, I never assumed you make it intentionally unsafe. However you, as
implementer, probably know ways more about when something may throw (and
how things then behave) then users, and thus which exception guarantees
your library implicitly provides. Is there a docs section which
explicitly says under which circumstances something may throw and how
subsequently behavior, e.g. of asynchronous operations already launched,
is defined?

I take your transform tutorial code as example:

// create a vector on the device
compute::vector<float> device_vector(host_vector.size(), context);

// transfer data from the host to the device
compute::copy(host_vector.begin(), host_vector.end(),
device_vector.begin(), queue);

// calculate the square-root of each element in-place
compute::transform(device_vector.begin(),
device_vector.end(),
device_vector.begin(),
compute::sqrt<float>(),
queue);

// copy values back to the host
compute::copy(device_vector.begin(), device_vector.end(),
host_vector.begin(), queue);

For neither the copy nor transform algorithm does the doc explicitly say
that it isn't throwing anything (apologize if it says so in another doc
section), so for exception-safe code as user I must assume the worst,
i.e. any of them can throw (for the moment assume that I don't choose to
disable throwing by #define). So I must ensure that device_vector will
be kept alive until all commands which were successfully launched could
finish; that appears to boiling down to something like a try around the
copy-transform-copy, and a queue.finish() (or similar) in catch. If
there is more than one container / command-queue etc. involved the
complexity of sorting things out for proper cleanup can raise
considerably; so does it when within the program hierarchy the
creation/lifetime management of a container on the one hand and the
command queue ownership / invocation of operations on the container on
the other hand are separated (there may simply be naturally no central
place which can control both the container lifetime and also have access
to the command queue(s)).

Note the aim is not desperately trying to get the results of the
transform somehow, just to ensure that the OpenCL runtime is not doing
anything disastrous in the asynchronous background, i.e. here accessing
memory that has already gone out of scope due to stack unwinding. If
your implementation already provides some exception / safety guarantees
in the example then these must become easily visible to users.

Thomas


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Rob Stewart
2014-12-30 10:01:28 UTC
Permalink
On December 29, 2014 6:49:35 PM EST, Kyle Lutz <***@gmail.com> wrote:
>On Mon, Dec 29, 2014 at 2:58 PM, Thomas M <***@gmail.com> wrote:
>
>>Note that because your API can
>> throw I consider exception effects to be of much greater concern than
>with
>> the (implicitly non-throwing) Khronos API; and just enqueuing a
>couple of
>> commands can make a proper, manual cleanup really easily non-trivial.
>
>I have tried as much as possible to ensure that the library
>exception-safe (I surely hope you aren't assuming that I intentionally
>made the library non-exception-safe). If you do find any bugs related
>to exception handling please do submit them with a reproducible
>test-case to the issue tracker [1] and I will get them fixed as soon
>as possible. Also, in case you were unaware, you can always disable
>all exceptions in Boost.Compute by defining BOOST_NO_EXCEPTIONS.

BOOST_NO_EXCEPTIONS invokes a global handler to note a failure. That means after every library call, the user must check a global value to determine if there was an error. Overloading with a Boost.System error_code localizes error checking. (Asio does this, for example.) I consider that a superior way to avoid exception overhead.

The question is whether exception overhead is a bottleneck in coding with this library's API. It is necessary to weigh the likelihood of errors against the costs of reporting and detecting them. Conditional logic to check for errors is not pipelining friendly, but throwing exceptions is terribly expensive.
--
Rob

(Sent from my portable computation engine)

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Gruenke,Matt
2014-12-30 10:58:10 UTC
Permalink
-----Original Message-----
From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Rob Stewart
Sent: Tuesday, December 30, 2014 5:01
To: ***@lists.boost.org
Subject: Re: [boost] Synchronization (RE: [compute] review)

> On December 29, 2014 6:49:35 PM EST, Kyle Lutz wrote:

>> Also, in case you were unaware, you can always disable all
>> exceptions in Boost.Compute by defining BOOST_NO_EXCEPTIONS.

> BOOST_NO_EXCEPTIONS invokes a global handler to note a failure. That means
> after every library call, the user must check a global value to determine
> if there was an error. Overloading with a Boost.System error_code localizes
> error checking. (Asio does this, for example.) I consider that a superior
> way to avoid exception overhead.

Either way, I don't see how you avoid checking whether an error occurred after each operation. And when you do detect an error, then you can certainly skip some work, but you're still stuck waiting for the non-failed operations to complete before you can destroy any of the datastructures they're using. And that would be much simpler with a good contingent of RAII-friendly constructs, like those we're discussing.

Therefore, I don't see the issue of exceptions vs. error codes (nor the mechanism of error code access) being relevant to the discussion of synchronization. The synchronization support needs to be there, no matter what. And it should be relatively unobtrusive and easy to use.


> The question is whether exception overhead is a bottleneck in coding with this
> library's API. It is necessary to weigh the likelihood of errors against the
> costs of reporting and detecting them. Conditional logic to check for errors
> is not pipelining friendly, but throwing exceptions is terribly expensive.

By comparison with network programming (since you cited Boost.ASIO), GPU/OpenCL errors should be far less common. Therefore, error handling needn't be optimized for performance. I think our only concern is successful recovery. Indeed, as OpenCL provides no cancelation mechanism, you cannot avoid waiting around for any partial results to be computed that haven't failed.

BTW, the types of operations generating these errors typically involve communication with a device or at least a couple syscalls. So, the kind of ALU pipeline effects of conditional statements that you mention aren't relevant, here.


Matt


________________________________

This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Rob Stewart
2014-12-31 09:47:13 UTC
Permalink
On December 30, 2014 5:58:10 AM EST, "Gruenke,Matt" <***@Tycoint.com> wrote:
> -----Original Message-----
> From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Rob
> Stewart
>
> > BOOST_NO_EXCEPTIONS invokes a global handler to note a failure. That
> means
> > after every library call, the user must check a global value to
> determine
> > if there was an error. Overloading with a Boost.System error_code
> localizes
> > error checking. (Asio does this, for example.) I consider that a
> superior
> > way to avoid exception overhead.
>
> Either way, I don't see how you avoid checking whether an error
> occurred after each operation.

You don't, but one is local to the last function call and the other is a global value.

> And when you do detect an error, then
> you can certainly skip some work, but you're still stuck waiting for
> the non-failed operations to complete before you can destroy any of
> the datastructures they're using. And that would be much simpler with
> a good contingent of RAII-friendly constructs, like those we're
> discussing.

Okay

> Therefore, I don't see the issue of exceptions vs. error codes (nor
> the mechanism of error code access) being relevant to the discussion
> of synchronization.

People were looking for ways to avoid exceptions. Whether that was to support embedded environments without exception support or to avoid exception overhead, I don't know.

> The synchronization support needs to be there, no
> matter what. And it should be relatively unobtrusive and easy to use.

Sure

> > The question is whether exception overhead is a bottleneck in coding
> with this
> > library's API. It is necessary to weigh the likelihood of errors
> against the
> > costs of reporting and detecting them. Conditional logic to check
> for errors
> > is not pipelining friendly, but throwing exceptions is terribly
> expensive.
>
> By comparison with network programming (since you cited Boost.ASIO),
> GPU/OpenCL errors should be far less common. Therefore, error
> handling needn't be optimized for performance.

What about embedded programming environments without exception support?

> BTW, the types of operations generating these errors typically involve
> communication with a device or at least a couple syscalls. So, the
> kind of ALU pipeline effects of conditional statements that you
> mention aren't relevant, here.

They still apply if the error checking is done on the host side.

___
Rob

(Sent from my portable computation engine)

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Gruenke,Matt
2014-12-29 23:56:19 UTC
Permalink
-----Original Message-----
From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Thomas M
Sent: December 29, 2014 16:20
To: ***@lists.boost.org
Subject: Re: [boost] Synchronization (RE: [compute] review)

> If input/output ranges generally refer to iterators from the boost::compute library, then:
> -) an iterator can store the container (or other data structure it belongs to, if any)
> -) an algorithm can, via the iterators, "communicate" with the container(s)
>
> For an input operation the data must be available throughout & in unaltered manner from
> the time of enqueuing the input operation until its >completion.

Essentially, what I think you want is to embed the equivalent of a shared_ptr to the underlying storage. Event callbacks could be used to decrement it, when the operation has completed. That way, even if the container object is deleted, the device memory sticks around so long as any pending operations reference it.

For host data structures, it'd be nice if they could be restricted to iterators of a special container - let's call it host_mem<>, for the sake of discussion. Unlike mapped_view<>, you could only copy into it, so that it owns the memory. Reads & destruction could block on any pending operations that have been enqueued with it as a destination.

If you don't want to pay the penalty of copying into a host_mem<> object, then just do an explicit command_queue::enqueue_write_buffer() or command_queue::enqueue_write_buffer_async(). Another benefit of requiring host_mem<> is that the host -> host copy is explicit.


BTW, for command_queue-level operations, Kyle seems to agree with my proposal to have a sort of guard object that calls event::wait(), upon destruction. I actually wonder if it makes sense to call this a 'guarantee', since it doesn't quite behave like traditional guard objects. He's even accepted my offer to code this up.


Matt


________________________________

This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Thomas M
2014-12-30 07:48:42 UTC
Permalink
On 30/12/2014 00:56, Gruenke,Matt wrote:
> -----Original Message-----
> From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Thomas M
> Sent: December 29, 2014 16:20
> To: ***@lists.boost.org
> Subject: Re: [boost] Synchronization (RE: [compute] review)
>
>> If input/output ranges generally refer to iterators from the boost::compute library, then:
>> -) an iterator can store the container (or other data structure it belongs to, if any)
>> -) an algorithm can, via the iterators, "communicate" with the container(s)
>>
>> For an input operation the data must be available throughout & in unaltered manner from
>> the time of enqueuing the input operation until its >completion.
>
> Essentially, what I think you want is to embed the equivalent of a shared_ptr to the underlying storage. Event callbacks could be used to decrement it, when the operation has completed. That way, even if the container object is deleted, the device memory sticks around so long as any pending operations reference it.

yes, I was thinking of a shared_ptr for the implementation part.

> BTW, for command_queue-level operations, Kyle seems to agree with my proposal to have a sort of guard object that calls event::wait(), upon destruction. I actually wonder if it makes sense to call this a 'guarantee', since it doesn't quite behave like traditional guard objects. He's even accepted my offer to code this up.

If you want to code that functionality up that's more than fine with me.
I am not happy though with the need to explicitly, i.e. manually,
associate guards with command-queue operations just for the sake of
exception-safety (I am referring only to that here). And out of the box
I cannot come up with a single example when a memory container is fine
to die while OpenCL operations are still running on it or enqueued to
run on it, so that's why I prefer a mechanisms which does it always,
implicitly (unless maybe the user wants to take over it explicitly).

Your proposal looks like:

boost::compute::guard wg(cq.enqueue_write_buffer_async(dv, 0, hv.size(),
&hv.front()));

which appears to me quite lengthy and more error-prone (forgetting the
guard) to get exception safety.

Now assume that an iterator can store its underlying container, isn't it
possible that the enqueue_write_buffer_async itself places some
guard-like structure into the container, and when the container's
destructor is invoked it calls wait() on all of them?

Thomas

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Gruenke,Matt
2014-12-30 08:55:53 UTC
Permalink
-----Original Message-----
From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Thomas M
Sent: Tuesday, December 30, 2014 2:49
To: ***@lists.boost.org
Subject: Re: [boost] Synchronization (RE: [compute] review)


>> BTW, for command_queue-level operations, Kyle seems to agree with my proposal
>> to have a sort of guard object that calls event::wait(), upon destruction.
>> I actually wonder if it makes sense to call this a 'guarantee', since it
>> doesn't quite behave like traditional guard objects. He's even accepted my
>> offer to code this up.
>
> If you want to code that functionality up that's more than fine with me.
> I am not happy though with the need to explicitly, i.e. manually, associate
> guards with command-queue operations just for the sake of exception-safety
> (I am referring only to that here). And out of the box I cannot come up with
> a single example when a memory container is fine to die while OpenCL operations
> are still running on it or enqueued to run on it, so that's why I prefer a
> mechanisms which does it always, implicitly (unless maybe the user wants to
> take over it explicitly).
>
> Your proposal looks like:
>
> boost::compute::guard wg(cq.enqueue_write_buffer_async(dv, 0, hv.size(), &hv.front()));
>
> which appears to me quite lengthy and more error-prone (forgetting the
> guard) to get exception safety.

Yes, except I'm now proposing to call it a 'guarantee', since it guarantees that the event will have completed within its lifetime.


> Now assume that an iterator can store its underlying container, isn't it possible
> that the enqueue_write_buffer_async itself places some guard-like structure into
> the container, and when the container's destructor is invoked it calls wait() on
> all of them?

Keep in mind that the guarantee is proposed to be used with command_queue operations, which accept only pointers and buffer objects - not iterators. I think it's not unreasonable if the low level interface requires a bit more care to use, if that makes it more flexible and efficient.

For instance, what if the datastructures on which you're operating are protected from exceptions by some other means (i.e. tied to a scope outside the try block)? Are you still going to force the caller to pay the price of copying the data into a special container? To avoid such overheads, do they then have to drop down to the C interface, and forego all of the other benefits of Boost.Compute?

I agree with Kyle's approach of having a fairly direct and simple interface, at the low level, while trying to make the high-level interface easy & safe. I also agree with you that the high-level interface could be made both safer *and* more efficient, by restricting it to operate only on Boost.Compute containers with some additional facilities (i.e. refcounting in device containers and a guarantee in the host wrapper).

If he'd just buy into that, I think it'd give Boost.Compute all the safety, efficiency, flexibility, and ease of use that anyone needs. There are always going to be tradeoffs. I'd prefer that, rather than the library deciding how to strike the best balance, it give the users a set of good options and let them decide which combination best fits the bill.


Matt


________________________________

This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Thomas M
2014-12-30 12:36:49 UTC
Permalink
On 30/12/2014 09:55, Gruenke,Matt wrote:
> -----Original Message-----
> From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Thomas M
> Sent: Tuesday, December 30, 2014 2:49
> To: ***@lists.boost.org
> Subject: Re: [boost] Synchronization (RE: [compute] review)
>
>
>>> BTW, for command_queue-level operations, Kyle seems to agree with my proposal
>>> to have a sort of guard object that calls event::wait(), upon destruction.
>>> I actually wonder if it makes sense to call this a 'guarantee', since it
>>> doesn't quite behave like traditional guard objects. He's even accepted my
>>> offer to code this up.
>>
>> If you want to code that functionality up that's more than fine with me.
>> I am not happy though with the need to explicitly, i.e. manually, associate
>> guards with command-queue operations just for the sake of exception-safety
>> (I am referring only to that here). And out of the box I cannot come up with
>> a single example when a memory container is fine to die while OpenCL operations
>> are still running on it or enqueued to run on it, so that's why I prefer a
>> mechanisms which does it always, implicitly (unless maybe the user wants to
>> take over it explicitly).
>>
>> Your proposal looks like:
>>
>> boost::compute::guard wg(cq.enqueue_write_buffer_async(dv, 0, hv.size(), &hv.front()));
>>
>> which appears to me quite lengthy and more error-prone (forgetting the
>> guard) to get exception safety.
>
> Yes, except I'm now proposing to call it a 'guarantee', since it guarantees that the event will have completed within its lifetime.
>
>
>> Now assume that an iterator can store its underlying container, isn't it possible
>> that the enqueue_write_buffer_async itself places some guard-like structure into
>> the container, and when the container's destructor is invoked it calls wait() on
>> all of them?
>
> Keep in mind that the guarantee is proposed to be used with command_queue operations, which accept only pointers and buffer objects - not iterators. I think it's not unreasonable if the low level interface requires a bit more care to use, if that makes it more flexible and efficient.

Ok I missed the non-iterator cases and see your point.

If you are going to implement such RAII guards here's a short wish-list
of features / guard classes:

a) make guards "transferable" across functions
b) a container of guards and/or a guard for a wait_list as whole
c) a guard for a command-queue as whole
[possibly guards for other classes as well]

a) + b) because something like this is really useful:

void foo()
{
// setup all memory objects etc.

some_guard guard;
make_async_copy(..., guard);


some_guard_container cont;
// or cont is a wait_list on which a guard can be placed

more_async_copies(..., cont);
yet_more_async_copies(..., cont);
even_more_async_copies(..., cont);
cont.wait();

some_guard guard2;
make_async_copy(..., guard2);
}

The functions ...async_copy thus can launch asynchronous operations but
themselves do not wait on return; ownership of initially obtained guards
gets passed on to foo. And a single guard can refer to multiple
events/asynchronous operations as a whole group.

With c) I have something like this in mind:

{
some_cmd_queue_guard qu_guard(queue);

cq.enqueue_write_buffer_async(...);
transform(..., queue); // implicitly async
cq.enqueue_read_buffer_async(...);

// here automatic synchronization occurs
}

where the destructor of some_cmd_queue_guard calls queue.finish(), or
similar. So there is no need to place a separate guard on every single
operation, for regularly executing code at the end of the block
automatic synchronization with the host occurs, and in the case of an
exception automatic protection is given.

Thomas







_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Gruenke,Matt
2014-12-30 13:48:27 UTC
Permalink
-----Original Message-----
From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Thomas M
Sent: Tuesday, December 30, 2014 7:37
To: ***@lists.boost.org
Subject: Re: [boost] Synchronization (RE: [compute] review)

> If you are going to implement such RAII guards here's a short wish-list of features / guard classes:
>
> a) make guards "transferable" across functions

I agree they should be movable, but it makes no sense for them to be copyable.


> b) a container of guards and/or a guard for a wait_list as whole

Hmmm... I can see the benefits (convenience). I'd make it a different type, though.

I assume it should hold a reference to the list? Since the guarantee is designed to block when the wait_list goes out of scope, I think it's reasonable to assume its scope is a superset of the guarantee's.


> c) a guard for a command-queue as whole
> [possibly guards for other classes as well]

Why? Convenience?

Unless you're using it as a shorthand for waiting on individual events or wait_lists, there's no need. The event_queue is internally refcounted. When the refcount goes to zero, the destructor will block on all outstanding commands.


> a) + b) because something like this is really useful:

Um... how about this:

void foo()
{
// setup all memory objects etc.

wait_list wl;
wait_list::guarantee wlg(wl);

// send data to device
wl.insert(cq.enqueue_write_buffer_async(devmem, 0, size, host_ptr));
wl.insert(cq.enqueue_write_buffer_async(devmem2, 0, size, host_ptr2));

// a kernel that reads devmem and devmem2 and writes to devmem
wl.insert(cq.enqueue_task(kern, wl)); // Note: wl is copied by enqueue funcs

// copy result back to host
wl.insert(cq.enqueue_read_buffer_async(devmem, 0, size, host_ptr, wl));

// wl.wait() would only be necessary if you wanted to access the results, here.


// Enqueue an independent set of operations with another wait_list
wait_list wl_b;
wait_list::guarantee wlg_b(wl);

// send data to device
wl_b.insert(cq.enqueue_write_buffer_async(devmem_b, 0, size_b, host_ptr_b));

// ...
}



> With c) I have something like this in mind:

What about this?

{
command_queue cq(cntx, dev);
command_queue::guarantee cqg(cq);
cq.enqueue_write_buffer_async(devmem, 0, size, host_ptr)
transform(..., cq); // implicitly async cq.enqueue_read_buffer_async(...);

// here automatic synchronization occurs
}


It does presume that command_queues are local and tied to related batches of computations. Those assumptions won't always hold.


Matt


________________________________

This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Peter Dimov
2014-12-30 13:56:56 UTC
Permalink
Thomas M wrote:

>> If you are going to implement such RAII guards here's a short wish-list
>> of features / guard classes:

I don't know much about the subject, but I did look (quickly) at SYCL, and
it appears to have taken lifetime issues into account, so maybe it would be
worth it to see how it solves the problem.


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Gruenke,Matt
2014-12-30 14:24:49 UTC
Permalink
> -----Original Message-----
> From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Peter Dimov
> Sent: Tuesday, December 30, 2014 8:57
> To: ***@lists.boost.org
> Subject: Re: [boost] Synchronization (RE: [compute] review)
>
> Thomas M wrote:
>
>> If you are going to implement such RAII guards here's a short
>> wish-list of features / guard classes:
>
> I don't know much about the subject, but I did look (quickly) at SYCL, and it
> appears to have taken lifetime issues into account, so maybe it would
> be worth it to see how it solves the problem.

They seem to use an approach similar to the host_mem<> container I suggested (sans copying):

Memory objects can be constructed with or without attached host memory.
If no host memory is attached at the point of construction, then destruction
of that memory object is non-blocking. The user can optionally provide a
storage object which specifies the behaviour on construction and destruction
(see Storage section 2.7.1 below). If host memory is attached and the user
does not supply a storage object, then the default behaviour is followed,
which is that the destructor will block until any command groups operating
on the memory object have completed, then, if the contents of the memory
object is modified on a device those contents are copied back to host and
only then does the destructor return.


The reason I like copying is that, if the memory is owned by the container, you can implement future<>-like blocking when there are pending writes to it. Also, when copying into it, the data can be made contiguous. But I suppose copying might be a high price to pay, for those benefits.


BTW, I only intended guarantees as a low-level mechanism. It was always my belief that the high-level interface would best be served by a host_mem<> container and some form of protection (either blocking or refcounting) embedded in the device containers.


Matt


________________________________

This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Vicente J. Botet Escriba
2014-12-30 14:08:21 UTC
Permalink
Le 30/12/14 14:48, Gruenke,Matt a écrit :
> -----Original Message-----
> From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Thomas M
> Sent: Tuesday, December 30, 2014 7:37
> To: ***@lists.boost.org
> Subject: Re: [boost] Synchronization (RE: [compute] review)
>
>> If you are going to implement such RAII guards here's a short wish-list of features / guard classes:
>>
>> a) make guards "transferable" across functions
> I agree they should be movable, but it makes no sense for them to be copyable.
>
>
>> b) a container of guards and/or a guard for a wait_list as whole
> Hmmm... I can see the benefits (convenience). I'd make it a different type, though.
>
> I assume it should hold a reference to the list? Since the guarantee is designed to block when the wait_list goes out of scope, I think it's reasonable to assume its scope is a superset of the guarantee's.
>
>
>> c) a guard for a command-queue as whole
>> [possibly guards for other classes as well]
> Why? Convenience?
>
> Unless you're using it as a shorthand for waiting on individual events or wait_lists, there's no need. The event_queue is internally refcounted. When the refcount goes to zero, the destructor will block on all outstanding commands.
>
>
>> a) + b) because something like this is really useful:
> Um... how about this:
>
> void foo()
> {
> // setup all memory objects etc.
>
> wait_list wl;
> wait_list::guarantee wlg(wl);
>
> // send data to device
> wl.insert(cq.enqueue_write_buffer_async(devmem, 0, size, host_ptr));
> wl.insert(cq.enqueue_write_buffer_async(devmem2, 0, size, host_ptr2));
>
> // a kernel that reads devmem and devmem2 and writes to devmem
> wl.insert(cq.enqueue_task(kern, wl)); // Note: wl is copied by enqueue funcs
>
> // copy result back to host
> wl.insert(cq.enqueue_read_buffer_async(devmem, 0, size, host_ptr, wl));
>
> // wl.wait() would only be necessary if you wanted to access the results, here.
>
>
> // Enqueue an independent set of operations with another wait_list
> wait_list wl_b;
> wait_list::guarantee wlg_b(wl);
>
> // send data to device
> wl_b.insert(cq.enqueue_write_buffer_async(devmem_b, 0, size_b, host_ptr_b));
>
> // ...
> }
>
>
Maybe you can follow the task_region design (See
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4088.pdf).
>> With c) I have something like this in mind:
> What about this?
>
> {
> command_queue cq(cntx, dev);
> command_queue::guarantee cqg(cq);
> cq.enqueue_write_buffer_async(devmem, 0, size, host_ptr)
> transform(..., cq); // implicitly async cq.enqueue_read_buffer_async(...);
>
> // here automatic synchronization occurs
> }
>
>
> It does presume that command_queues are local and tied to related batches of computations. Those assumptions won't always hold.
The same here.

Best,
Vicente

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Gruenke,Matt
2014-12-30 14:01:52 UTC
Permalink
Upon further consideration...

-----Original Message-----
From: Gruenke,Matt
Sent: Tuesday, December 30, 2014 8:49
To: ***@lists.boost.org
Subject: RE: [boost] Synchronization (RE: [compute] review)

> -----Original Message-----
> From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Thomas M
> Sent: Tuesday, December 30, 2014 7:37
> To: ***@lists.boost.org
> Subject: Re: [boost] Synchronization (RE: [compute] review)

>> c) a guard for a command-queue as whole [possibly guards for other
>> classes as well]
>
> Why? Convenience?
>
> Unless you're using it as a shorthand for waiting on individual events or wait_lists,
> there's no need. The event_queue is internally refcounted. When the refcount goes
> to zero, the destructor will block on all outstanding commands.

I should've listed to myself more carefully. I think this means command_queue-level guarantees aren't necessary or useful, because you've either got:

1) A local command_queue, with a refcount of 1, in which case it will block upon destruction.

2) A copy of a nonlocal command_queue, in which case there may be unrelated commands in the queue.


It's up to Kyle if he wants it, but I'd skip the command_queue::guarantee, as it would likely lead to confusion, misuse, and generally sloppy code.


Matt


________________________________

This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Thomas M
2014-12-30 15:33:08 UTC
Permalink
On 30/12/2014 15:01, Gruenke,Matt wrote:
> Upon further consideration...
>
> -----Original Message-----
> From: Gruenke,Matt
> Sent: Tuesday, December 30, 2014 8:49
> To: ***@lists.boost.org
> Subject: RE: [boost] Synchronization (RE: [compute] review)
>
>> -----Original Message-----
>> From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Thomas M
>> Sent: Tuesday, December 30, 2014 7:37
>> To: ***@lists.boost.org
>> Subject: Re: [boost] Synchronization (RE: [compute] review)
>
>>> c) a guard for a command-queue as whole [possibly guards for other
>>> classes as well]
>>
>> Why? Convenience?
>>
>> Unless you're using it as a shorthand for waiting on individual events or wait_lists,
>> there's no need. The event_queue is internally refcounted. When the refcount goes
>> to zero, the destructor will block on all outstanding commands.

To some degree it's convenience, that's correct. I find this both more
explicit and straightforward to code:

command_queue::guarantee cqg(cq);

cq.enqueue_write_buffer_async(...);
transform(..., cqg);
cq.enqueue_read_buffer_async(...);
...

than

wait_list wl;
wait_list::guarantee wlg(wl);
wl.insert(cq.enqueue_write_buffer_async(...));
wl.insert(cq.enqueue_task(kern, wl));
wl.insert(cq.enqueue_read_buffer_async(...));

I don't see anything wrong with a compact form which also emphasizes the
grouped nature of all following commands if that's the situation; also I
don't really know what enqueue_task shall exactly look like, but the
STL-ish similarity is lost though.

>
> I should've listed to myself more carefully. I think this means command_queue-level guarantees aren't necessary or useful, because you've either got:
>
> 1) A local command_queue, with a refcount of 1, in which case it will block upon destruction.
>
> 2) A copy of a nonlocal command_queue, in which case there may be unrelated commands in the queue.

I cannot fully follow this; let's presume the convenience aspect is
appreciated, then one reason to offer these guards is to make sure that
other objects, e.g. memory objects, still persist while an asynchronous
operation might use them:

std::vector<float> host_data(...);
cq.enqueue_write_buffer_async(..., &host_data.front());
// here run some algorithm(s)
cq.enqueue_read_buffer_async(..., &host_data.front());

If the command_queue is local then it must become constructed after
host_data, as otherwise at the time of the queue's destruction, i.e.
when it waits for the enqueued operations to finish, host_data is
already dead. To me an explicit guard which I can place flexibly into
any block/scope is ways more clearer (and safer) to handle than
declaration ordering of "primary" objects.
As for the second point I don't see how this alters the game. Every
scope must ensure that its relevant operations have completed before
resources required by the operations become released; if that requires
that commands enqueued earlier (outside) must also complete so shall it be.

I am not saying that such a command-queue guarantee is always the right
thing.

Thomas

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Gruenke,Matt
2014-12-31 01:31:46 UTC
Permalink
Let's take another look at the big picture.

In its current form, the high-level interface of Boost.Compute has safe (if overkill) synchronization with host datastructures. There's currently insufficient synchronization of device containers. Finally, the low level interface lacks the RAII objects needed to facilitate exception-safe code.


In this thread, we've proposed:

1. Host memory containers, as a synchronization *optimization* of high-level operations involving host memory. This seems similar to the approach employed by SYCL (thanks to Peter Dimov).

2. You suggested device containers & their iterators could have the machinery to make them exception-safe. This could take the form of either refcounting, or synchronization.

3. RAII wrappers for events and wait_lists.


Remember, if Kyle simply adopts #2, then regardless of the status of #1, I believe the high-level interface will be safe. So, there should be no need to use #3 with high-level operations.

Considering the above, the existing synchronization behavior of command_queue, and the ordering dependency you mentioned (i.e. that all datastructures would need to be in scope before the guarantee on the command_queue), I conclude that it's not necessary and actually too error-prone to synchronize at the command_queue level.

And regarding wait_lists, the reason I agreed with your suggestion to offer guarantees for them is that they're the primary mechanism for defining command execution order. Since I think most code using the low level interface will probably be using wait_lists anyhow, there's an undeniable convenience to offering guarantees at that level.


As far as I'm aware, Kyle has expressed casual interest in supporting #1 (but not forcing it, presumably because this would break the nice iterator interface that many people like). I don't know where he stands on #2. And he's agreed to seriously consider #3 (in the form of a patch I've offered to submit).

I think we should focus on #2. I'd really like to know his stance on it.


Matt


________________________________

This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Kyle Lutz
2014-12-31 04:19:45 UTC
Permalink
On Tue, Dec 30, 2014 at 5:31 PM, Gruenke,Matt <***@tycoint.com> wrote:
> Let's take another look at the big picture.
>
> In its current form, the high-level interface of Boost.Compute has safe (if overkill) synchronization with host datastructures. There's currently insufficient synchronization of device containers. Finally, the low level interface lacks the RAII objects needed to facilitate exception-safe code.
>
>
> In this thread, we've proposed:
>
> 1. Host memory containers, as a synchronization *optimization* of high-level operations involving host memory. This seems similar to the approach employed by SYCL (thanks to Peter Dimov).
>
> 2. You suggested device containers & their iterators could have the machinery to make them exception-safe. This could take the form of either refcounting, or synchronization.
>
> 3. RAII wrappers for events and wait_lists.
>
>
> Remember, if Kyle simply adopts #2, then regardless of the status of #1, I believe the high-level interface will be safe. So, there should be no need to use #3 with high-level operations.
>
> Considering the above, the existing synchronization behavior of command_queue, and the ordering dependency you mentioned (i.e. that all datastructures would need to be in scope before the guarantee on the command_queue), I conclude that it's not necessary and actually too error-prone to synchronize at the command_queue level.
>
> And regarding wait_lists, the reason I agreed with your suggestion to offer guarantees for them is that they're the primary mechanism for defining command execution order. Since I think most code using the low level interface will probably be using wait_lists anyhow, there's an undeniable convenience to offering guarantees at that level.
>
>
> As far as I'm aware, Kyle has expressed casual interest in supporting #1 (but not forcing it, presumably because this would break the nice iterator interface that many people like). I don't know where he stands on #2. And he's agreed to seriously consider #3 (in the form of a patch I've offered to submit).
>
> I think we should focus on #2. I'd really like to know his stance on it.

First off, I'd just like to thank you guys for bringing this issue up
and actually proposing solutions. Currently I'm leaning towards
solution #2 as well but I'll need some more time to think all of this
over. Also, solution #3 is the least-intrusive and could also suit
other use-cases so I think it would be worthwhile to pursue.

-kyle

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Thomas M
2014-12-31 07:13:51 UTC
Permalink
On 31/12/2014 02:31, Gruenke,Matt wrote:
> Let's take another look at the big picture.
>
> In its current form, the high-level interface of Boost.Compute has safe (if overkill) synchronization with host datastructures. There's currently insufficient synchronization of device containers. Finally, the low level interface lacks the RAII objects needed to facilitate exception-safe code.
>
>
> In this thread, we've proposed:
>
> 1. Host memory containers, as a synchronization *optimization* of high-level operations involving host memory. This seems similar to the approach employed by SYCL (thanks to Peter Dimov).
>
> 2. You suggested device containers & their iterators could have the machinery to make them exception-safe. This could take the form of either refcounting, or synchronization.
>
> 3. RAII wrappers for events and wait_lists.
>
>
> Remember, if Kyle simply adopts #2, then regardless of the status of #1, I believe the high-level interface will be safe. So, there should be no need to use #3 with high-level operations.
>
> Considering the above, the existing synchronization behavior of command_queue, and the ordering dependency you mentioned (i.e. that all datastructures would need to be in scope before the guarantee on the command_queue), I conclude that it's not necessary and actually too error-prone to synchronize at the command_queue level.
>
> And regarding wait_lists, the reason I agreed with your suggestion to offer guarantees for them is that they're the primary mechanism for defining command execution order. Since I think most code using the low level interface will probably be using wait_lists anyhow, there's an undeniable convenience to offering guarantees at that level.
>
>
> As far as I'm aware, Kyle has expressed casual interest in supporting #1 (but not forcing it, presumably because this would break the nice iterator interface that many people like). I don't know where he stands on #2. And he's agreed to seriously consider #3 (in the form of a patch I've offered to submit).
>
> I think we should focus on #2. I'd really like to know his stance on it.

I don't consider the proposed ways to be mutually exclusive, or at least
not as much as I interpret your post. To me it makes most sense offering
users several options, at various abstraction levels, so the choice can
be made flexibly given the problem at hand.

With respect to the command_queue "guarantee" I keep my opinion that I
consider it absolutely useful; I also don't see why it's more
error-prone in general (see below). FWIW in my real-code it's a rather
frequent case to execute a set of operations on an (in-order)
command-queue as batch (copy input to device, run several kernels, copy
output to host) and yes then I do make the synchronization at the
command-queue level at the end. Why should I fiddle around with a number
of individual events (paying unnecessary OpenCL overhead for setting
them all up, plus littering my host code) when all that matters is the
all-ops execution as set? Certainly not every problem is suited for
doing so, but why making things unnecessarily complicated if applicable?

With respect to error-proneness (e.g. regarding the lifetime of other
objects) how does a wait_list differ? -> all that matters is the point
at which the guarantee was created:

wait_list wl;
wait_list::guarantee wlg(wl);

wl.insert(cq.enqueue_write_buffer_async(...));
wl.insert(cq.enqueue_...);

if some object was created after wlg but used in a subsequent
asynchronous operation then this blows up as much as here as it does for
a command-queue guarantee; in either case those event(s) must be handled
separately.

With respect to performance there can be two-fold advantages of
synchronizing at the command-queue level: firstly as already mentioned
not having to pay the OpenCL overhead for creating and waiting on a
number of individual events; and secondly for an out-of-order command
queue the issuing of explicit waits for all events, which must occur in
some order, may cause unnecessary rearrangement/ordering of the queue's
execution order (AFAIK it's the implementers freedom if it gives
preference to the just-issued wait or not).

I think it has it's very legitimate reasons why the plain OpenCL API
allows synchronization based on scalar events, arrays of events, or
other abstraction levels such as command queues. Why should C++ wrapped
guarantees refer to some of them but exclude the others?

cheers,
Thomas


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Gruenke,Matt
2014-12-31 08:15:10 UTC
Permalink
-----Original Message-----
From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Thomas M
Sent: Wednesday, December 31, 2014 2:14
To: ***@lists.boost.org
Subject: Re: [boost] Synchronization (RE: [compute] review)

> On 31/12/2014 02:31, Gruenke,Matt wrote:

> I don't consider the proposed ways to be mutually exclusive, or at least
> not as much as I interpret your post.

They're not, nor did I mean to imply they were. I think they're very much complementary. Ideally, we'd get all three. I consider #2 the highest priority, because writing exception-safe code without it would be very difficult and cumbersome. #3 is similarly valuable, if you're directly using the command_queue. #1 is "only" an optimization - not a correctness/safety issue.


> With respect to the command_queue "guarantee" I keep my opinion that I
> consider it absolutely useful; I also don't see why it's more error-prone
> in general (see below).

It's error-prone, because the guarantee must have a smaller scope than all of the containers it's intended to protect. You can't just instantiate new containers as you need them. That type of usage error isn't possible with event guarantees.

You're correct to point out that this applies equally to wait_list guarantees. Hence, I am withdrawing my support for them.


> With respect to performance there can be two-fold advantages of synchronizing
> at the command-queue level: firstly as already mentioned not having to pay the
> OpenCL overhead for creating and waiting on a number of individual events;

He's always creating the events. So, the only difference is whether you wait on them. Waiting should amount to a mutex-protected flag check (and a pthread_condition_wait(), if the event is incomplete). Locking and releasing mutexes is cheap, unless it actually blocks. In some tests I've run, years ago, it's on the order of a hundred cycles, on Linux, maybe less.


> and secondly for an out-of-order command queue the issuing of explicit waits
> for all events, which must occur in some order, may cause unnecessary
> rearrangement/ordering of the queue's execution order (AFAIK it's the
> implementers freedom if it gives preference to the just-issued wait or not).

So, waiting on an event is going to change their scheduling? Call me a skeptic, but I doubt that. If it does happen, then a better approach would be to enqueue a marker and wait on that. Or just make an explicit call to command_queue::wait(), if you know there are no other commands in there.

In any case, the issue is probably be moot. First, it seems like Boost.Compute won't be supporting out-of-order command queues. Second, unless you're using async copies and avoiding algorithms that implicitly synchronize for other reasons (i.e. to return results by value), then you'll block on the results before you ever get near the end of scope (in the non-error case, that is).


I think our positions are fairly entrenched, at this point. But don't let me stop you from trying to convince Kyle. In the end, his is the only opinion that matters.


Matt


________________________________

This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Thomas M
2014-12-31 09:04:49 UTC
Permalink
On 31/12/2014 09:15, Gruenke,Matt wrote:

>
> I think our positions are fairly entrenched, at this point. But don't let me stop you from trying to convince Kyle. In the end, his is the only opinion that matters.
>

Let me throw in a proposal to dissolve our entrenchment ;):

boost::compute::...::guarantee is a true guarantee and refers only to
events (or whatever can fulfill a true guarantee);

boost::compute::...::wait (... being e.g. a [event,] wait_list,
command_queue, all sorts of stuff) does basically the same, i.e. wait,
but the name makes clear it isn't a full guarantee. Thus it provides
convenience but doesn't prevent you from shooting yourself in the foot
if you are careless - sounds like perfect C++ philosophy ;).
I am very open to different names (wait, waiter, wait_lock, wait_on_me,
eaahhhh ... anything short + concise much appreciated).

best, Thomas


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Gruenke,Matt
2014-12-31 10:00:00 UTC
Permalink
I just wanted to remind anyone following this that, if Kyle adopts #2 (embedding synchronization or refcounting in device memory containers), these waits/guards/guarantees/etc. wouldn't be of interest to the majority of users. So, I think it's a lot of debate over what's essentially a corner case. And event guarantees get the job done, while anything else is just easier to use (and misuse).

I hope you don't feel that's unfair, Thomas.


Matt


________________________________

This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Thomas M
2014-12-31 11:51:05 UTC
Permalink
On 31/12/2014 11:00, Gruenke,Matt wrote:
> I just wanted to remind anyone following this that, if Kyle adopts #2 (embedding synchronization or refcounting in device memory containers), these waits/guards/guarantees/etc. wouldn't be of interest to the majority of users. So, I think it's a lot of debate over what's essentially a corner case. And event guarantees get the job done, while anything else is just easier to use (and misuse).

Let me put it this way: The library should be by itself, that is
implicit / hidden to the user, be as much exception safe as possible (I
think we agree on that). Whatever gets us closest to that is perfectly
fine with me. If I don't need to handle a single guarantee/wait etc. by
myself, great. And as side effect that wipes out the probably biggest
misuse source (= requiring the user to do some explicit in order to be
protected).

Currently I just cannot see, technically speaking, how this can be done
throughout e.g. when the library interfaces with non-library objects
(like plain void * host memory) or via the raw OpenCL objects; hence
tools are useful which - now to be done explicitly by the user - help
out here. Since there's a range of such interactions imaginable, with
different preconditions, equally the tools offered shall provide some
diversity. I am against forcing users to a single but clumpsy tool (an
all-powered event-level guarantee) if the preconditions in a given
application can be much more relaxed and something higher-leveled does
the job as well.

cheers, Thomas

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Thomas M
2014-12-31 17:11:12 UTC
Permalink
On 31/12/2014 12:51, Thomas M wrote:
> Since there's a range of such interactions imaginable, with
> different preconditions, equally the tools offered shall provide some
> diversity. I am against forcing users to a single but clumpsy tool (an
> all-powered event-level guarantee) if the preconditions in a given
> application can be much more relaxed and something higher-leveled does
> the job as well.

Matt, here's an example when IMHO a plain event-level guarantee as only
tool won't do it:

for (int i = 0; i < n; ++i)
{
cq.enqueue_write_buffer_async(devmem[i], 0, size[i], host_ptr[i]);
}


If there's a true guarantee around the cq.enqueue_write_buffer_async:

for (int i = 0; i < n; ++i)
{
boost::compute::guarantee
guar(cq.enqueue_write_buffer_async(devmem[i], 0, size[i], host_ptr[i]));
}

then the asynchronous nature is totally lost (it blocks on every iteration).

If the guarantee can be "moved outside of the loop" to avoid the
blocking, something like:

boost::compute::guarantee_list guarList;
for (int i = 0; i < n; ++i)
{
guarList.move_into(boost::compute::guarantee(cq.enqueue_write_buffer_async(devmem[i],
0, size[i], host_ptr[i]));
}

then it's not a guarantee in your sense any more, because I can
obviously also move it out to any improper scope loosing the protection.

The way to go for is a higher-level protection (e.g. wait on a wait_list
/ command_queue); thus the asynchronous copy nature is kept at the
user's responsibility to place the protection at an "appropriate" scope.

Thomas



_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Gruenke,Matt
2014-12-31 19:32:37 UTC
Permalink
-----Original Message-----
From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Thomas M
Sent: December 31, 2014 12:11
To: ***@lists.boost.org
Subject: Re: [boost] Synchronization (RE: [compute] review)

> On 31/12/2014 12:51, Thomas M wrote:

> If the guarantee can be "moved outside of the loop" to avoid the blocking,
> something like:

It was always my intention that they be movable, setable, and default-constructable. That should allow an array (or scoped_array<>, if the size isn't known until runtime) to be defined outside the loop and initialized inside.


> then it's not a guarantee in your sense any more, because I can obviously also
> move it out to any improper scope loosing the protection.

It's still a guarantee - just not necessarily a guarantee of what you want. It guarantees that the event will complete within *its* lifetime. And yes, making it movable and setable opens the door to misuse, but the simplest and most natural way to use it is still the correct way.

I feel like this issue has been receiving a disproportionate amount of focus. Most users of Boost.Compute will probably never use one of these constructs. Also, no one has yet closed the door to wait_list- or command_queue- level equivalents. I'm simply not going to write them. And even if they aren't eventually added to the library, you can still define them in your own code.


Honestly, I'm much more concerned about #2. While we argue over levels of convenience, that looming issue of exception safety still casts a long, dark shadow over the high level interface of Boost.Compute and its users.


Matt


________________________________

This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Thomas M
2014-12-31 21:14:38 UTC
Permalink
On 31/12/2014 20:32, Gruenke,Matt wrote:
> I feel like this issue has been receiving a disproportionate amount of focus. Most users of Boost.Compute will probably never use one of these constructs. Also, no one has yet closed the door to wait_list- or command_queue- level equivalents. I'm simply not going to write them. And even if they aren't eventually added to the library, you can still define them in your own code.
>
>
> Honestly, I'm much more concerned about #2. While we argue over levels of convenience, that looming issue of exception safety still casts a long, dark shadow over the high level interface of Boost.Compute and its users.
>

Ok let's settle this issue.
When you implement #2, and yes that's certainly a complex task, please
compile a thorough list when the library can't ensure exception safety
on it's on and share it here; we can then take a new look at the
full-scale problem and think of an interface. Is that ok?

Thomas


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Gruenke,Matt
2014-12-31 21:44:04 UTC
Permalink
-----Original Message-----
From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Thomas M
Sent: December 31, 2014 16:15
To: ***@lists.boost.org
Subject: Re: [boost] Synchronization (RE: [compute] review)

> When you implement #2,

I hadn't planned to do anything for #2. Kyle said he's thinking about it.

Since you had some ideas, maybe you could sketch them up in an email them, or submit a patch. If you're afraid it might be a waste of time, email Kyle directly and ask if he's interested.


> please compile a thorough list when the library can't ensure exception
> safety on it's on and share it here; we can then take a new look at the
> full-scale problem and think of an interface. Is that ok?

That sounds like a request for Kyle. You could always get the ball rolling, and maybe he'd help.


Matt


________________________________

This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Rob Stewart
2014-12-31 10:10:55 UTC
Permalink
On December 31, 2014 3:15:10 AM EST, "Gruenke,Matt" <***@Tycoint.com> wrote:
> -----Original Message-----
> From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Thomas
> M
>
> > With respect to the command_queue "guarantee" I keep my opinion that
> I
> > consider it absolutely useful; I also don't see why it's more
> error-prone
> > in general (see below).
>
> It's error-prone, because the guarantee must have a smaller scope than
> all of the containers it's intended to protect. You can't just
> instantiate new containers as you need them. That type of usage error
> isn't possible with event guarantees.
>
> You're correct to point out that this applies equally to wait_list
> guarantees. Hence, I am withdrawing my support for them.

What if the guarantees were part of the construct in use and only registered or lifetime-managed objects can be used with it? By that I mean that of you want synchronization, then you create a synchronizing command queue our wait list. If you want to use host memory with it, the API requires a host memory object that was previously registered with the queue/wait list or a special host memory object with a ref count the queue/wait list can increment.

The point is to have an interface that enforces correct lifetime management rather than relying on the user to remember to do certain things in a certain order.

Another thought is to get the guarantee from the queue/wait list and then pass it to constructors and APIs that should be protected. That doesn't preclude using unprotected objects and calls unless the queue/wait list can verify that it's guarantee object was used to create the object handed to it or on the API call invoked on it whenever there is an extant guarantee object.

___
Rob

(Sent from my portable computation engine)

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Thomas M
2014-12-29 09:21:24 UTC
Permalink
On 29/12/2014 03:23, Kyle Lutz wrote:
> On Sun, Dec 28, 2014 at 6:16 PM, Gruenke,Matt <***@tycoint.com> wrote:
>> -----Original Message-----
>> From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Kyle Lutz
>> Sent: Sunday, December 28, 2014 20:36
>> To: ***@lists.boost.org List
>> Subject: Re: [boost] Synchronization (RE: [compute] review)
>>
>>> On Sun, Dec 28, 2014 at 4:46 PM, Gruenke,Matt wrote:
>>
>>>> My understanding, based on comments you've made to other reviewers, is
>>>> that functions like boost::compute::transform() are asynchronous when
>>>> the result is on the device, but block when the result is on the host.
>>>> This is what I'm concerned about. Is it true?
>>
>>> Yes this is correct. In general, algorithms like transform() are asynchronous
>>> when the input/output ranges are both on the device and synchronous when one
>>> of the ranges is on the host. I'll work on better ways to allow asynchrony in
>>> the latter case. One of my current ideas is add asynchronous memory-mapping
>>> support to the mapped_view class [1] which can then be used with any of the
>>> algorithms in an asynchronous fashion.

When you speak of input/output ranges on the host, to what kinds of
iterators do you refer to? Any input/output iterator kind (e.g.
iterators from a std:: container -> just tried a std::vector on
boost::compute::transform, didn't compile if provided as input range),
or iterators that are part of your library?

Thomas

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Kyle Lutz
2014-12-29 16:56:14 UTC
Permalink
On Mon, Dec 29, 2014 at 1:21 AM, Thomas M <***@gmail.com> wrote:
> On 29/12/2014 03:23, Kyle Lutz wrote:
>>
>> On Sun, Dec 28, 2014 at 6:16 PM, Gruenke,Matt <***@tycoint.com>
>> wrote:
>>>
>>> -----Original Message-----
>>> From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Kyle Lutz
>>> Sent: Sunday, December 28, 2014 20:36
>>> To: ***@lists.boost.org List
>>> Subject: Re: [boost] Synchronization (RE: [compute] review)
>>>
>>>> On Sun, Dec 28, 2014 at 4:46 PM, Gruenke,Matt wrote:
>>>
>>>
>>>>> My understanding, based on comments you've made to other reviewers, is
>>>>> that functions like boost::compute::transform() are asynchronous when
>>>>> the result is on the device, but block when the result is on the host.
>>>>> This is what I'm concerned about. Is it true?
>>>
>>>
>>>> Yes this is correct. In general, algorithms like transform() are
>>>> asynchronous
>>>> when the input/output ranges are both on the device and synchronous when
>>>> one
>>>> of the ranges is on the host. I'll work on better ways to allow
>>>> asynchrony in
>>>> the latter case. One of my current ideas is add asynchronous
>>>> memory-mapping
>>>> support to the mapped_view class [1] which can then be used with any of
>>>> the
>>>> algorithms in an asynchronous fashion.
>
>
> When you speak of input/output ranges on the host, to what kinds of
> iterators do you refer to? Any input/output iterator kind (e.g. iterators
> from a std:: container -> just tried a std::vector on
> boost::compute::transform, didn't compile if provided as input range), or
> iterators that are part of your library?

In general, all of the algorithms operate on Boost.Compute iterators
rather than STL iterators. The main exception to this rule is the
boost::compute::copy() algorithm which copies between ranges of STL
iterators on the host and Boost.Compute iterators on the device.
Anther exception is the boost::compute::sort() algorithm which can
directly sort ranges of random-access iterators on the host (as long
as the data is contiguous, e.g. std::vector<T>). I am working to add
more direct support for host ranges to the other algorithms. Currently
the best way to use the Boost.Compute algorithms together with host
memory is with the mapped_view [1] class.

-kyle

[1] http://kylelutz.github.io/compute/boost/compute/mapped_view.html

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Thomas M
2014-12-29 19:53:57 UTC
Permalink
On 29/12/2014 17:56, Kyle Lutz wrote:
> On Mon, Dec 29, 2014 at 1:21 AM, Thomas M <***@gmail.com> wrote:
>> On 29/12/2014 03:23, Kyle Lutz wrote:
>>>
>>> On Sun, Dec 28, 2014 at 6:16 PM, Gruenke,Matt <***@tycoint.com>
>>> wrote:
>>>>
>>>> -----Original Message-----
>>>> From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Kyle Lutz
>>>> Sent: Sunday, December 28, 2014 20:36
>>>> To: ***@lists.boost.org List
>>>> Subject: Re: [boost] Synchronization (RE: [compute] review)
>>>>
>>>>> On Sun, Dec 28, 2014 at 4:46 PM, Gruenke,Matt wrote:
>>>>
>>>>
>>>>>> My understanding, based on comments you've made to other reviewers, is
>>>>>> that functions like boost::compute::transform() are asynchronous when
>>>>>> the result is on the device, but block when the result is on the host.
>>>>>> This is what I'm concerned about. Is it true?
>>>>
>>>>
>>>>> Yes this is correct. In general, algorithms like transform() are
>>>>> asynchronous
>>>>> when the input/output ranges are both on the device and synchronous when
>>>>> one
>>>>> of the ranges is on the host. I'll work on better ways to allow
>>>>> asynchrony in
>>>>> the latter case. One of my current ideas is add asynchronous
>>>>> memory-mapping
>>>>> support to the mapped_view class [1] which can then be used with any of
>>>>> the
>>>>> algorithms in an asynchronous fashion.
>>
>>
>> When you speak of input/output ranges on the host, to what kinds of
>> iterators do you refer to? Any input/output iterator kind (e.g. iterators
>> from a std:: container -> just tried a std::vector on
>> boost::compute::transform, didn't compile if provided as input range), or
>> iterators that are part of your library?
>
> In general, all of the algorithms operate on Boost.Compute iterators
> rather than STL iterators. The main exception to this rule is the
> boost::compute::copy() algorithm which copies between ranges of STL
> iterators on the host and Boost.Compute iterators on the device.
> Anther exception is the boost::compute::sort() algorithm which can
> directly sort ranges of random-access iterators on the host (as long
> as the data is contiguous, e.g. std::vector<T>). I am working to add
> more direct support for host ranges to the other algorithms. Currently
> the best way to use the Boost.Compute algorithms together with host
> memory is with the mapped_view [1] class.

I would find it very helpful to forbid invalid iterators as arguments as
much as possible at compile time. For example I can use a
std::vector::iterator as output range argument:

std::vector<int> stdvec(100, 0);
boost::compute::transform(some_begin(), some_end(), stdvec.begin(), ...);

I guess such a use would be invalid; if transform accepts only iterators
from your library then I suppose such compile-time checks should be
possible for both input and output (including targeted error message).

Thomas

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Kyle Lutz
2014-12-29 21:16:54 UTC
Permalink
On Mon, Dec 29, 2014 at 11:53 AM, Thomas M <***@gmail.com> wrote:
> On 29/12/2014 17:56, Kyle Lutz wrote:
>>
>> On Mon, Dec 29, 2014 at 1:21 AM, Thomas M <***@gmail.com> wrote:
>>>
>>> On 29/12/2014 03:23, Kyle Lutz wrote:
>>>>
>>>>
>>>> On Sun, Dec 28, 2014 at 6:16 PM, Gruenke,Matt <***@tycoint.com>
>>>> wrote:
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Kyle
>>>>> Lutz
>>>>> Sent: Sunday, December 28, 2014 20:36
>>>>> To: ***@lists.boost.org List
>>>>> Subject: Re: [boost] Synchronization (RE: [compute] review)
>>>>>
>>>>>> On Sun, Dec 28, 2014 at 4:46 PM, Gruenke,Matt wrote:
>>>>>
>>>>>
>>>>>
>>>>>>> My understanding, based on comments you've made to other reviewers,
>>>>>>> is
>>>>>>> that functions like boost::compute::transform() are asynchronous when
>>>>>>> the result is on the device, but block when the result is on the
>>>>>>> host.
>>>>>>> This is what I'm concerned about. Is it true?
>>>>>
>>>>>
>>>>>
>>>>>> Yes this is correct. In general, algorithms like transform() are
>>>>>> asynchronous
>>>>>> when the input/output ranges are both on the device and synchronous
>>>>>> when
>>>>>> one
>>>>>> of the ranges is on the host. I'll work on better ways to allow
>>>>>> asynchrony in
>>>>>> the latter case. One of my current ideas is add asynchronous
>>>>>> memory-mapping
>>>>>> support to the mapped_view class [1] which can then be used with any
>>>>>> of
>>>>>> the
>>>>>> algorithms in an asynchronous fashion.
>>>
>>>
>>>
>>> When you speak of input/output ranges on the host, to what kinds of
>>> iterators do you refer to? Any input/output iterator kind (e.g. iterators
>>> from a std:: container -> just tried a std::vector on
>>> boost::compute::transform, didn't compile if provided as input range), or
>>> iterators that are part of your library?
>>
>>
>> In general, all of the algorithms operate on Boost.Compute iterators
>> rather than STL iterators. The main exception to this rule is the
>> boost::compute::copy() algorithm which copies between ranges of STL
>> iterators on the host and Boost.Compute iterators on the device.
>> Anther exception is the boost::compute::sort() algorithm which can
>> directly sort ranges of random-access iterators on the host (as long
>> as the data is contiguous, e.g. std::vector<T>). I am working to add
>> more direct support for host ranges to the other algorithms. Currently
>> the best way to use the Boost.Compute algorithms together with host
>> memory is with the mapped_view [1] class.
>
>
> I would find it very helpful to forbid invalid iterators as arguments as
> much as possible at compile time. For example I can use a
> std::vector::iterator as output range argument:
>
> std::vector<int> stdvec(100, 0);
> boost::compute::transform(some_begin(), some_end(), stdvec.begin(), ...);
>
> I guess such a use would be invalid; if transform accepts only iterators
> from your library then I suppose such compile-time checks should be possible
> for both input and output (including targeted error message).

Fully agree. And Boost.Compute already has an internal
`is_device_iterator` trait which we could use to check for proper
device iterators. I've opened an issue for this [1].

-kyle

[1] https://github.com/kylelutz/compute/issues/392

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Gruenke,Matt
2014-12-29 01:03:47 UTC
Permalink
-----Original Message-----
From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Kyle Lutz
Sent: Sunday, December 28, 2014 14:42
To: ***@lists.boost.org List
Subject: Re: [boost] [compute] review

> On Sun, Dec 28, 2014 at 1:54 AM, Gruenke,Matt wrote:

>> I didn't notice any unit tests specifically intended to verify exception-safety.
>> I'd want to know that had been thoroughly vetted, before I'd consider
>> using Boost.Compute in production code.

> There are already some tests which for the error/exception paths in Boost.Compute
> (testing building invalid programs in "test_program.cpp" is the first that comes
> to mind). I'll work on adding more. Please let me know if there are any specific
> areas you'd like to see more heavily tested.

The only exception-related tests I saw seemed to be a small number that simply checked whether exceptions were being thrown when expected.

What I'm specifically concerned about is whether your containers and other wrappers are exception-safe. I don't want memory or resource leaks (or worse!) in my programs, when I handle recoverable exceptions and continue executing. I assume the bulk of your classes are simply RAII wrappers around the OpenCL C types, so they're probably fairly safe. I guess I'm more concerned about containers and anything with nontrivial state.

I don't know how much boost's unit test framework can help with this. Certainly, it can exercise a number of cases and report if there are any segfaults. But to check for resource leaks probably means also running them with a tool like valgrind, and examining the output. I'd enable file descriptor tracking, in hopes that it might detect some resource leaks down in vendor-specific code. Obviously, your library can only be as good as the underlying implementation, but the hope is that at least Boost.Compute doesn't add any leaks or errors of its own.


Matt


________________________________

This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Emil Dotchevski
2014-12-29 01:07:56 UTC
Permalink
On Sun, Dec 28, 2014 at 5:03 PM, Gruenke,Matt <***@tycoint.com> wrote:

> -----Original Message-----
> I don't know how much boost's unit test framework can help with this.
> Certainly, it can exercise a number of cases and report if there are any
> segfaults. But to check for resource leaks probably means also running
> them with a tool like valgrind, and examining the output.


Obviously one could write tests to verify exception safety without using
special tools.

--
Emil Dotchevski
Reverge Studios, Inc.
http://www.revergestudios.com/reblog/index.php?n=ReCode

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Kyle Lutz
2014-12-29 01:51:30 UTC
Permalink
On Sun, Dec 28, 2014 at 5:03 PM, Gruenke,Matt <***@tycoint.com> wrote:
> -----Original Message-----
> From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Kyle Lutz
> Sent: Sunday, December 28, 2014 14:42
> To: ***@lists.boost.org List
> Subject: Re: [boost] [compute] review
>
>> On Sun, Dec 28, 2014 at 1:54 AM, Gruenke,Matt wrote:
>
>>> I didn't notice any unit tests specifically intended to verify exception-safety.
>>> I'd want to know that had been thoroughly vetted, before I'd consider
>>> using Boost.Compute in production code.
>
>> There are already some tests which for the error/exception paths in Boost.Compute
>> (testing building invalid programs in "test_program.cpp" is the first that comes
>> to mind). I'll work on adding more. Please let me know if there are any specific
>> areas you'd like to see more heavily tested.
>
> The only exception-related tests I saw seemed to be a small number that simply checked whether exceptions were being thrown when expected.
>
> What I'm specifically concerned about is whether your containers and other wrappers are exception-safe. I don't want memory or resource leaks (or worse!) in my programs, when I handle recoverable exceptions and continue executing. I assume the bulk of your classes are simply RAII wrappers around the OpenCL C types, so they're probably fairly safe. I guess I'm more concerned about containers and anything with nontrivial state.
>
> I don't know how much boost's unit test framework can help with this. Certainly, it can exercise a number of cases and report if there are any segfaults. But to check for resource leaks probably means also running them with a tool like valgrind, and examining the output. I'd enable file descriptor tracking, in hopes that it might detect some resource leaks down in vendor-specific code. Obviously, your library can only be as good as the underlying implementation, but the hope is that at least Boost.Compute doesn't add any leaks or errors of its own.

I have attempted as best as possible to make the library exception
safe. I will work on improving the unit tests to explicitly check for
exception safety and proper recovery. I'd welcome anyone to look over
the code and, if possible, provide test-cases which show incorrect
behavior.

-kyle

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Gruenke,Matt
2014-12-29 01:36:09 UTC
Permalink
-----Original Message-----
From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Kyle Lutz
Sent: Sunday, December 28, 2014 14:42
To: ***@lists.boost.org List
Subject: Re: [boost] [compute] review

> On Sun, Dec 28, 2014 at 1:54 AM, Gruenke,Matt wrote:

>> Finally, based on the docs, it seems there's a bug in
>> boost::compute::command_queue::enqueue_copy_image_to_buffer() and
>> boost::compute::command_queue::enqueue_copy_buffer_to_image().
>> The region and origin parameters should be arrays, in the 2D and 3D cases.
>> If this discrepancy is intentional, it should be noted. I haven't
>> exhaustively reviewed the reference docs, so there might be other
>> issues of this sort.
>
> Can you be more specific as to what you consider to be a "bug" here?
> These APIs in the command_queue class are very "thin" wrappers on top of
> the clEnqueue* functions from the OpenCL C API. If you could provide a
> small test-case which demonstrates the issue and submit it to the
> bug-tracker [1] that would be great.

I can submit bugs, but this is really simple (and blatant). Now let's look at boost::compute::command_queue::enqueue_read_image(). It has the following overloads:

void enqueue_read_image(const image2d & image, const size_t origin,
const size_t region, size_t row_pitch,
void * host_ptr,
const wait_list & events = wait_list());

void enqueue_read_image(const image3d & image, const size_t origin,
const size_t region, size_t row_pitch,
size_t slice_pitch, void * host_ptr,
const wait_list & events = wait_list());

See how origin and region are size_t? Well, here's the OpenCL function they claim to wrap:

cl_int clEnqueueReadImage(cl_command_queue command_queue,
cl_mem image, cl_bool blocking_read,
const size_t *origin, const size_t *region,
size_t row_pitch, size_t slice_pitch,
void *ptr, cl_uint num_events_in_wait_list,
const cl_event *event_wait_list, cl_event *event );


See how origin and region are const size_t *? The reason is:

origin - Defines the (x, y, z) offset in pixels in the 1D, 2D, or 3D image, the (x, y) offset and the image index in the image array or the (x) offset and the image index in the 1D image array. If image is a 2D image object, origin[2] must be 0. If image is a 1D image or 1D image buffer object, origin[1] and origin[2] must be 0. If image is a 1D image array object, origin[2] must be 0. If image is a 1D image array object, origin[1] describes the image index in the 1D image array. If image is a 2D image array object, origin[2] describes the image index in the 2D image array.

region - Defines the (width, height, depth) in pixels of the 1D, 2D or 3D rectangle, the (width, height) in pixels of the 2D rectangle and the number of images of a 2D image array or the (width) in pixels of the 1D rectangle and the number of images of a 1D image array. If image is a 2D image object, region[2] must be 1. If image is a 1D image or 1D image buffer object, region[1] and region[2] must be 1. If image is a 1D image array object, region[2] must be 1.


I have seen several such examples, in boost::compute::command_queue, and I didn't even look very hard. But it would be good to sort out all of these problems before Boost.Compute is released as part of boost, since fixing some may involve incompatible API changes. I guess you could add overloads, in these cases, but it'd be nice if Boost.Compute didn't already have cruft right at the start.


Matt


________________________________

This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Kyle Lutz
2014-12-29 01:45:08 UTC
Permalink
On Sun, Dec 28, 2014 at 5:36 PM, Gruenke,Matt <***@tycoint.com> wrote:
> -----Original Message-----
> From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Kyle Lutz
> Sent: Sunday, December 28, 2014 14:42
> To: ***@lists.boost.org List
> Subject: Re: [boost] [compute] review
>
>> On Sun, Dec 28, 2014 at 1:54 AM, Gruenke,Matt wrote:
>
>>> Finally, based on the docs, it seems there's a bug in
>>> boost::compute::command_queue::enqueue_copy_image_to_buffer() and
>>> boost::compute::command_queue::enqueue_copy_buffer_to_image().
>>> The region and origin parameters should be arrays, in the 2D and 3D cases.
>>> If this discrepancy is intentional, it should be noted. I haven't
>>> exhaustively reviewed the reference docs, so there might be other
>>> issues of this sort.
>>
>> Can you be more specific as to what you consider to be a "bug" here?
>> These APIs in the command_queue class are very "thin" wrappers on top of
>> the clEnqueue* functions from the OpenCL C API. If you could provide a
>> small test-case which demonstrates the issue and submit it to the
>> bug-tracker [1] that would be great.
>
> I can submit bugs, but this is really simple (and blatant). Now let's look at boost::compute::command_queue::enqueue_read_image(). It has the following overloads:
>
> void enqueue_read_image(const image2d & image, const size_t origin,
> const size_t region, size_t row_pitch,
> void * host_ptr,
> const wait_list & events = wait_list());
>
> void enqueue_read_image(const image3d & image, const size_t origin,
> const size_t region, size_t row_pitch,
> size_t slice_pitch, void * host_ptr,
> const wait_list & events = wait_list());
>
> See how origin and region are size_t? Well, here's the OpenCL function they claim to wrap:
>
> cl_int clEnqueueReadImage(cl_command_queue command_queue,
> cl_mem image, cl_bool blocking_read,
> const size_t *origin, const size_t *region,
> size_t row_pitch, size_t slice_pitch,
> void *ptr, cl_uint num_events_in_wait_list,
> const cl_event *event_wait_list, cl_event *event );
>
>
> See how origin and region are const size_t *? The reason is:
>
> origin - Defines the (x, y, z) offset in pixels in the 1D, 2D, or 3D image, the (x, y) offset and the image index in the image array or the (x) offset and the image index in the 1D image array. If image is a 2D image object, origin[2] must be 0. If image is a 1D image or 1D image buffer object, origin[1] and origin[2] must be 0. If image is a 1D image array object, origin[2] must be 0. If image is a 1D image array object, origin[1] describes the image index in the 1D image array. If image is a 2D image array object, origin[2] describes the image index in the 2D image array.

> region - Defines the (width, height, depth) in pixels of the 1D, 2D or 3D rectangle, the (width, height) in pixels of the 2D rectangle and the number of images of a 2D image array or the (width) in pixels of the 1D rectangle and the number of images of a 1D image array. If image is a 2D image object, region[2] must be 1. If image is a 1D image or 1D image buffer object, region[1] and region[2] must be 1. If image is a 1D image array object, region[2] must be 1.
>
>
> I have seen several such examples, in boost::compute::command_queue, and I didn't even look very hard. But it would be good to sort out all of these problems before Boost.Compute is released as part of boost, since fixing some may involve incompatible API changes. I guess you could add overloads, in these cases, but it'd be nice if Boost.Compute didn't already have cruft right at the start.

Thanks for bringing this to my attention. This looks to be a bug in
generating the Boost documentation from the doxygen comments.

For example, the enqueue_read_image() method takes the origin argument
as "const size_t origin[2]" but the generated documentation lists this
as merely "const size_t origin". I'd invite you, for now until the
documentation bug is fixed, to look directly at the header file [1].

If anyone more knowledgable with the doxygen->quickbook toolchain
could comment on this or knows of a work-around that would be a great
help.

-kyle

[1] https://github.com/kylelutz/compute/blob/master/include/boost/compute/command_queue.hpp#L678

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Vicente J. Botet Escriba
2014-12-30 15:15:57 UTC
Permalink
Hi,

I have no time to do a complete review. Anyway here it is my minimal review


*1. What is your evaluation of the design?**
***
I would prefer that the algorithms take as first parameter the queue
(which could be abstracted to an asynchronous executor).

The performances let think that there is no interest in using the
library for containers having less than 10^4/10^5 elements.


Even if this is considered a low level library, the documentation should
show how a library writer could make use of the provided interface to
provide a higher one.

I would prefer that the library make use of std::future or boost::future
class instead of adding an additional one.
Do the author tried to make use of the existing futures? if yes, which
problems were found?

I'm missing a range interface to most of the algorithms.

I suspect that as this is a C++03 library it doesn't take advantages of
some interesting C++11+/C++14 features.

It is not clear which operations are synchronous, which ones could
block, ...

In order to get the native OpenCl handle I suggest to name the function
get_native_handle instead of get.

*2. What is your evaluation of the implementation?*
I've not see at the implementation.

*3. What is your evaluation of the documentation?*

The documentation lacks of some core tutorials. I was unable to have an
idea of what a context is, what can be done with a command_queue, what
is a Kernel, a pipe, ...? Does it mean that the user must know OpenCL?

The reference documentation should follow he C++ standard way, including
pre-conditions, effects, synchronization, throws clauses.

I would appreciate a description of the algorithms used, how the
parallelism is used? is this completely managed by OpenCL or do the
library something else?

*4. What is your evaluation of the potential usefulness of the library?**
***
This is essentially an optimization library. It should be quite useful,
but the performances results don't probe it.

*5. Did you try to use the library? With what compiler? Did you have any
problems?**
*
No.*
***
*6. How much effort did you put into your evaluation? A glance? A quick
reading? In-depth study?**
***
A glance.

*7. Are you knowledgeable about the problem domain?*

Yes and not. I don't know about OpenCL, but I'm aware of the current
parallel proposals.

*8. Do you think the library should be accepted as a Boost library? **
*
As I have not taken too much time I'm unable to say that this library
should not be included in Boost. So I vote to include it conditionally,
subject to:
* I would need some performances that probe the library is better
than using standard STL containers in most of the cases.
* The documentation should be completed with more tutorials and a
better reference documentation.

Best,
Vicente

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Kyle Lutz
2014-12-30 19:08:30 UTC
Permalink
On Tue, Dec 30, 2014 at 7:15 AM, Vicente J. Botet Escriba
<***@wanadoo.fr> wrote:
> Hi,
>
> I have no time to do a complete review. Anyway here it is my minimal review
>
>
> *1. What is your evaluation of the design?**
> ***
> I would prefer that the algorithms take as first parameter the queue (which
> could be abstracted to an asynchronous executor).

Any particular reason to prefer the first argument rather than the
last? Currently, by using the last argument to specify the command
queue, we can use a default argument of "system::default_queue()".
For the common case of only wanting to use the default compute device
on the system, all algorithms will work correctly and the user never
has to bother with finding a device, setting up a context, creating a
command queue and passing it to the algorithm function.

> The performances let think that there is no interest in using the library
> for containers having less than 10^4/10^5 elements.

In general yes, GPUs/Accelerators are ill-suited for small problem
sizes (unless the computation performed on each element is quite
expensive). Sorting 1000 integers will nearly always be faster on the
host rather than the device.

> Even if this is considered a low level library, the documentation should
> show how a library writer could make use of the provided interface to
> provide a higher one.

I agree, there should be more documentation for this. Could you let me
know what sort of higher-level interface you'd like to see
demonstrated?

> I would prefer that the library make use of std::future or boost::future
> class instead of adding an additional one.
> Do the author tried to make use of the existing futures? if yes, which
> problems were found?

The implementation of the boost::compute::future type is fairly
different than that of the std::/boost::future types. Essentially,
there is no operation akin to "promise.set_value()" for
boost::compute::future. Instead, it is a simple wrapper around an
OpenCL event object which can be used to wait for an operation on the
device to complete (via calling the blocking clWaitForEvents()
function from the OpenCL API). Please let me know if there are ways to
better integrate with boost::future.

> I'm missing a range interface to most of the algorithms.

The algorithms API is meant to mimic the STL algorithms API. Adding a
range-based interface is definitely possible, but is a non-trivial
amount of work. Would you be interested in contributing this?

> I suspect that as this is a C++03 library it doesn't take advantages of some
> interesting C++11+/C++14 features.

It does take advantage of some C++11 features (e.g. variadic
functions) where possible. I would be very open to integrating more
C++11/14 features where appropriate.

> It is not clear which operations are synchronous, which ones could block,
> ...

Could you be more specific about what operations you are referring to?
I know there is already some documentation regarding this for the
command_queue class as well as for the copy()/copy_async() functions.
I'll work on adding more.

> In order to get the native OpenCl handle I suggest to name the function
> get_native_handle instead of get.
>
> *2. What is your evaluation of the implementation?*
> I've not see at the implementation.
>
> *3. What is your evaluation of the documentation?*
>
> The documentation lacks of some core tutorials. I was unable to have an idea
> of what a context is, what can be done with a command_queue, what is a
> Kernel, a pipe, ...? Does it mean that the user must know OpenCL?

The low-level API is a wrapper around OpenCL and so having some
background knowledge there helps. I'll work on adding more
documentation for these classes and how they fit together.

> The reference documentation should follow he C++ standard way, including
> pre-conditions, effects, synchronization, throws clauses.

Fully agree, I'll continue to work on this.

> I would appreciate a description of the algorithms used, how the parallelism
> is used? is this completely managed by OpenCL or do the library something
> else?

I'm not sure if I fully understand what you're asking here but I'll
try to answer it.

OpenCL provides an API for executing code on compute devices (e.g.
GPUs), Boost.Compute provides algorithms which, when called, use the
OpenCL API to create programs, compile them for the device, and
execute them.

Let me know if I can better explain this.

> *4. What is your evaluation of the potential usefulness of the library?**
> ***
> This is essentially an optimization library. It should be quite useful, but
> the performances results don't probe it.

Could you clarify what you mean by "the performances results don't
probe it"? Boost.Compute provides a number of benchmark programs which
can be used to measure and evaluate the performance of the library on
a particular system. Results from this are also published on the
"Performance" page [1] in the documentation.

> *5. Did you try to use the library? With what compiler? Did you have any
> problems?**
> *
> No.*
> ***
> *6. How much effort did you put into your evaluation? A glance? A quick
> reading? In-depth study?**
> ***
> A glance.
>
> *7. Are you knowledgeable about the problem domain?*
>
> Yes and not. I don't know about OpenCL, but I'm aware of the current
> parallel proposals.
>
> *8. Do you think the library should be accepted as a Boost library? **
> *
> As I have not taken too much time I'm unable to say that this library should
> not be included in Boost. So I vote to include it conditionally, subject to:
> * I would need some performances that probe the library is better than
> using standard STL containers in most of the cases.

Boost.Compute will never be universally better than the STL, and it
was never intended to be. CPUs and GPUs make fundamentally different
trade-offs which lead to very different performance characteristics.
GPUs support performing computation on large amounts of data with very
high throughput (in some cases 10x-100x faster than a CPU). The
down-side to this is a larger overhead to get the GPU running as well
as the cost of transferring data to device memory. Boost.Compute and
the STL are different tools which each suit their own different
use-cases. I don't think Boost.Compute should be required to supplant
the STL in order to be an acceptable library.

> * The documentation should be completed with more tutorials and a better
> reference documentation.

Fully agree, I'll definitely continue to work on improving the documentation.

Thanks for the review. Let me know if I can explain anything more clearly.

-kyle

[1] http://kylelutz.github.io/compute/boost_compute/performance.html

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Gruenke,Matt
2014-12-31 19:41:13 UTC
Permalink
-----Original Message-----
From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Kyle Lutz
Sent: December 30, 2014 14:09
To: ***@lists.boost.org List
Subject: Re: [boost] [compute] review

>On Tue, Dec 30, 2014 at 7:15 AM, Vicente J. Botet Escriba wrote:

>> I would prefer that the algorithms take as first parameter the queue
>> (which could be abstracted to an asynchronous executor).
>
> Any particular reason to prefer the first argument rather than the last?
> Currently, by using the last argument to specify the command queue, we
> can use a default argument of "system::default_queue()". For the common
> case of only wanting to use the default compute device on the system, all
> algorithms will work correctly and the user never has to bother with
> finding a device, setting up a context, creating a command queue and
> passing it to the algorithm function.

While that's certainly convenient, it does open the door to a subtle (but serious) error. In code not using the default queue, one or more of the functions may inadvertently omit that parameter. This would result in operations being directed to different queues, introducing race conditions.

I would not give the queue parameter a default. It's not hard for them to explicitly call system::default_queue(), at the top of their code, save the result, and pass it into their functions. They still wouldn't need to find devices, setup contexts, etc.


Matt


________________________________

This e-mail contains privileged and confidential information intended for the use of the addressees named above. If you are not the intended recipient of this e-mail, you are hereby notified that you must not disseminate, copy or take any action in respect of any information contained in it. If you have received this e-mail in error, please notify the sender immediately by e-mail and immediately destroy this e-mail and its attachments.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Thomas M
2014-12-31 21:18:30 UTC
Permalink
On 31/12/2014 20:41, Gruenke,Matt wrote:
> -----Original Message-----
> From: Boost [mailto:boost-***@lists.boost.org] On Behalf Of Kyle Lutz
> Sent: December 30, 2014 14:09
> To: ***@lists.boost.org List
> Subject: Re: [boost] [compute] review
>
>> On Tue, Dec 30, 2014 at 7:15 AM, Vicente J. Botet Escriba wrote:
>
>>> I would prefer that the algorithms take as first parameter the queue
>>> (which could be abstracted to an asynchronous executor).
>>
>> Any particular reason to prefer the first argument rather than the last?
>> Currently, by using the last argument to specify the command queue, we
>> can use a default argument of "system::default_queue()". For the common
>> case of only wanting to use the default compute device on the system, all
>> algorithms will work correctly and the user never has to bother with
>> finding a device, setting up a context, creating a command queue and
>> passing it to the algorithm function.
>
> While that's certainly convenient, it does open the door to a subtle (but serious) error. In code not using the default queue, one or more of the functions may inadvertently omit that parameter. This would result in operations being directed to different queues, introducing race conditions.
>
> I would not give the queue parameter a default. It's not hard for them to explicitly call system::default_queue(), at the top of their code, save the result, and pass it into their functions. They still wouldn't need to find devices, setup contexts, etc.

I fully second that; I also see little gained by the default parameter
but quite some (unnecessary) risks introduced.

Thomas


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Vicente J. Botet Escriba
2015-01-01 14:59:54 UTC
Permalink
Le 30/12/14 20:08, Kyle Lutz a écrit :
> On Tue, Dec 30, 2014 at 7:15 AM, Vicente J. Botet Escriba
> <***@wanadoo.fr> wrote:
>> Hi,
>>
>> I have no time to do a complete review. Anyway here it is my minimal review
>>
>>
>> *1. What is your evaluation of the design?**
>> ***
>> I would prefer that the algorithms take as first parameter the queue (which
>> could be abstracted to an asynchronous executor).
> Any particular reason to prefer the first argument rather than the
> last? Currently, by using the last argument to specify the command
> queue, we can use a default argument of "system::default_queue()".
> For the common case of only wanting to use the default compute device
> on the system, all algorithms will work correctly and the user never
> has to bother with finding a device, setting up a context, creating a
> command queue and passing it to the algorithm function.
One of the reasons is to allow for algorithms that take a variadic
number of arguments. I don't know if you have one on your library, but I
find this a uniform pattern.
Second, I consider that the user must be explicit about which queue
(devices) is used as AFAIK this is an optimization library.
>
>> The performances let think that there is no interest in using the library
>> for containers having less than 10^4/10^5 elements.
> In general yes, GPUs/Accelerators are ill-suited for small problem
> sizes (unless the computation performed on each element is quite
> expensive). Sorting 1000 integers will nearly always be faster on the
> host rather than the device.
>
>> Even if this is considered a low level library, the documentation should
>> show how a library writer could make use of the provided interface to
>> provide a higher one.
> I agree, there should be more documentation for this. Could you let me
> know what sort of higher-level interface you'd like to see
> demonstrated?
The one of the standard Parallel TS :), or a pipeline framework.
>
>> I would prefer that the library make use of std::future or boost::future
>> class instead of adding an additional one.
>> Do the author tried to make use of the existing futures? if yes, which
>> problems were found?
> The implementation of the boost::compute::future type is fairly
> different than that of the std::/boost::future types. Essentially,
> there is no operation akin to "promise.set_value()" for
> boost::compute::future. Instead, it is a simple wrapper around an
> OpenCL event object which can be used to wait for an operation on the
> device to complete (via calling the blocking clWaitForEvents()
> function from the OpenCL API). Please let me know if there are ways to
> better integrate with boost::future.

I understand this seems to don't match with the std/future/promise
design. Note that the futures returned by std::async,
std::make_ready_future, future<>::then, when_all, when_any, don't show
neither an associated promise (as it is hidden).
I don't know yet how this could be done, but I find it an important use
case to be solved.

I'm not (by principle) against a multiplication of future types, but
this means that we would need to duplicate the whole future interface
(then continuations, when_all, when _any, ...).


>
>> I'm missing a range interface to most of the algorithms.
> The algorithms API is meant to mimic the STL algorithms API. Adding a
> range-based interface is definitely possible, but is a non-trivial
> amount of work. Would you be interested in contributing this?
Agreed this is a lot of work. No I have not time to contribute to this.
I was just wondering if an range interface, couldn't make it possible to
compose the programs executed by the whole pipeline and make the cost of
copies less significant.
>
>> I suspect that as this is a C++03 library it doesn't take advantages of some
>> interesting C++11+/C++14 features.
> It does take advantage of some C++11 features (e.g. variadic
> functions) where possible. I would be very open to integrating more
> C++11/14 features where appropriate.
>
>> It is not clear which operations are synchronous, which ones could block,
>> ...
> Could you be more specific about what operations you are referring to?
All the functions of the core classes must be explicit about
thread-safety (synchronization).
I suspect that almost all the algorithm block, and no parallelism
(pipeline) is possible. Could you confirm?
> I know there is already some documentation regarding this for the
> command_queue class as well as for the copy()/copy_async() functions.
> I'll work on adding more.
Great.
>
>> In order to get the native OpenCl handle I suggest to name the function
>> get_native_handle instead of get.
>>
>> *2. What is your evaluation of the implementation?*
>> I've not see at the implementation.
>>
>> *3. What is your evaluation of the documentation?*
>>
>> The documentation lacks of some core tutorials. I was unable to have an idea
>> of what a context is, what can be done with a command_queue, what is a
>> Kernel, a pipe, ...? Does it mean that the user must know OpenCL?
> The low-level API is a wrapper around OpenCL and so having some
> background knowledge there helps. I'll work on adding more
> documentation for these classes and how they fit together.
Thanks.
>
>> The reference documentation should follow he C++ standard way, including
>> pre-conditions, effects, synchronization, throws clauses.
> Fully agree, I'll continue to work on this.
Great.
>
>> I would appreciate a description of the algorithms used, how the parallelism
>> is used? is this completely managed by OpenCL or do the library something
>> else?
> I'm not sure if I fully understand what you're asking here but I'll
> try to answer it.
>
> OpenCL provides an API for executing code on compute devices (e.g.
> GPUs), Boost.Compute provides algorithms which, when called, use the
> OpenCL API to create programs, compile them for the device, and
> execute them.
>
> Let me know if I can better explain this.
I have take a look at the OpenCL documentation and I understand it better.
Could it be possible to have a roughly pseudo code of the program that
each one of the algorithms executes on the device?

>
>> *4. What is your evaluation of the potential usefulness of the library?**
>> ***
>> This is essentially an optimization library. It should be quite useful, but
>> the performances results don't probe it.
> Could you clarify what you mean by "the performances results don't
> probe it"? Boost.Compute provides a number of benchmark programs which
> can be used to measure and evaluate the performance of the library on
> a particular system. Results from this are also published on the
> "Performance" page [1] in the documentation.
What a meant is the performances are often worst except when the size of
the container is high enough. The scope of the library could be to be
better only for these big containers, but I'm missing an interface that
makes a choice between the Compute or the STL one.
>
>>
>> *8. Do you think the library should be accepted as a Boost library? **
>> *
>> As I have not taken too much time I'm unable to say that this library should
>> not be included in Boost. So I vote to include it conditionally, subject to:
>> * I would need some performances that probe the library is better than
>> using standard STL containers in most of the cases.
> Boost.Compute will never be universally better than the STL, and it
> was never intended to be. CPUs and GPUs make fundamentally different
> trade-offs which lead to very different performance characteristics.
> GPUs support performing computation on large amounts of data with very
> high throughput (in some cases 10x-100x faster than a CPU). The
> down-side to this is a larger overhead to get the GPU running as well
> as the cost of transferring data to device memory. Boost.Compute and
> the STL are different tools which each suit their own different
> use-cases. I don't think Boost.Compute should be required to supplant
> the STL in order to be an acceptable library.
I suggest you to make more evident the goal and scope of the library in
the introduction of the library. Please, replace my requirement by this
one.
>
>> * The documentation should be completed with more tutorials and a better
>> reference documentation.
> Fully agree, I'll definitely continue to work on improving the documentation.
>
> Thanks for the review. Let me know if I can explain anything more clearly.
>
>
I will comeback if I find some time to read the documentation carefully.

Good luck,
Vicente

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Loading...