C behavior on malloc and free

Post by r***@gmail.com
What does the C standard say about this kind of occurrence?
char* x = malloc(10);
free(x + 1);
Will the free() algorithm look at the exact pointer address, or a pointer
within the range of allocated memory?

I've no idea but wouldn't expect it to be mandatory for an
implementation to support it.

It would be grossly inefficient.

Instead of free(p) looking, for example, a few bytes before p where it
might have stored size and other info, it now has to search a global
table of possibly millions of allocations.

There might be ways of making that more efficient, but it's a lot easier
to insist that p points to the start of an allocation.

--
Bartc

BGB

2016-04-22 23:55:14 UTC

yes, that approach would be grossly inefficient, but this is more a case
of "don't do it that way". though, in any case, even a fast lookup is
still slower than just assuming a pointer points at the start of the
data area for an object.

Post by BartC
There might be ways of making that more efficient, but it's a lot easier
to insist that p points to the start of an allocation.

yep. pretty much.

going OT:

for other reasons, I have done memory allocators where need to lookup
memory objects from arbitrary pointers, which may or may not be valid,
and may point into the middle of an object, aren't particularly uncommon.

this does have an impact on the design of the allocator (for various
reasons, traditional block-slicing free-list allocators are no longer
the ideal solution).

for smaller allocations, my preferred strategy is mapping the object to
a series of fixed size cells, with a bitmap tracking which cells are in
use (and the type of each cell). this results in a constant space
overhead, and allows reasonably quick lookups.

for larger allocations, and for looking up regions of the heap (the
actual memory you get from mmap or VirtualAlloc or similar in the
background), a typical strategy in my case is to keep either a sorted
array or a binary tree structure (a balanced binary tree is more
complicated, but has overall better properties for this).

this means lookup time is O(log n) for the number of heap regions (or
large allocations), with a secondary O(1) average case to find
individual cells (for smaller objects).

some values which I am using in a newer MM in this style:
cell-size for small objects: 16B
cell-size for medium objects: 256B
object-size cutoff for small objects: 4KiB
object-size cutoff for medium objects: 256KiB (*1)
small and medium object heap region: 16MB (*2)

constant memory overhead:
constant 4 bits/cell;
for 16B cells: 3.125%
for 256B cells: 0.195%

1: this value more has to do with VirtualAlloc using a 64KiB logical
page size. having this cutoff at 64KiB is closer to optimal, and is a
better choice with the 4KiB pages typical of mmap, but results in
considerable overhead on Windows for objects in the 64KiB-256KiB range.

while it is possible to use a free-list allocator, objects in this size
range tend to be infrequent enough that just using raw allocation is
easier (and a VA fallback with a larger cutoff is moderately faster than
falling back to MSVCRT's malloc, with otherwise fairly similar average
memory use).

the reason for these cutoff values have to do with constant overhead and
the relative cost of bitmap operations, where as objects get bigger a
few things happen:
the amount of space used by the cell bitmap becomes more significant
relative the overhead imposed by that of using a larger block size;
the cost of bitmap operations during allocation/freeing become more
significant.

2: the ideal value for this depends a fair bit on the total size of the
heap, and the relative distribution of objects. IME, the small and
medium heaps tend to comprise the vast majority of memory objects, and
also a majority of the total memory use (around 60-70% of the memory pie).

with generally around 2-4GB of heap space, 16MB is a pretty good size
for heap regions: log2(3072/16) ~= 7.6.

smaller means that larger amounts of time are spent walking the tree to
find the correct memory region, whereas much larger reduces the amount
of usable address space in 32-bit processes (for not much gain).

if targeting hardware where the memory space is smaller, a smaller size
for heap-regions may make sense. note also that this is not a sensible
design for microcontroller applications.

or such...

Kaz Kylheku

2016-04-23 02:29:27 UTC

The Obvious:

Suppose that all malloc objects are laid out on, say, 64 byte
boundaries.

Then you can easily support free((char *) x + k), for 0 <= k < 64,
with just an extra bit operation to align the pointer down.

BGB

2016-04-23 05:02:40 UTC

Post by Kaz Kylheku

Suppose that all malloc objects are laid out on, say, 64 byte
boundaries.
Then you can easily support free((char *) x + k), for 0 <= k < 64,
with just an extra bit operation to align the pointer down.

though, nevermind if 64B alignment would be terrible for memory
footprint if a majority of ones' objects are fairly small (most smaller
than 64B).

this is partly why I was using a 16 byte allocation granularity, though
many objects still have a bit of overhead due to the use of 16-byte
object headers (which can make slabs favorable even with the 16B
granularity).

example: one allocates a cons cell with 2 pointers, which ends up taking
32 bytes to represent on-heap.

whereas, a slab can store each at exactly 8 or 16 bytes, and thus reduce
per-alloc overhead (at the cost that typically the slab will implicitly
involve the use of intra-object pointers and similar).

FWIW: trying to allocate small objects like this tends to work terrible
with MSVCRT's malloc, which wastes huge amounts of memory and becomes
rather slow when faced will millions of small memory allocs and frees
(in addition to it providing no real means to do run-time/dynamic
type-checking or similar).

BartC

2016-04-23 09:07:46 UTC

Post by Kaz Kylheku

Yes, if the low bits are always 0 (the low 6 bits in your example when K
is 64), then it is possible to do something like that. But in general it
won't work.

One feature of C is that a program doesn't need to keep track of the
sizes of allocated objects. So if p points somewhere along an allocated
string or array, and it doesn't know where the start of the object is,
then neither it nor free() will know if it is within K bytes of the start.

(Of course, in the OP's example, free(p+1) can be trivially rewritten as
free(p). I assume he meant free(q) where q is (p+n), of which neither p
nor n are known at that point.)

If certain allocations are known to be not bigger than K bytes, then
this scheme might work, but then the program doesn't really need special
support of free(): it can do its own masking of the low bits, perhaps
via a freek(q) wrapper function.

But it needs to be able to function with *any* compile-time value of K
if portability is needed. A program that works with K=32 also needs to
work when K=8.

--
Bartc

BGB

2016-04-23 12:33:49 UTC

Post by Kaz Kylheku

Yes, if the low bits are always 0 (the low 6 bits in your example when K
is 64), then it is possible to do something like that. But in general it
won't work.
One feature of C is that a program doesn't need to keep track of the
sizes of allocated objects. So if p points somewhere along an allocated
string or array, and it doesn't know where the start of the object is,
then neither it nor free() will know if it is within K bytes of the start.
(Of course, in the OP's example, free(p+1) can be trivially rewritten as
free(p). I assume he meant free(q) where q is (p+n), of which neither p
nor n are known at that point.)
If certain allocations are known to be not bigger than K bytes, then
this scheme might work, but then the program doesn't really need special
support of free(): it can do its own masking of the low bits, perhaps
via a freek(q) wrapper function.
But it needs to be able to function with *any* compile-time value of K
if portability is needed. A program that works with K=32 also needs to
work when K=8.

yes, pretty much.

for example: one can allocate a chunk of memory, divide it up into
pieces, and fetch a pointer to the last piece.

if they ask information about the last piece, with a function that
accepts arbitrary pointers, then ideally it should be able to figure out
that it is within the original allocation and do something sensible with it.

now, should a free function accept arbitrary pointers (if it has the
means to do so)? I am leaning towards no. reason is that conceptually
the alloc and free are paired, so it makes sense that they both use the
same pointer.

chances are if a pointer internal to an object is used to try to free
it, then something is has gone wrong, and it may be better to invoke
debugging features or similar rather than just quietly accept it.

granted, other useful MM features:
* getting the base of a memory allocation;
* getting the size of a memory allocation;
* assigning and retrieving a type ID for the allocation;
* potentially supporting some user flags bits;
* ability to track where the allocation function was called from;
** particularly useful for "heap forensics" type tasks;
* being able to deal well with lots of small allocations;
** don't underestimate the small allocs, there may be *lots* of them
* deal well with apps using up the entire 32-bit address space.
* ...

a partial result of this is that a fair number of non-trivial codebases
end up writing their own memory allocators to sidestep the common
deficiencies of the standard "malloc" interface.

sometimes, C libraries offer extension features for some of this for
malloc, but relying on this is even less portable than just writing
ones' own allocator (since otherwise they end up with a program tied to
a specific C library).

s***@casperkitty.com

2016-04-23 16:17:01 UTC

Post by BGB
for smaller allocations, my preferred strategy is mapping the object to
a series of fixed size cells, with a bitmap tracking which cells are in
use (and the type of each cell). this results in a constant space
overhead, and allows reasonably quick lookups.
for larger allocations, and for looking up regions of the heap (the
actual memory you get from mmap or VirtualAlloc or similar in the
background), a typical strategy in my case is to keep either a sorted
array or a binary tree structure (a balanced binary tree is more
complicated, but has overall better properties for this).

Another option, especially for systems with virtual addressing, is to have
different regions of address space dedicated to different sizes of blocks.
This approach makes it easy to find the base address of a block given an
address within it, and can avoid conventional fragmentation. The biggest
problems arise if the amount of physical memory allocated to holding some
size of block differs significantly from the amount of storage that's
actually needed for blocks of that size, but for some usage patterns
that's not really a problem.

BGB

2016-04-23 19:39:38 UTC

this has its own sets of drawbacks though.

the binary tree strategy gives an O(log2 n) lookup for regions, which as
noted can reach the target within 7 or 8 loop steps (with a 3GB heap).

likewise, 12 loop steps could cover an approximately 64GB heap.

one strategy that works for 32-bit targets is by noting the small and
finite size of the address space, it is possible to use a lookup table
to reduce the average-case for memory-region lookups to O(1) by just
feeding the high order bits of the address through the LUT.

this strategy fails hard though on 64-bit targets, as the address space
is too big to effectively represent using a lookup table. similarly, the
overall savings from using a lookup here tend to be fairly modest.

IME: I don't seem to be having too serious of issues with fragmentation,
but haven't really looked into this issue all that seriously.

for smaller embedded targets, or different use-cases, assumptions will
need to be different.

r***@gmail.com

2016-04-23 16:35:49 UTC

I've no idea but wouldn't expect it to be mandatory for an
implementation to support it.
It would be grossly inefficient.
Instead of free(p) looking, for example, a few bytes before p where it
might have stored size and other info, it now has to search a global
table of possibly millions of allocations.
There might be ways of making that more efficient, but it's a lot easier
to insist that p points to the start of an allocation.

Right now the free() algorithm would be looking for an exact value,
so why not alter that search algorithm to look for a value that's >=
the pointer value, but less than the next one? That would give you
a possible location for the pointer, and then it's just a matter of
comparing the range of start + length.

It makes sense to me to store both start and length for allocated
blocks. CAlive also intends to add a managed memory pointer, which
is one additional layer of indirection, allowing for relocation as
needed at the sacrifice of a little performance.

Best regards,
Rick C. Hodgin

BartC

2016-04-23 17:49:18 UTC

Post by BartC
It would be grossly inefficient.
Instead of free(p) looking, for example, a few bytes before p where it
might have stored size and other info, it now has to search a global
table of possibly millions of allocations.
There might be ways of making that more efficient, but it's a lot easier
to insist that p points to the start of an allocation.

Right now the free() algorithm would be looking for an exact value,

It doesn't need to look for it; it's given the exact value.

Post by r***@gmail.com
so why not alter that search algorithm to look for a value that's >=
the pointer value, but less than the next one?

Look where? In a global list of SORTED allocations? And whereabouts in
the list would it start; from the beginning?

That would then need extra effort to keep sorted too, as blocks are
freed and allocated.

Post by r***@gmail.com
That would give you
a possible location for the pointer, and then it's just a matter of
comparing the range of start + length.

Free() is something that should just take a handful of lines to execute.
But you're going to make it execute some algorithm? You do seem to like
making life difficult!

Post by r***@gmail.com
It makes sense to me to store both start and length for allocated
blocks.

(The allocators I write (usually on top of malloc/free or the OS
equivalent) don't maintain a list of allocated blocks, nor of the
(requested) allocated sizes.

Both these bits of information are known to the program and simply
passed to the deallocator function. There is no overhead other than
over-allocation as block sizes are rounded up.

However, that information /is/ stored for deallocated blocks.)

--
Bartc

r***@gmail.com

2016-04-23 18:42:33 UTC

Right now the free() algorithm would be looking for an exact value,

It doesn't need to look for it; it's given the exact value.

True.

Post by r***@gmail.com
so why not alter that search algorithm to look for a value that's >=
the pointer value, but less than the next one?

Look where? In a global list of SORTED allocations? And whereabouts in
the list would it start; from the beginning?

When you describe it that way, you make it sound like a bad idea. :-)

Post by BartC
That would then need extra effort to keep sorted too, as blocks are
freed and allocated.

A three-tier, multi-way linked list would allow for a fast operation in
narrowing down the block in a few operations.

Post by r***@gmail.com
That would give you
a possible location for the pointer, and then it's just a matter of
comparing the range of start + length.

Free() is something that should just take a handful of lines to execute.
But you're going to make it execute some algorithm? You do seem to like
making life difficult!

I do not. But, I do look at things from the point of view of developers
and their needs. I think that the idea of imposing machine-needs upon
developers is, at times, an unnecessary consideration.

Suppose I were to implement a methodology which allocates and stores
memory as you indicate above (global list, sorted, search from the
beginning), given how much faster computers are today than they were
in the past, how fast would the code be that uses that allocation
scheme compared to the other scheme used on older, slower equipment?

My point is that in most cases things are pretty fast. So, there's
no reason to limit a developer's abilities in those cases when the
speed of the allowance doesn't greatly affect performance. In those
cases where speed is required, then use a fast algorithm.

It might be worth:

malloc(), free();
malloc2(), free2();

Where the second one allows for the type of algorithm which allows for
non-source pointers to be used.

Post by r***@gmail.com
It makes sense to me to store both start and length for allocated
blocks.

(The allocators I write (usually on top of malloc/free or the OS
equivalent) don't maintain a list of allocated blocks, nor of the
(requested) allocated sizes.
Both these bits of information are known to the program and simply
passed to the deallocator function. There is no overhead other than
over-allocation as block sizes are rounded up.
However, that information /is/ stored for deallocated blocks.)

Makes sense. Sounds like a good algorithm.

Best regards,
Rick C. Hodgin

s***@casperkitty.com

2016-04-23 19:42:03 UTC

Post by r***@gmail.com
My point is that in most cases things are pretty fast. So, there's
no reason to limit a developer's abilities in those cases when the
speed of the allowance doesn't greatly affect performance. In those
cases where speed is required, then use a fast algorithm.

Compared with the effort of keeping track of what allocations are owned
by whom, and determining when allocations can safely be released, keeping
a pointer to the base address of the allocations one owns is a very small
burden. Having free(p) free an entire allocation containing p when p is
a pointer to one item inside it seems rather dangerous. A more useful
facility which would be more akin to the way some memory managers work
would be to have a "release" method which includes a size, and saying that
code may call "release" on any suitably-aligned address within an allocation
provided that it refrains from using the indicated storage for any purpose
whatsoever [the alignment requirement would ensure that free chunks are
always large enough to hold information about themselves].

r***@gmail.com

2016-04-23 19:50:39 UTC

I was thinking that each entry would store its base pointer and length,
and then free() called with any pointer within an allocated block would
free the whole block.

We'll see though. I'll do some testing. :-)

Best regards,
Rick C. Hodgin

Ian Collins

2016-04-23 22:39:43 UTC

Post by BartC
Free() is something that should just take a handful of lines to execute.
But you're going to make it execute some algorithm? You do seem to like
making life difficult!

I do not. But, I do look at things from the point of view of developers
and their needs. I think that the idea of imposing machine-needs upon
developers is, at times, an unnecessary consideration.

What possible benefit would there be for developers? When, except for
in a contrived example, would I want to allocate something and free
something else?

Memory is just one of may resources that get allocated and freed, why
should it be treated any differently?

--
Ian Collins

r***@gmail.com

2016-04-25 00:25:57 UTC

Post by Ian Collins
What possible benefit would there be for developers? When,
except for in a contrived example, would I want to allocate
something and free something else?

In one of my interpreters, I use a database which stores data in
this type of form:

field=value
name="Rick"
age=46

It is by design so it's readily editable manually by the users who
have the system. As such, I have to be able to parse things which
may have some errors, whitespaces, or quotes in them, which can
be removed using:

char* field = malloc(length);

trim_whitespaces(&field, &length);
trim_quotes(&field, &length);

And the actual implementation wraps all of that into a "struct SDatum"
which has char* and int length built in, along with trim() functions
designed expressly for this purpose.

The trim() functions could increment char* and decrement length for
each character to trim on the left, and just decrement length for the
ones on the right.

Post by Ian Collins
Memory is just one of may resources that get allocated and freed,
why should it be treated any differently?

Because it could be useful to handle it differently in certain cases.

I'm not opposed to malloc() and free() working on exact pointers,
and malloc2(), and free2() working on compound pointer+length
ranges.

Best regards,
Rick C. Hodgin

BartC

2016-04-25 10:06:25 UTC

Post by Ian Collins
What possible benefit would there be for developers? When,
except for in a contrived example, would I want to allocate
something and free something else?

In one of my interpreters, I use a database which stores data in
field=value
name="Rick"
age=46
It is by design so it's readily editable manually by the users who
have the system. As such, I have to be able to parse things which
may have some errors, whitespaces, or quotes in them, which can
char* field = malloc(length);
trim_whitespaces(&field, &length);
trim_quotes(&field, &length);
And the actual implementation wraps all of that into a "struct SDatum"
which has char* and int length built in, along with trim() functions
designed expressly for this purpose.
The trim() functions could increment char* and decrement length for
each character to trim on the left, and just decrement length for the
ones on the right.

Then SDatum should include a pointer to the original allocation, if the
string could be 'sliced' in-place - reduced to a substring.

Or the trim functions can simply allocate a new string, and delete the old.

Or the trim functions can shift the resulting string to the left so that
it still starts at the beginning of the allocation.

Or, to get more sophisticated (but you say this is inside an
interpreter), trim could create a slice into the original string,
complete with reference counts to stop the original being freed until
the trimmed copy relinquishes the slice.

There are all things I might look at before thinking of messing about
with a proven and efficient memory allocator.

--
Bartc

r***@gmail.com

2016-04-25 10:32:02 UTC

[snip]

All those things you wrote are the workarounds, and I use one of them.
As I was coding the workaround, the malloc2() / free2() solution occurred
to me as another viable alternative.

I also discovered Microsoft's memory allocator will free the block by
passing a pointer within base <= ptr <= base + length.

There are all things I might look at before thinking of messing
about with a proven and efficient memory allocator.

It's not about efficiency, but rather utility. I don't want to CHANGE the
existing algorithm, but rather give a new option to allow for this feature.

Best regards,
Rick C. Hodgin

Dag-Erling Smørgrav

2016-04-29 12:23:46 UTC

Post by r***@gmail.com
I also discovered Microsoft's memory allocator will free the block by
passing a pointer within base <= ptr <= base + length.

I hope the second <= is a typo...

DES

--
Dag-Erling Smørgrav - ***@des.no

r***@gmail.com

2016-04-29 12:25:46 UTC

Post by Dag-Erling SmÃ¸rgrav

Post by r***@gmail.com
I also discovered Microsoft's memory allocator will free the block by
passing a pointer within base <= ptr <= base + length.

I hope the second <= is a typo...

I caught that after I posted it. I wondered if anybody else would. Yes.
It's supposed to be "base <= ptr < base + length".

Good catch. :-)

Best regards,
Rick C. Hodgin

BartC

2016-04-29 13:31:18 UTC

Post by Dag-Erling SmÃ¸rgrav

Post by r***@gmail.com
I also discovered Microsoft's memory allocator will free the block by
passing a pointer within base <= ptr <= base + length.

I hope the second <= is a typo...

I caught that after I posted it. I wondered if anybody else would. Yes.
It's supposed to be "base <= ptr < base + length".

Which allocator is this: is it malloc() inside MSVCRT.DLL? Because that
one crashes if passed a pointer within the allocation rather than to the
start?

Or do you mean something like HeapFreec()? Those require a heap handle
as well as the pointer, and probably have access to VM tables to make it
easier to do what you want.

--
Bartc

r***@gmail.com

2016-04-29 13:45:26 UTC

Post by Dag-Erling SmÃ¸rgrav

Post by r***@gmail.com
I also discovered Microsoft's memory allocator will free the block by
passing a pointer within base <= ptr <= base + length.

I hope the second <= is a typo...

I caught that after I posted it. I wondered if anybody else would. Yes.
It's supposed to be "base <= ptr < base + length".

Which allocator is this: is it malloc() inside MSVCRT.DLL? Because that
one crashes if passed a pointer within the allocation rather than to the
start?

Yes. However, when I tested it just now it crashed too.

I discovered it the other day by accident by doing the thing I am
proposing in this thread, just to see what would happen. It worked,
but I thought it might be a nonstandard Microsoft extension, hence
my post.

But you're right. When I tried it just now in a simple 10 line project,
it failed. Perhaps I had the correct pointer in my code, and just
thought it was offset.

My apologies if so.

Post by BartC
Or do you mean something like HeapFreec()? Those require a heap handle
as well as the pointer, and probably have access to VM tables to make it
easier to do what you want.

Nope. Just regular malloc() and free(), but it was compiled using a .cpp
file extension. Only difference.

Best regards,
Rick C. Hodgin

BGB

2016-04-25 15:57:41 UTC

Post by Ian Collins
What possible benefit would there be for developers? When,
except for in a contrived example, would I want to allocate
something and free something else?

Then SDatum should include a pointer to the original allocation, if the
string could be 'sliced' in-place - reduced to a substring.
Or the trim functions can simply allocate a new string, and delete the old.
Or the trim functions can shift the resulting string to the left so that
it still starts at the beginning of the allocation.
Or, to get more sophisticated (but you say this is inside an
interpreter), trim could create a slice into the original string,
complete with reference counts to stop the original being freed until
the trimmed copy relinquishes the slice.
There are all things I might look at before thinking of messing about
with a proven and efficient memory allocator.

in a new VM I am working on, strings and arrays may be "offset".

in this case, the array reference consists of a heap-index (*) to the
array, as well as an offset relative to the base of the array.

this allows both ability to walk arrays similar to C pointers, but also
to reasonably efficiently do bounds checking and similar.

internally, the offsets are represented as bytes, and bounds-checks are
done in terms of bytes, as it is possible at the language level to cast
arrays to different types (also like pointers), while still enforcing a
baseline level of type-safety (the VM will know when an array is
accessed out-of-type, and can employ "sanity checks" as needed).

originally I had it element-aligned, but switched to byte-aligned as
this allows casting an int[] array to byte[], which would not be
possible otherwise (it would only allow casting to types with a larger
alignment).

this also allows for misaligned array access, but I haven't really
decided on policies for this yet.

*: because I need to fit both an array and a reasonably large offset
into a 64-bit reference on a 64-bit target, the array is encoded as an
index identifying the heap object (at the cost of currently imposing
limits on heap and array sizes, but I may later add an overflow case or
similar).

strings are generally represented as offsets into string-table objects,
which store a series of strings end-to-end. a downside of this though,
is that this scheme doesn't allow efficiently bounds-checking individual
strings.

while it could be possible to store string lengths, there isn't a good
way to find the start of the string if one has an offset into the body
of the string. the tradeoff is either doing slower string checks, or not
bother and potentially allow operations on strings to jump to other
strings (rather than trap and throw an exception or similar).

another possible option would be allocating strings individually, but
this would be rather expensive for the common case of short strings.

some things still TBD...

s***@casperkitty.com

2016-04-29 14:55:14 UTC

Post by BGB
in a new VM I am working on, strings and arrays may be "offset".
in this case, the array reference consists of a heap-index (*) to the
array, as well as an offset relative to the base of the array.

Such designs often have a significant speed penalty compared with designs
that use direct pointers, but can allow reasonably effective sandboxing
without having to guard the storage in which user code is storing its
"pointers", provided that user code should be allowed to access the
entirety of any object to which it might need to be given a pointer.

The classic Macintosh used an approach somewhat like that for its memory
management--a lot of code would allocate handles, which from the client's
point of view could simply be used as double-indirect pointers, with the
caveat that client code wishing to use the content of the handle must do
one of two things:

1. Lock the handle prior to use and unlock it after; the system maintained
for each handle a lock count, so there was no harm in locking a handle
that was already locked, but unlocking operations needed to be balanced
with locking ones and there was no safe way to recover if things got
unbalanced.

2. Code needed to refetch the master pointer associated with a handle any
time certain system functions may have been called directly or indirectly.
There would be no problem with:

char *st;
st = (char*)myHandle; // Assuming it's known to be 10 or more bytes
for (int i=0; i<10; i++)
st[i] = i+'0';

but code which wanted to do something more sophisticated with each
character in **st would have to do something like either [I forget the
API method names]

char *st;
LockHandle(myHandle);
st = (char*)myHandle; // Assuming it's know to be 10 or more bytes
for (int i=0; i<10; i++)
show_char(myGrafPort, i*20, 10, st[i];
UnlockHandle(myHandle);

or

char *st;
for (int i=0; i<10; i++)
{
st = (char*)myHandle; // Assuming it's know to be 10 or more bytes
show_char(myGrafPort, i*20, 10, st[i];
}

The first workaround would be unsuitable for use in any kind of system where
multiple simultaneously-running threads can create heap objects, but can be
much more efficient than the second if cooperative multitasking would
suffice. A slight variation, used by PalmOS, is to have the LockHandle
function return a pointer to the storage allocated thereby.

BGB

2016-04-29 23:16:36 UTC

it was done this way mostly because I wanted bounds-checked access, and
had limited space in the references.

I could have done checking via reverse lookups, but this is even more
expensive. likewise, heap-allocated box-pointers are more expensive than
a packed heap-index and offset.

on a 32-bit target, a pointer to the array-object can be used instead of
an index (the index is 32 bits, so could hold a raw 32-bit pointer).

for the small heap, this encodes a 31-bit cell index (allowing up to
32GB for the small-object heap).
for the medium and large object heaps, it is a 30 bit object index
(256GB for medium heap, TB for large heap).

the cost shouldn't be *that* high, at least beyond what is already
implied by this VM running stuff in an interpreter.

later, I may consider direct-pointers as an alternative for cases where
bounds can be statically determined to be in range.

this VM uses 64-bit tagged references for object references.
for most normal objects, the layout is:
Bits 0-47: Object Address
Bits 48-59: Object Type Tag (0=untagged pointer)
Bits 60-63: Reference Type Tag (0=object pointer)

another pointer type (15 or F), allows pointers with a 60 bit address.

there are also 62 bit fixlong and flonum types:
also narrower (32 bit) fixint and fixfloat types.
there is also a 56-bit rotlong
uses rotation and fill patterns to try to avoid the boxlong fallback

however, this VM is statically typed and primarily uses untagged
representations for most values.

object references are generally always tagged though, mostly because
keeping the object tag in the reference allows faster access than
fetching it from the memory manager.

some of these tag-cases are likewise used for arrays:
Bits 0-31: Array Heap ID (or Pointer)
Bits 32-59: Array Base Offset
Bits 60-63: Reference Type Tag (12=Tag Array)

the result is currently a 256MB array size limit. possibly later, for
arrays larger than this, would either need to add an overflow case, or
possibly split larger arrays into a number of smaller parts (this
reference structure would then only refer to a part of an array at a
time in this case).

currently, this is unaddressed, as it is currently fairly unlikely that
script code will need to use arrays this large (and, if I did, would
probably then also need to address the current use of 32-bit integers
for array index/offset operations, which ATM causes a 2GB limit, and
probably add some operators which accept 'long' arguments).

Post by s***@casperkitty.com
The classic Macintosh used an approach somewhat like that for its memory
management--a lot of code would allocate handles, which from the client's
point of view could simply be used as double-indirect pointers, with the
caveat that client code wishing to use the content of the handle must do
1. Lock the handle prior to use and unlock it after; the system maintained
for each handle a lock count, so there was no harm in locking a handle
that was already locked, but unlocking operations needed to be balanced
with locking ones and there was no safe way to recover if things got
unbalanced.
2. Code needed to refetch the master pointer associated with a handle any
time certain system functions may have been called directly or indirectly.
char *st;
st = (char*)myHandle; // Assuming it's known to be 10 or more bytes
for (int i=0; i<10; i++)
st[i] = i+'0';
but code which wanted to do something more sophisticated with each
character in **st would have to do something like either [I forget the
API method names]
char *st;
LockHandle(myHandle);
st = (char*)myHandle; // Assuming it's know to be 10 or more bytes
for (int i=0; i<10; i++)
show_char(myGrafPort, i*20, 10, st[i];
UnlockHandle(myHandle);
or
char *st;
for (int i=0; i<10; i++)
{
st = (char*)myHandle; // Assuming it's know to be 10 or more bytes
show_char(myGrafPort, i*20, 10, st[i];
}
The first workaround would be unsuitable for use in any kind of system where
multiple simultaneously-running threads can create heap objects, but can be
much more efficient than the second if cooperative multitasking would
suffice. A slight variation, used by PalmOS, is to have the LockHandle
function return a pointer to the storage allocated thereby.

in this VM, there is no locking or unlocking of memory in this way.

there are some mutexes (spinlocks or CriticalSection objects) and some
thread-local state, but this is about it.

s***@casperkitty.com

2016-04-30 02:24:12 UTC

Post by s***@casperkitty.com
Such designs often have a significant speed penalty compared with designs
that use direct pointers, but can allow reasonably effective sandboxing
without having to guard the storage in which user code is storing its
"pointers", provided that user code should be allowed to access the
entirety of any object to which it might need to be given a pointer.

I wasn't trying to suggest that the approach wasn't a good one. The costs
are significant, but for some combinations of semantic requirements there
may not be a cheaper alternative.

Post by BGB
on a 32-bit target, a pointer to the array-object can be used instead of
an index (the index is 32 bits, so could hold a raw 32-bit pointer).

Even on a larger target, a scaled pointer could be used. The biggest issue
with a pointer or scaled pointer is that there's no efficient way to ensure
that an arbitrary bit pattern can't be interpreted as a pointer to arbitrary
memory. It's not possible for a compliant C compiler to completely prevent
arbitrary bit patterns from being interpreted as pointers, but using handles
will make it possible to ensure that such pointers can only identify areas of
memory the program was supposed to be able to access.

Post by BGB
for the small heap, this encodes a 31-bit cell index (allowing up to
32GB for the small-object heap).
for the medium and large object heaps, it is a 30 bit object index
(256GB for medium heap, TB for large heap).

An alternative would be to use pointers with different scale factors for each
heap.

Post by BGB
the cost shouldn't be *that* high, at least beyond what is already
implied by this VM running stuff in an interpreter.

It's significant, but sometimes it's worthwhile.

Post by s***@casperkitty.com
The classic Macintosh used an approach somewhat like that for its memory
management--a lot of code would allocate handles....

in this VM, there is no locking or unlocking of memory in this way.
there are some mutexes (spinlocks or CriticalSection objects) and some
thread-local state, but this is about it.

Handle locking isn't used for critical sections, but instead is a way to
avoid having to do repeated double-indirect lookups. The cost savings
from avoiding repeated double indirection is nowhere nearly as significant
on a modern processor as it was on a Motorola 68000 or ColdFire processor,
but I still find the approach interesting.

BGB

2016-04-30 15:54:15 UTC

I wasn't trying to suggest that the approach wasn't a good one. The costs
are significant, but for some combinations of semantic requirements there
may not be a cheaper alternative.

yeah.

for this VM I wanted array operations to be bounds-checked.
this along with the ability to walk arrays and cast them to other types,
similar to C pointers (this VM is running a custom language, rather than
C, but to some extent the language resembles a C/Java hybrid). some
other VM's of mine have run C though.

unlike C, given out-of-type access can be detected, it is possible to
sanity-check any pointers or object references coming from memory which
was accessed out-of-type (such as verify that it is valid and represents
an object the client code has the needed access rights).

similarly, constraints can be imposed on what can hold references to
what, and in other cases checks would be used on access.

( decided to leave out a big chunk about security and memory access
rights and similar )

(WRT access rights checking) this is probably more of a future thing
though at this point with this VM. more immediate priority is getting
everything working well.

Post by BGB
on a 32-bit target, a pointer to the array-object can be used instead of
an index (the index is 32 bits, so could hold a raw 32-bit pointer).

yep. doing C in a VM with full access checks is a little harder, as it
it is typically be necessary to validate pointers much more frequently.

an old x86 interpreter/emulator of mine at one point had used a trick
for optimizing memory-access operations, where it maintained side-info
with the registers to cache pointers to the backing memory (to limit
trips through the MMU logic and similar, *).

so, it is a similar sort of idea.

*: if you load something into a register, subsequent accesses can
"remember", and thus isn't necessary to do a full trip on each access,
but instead just when loaded from somewhere "unknown" or if the "simple
case" doesn't check out (ex: is outside the bounds of the span
previously returned by the MMU).

likewise, for the most part only the MMU logic needed to care what was
in the segment registers (for 32-bit x86, it is possible to almost
entirely ignore segment registers, and use a partially separate backend
for 16-bit operation).

FWIW: this was also one of my first interpreters to use call-threaded
code (and split apart decoding from executing instructions, partly
initially because the x86 decoding step was reasonably expensive).

my newer interpreters are still generally faster (things got refined a
little more). this includes some newer emulators as well (generally for
micro-controllers and similar).

An alternative would be to use pointers with different scale factors for each
heap.

the heap indices essentially were scaled.

for example, for the small-object heap:
Bit 0: 0
Bit 1-20: Cell Index (multiple of 16 bytes)
Bit 21-31: Identifies the heap region (multiple of 16MB)

unpacking the reference basically consists of fetching the region base
and adding in the cell index.

similar applied to the medium heap, just with a multiple of 256 bytes
(and 65536 cells per region vs 1M cells per region).

the large object heap held an ID number for the memory object (which
gives the index of the respective AVL node).

in all these cases, decoding the index to a pointer is O(1).

Post by BGB
the cost shouldn't be *that* high, at least beyond what is already
implied by this VM running stuff in an interpreter.

It's significant, but sometimes it's worthwhile.

in a past VM which worked a vaguely similar way, in its JIT compiled
form code could approach similar performance to that of natively
compiled C (with a C-like coding style), despite some of its design
handicaps in this area.

part of this is basically having the VM/JIT be smart enough to realize
where it can skip using more costly representations.

for an interpreter, the are much bigger issues to be concerned about,
and even if done well, there will still be a ~10x to 15x slowdown vs native.

Post by s***@casperkitty.com
The classic Macintosh used an approach somewhat like that for its memory
management--a lot of code would allocate handles....

in this VM, there is no locking or unlocking of memory in this way.
there are some mutexes (spinlocks or CriticalSection objects) and some
thread-local state, but this is about it.

ok. a similar cache-as-pointer approach is typically used here.

there is no need to "lock" memory in any sense though, as the memory
does not move around, nor is there any form of exclusivity towards its
access.

s***@casperkitty.com

2016-04-30 19:30:15 UTC

Post by BGB
yep. doing C in a VM with full access checks is a little harder, as it
it is typically be necessary to validate pointers much more frequently.

It is difficult to sandbox C; it would be much easier to sandbox a language
that was very close to C, but added some restrictions and possibly some new
types as well. A secure language should not allow the construction of
possibly-invalid pointers. While there are some applications where it is
necessary to construct pointers from numbers in ways that cannot possibly
be validated, many other applications would have no need for such a thing
given a type-agnostic means of copying objects that may include pointers.
The relationship between C's numeric sizes and common machine word sizes
makes things a bit awkward, but if one had a system with 31-bit "char",
"short", and "int" types, 62-bit "long", and 93-bit "long long" and added
a restriction that storing into the storage occupied by a pointer the
result of any integer operator would invalidate the pointer, one could then
represent each pointer as a 31-bit object ID whose upper word was set and
a displacement whose upper word was clear. Loading the first word of a
pointer into a "char" would yield one whose upper bit was invisibly set,
but the only way code could ever get its hands on such a value would be
to load it from a valid pointer. Dangling pointers could be avoided by
ensuring that before any object-id gets reused, memory would be searched
for no-longer-valid object ids, and any such ids found would be invalidated.

Post by BGB
the heap indices essentially were scaled.
Bit 0: 0
Bit 1-20: Cell Index (multiple of 16 bytes)
Bit 21-31: Identifies the heap region (multiple of 16MB)
unpacking the reference basically consists of fetching the region base
and adding in the cell index.

I've long thought the 8086 segmentation design was under-appreciated (though
I admit I was among those under-appreciating it in its day, I have yet to
see any better approach for accessing 1MB of addressing space on a 16-bit
machine). The 80286 design was a step back in terms of usability, and
except for emulation purposes the 80386 machine followed the 80286 approach
while ignoring what was good about the 8086 approach. What you describe is
like what I wish the 80386 could have supported at a hardware level, though
I would probably have had a smaller number of heap regions so as to allow
descriptors for all of them to be kept on-chip.

Post by BGB
there is no need to "lock" memory in any sense though, as the memory
does not move around, nor is there any form of exclusivity towards its
access.

Allowing objects to move around can really help ease problems with memory
fragmentation, especially in resource-constrained systems. The cost of
supporting relocatable memory on the Macintosh was pretty reasonable given
the extent to which it could ease fragmentation.

BGB

2016-04-30 23:20:58 UTC

Post by BGB
yep. doing C in a VM with full access checks is a little harder, as it
it is typically be necessary to validate pointers much more frequently.

the language I am working on is partly intended as a compromise between
some of the languages I have used.

I am primarily a C programmer, so C still weighs fairly heavily in the
language design (though, it isn't really meant to try to "replace" C, as
it is intended for different use cases).

I also use Java and C# to some extent, but find them at times to be a
bit dissatisfying (stuff that is fairly straightforward in C may be
rather annoying in C#, and a horrible mess in Java).

similarly, C++ also weighs in, though the language doesn't really aim to
recreate C++ (it is a somewhat different beast).

general:
syntax:
mixture of Java, C, and ActionScript;
tries to be fairly conservative here
I see little point in making the syntax needlessly weird
granted, some things depend on ones' perspective / ...
parser works more like C# than like in C / C++ or Java
parsing depends solely on syntax, not on context
type-system:
statically-typed, mostly Java-like;
variant types will exist as a special case (*1).
a lot of C-like features (arrays may be offset and type-cast);
will support structs, which are mostly C#-like;
struct pointers may point into arrays;
like C#, they will not support inheritance or virtual methods.
object system:
mostly Java-like, single inheritance with interfaces
methods implicitly virtual (unless final or not overloaded)
will probably support default methods.
near-term, will probably not support generics
scope:
semantically, scoping model is C# like.
packages and images work similarly to namespaces and assemblies
compilation process is also similar to C#
modules link horizontally and declaration order is irrelevant
like C and C++, functions and variables may be declared globally
argument variables may be passed by reference (like C# and C++)
memory management:
explicit, manual, and region-based (*2)
in many cases, lifetime is made explicit in declarations and usage.
essentially 'automatic duration' and use of RAII like patterns
other cases are handled by regions or by manual new/delete.

*1: as noted, this is where the 62 bit fixlong and flonum types come
into play. the normal long and double types are 64-bit. smaller integer
types are represented at full precision as variant.

these exist mostly because sometimes one really does need it.
'variant' will in many ways function similarly to 'object' in C#, and is
used in basically a similar way.

core types are otherwise fixed size, with an attempt to nail down
numerical semantics (arithmetic expressions should ideally have well
defined results). some semantics here differ a bit from C (different
type promotion rules, minor differences in operator precedence rules, ...).

*2: traditional garbage collection can have rather undesirable effects
on performance and overall responsiveness, particularly in real-time
use-cases (this is one of the major intended application areas).

implicitly, the language design will assume that the VM provides for
memory and object management, but mostly leaves it to user code that
memory gets freed.

new/delete will be used, but objects will generally exist within
regions. specialized regions may be created, and when destroyed any
objects remaining within the region are also destroyed (as an
alternative to tracking them down and deleting them individually).

I had considered also supporting reference-counting, but ATM I have
reservations on this.

the size of the heap depends a fair bit on how much memory is in use.
this part is means to be fairly transparent to client code, and ideally
it will not be munging around too much with references.

it is likely to depend on the code's level of "trust" how much the VM
allows it to get away with on this front.

Post by BGB
there is no need to "lock" memory in any sense though, as the memory
does not move around, nor is there any form of exclusivity towards its
access.

ok, makes sense. could be worth considering, though I am uncertain ATM.
it would have to be counter-balanced some with the cost of memory copying.

if working with a lot of memory, memory access patterns can easily
become a big factor for performance, and the delays caused by "memcpy()"
style operations can potentially become fairly significant (particularly
in real-time).

though, this is likely to depend a lot on the hardware and use-case.

main targets presently are x86, x86-64, and ARM.

s***@casperkitty.com

2016-05-01 09:03:21 UTC

Post by BGB
*1: as noted, this is where the 62 bit fixlong and flonum types come
into play. the normal long and double types are 64-bit. smaller integer
types are represented at full precision as variant.

I'd recommend using a different allocation of bit values: 61-bit references,
61-bit integers, and 63.5-bit floating-point values. Doing that will allow
all floating-point computations to yield precisely the same values as they
would with IEEE-64 except when the results exceed the range +/- 1.34E+154,
which of course they usually won't.

BGB

2016-05-01 15:00:42 UTC

most normal double values have full precision anyways, except when cast
to variant. if a variable is declared as double, it uses a raw double.

the impact on the precision of shaving 2 bits off for doubles seems
pretty minor though (and in any case it is a serious improvement over
the 28 bit flonum type used for a long time by its predecessor). it
seems fairly unlikely most code would actually notice the difference.

however, IME, large integer values come up fairly often, so basically
want as close as I can get to the full 64 bits here.

boxed types exist as a fallback, which allow full precision but are more
expensive (as they can't be shoved into a reference and need to be
heap-allocated). similar is true of int128 and float128, which are
always boxed.

double f, g, h;
variant v;
...
h=f*f+g; //full double precision

v=[f, g, h]; //loses 2 bits from each
h=(double)va[2]; //gets an 'h' with 2 fewer bits.

however:
v=[f, g, h]D; //keeps full precision
h=(double)va[2]; //no loss

likewise:
v=(variant)h; //loses 2 bits
h=(double)v; //lost 2 bits

the casts are a formality here, but the compiler may issue a warning if
no casts are used.

but, with variant, you can do:
if(v is double)
...
if(v is variant[])
...
...

side note (syntax):
[ ... ] declares an inline array, with no suffix element type is
inferred (typically depends on LHS or destination).
[ ... ]X declares an array, where the suffix indicates the element type.
[ ... ]:T also declares an array with an explicit type.
ex: "[1, 2, 3]:double"

note that partially as a holdover from ES/AS, it uses [...] for arrays,
and {...} for objects.

side note:
struct FooVec2 {
double x, y;
}

FooVec2 v2;
v2={x: 3, y: 4}:FooVec2;

though, a traditional constructor could also be used:
struct FooVec2 {
double x, y;
FooVec2(double x, double y)
{ this.x=x; this.y=y; }
}

FooVec2 v2;
v2=new FooVec2(3, 4);

s***@casperkitty.com

2016-05-01 16:41:06 UTC

Post by BGB
the impact on the precision of shaving 2 bits off for doubles seems
pretty minor though (and in any case it is a serious improvement over
the 28 bit flonum type used for a long time by its predecessor). it
seems fairly unlikely most code would actually notice the difference.

Actually, from a numerical perspective shaving off one bit is often worse
than shaving off many more. Shaving off two bits isn't as bad, but it's
still often worse than shaving more.

Perfect rounding will yield a result that's always within 1/2 of a unit
in the last place (ulp) of the mathematically perfect result. If perfect
rounding is followed up by rounding off the last bit, however, the result
may be by as much as 3/4 of a unit in the *new* last place. Rounding off
the last two bits may yield a result which is within 5/8 of a unit in the
new last place. Not as bad as 3/4, but still significantly worse than 1/2.
While some people mumble about the fact that systems which use extended
precision for intermediate computations may sometimes be off by 1025/2048
ulp, that's close enough to 1/2 that the advantages of using a higher-
precision temporary type *in languages that support such types properly*
would generally outweigh the disadvantages if the lack of proper language
support hadn't led hardware designers to effectively abandon such types.

If you do want to loop off more than half a bit, I'd suggest working with
scaled values so that you can give up bits off the exponent rather than
the mantissa. Doing so will force numbers into denormal range sooner than
would otherwise be required, but will avoid double-rounding issues for
values outside that range. If you can't do that, I'd suggest lopping off
four bits rather than two, so the worst-case rounding error would be 17/32
ulp rather than 5/8.

r***@gmail.com

2016-05-01 16:50:51 UTC

Actually, from a numerical perspective shaving off one bit is often worse
than shaving off many more. Shaving off two bits isn't as bad, but it's
still often worse than shaving more.
Perfect rounding will yield a result that's always within 1/2 of a unit
in the last place (ulp) of the mathematically perfect result. If perfect
rounding is followed up by rounding off the last bit, however, the result
may be by as much as 3/4 of a unit in the *new* last place. Rounding off
the last two bits may yield a result which is within 5/8 of a unit in the
new last place. Not as bad as 3/4, but still significantly worse than 1/2.
While some people mumble about the fact that systems which use extended
precision for intermediate computations may sometimes be off by 1025/2048
ulp, that's close enough to 1/2 that the advantages of using a higher-
precision temporary type *in languages that support such types properly*
would generally outweigh the disadvantages if the lack of proper language
support hadn't led hardware designers to effectively abandon such types.
If you do want to loop off more than half a bit, I'd suggest working with
scaled values so that you can give up bits off the exponent rather than
the mantissa. Doing so will force numbers into denormal range sooner than
would otherwise be required, but will avoid double-rounding issues for
values outside that range. If you can't do that, I'd suggest lopping off
four bits rather than two, so the worst-case rounding error would be 17/32
ulp rather than 5/8.

What do you think about the idea of computing ranges, SuperCat? Of
using unrounded values, and assuming them to be the floor value, and
then to use the next higher bit pattern to represent the ceil value,
and then computing the ranges on calculations, which then would yield
the upper and lower floor results, which can then be fed into subsequent
calculations.

By doing this, you are guaranteed to get a final set of values that
contains the absolute maximum and minimum values in the computation,
with the opportunity to then take (x+y)/2 to derive your "average" or
"suitable" answer.

No rounding at any point, but with the expense of the growing number
of computations depending on how big the expression is.

It may be a desirable alternative to N-bit computing, as the workload
then is a fixed and known quantity that is expressible, and can be
conducted in massive parallel on a vector FPU quite easily.

Best regards,
Rick C. Hodgin

BGB

2016-05-01 19:27:18 UTC

Actually, from a numerical perspective shaving off one bit is often worse
than shaving off many more. Shaving off two bits isn't as bad, but it's
still often worse than shaving more.

I doubt this, I would need to run tests.

in any case, IME shaving a few bits off of double tends to be
considerably more accurate than float-28, which was solidly "pretty bad"
(the effects of truncation were often fairly readily visible).

Post by s***@casperkitty.com
Perfect rounding will yield a result that's always within 1/2 of a unit
in the last place (ulp) of the mathematically perfect result. If perfect
rounding is followed up by rounding off the last bit, however, the result
may be by as much as 3/4 of a unit in the *new* last place. Rounding off
the last two bits may yield a result which is within 5/8 of a unit in the
new last place. Not as bad as 3/4, but still significantly worse than 1/2.
While some people mumble about the fact that systems which use extended
precision for intermediate computations may sometimes be off by 1025/2048
ulp, that's close enough to 1/2 that the advantages of using a higher-
precision temporary type *in languages that support such types properly*
would generally outweigh the disadvantages if the lack of proper language
support hadn't led hardware designers to effectively abandon such types.

chopping off 2 bits typically doesn't have enough of an effect on the
result to show up in printed output, and is also this case is also
trivially avoided if one really wants to keep full precision.

for most use cases, a slight loss in precision here is fairly immaterial.

Post by s***@casperkitty.com
If you do want to loop off more than half a bit, I'd suggest working with
scaled values so that you can give up bits off the exponent rather than
the mantissa. Doing so will force numbers into denormal range sooner than
would otherwise be required, but will avoid double-rounding issues for
values outside that range. If you can't do that, I'd suggest lopping off
four bits rather than two, so the worst-case rounding error would be 17/32
ulp rather than 5/8.

if you mean by using integer multiplies, this is basically no-go.

for some targets, large integer multiplies (and especially divides) are
to be avoided if possible, as in some cases they can be between pretty
expensive, and divides can be borderline absurd.

shifts are cheaper here.

actually, this is a reason I simply chop off 2 bits.
previously, I had used a scheme where E +/- 256 had full precision, at
the cost of 4 bits outside this range.

however, the mask-checks and branches involved made the wrap/unwrap
operations fairly costly. the current scheme is a bit cheaper overall.

BGB

2016-05-05 00:45:46 UTC

Actually, from a numerical perspective shaving off one bit is often worse
than shaving off many more. Shaving off two bits isn't as bad, but it's
still often worse than shaving more.

I doubt this, I would need to run tests.
in any case, IME shaving a few bits off of double tends to be
considerably more accurate than float-28, which was solidly "pretty bad"
(the effects of truncation were often fairly readily visible).

weird... got around to testing it, and saw these results.

in my tests (mostly involving running an integrator over randomized
value ranges with simple calculations over trigonometric functions and
quantizing all the doubles used by the integrator), on average the most
accurate result was shaving 3 bits with a bias of 3.

best results (shaved bits and rounding bias):
3 bits, bias=3 (1st place)
4 bits, bias=7 (2nd place)
2 bits, bias=2 (3rd place)

all 3 of these had fairly similar error values.

shaving more bits generally seems to get worse results.

truncation seems to universally give significantly worse results than
rounding (where 2 bits did the best for truncation).

with default "%f" formatting, usually only truncation showed results
with differing low-order digits.

BGB

2016-04-23 20:03:42 UTC

Right now the free() algorithm would be looking for an exact value,

It doesn't need to look for it; it's given the exact value.

Post by r***@gmail.com
so why not alter that search algorithm to look for a value that's >=
the pointer value, but less than the next one?

Look where? In a global list of SORTED allocations? And whereabouts in
the list would it start; from the beginning?
That would then need extra effort to keep sorted too, as blocks are
freed and allocated.

one can implement it as a big AVL tree kept sorted by addresses. it
actually works pretty ok.

as regions/blocks are allocated or freed, the tree will periodically
re-balance itself.

while a sorted array is simpler, the costs of inserts/deletes tends to
be a bit higher (worst case is O(n^2) if done this way), as well as
there may be a problem of needing to potentially expand this array
(unless initially allocated at its largest possible size).

fully generalizing this typically requires resorting to an AVL tree or a
B-Tree or similar, so one almost may as well just use the tree up-front
and gain the O(log n) inserts/deletes.

Post by r***@gmail.com
That would give you
a possible location for the pointer, and then it's just a matter of
comparing the range of start + length.

Free() is something that should just take a handful of lines to execute.
But you're going to make it execute some algorithm? You do seem to like
making life difficult!

ideally, yes.

though, if done well, the average cost of memory allocation and freeing
can still be kept fairly small.

for example, memory objects can be cached in free-lists, and it is
pretty quick to allocate/free in cases where this involves simply adding
or removing the object from the appropriate sized list.

Post by r***@gmail.com
It makes sense to me to store both start and length for allocated
blocks.

(The allocators I write (usually on top of malloc/free or the OS
equivalent) don't maintain a list of allocated blocks, nor of the
(requested) allocated sizes.
Both these bits of information are known to the program and simply
passed to the deallocator function. There is no overhead other than
over-allocation as block sizes are rounded up.
However, that information /is/ stored for deallocated blocks.)

FWIW, it may make sense to keep track of both sets of blocks.

with trees, this may mean a tree for allocated regions/blocks, as well
as a tree for free regions/blocks.

for free memory, it may make sense to have the tree sorted by size
rather than address, as one more often has reason to search the tree for
a given size of free memory than search for a specific address.

one nasty hack is to have joint trees where the nodes/leaves can be
walked either in terms of size or address, but I have only ever done
this once (generally too much of a pain to be worthwhile, vs either
multiple AVL trees, or just using bitmap searching...).

BartC

2016-04-23 20:37:32 UTC

Post by BartC
Free() is something that should just take a handful of lines to execute.
But you're going to make it execute some algorithm? You do seem to like
making life difficult!

ideally, yes.
though, if done well, the average cost of memory allocation and freeing
can still be kept fairly small.
for example, memory objects can be cached in free-lists, and it is
pretty quick to allocate/free in cases where this involves simply adding
or removing the object from the appropriate sized list.

The allocator I most use is for use with byte-code interpreters which
have enough trouble maintaining performance without needing to worry
about memory allocation too! It has to be simple - and quick.

Big allocations (>1KB) just used malloc (although sizes are pre-rounded
up before submitting to malloc, so that the program knows the guaranteed
capacity).

Smaller ones, in the most recent scheme, use power-of-two sizes from 16
to 1024 bytes, with a dedicated free-list for each size.

Post by BartC
However, that information /is/ stored for deallocated blocks.)

FWIW, it may make sense to keep track of both sets of blocks.

Except that I use memory /within/ the block for the free-list to link
blocks. That memory is not available when the block is in use.

Post by BGB
one nasty hack is to have joint trees where the nodes/leaves can be
walked either in terms of size or address, but I have only ever done
this once (generally too much of a pain to be worthwhile, vs either
multiple AVL trees, or just using bitmap searching...).

Decades ago I tried using more complicated schemes, such as bit-maps to
try and avoid fragmentation by combining smaller blocks into larger ones
(then, block sizes were more variable too).

But I then discovered that it worked perfectly fine without any of that!
And this was when memory was still tight too, not because there was so
much spare memory as there is now.

I suppose I could end up with lots of freed 64-byte blocks that I can't
re-use because now I mostly need 32- or 128-byte ones, but really it's
never seemed to cause any problems.

--
Bartc

BGB

2016-04-23 22:40:54 UTC

Post by BartC
Free() is something that should just take a handful of lines to execute.
But you're going to make it execute some algorithm? You do seem to like
making life difficult!

ideally, yes.
though, if done well, the average cost of memory allocation and freeing
can still be kept fairly small.
for example, memory objects can be cached in free-lists, and it is
pretty quick to allocate/free in cases where this involves simply adding
or removing the object from the appropriate sized list.

I do likewise (using it for an interpreter, and also a compiler), and
similarly have a use for reasonably fast allocators.

though, I don't personally take huge issue with spending maybe a few
kLOC on a memory allocator.

granted, the allocator generally needs to function as a primary
allocator within a codebase, which in my case has included things like
doing Minecraft-style worlds in a 3D engine (*).

needing to load and unload voxel chunks and rebuild vertex arrays and
similar as the camera moves around often puts a bit of load on the
memory allocator (this style of engine isn't particularly lightweight on
the memory usage front).

*: actually, I am working on my second such engine. I decided for
various personal reasons to do a ground-up reimplementation of pretty
much my entire 3D engine codebase (my prior engine being an
approximately 1 MLOC codebase).

though, it has been a pretty slow moving project thus far (~ 6 months,
still arguably at pretty early levels of development).

Post by BartC
Big allocations (>1KB) just used malloc (although sizes are pre-rounded
up before submitting to malloc, so that the program knows the guaranteed
capacity).

it works, but generally malloc hasn't really been suitable for my needs,
and by the time it can be made so, one already has most of the hard
parts of a full-featured allocator (for using mmap or VirtualAlloc as
the backend), so may as well just do this.

FWIW, for mapping sizes to free-lists, general strategy has been
< 4064: i=(sz+31)>>4;
<261856: i=256+((sz+287)>>8);
otherwise: use LObj allocator.

for freeing, if it is a small object, it just adds it back into the
correct free-list for that size.

alloc (tried in sequence):
first check correct-sized free-list;
check directly bigger free lists;
fall back to scanning for a run of free cells;
try expanding heap if below the low-threshold and try again;
flush all the free lists and try again;
see about trying to expand the appropriate heap if possible;
return NULL (out of space).

free lists are maintained per-thread to avoid needing to lock the MM in
the simple case.

Post by BartC
Smaller ones, in the most recent scheme, use power-of-two sizes from 16
to 1024 bytes, with a dedicated free-list for each size.

in my case, objects under 256 bytes tend to be a significant part of the
total memory use. power-of-2 subdivision has been used for objects under
64B before, given there isn't really a much more space effective way to
do this (16 byte cells are generally most effective IME for objects in
the 256B to 4kB range).

though, smaller cells are counter-productive on the performance front.

the reason for another level of 256B cells was because a prior MM had
fell back to using a malloc-backed large-object system for everything
over 6kB, but frequent allocation/deallocation of arrays lead to a fair
bit of overhead within MSVCRT (as well as with the insert/delete
operations).

Post by BartC
However, that information /is/ stored for deallocated blocks.)

FWIW, it may make sense to keep track of both sets of blocks.

Except that I use memory /within/ the block for the free-list to link
blocks. That memory is not available when the block is in use.

in my case, typically object headers are used.

small objects have a 16B header which holds:
the logical object type;
the size of the object;
the source file and line number;
a reference count and flags;
a header check value.

many of these are bit-packed table indices to save space.

for small objects, because of the bitmap, no internal pointers are needed.

large objects and memory regions have an additional header which is used
for the AVL tree structure and similar (this one does contain pointers).

Decades ago I tried using more complicated schemes, such as bit-maps to
try and avoid fragmentation by combining smaller blocks into larger ones
(then, block sizes were more variable too).
But I then discovered that it worked perfectly fine without any of that!
And this was when memory was still tight too, not because there was so
much spare memory as there is now.
I suppose I could end up with lots of freed 64-byte blocks that I can't
re-use because now I mostly need 32- or 128-byte ones, but really it's
never seemed to cause any problems.

allocation bitmaps are one of my simpler schemes, or at least, simpler
than AVL trees.

but, as noted, I am left currently using a bi-level scheme, with 16B and
256B cells (an allocation being a span of adjacent cells).

as is, heap sizes aren't really big enough to justify adding additional
cell-allocator levels.

s***@casperkitty.com

2016-04-22 21:09:51 UTC

Post by r***@gmail.com
What does the C standard say about this kind of occurrence?
char* x = malloc(10);
free(x + 1);

The Standard imposes no requirements if "free" is passed anything other than
a null pointer or a pointer that has been returned from "malloc", "calloc",
or "realloc" and has not been passed to "free" or "realloc" [even if "realloc"
returns a pointer with the same bits as the one passed in, it is from the
point of view of the Standard a new pointer which may [and should] at some
point be passed to "free" or "realloc"].

Further, unlike things like left-shifting a negative integer which will
behave predictably on all non-obtuse two's-complement implementations, the
consequences of passing an invalid pointer to "free" are generally not
predictable on any implementation outside of certain deliberately-constructed
scenarios.

r***@gmail.com

2016-04-22 21:36:37 UTC