move everything from t_dictionary to another t_dictionary (to keep same t_dictionary*)

diablodale's icon

How can I efficiently move all content of one t_dictionary to another t_dictionary?
Or swap the t_dictionary* that is stored in the registration system?
Think of it as...pseudocode...

t_dictionary* my_registered_dictionary = ...;
t_dictionary* new_unregistered_dictionary = dictobj_dictionaryfromstring(/* json */);

// keep the same name in the registration system
my_registered_dictionary = std::move(new_unregistered_dictionary);

No copy, no clone, just an efficient move of the dict contents (which are likely a bunch of pointers).

Below seems inefficient to me

// chews through unique symbols when not reused, and thrashes registration db
object_free(my_registered_dictionary);
dictobj_register(new_unregistered_dictionary, null)

--or---

// allocates double memory, and not sure copyunique is even correct
dictionary_clear(my_registered_dictionary);
dictionary_copyunique(my_registered_dictionary, new_unregistered_dictionary);
object_free(new_unregistered_dictionary);

I see in the headers some mystery apis...are these or others any help?

// appeared with no docs in the Max 7 headers (Max 6 doesn't appear to have these)
t_max_err dictionary_clone_to_existing(const t_dictionary* d, t_dictionary* dc);
t_max_err dictionary_copy_to_existing(const t_dictionary* d, t_dictionary* dc);
t_max_err dictionary_merge_to_existing(const t_dictionary* d, t_dictionary* dc);

Joshua Kit Clayton's icon

Hi Diablodale,

There isn't any current API to support what you're asking for. The best approach IMO would be the free and new registration approach. All of the other API calls that you point to copy contents from the source to the destination dictionary similar to what your previous code does.

The unique symbol hash uses a pool that is cycled and reused after N iterations to prevent boundless growth, and I wouldn't worry about much inefficiency there.

That said, I have encountered also wanting exactly the behavior you describe for moving underlying dictionary data (a linklist and a hashtab) to an existing dictionary object without the copy, so sometime we might support this in the API.

Best,

Joshua

diablodale's icon

Got it. I implemented your IMO. In testing, I am not seeing reuse of the unique symbols. dictobj_register() continuously created 4500+ new unique names. That seems excessive...how big is this pool?

Repro

My tests pseudocode...

t_dictionary* dict = nullptr;

while (true) {
  std::string json = makejsonstring();
  t_dictionary* newdict = nullptr;
  dictobj_dictionaryfromstring(&newdict, json.data(), true);
  newdict = dictobj_register(newdict, &name);
  if (dict)
    object_free(dict);
  dict = newdict;
  debugprint(dictobj_namefromptr(dict));
}

Result

4500+ unique names. I stopped the test when I saw no reuse.

The names...

start with u

then 3 digits that are randomish and reused. For example, I found 8 occurrances of u830

then 2 zeros

then 4 digits that started at ~1800 and then forever incremented by 1. Not reused.

e.g.

print: dictionary u694001830
...
print: dictionary u356002275
...
print: dictionary u872003363
...
print: dictionary u724004196
...
print: dictionary u372005879
...
print: dictionary u130006421

I saw no reuse up to uxxx006421 when I stopped the test.

Expected

Reuse of any of the previous 4500 unique names. 4499 of them are not in use. In addition, these names are in the separate dictionary namespace which makes collisions with jit names impossible.

diablodale's icon

Running the test longer finally shows wraparound of the last 5 digits.

print: dictionary u768010860
print: dictionary u552010861
print: dictionary u717010862
print: dictionary u230010863
print: dictionary u959000217  <-- wraparound...however...
print: dictionary u878000252
print: dictionary u023000335
print: dictionary u263000333

However, since the first 3 digits after the u are randomish and reused, it means that the wraparound of the first 5 apply only to a unique grouping of the first 3.

That means that the wraparound of the ~10000 digits applies to only 1 of the first 3 digits (000-999 aka 1000). So by math and guessing at the seen behaviors... it will generate approximately 10,000 x 1,000 = 1,000,0000 unique names before we see reuse.

Joshua Kit Clayton's icon

Hi Dale,

The unique symbol pool size must grow to 10k symbols before reuse. That pool of 10k symbols will have this randomized prefix which exists to have better symbol table hashing characteristics.

Because of this, your assumption of 10,000 x 1,000 is incorrect. Please let us know if you find otherwise. You could perform a test by hashing a symbol in a hashtab and seeing at what point you find a symbol is already hashed rather than looking at the numbers in the unique symbol string.

e.g. something like the following untested code

t_hashtab* myhash = hashtab_new(0);
long count = 0;

...

t_object* p = 0;
hashtab_lookup(myhash, uniquesym, &p);
if (p) {
  object_post(x, "I've been reused at count = %ld", count);
} else {
  hashtab_storeflags(myhash, uniquesym, (t_object*)1, OBJ_FLAGS_DATA);
  count++;
}

Hope this helps!

-Joshua

diablodale's icon

I think I'm not being clear enough... I'm investigating two things simultaneously

  1. the reuse/size of the unique id "table"

  2. the number of symbols created to support unique ids

From what you write above...

  1. the unique "table" of 10k numbers is separate from the registration hash table that holds the lookup of "uDDD00DDDDD" to an object pointer. The APIs that request a unique id will draw from the potential 10k digits, and then prepend a random 3 digits to the front, so that later the lookup using that id will better hash into the registration table. Did I understand that right?

  2. There will still be up to 10,000 x 1,000 = 1,000,0000 unique names. Names meaning Max symbols. When the above pulls from the 10k, and prepends on 1k, that creates up to 1M unique symbols. That's a lot of little pointers and string blobs scattered through memory. Likely a driving force for the new Max 8 string support. Did I understand this right?

Joshua Kit Clayton's icon

Hi Dale,

Sorry I'm not being clear. The 10k unique symbols available in the pool prior to reuse includes the random prefix. There is no separate pool of linear increasing integers and a randomized prefix. There is only a pool of unique names of the form uDDDDDDDDDD that must contain 10k symbols before being reused. There could be more than 10k unique symbols but only if they are in use above this 10k reserve pool.

#define UNIQUE_SYMBOL_POOL_THRESH 10000

t_symbol* symbol_unique(void)
{
    char str[256];
    t_symbol* s = NULL;
    long generate = 1;

    if (_unique_symbol_pool) {
        systhread_mutex_newlock(&_unique_symbol_lock, 0);
        if (dictionary_getentrycount(_unique_symbol_pool) > UNIQUE_SYMBOL_POOL_THRESH) {
            if ((s = dictionary_getorderedkey(_unique_symbol_pool, 0))) {
                dictionary_chuckentry(_unique_symbol_pool, s);
                generate = 0;
                if (g_max_debug) {
                    object_post(NULL, "symbol_unique: reusing %s", s->s_name);
                }
            }
        }
        systhread_mutex_unlock(_unique_symbol_lock);
    }

    if (generate) {
        sprintf(str, "u%03d%06d", (int)rerand() % 1000, (int)++_object_uid); // randomize the first 3 digits for better hashing
        s = gensym(str);
    }

    return s;
}

And in object_unregister, the relevant code is:

t_max_err object_unregister(void* x)
{
    ...
    if (symbol_is_unique(s)) {
        systhread_mutex_newlock(&_unique_symbol_lock, 0);
        if (!_unique_symbol_pool) {
            _unique_symbol_pool = dictionary_new_hashsize(1381);
        }
        if (!dictionary_hasentry(_unique_symbol_pool, s)) { // it's already there
            dictionary_appendsym(_unique_symbol_pool, s, s);
        }
        else {
            if (g_max_debug) {
                object_post(NULL, "object_unregister: unique sym %s is already in the unique pool", s->s_name);
            }
        }
        systhread_mutex_unlock(_unique_symbol_lock);
        if (g_max_debug) {
            object_post(NULL, "object_unregister: slating unique sym %s for reuse", s->s_name);
        }
    }
    ...
}

I hope this clarifies what is happening.

The unique symbol pool implementation should not hugely bloat your memory or cause significant performance issues. If you find otherwise please let us know, but in real world use cases this code doesn't barely register highly on any performance or memory traces that I've encountered.

diablodale's icon

Got it, thanks. Testing in Max 7 aligns with the behavior you describe and your code snips. I see the rollover with the same set of uDDDDDDDDDD repeated.

That's a curious approach to me... to not reuse until 10k. I would think to reuse symbols as much as possible so the symbol char[] and the hash buckets (to which a reused symbol hashes) are all more likely to be in cpu cache. Something more like if (dictionary_getentrycount(_unique_symbol_pool) != 0 ...

For knowledge/learning... was there some measured behavior now ?20? years ago that benefited from not reusing until 10k?

Joshua Kit Clayton's icon

The rationale is to prevent reuse that is temporally local to possible pathological user operations like sending dictionary <uid> -> deferlow and have it reference actually a totally different logical dictionary that gets the same uid reused somehow (rather than generating an error that the dictionary cannot be found).

10k unique symbols is roughly 300k of memory, which isn't crazy. The CPU cache is going to be most important where there are tight loops referencing the same region of memory for many sequential operations like reading from an audio buffer or jitter matrix. Reuse of unique symbol names for these objects any sooner should not have any significant performance benefits.

You could implement your own unique symbol ID pool with a smaller set of symbols and I think you will not see any significant performance improvement. If you do take the time to do such an experiment and find demonstrably improved results for a real world use case, I'm always happy to look at improving things where it makes sense. This particular circumstance however wouldn't be where I would intuitively spend my efforts, and instead I generally would recommend profiling real world use cases to find the areas of maximum possible improvement, and then optimizing those code paths where possible.

diablodale's icon

Of course...even that pathlogical example won't be 100% fixed since it is possible for wraparound to occur if there are (10k - 1) unique names still being used. But that's a 1/10,000 in an already rare scenario. 😉 I can imaging visual debugging benefits. When a patch author is visually looking at those "uDDD..." strings and sees the same name across consecutive bangs...might led them down a path of confused/pain.

I'm sensitive to this because this is new behavior compared to jitter. Often in jitter objects, the output jit_matrix unique name stays the same on its outlets. For example, in the jit.grab.maxhelp, the outlet from jit.grab continuously sends the same message unique symbol jit_matrix u092000534 yet the visual frames from my webcam are different. Same name pointing to different data. My sensor plugins/extensions are the same...my jit_matrix symbol names stay the same yet the data within them changes. I use the low-level JIT_MATRIX_DATA_REFERENCE | JIT_MATRIX_DATA_FLAGS_USE approach.

With dictionaries, that all changes. No API support for swap to keep the same name and will now create ~30 unique symbols/second/sensor...which means every 6 minutes I will cycling through the entire pool of 10k unique names. Some of my advanced commercial customers run multiple sensors so those names will be cycled even faster.

Joshua Kit Clayton's icon

If you want the name consistency across your sensor frames, you can either use dictionary_clone_to_existing (which will make a data copy, typically what we do in scenarios like this, and the pattern I would recommend you follow unless you have significant measurable performance reasons to optimize further), or look into re-registering with the same name as the previously freed sensor frame dictionary.

Here's an example of what we do internally for dictobj_parse for a simple example that seems relevant.

void dictobj_parse(t_dictobj* x, t_symbol* msg, long argc, t_atom* argv)
{
    t_dictionary* d = NULL;
    t_max_err err;

    err = dictobj_dictionaryfromatoms_extended(&d, msg, argc, argv);
    if (!err) {
        dictionary_clone_to_existing(d, x->m_dictionary);
        dictobj_modify(x, x->m_dictionary);
        object_free(d);
    }
}
diablodale's icon

I chose to build my own UniqueSymbolPool class that my Dictionary class inherits. Pools thread-safe while aggressively reusing its base64 unique symbols. https://github.com/diablodale/maxcpp/blob/eda13bffdde0287f2d749f5283636b1e49488647/maxcpp/max.UniqueSymbolPool.h

Some feedback: xxxobject_register() apis are problematic as they have the fatal risk of freeing an object instead of registering an object. This occurs when the desired name is already registered to another object. No one anywhere wants their object to be unpredictibly freed by that api based on non-thread-safe knowledge. Can't pre-check for a desired name before object_register() because another thread could register that name between the check and the object_register(). Global crit section is janky for such need. These register apis have always been a pain for my use cases. I encourage Max to add a new api that never free/deletes an object. Perhaps something like t_max_err object_try_register(t_symbol* namespace, t_symbol* s, void** x) so that registration fails (and does not free/delete) an object if the name is already in use.

Joshua Kit Clayton's icon

I hear you. I like the idea of object_try_register!