Re: efficiency using categorical data

On 04/12/2017 4:27 PM, Bruce Elliott wrote:
...

> It sounds like the benefits for memory efficiency increase with larger
> array sizes, but then there might be a performance hit, since the
> methods being called to process all the array elements might be more
> costly than operations on fundamental data types.

For a categorical, memory efficiency occurs only if the memory for the
variable itself is more than the 64 bytes/element on average at the
minimum; testing just illustrated that storing additional information
such as the category name does, as one would expect, require additional
memory. In the example I used above, it the names are also one
character for each field, then the memory is the same. But, using a
shorter vector of 1:3 for convenience {'One','Two','Three'} adds 16
additional bytes over the 250 minimum for 266 total.

> I'm starting to think that Yair's suggestion of sticking with
> fundamental types and writing the code to do the bookkeeping is probably
> the most efficient way to work, albeit at the cost of more complex code.
> This is the point when I tend to start defining classes to keep the
> low-level processing encapsulated where other users won't have to see it.

Indeed, and here's the quandary and my beef with TMW as far as what
Matlab is seeming to be turning into...all these new features and the
"in love" with the object-based classes/methods has led to many nice
features such as the table/categorical variables under discussion here,
but at the cost of overall performance and tremendous code "bloat". The
new graphics engine is a prime example as well.

There are now a proliferation of new, improved methods/procedures that
often duplicate existing capability besides the new...there was the
previous *dataset* and an earlier incarnation of categorical/ordinal
variables in the Statistics Toolbox that are now deprecated and
eventually will (one hopes) be removed, but those who made use of it
extensively before they arrived on the scene in the base product will
have work to make that transition. That transition didn't take buy a
few releases; I've not tried to look exactly, but wasn't very long.

Personally, I'd have much preferred TMW have just held off and built
only one version and spent the other effort in really making what they
did implement "lean and mean" instead of rush to market. I understand
this might be seen internally as a loss of marketing opportunity given
the pressure they must be under with the alternatives, many of which are
open source so that the entry cost isn't there albeit support/effort
costs may make up for that those aren't budgeted line items generally.

> My biggest concern is that I'm getting tempted to start defining unique
> indices - or better, keys - for each record in a table. That keeps
> sounding like a database, which I don't really want to take on.

'Tis a quandary, granted...one can only try to read the tea leaves on a
given project as best one can as to which way to go. Yair's point of
it's a tradeoff between what's easy to code vis a vis the cost of what
performance is going to be mandatory for the longer term isn't easy to
solve a priori. I guess I'm glad I'm retired from the technical
consulting gig... :)

--

Latest Images

Trending Articles

Latest Images