Counterfeit Monkey profile run: What does it mean?

So, I tried to do a profile run to see if I could get any clue to what causes the slowness of the 6M62 version of Counterfeit Monkey.

https://gist.github.com/angstsmurf/1875b84fbca18da893c73aea66bcc505

On the first run I tried a test, mom-beats, which turned out to be extremely slow. The second time I just walked around a little at the start of the game, opened the barrier and talked to the barker.

Apparently it spends a lot of time in something called FlexSize, which along with RT__ChLDB, RT__ChSTB and @glk_select seems to account for most of the slowdown.

Does this tell us anything? Is there anything to do about it? Could I run the test in a more helpful way?

The commands used were:
./glulxe --profile profile-raw --profcalls output.ulx
python profile-analyze.py profile-raw gameinfo.dbg --glk dispatch_dump.xml

EDIT: Perhaps I should mention that I had to comment out the Ultra Undo extension in the source code in order to make the profile run work.

A lot of those are Inform 6’s strict mode safety checks. If you can compile it without, then you’ll be able to make a much more realistic profile. Running a release build rather than a test build should do it.

Right. Here is non-test build run:

https://gist.github.com/angstsmurf/a25dac43b80789c614fe2c034095c3b1

Still the same top two time-consumers. The others seem almost harmless in comparison.

The @glk_select call is time waiting for you to type something. You can eliminate that from the profile by piping a command script into glulxe.

When I did some profiling (gist.github.com/erkyrath/9422a0 … 3dfbb42f17) I found it more useful to sort by time-spent including child calls. It is true in your run that FlexSize eats a lot of time, but this is a low-level routine used in string manipulation. You really want to know the rule function that’s calling it so often – that will undoubtedly be several levels higher in the stack.

glk_select is the function that blocks the process waiting on user input. The greater percentage of time you spend here, the better you’re doing.

The next worst offender is:

[ FlexSize txb bsize n; ! Size of an individual block, including header
	if (txb == 0) return 0;
	for (bsize=1: n<txb->BLK_HEADER_N: bsize=bsize*2) n++;
	return bsize;
];

This is computing a power of 2 from txb->BLK_HEADER_N. It seems like we could use a glulx shift opcode to compute this as we did earlier with the word hashing.

Yes, that could be optimized, but it’s pettifogging until you know what is doing all that string manipulation.

True. I was thinking more from an I7 perspective than a Counterfeit Monkey perspective.

Is there any way to investigate this using the profiler? Any way to see what is calling FlexSize, other than putting lots of print statements in the I6 intermediary code?

Zarf suggested sorting including child calls. That would put whatever is calling FlexSize higher in the list, since it would include FlexSize’s time plus its own time and that of any other functions that it called.

Right, thanks! Guess I should have read the profile-analyze.py documentation before asking.

From looking at the template sources and the profiles, I could see that most of the calls to FlexSize are from HeapLargestFreeBlock, FlexRecutInternal, and BlkValueWrite. In addition to optimising FlexSize, it would be worth inlining the calls to it in those functions. And the other functions which a longer profile would reveal.

Back in the days of 6G60 I started an extension called High Performance Indexed Text. The idea was to inline the calls to BlkValueRead and BlkValueWrite. That was a major pain! I might restart it, but this time just do FlexSize, at least for now.

Thinking about how to inline these calls… it would be a whole lot easier if I6 had shift operators. No chance of getting them into I6 soon is there Zarf? :wink:

Inform’s management of the Flex system is pretty slow on the whole. It’s necessary to support the Z-Machine, but for Glulx games would it be possible to skip it entirely and just use @malloc?

That’s interesting. I’ll check that out.

In the meantime, the 6M62 code is now on par with the speed of the 6G60 version, thanks to the elimination of a lot of visibility tests. If anyone was discouraged by the earlier slowness, feel free to have another look.

Inform’s operator system is complicated and I don’t want to touch it. Using shift opcodes makes for ugly code, but it works and lets you do what you need.

That’s what I figured. The common shift operators are used for action invoking, so it would need some nasty overloading.

This is certainly worth examining: if it is possible, just using the memory opcodes directly should be a lot faster.

I’m actually not certain of that. The internal implementation of @malloc is naive (it doesn’t call libc malloc). You would have to do a lot of testing to see what’s faster.

Is the Play Online version of counterfeit monkey on IFDB the sped up one? I’m playing it now, and examining is fast, but each directional command takes several seconds to process.

No, we haven’t implement any major performance improvements yet.

Would there be any way to speed stuff up by letting players turn off pathfinder and other user-friendly features? Path finding is nice, but I’d rather type 15 fast commands than 1 slow one.

I’m pretty sure the pathfinding is a one-time cost, so it would be other things that are making it so slow.