Counterfeit Monkey profile run: What does it mean?

Angstsmurf · June 1, 2016, 1:27pm

So, I tried to do a profile run to see if I could get any clue to what causes the slowness of the 6M62 version of Counterfeit Monkey.

https://gist.github.com/angstsmurf/1875b84fbca18da893c73aea66bcc505

On the first run I tried a test, mom-beats, which turned out to be extremely slow. The second time I just walked around a little at the start of the game, opened the barrier and talked to the barker.

Apparently it spends a lot of time in something called FlexSize, which along with RT__ChLDB, RT__ChSTB and @glk_select seems to account for most of the slowdown.

Does this tell us anything? Is there anything to do about it? Could I run the test in a more helpful way?

The commands used were:
./glulxe --profile profile-raw --profcalls output.ulx
python profile-analyze.py profile-raw gameinfo.dbg --glk dispatch_dump.xml

EDIT: Perhaps I should mention that I had to comment out the Ultra Undo extension in the source code in order to make the profile run work.

Dannii · June 1, 2016, 1:42pm

A lot of those are Inform 6’s strict mode safety checks. If you can compile it without, then you’ll be able to make a much more realistic profile. Running a release build rather than a test build should do it.

Angstsmurf · June 1, 2016, 2:18pm

Right. Here is non-test build run:

https://gist.github.com/angstsmurf/a25dac43b80789c614fe2c034095c3b1

Still the same top two time-consumers. The others seem almost harmless in comparison.

zarf · June 1, 2016, 4:56pm

The @glk_select call is time waiting for you to type something. You can eliminate that from the profile by piping a command script into glulxe.

When I did some profiling (gist.github.com/erkyrath/9422a0 … 3dfbb42f17) I found it more useful to sort by time-spent including child calls. It is true in your run that FlexSize eats a lot of time, but this is a low-level routine used in string manipulation. You really want to know the rule function that’s calling it so often – that will undoubtedly be several levels higher in the stack.

vlaviano · June 1, 2016, 4:58pm

glk_select is the function that blocks the process waiting on user input. The greater percentage of time you spend here, the better you’re doing.

The next worst offender is:

[ FlexSize txb bsize n; ! Size of an individual block, including header
	if (txb == 0) return 0;
	for (bsize=1: n<txb->BLK_HEADER_N: bsize=bsize*2) n++;
	return bsize;
];

This is computing a power of 2 from txb->BLK_HEADER_N. It seems like we could use a glulx shift opcode to compute this as we did earlier with the word hashing.

zarf · June 1, 2016, 4:59pm

Yes, that could be optimized, but it’s pettifogging until you know what is doing all that string manipulation.

vlaviano · June 1, 2016, 5:02pm

True. I was thinking more from an I7 perspective than a Counterfeit Monkey perspective.

Angstsmurf · June 1, 2016, 5:12pm

Is there any way to investigate this using the profiler? Any way to see what is calling FlexSize, other than putting lots of print statements in the I6 intermediary code?

vlaviano · June 1, 2016, 5:15pm

Zarf suggested sorting including child calls. That would put whatever is calling FlexSize higher in the list, since it would include FlexSize’s time plus its own time and that of any other functions that it called.

Angstsmurf · June 1, 2016, 6:03pm

Right, thanks! Guess I should have read the profile-analyze.py documentation before asking.

Dannii · June 3, 2016, 2:50pm

From looking at the template sources and the profiles, I could see that most of the calls to FlexSize are from HeapLargestFreeBlock, FlexRecutInternal, and BlkValueWrite. In addition to optimising FlexSize, it would be worth inlining the calls to it in those functions. And the other functions which a longer profile would reveal.

Back in the days of 6G60 I started an extension called High Performance Indexed Text. The idea was to inline the calls to BlkValueRead and BlkValueWrite. That was a major pain! I might restart it, but this time just do FlexSize, at least for now.

Thinking about how to inline these calls… it would be a whole lot easier if I6 had shift operators. No chance of getting them into I6 soon is there Zarf?

Inform’s management of the Flex system is pretty slow on the whole. It’s necessary to support the Z-Machine, but for Glulx games would it be possible to skip it entirely and just use @malloc?

Angstsmurf · June 3, 2016, 3:20pm

That’s interesting. I’ll check that out.

In the meantime, the 6M62 code is now on par with the speed of the 6G60 version, thanks to the elimination of a lot of visibility tests. If anyone was discouraged by the earlier slowness, feel free to have another look.

zarf · June 3, 2016, 4:06pm

Inform’s operator system is complicated and I don’t want to touch it. Using shift opcodes makes for ugly code, but it works and lets you do what you need.

Dannii · June 3, 2016, 10:26pm

That’s what I figured. The common shift operators are used for action invoking, so it would need some nasty overloading.

DavidK · June 5, 2016, 4:09pm

This is certainly worth examining: if it is possible, just using the memory opcodes directly should be a lot faster.

zarf · June 5, 2016, 4:20pm

I’m actually not certain of that. The internal implementation of @malloc is naive (it doesn’t call libc malloc). You would have to do a lot of testing to see what’s faster.

mathbrush · July 11, 2016, 2:50am

Is the Play Online version of counterfeit monkey on IFDB the sped up one? I’m playing it now, and examining is fast, but each directional command takes several seconds to process.

Dannii · July 11, 2016, 3:01am

No, we haven’t implement any major performance improvements yet.

mathbrush · July 11, 2016, 3:48am

Would there be any way to speed stuff up by letting players turn off pathfinder and other user-friendly features? Path finding is nice, but I’d rather type 15 fast commands than 1 slow one.

Dannii · July 11, 2016, 3:58am

I’m pretty sure the pathfinding is a one-time cost, so it would be other things that are making it so slow.