Page 14 of 24

Re: Lets actually try Hybrid Emulation

Posted: Mon May 17, 2021 4:17 pm
by foft
Is anyone here an arm assembly whiz?

http://www.64kib.com/qemu_slow_stuck_fragment.log

It logs IN: for the 68k code to translate, OUT: for the arm version then it logs which one its running. It also logs 68k addresses on entry.

Re: Lets actually try Hybrid Emulation

Posted: Mon May 17, 2021 6:54 pm
by foft
So in theory qemu is executing these jit instructions...

However when I debug it with gdb, I don't seem to get code at these addresses. In case there is an offset (it maps it as both read/execute and read/write at two addresses) I thought I'd do this a few times on all the executing threads:
display/i $pc
stepi

Unfortunately I can't find the code its running! + the stack doesn't show properly (and yes I did try rebuilding qemu with -g and -O0). Ufff!

Re: Lets actually try Hybrid Emulation

Posted: Mon May 17, 2021 7:07 pm
by foft
So, still no performance fix...

However at least the irqs are better! Core/kernel module for the irq fix:
http://www.64kib.com/minimig_irq_corev2.tar.gz

(no qemu change needed, so still v9 - http://www.64kib.com/qemu_system_testv9.tar.xz)

Re: Lets actually try Hybrid Emulation

Posted: Tue May 18, 2021 10:00 am
by Caldor
Good to hear there is still progress :)

Re: Lets actually try Hybrid Emulation

Posted: Tue May 18, 2021 11:22 am
by LamerDeluxe
Feels like it is getting really close to a performance breakthrough now. Really interesting topic to follow, thanks for the frequent updates!

Re: Lets actually try Hybrid Emulation

Posted: Tue May 18, 2021 11:31 am
by foft
The performance issue is related to qemu write handling somehow.

On every write it calls 'notdirty_write' (see access/tcg/cputlb.c) which does a glib tree lookup and a bunch of other messing around..

Which I can see if I set this trace event...
trace-event memory_notdirty* on

memory_notdirty_write_access 0x40025768 ram_addr 0x245768 size 4
memory_notdirty_write_access 0x40025768 ram_addr 0x245768 size 4
memory_notdirty_write_access 0x40025768 ram_addr 0x245768 size 4
memory_notdirty_write_access 0x40025768 ram_addr 0x245768 size 4
memory_notdirty_write_access 0x40025768 ram_addr 0x245768 size 4
memory_notdirty_write_access 0x40025768 ram_addr 0x245768 size 4

Now I just need to work out why this happens and ... how to stop it!

Note that this 'mmu' stuff is used even without the 68040 mmu, its how qemu handles the memory in system mode I think.

Re: Lets actually try Hybrid Emulation

Posted: Tue May 18, 2021 11:56 am
by foft
It seems to take this path rather a lot... All the time?

/* Handle anything that isn't just a straight memory access. */
if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {

Re: Lets actually try Hybrid Emulation

Posted: Tue May 18, 2021 6:18 pm
by foft
So... this seems to be happen if the stack shares an mmu page with code. I refer here to the qemu mmu not the emulated 68k mmu.

When starting my tests from newcli this is the case and presumably other times.

Are there any amiga programs to force the stack location? They might be worth a try.

Re: Lets actually try Hybrid Emulation

Posted: Tue May 18, 2021 6:40 pm
by robinsonb5
foft wrote: Tue May 18, 2021 6:18 pm So... this seems to be happen if the stack shares an mmu page with code. I refer here to the qemu mmu not the emulated 68k mmu.

When starting my tests from newcli this is the case and presumably other times.

Are there any amiga programs to force the stack location? They might be worth a try.
Globally, not that I'm aware of - but here's an example of how, as a programmer, you can change the stack of your own task:
http://blackfiveservices.co.uk/amiga-c/ ... cktest.lha

If you change the MEMF_ANY to MEMF_REVERSE the new stack will be allocated from the opposite end of the free memory pool.

Re: Lets actually try Hybrid Emulation

Posted: Tue May 18, 2021 6:45 pm
by foft
Perhaps we can patch this, to add an extra 8k to each stack then grow downwards from there. So we never share code and stack.
http://aminet.net/package/util/boot/StackAttack2

Re: Lets actually try Hybrid Emulation

Posted: Tue May 18, 2021 6:58 pm
by robinsonb5
foft wrote: Tue May 18, 2021 6:45 pm Perhaps we can patch this, to add an extra 8k to each stack then grow downwards from there. So we never share code and stack.
http://aminet.net/package/util/boot/StackAttack2
Even unmodified it makes the stack significantly larger if there's loads of available memory - so it might already help. If you have more than 128 meg free the stack will be 128k - so that alone should be enough to make sure there's no page clash, yes?

Re: Lets actually try Hybrid Emulation

Posted: Tue May 18, 2021 7:03 pm
by foft
Since it grows down any code right after the stack allocation will clash, however large it is?

Still I might install it then put an 8k variable on the stack at the start of the dhrystones program.

Re: Lets actually try Hybrid Emulation

Posted: Tue May 18, 2021 7:08 pm
by robinsonb5
Or just allocate a new stack in a completely different part of RAM?

Re: Lets actually try Hybrid Emulation

Posted: Tue May 18, 2021 7:33 pm
by foft
I just installed stackatack2 'as-is'. With my simple loop it worked every time (of about 5-6 tries) now. I tried (real) dhrystones and get about 55000. This isn't quite the 400000 still so I wonder if something else is going on there, anyway its much much better than I got before.

Anyway this seems worth improving. Though of course the issue remains for other programs with data near code, so if its possible to cut the overhead in qemu that'd be good. I guess the same tlb is often hit so caching the last might save a tree lookup for instance.

Re: Lets actually try Hybrid Emulation

Posted: Tue May 18, 2021 9:05 pm
by foft
What is the memory map for rtg? I’d like to get that working properly with this.

Re: Lets actually try Hybrid Emulation

Posted: Wed May 19, 2021 7:21 am
by foft
foft wrote: Tue May 18, 2021 9:05 pm What is the memory map for rtg? I’d like to get that working properly with this.
Is it just in Z3 fast ram? So if I fix the caching and start using the 'correct' DDR3 area again rather than malloc'ed ram will RTG work again?

Re: Lets actually try Hybrid Emulation

Posted: Wed May 19, 2021 9:14 am
by robinsonb5
foft wrote: Wed May 19, 2021 7:21 am Is it just in Z3 fast ram? So if I fix the caching and start using the 'correct' DDR3 area again rather than malloc'ed ram will RTG work again?
Without actually checking, I think so, yes.
(MiSTer's RTG was based loosely on the rather hackish and area-constrained RTG solution I put together for TC64 and MiST. On those platforms the RTG region is certainly just an allocated chunk of Fast RAM - I believe, though I haven't checked, that it's the same on MiSTer.)

Re: Lets actually try Hybrid Emulation

Posted: Wed May 19, 2021 9:42 am
by foft
Right, so to get that working again need to work out how to enable the caching.

in uboot we pass these kernel options
mem=511M memmap=513M$511M
i.e. use 511M and restart 513M.

When I started I was mmaping the 384MB fast from the 513MB region. Even though I didn't use O_SYNC on the open it seemed uncached.

Pass we can pass mem=1024M and tell the kernel to reserve this some other way.

Alternatively I guess I need to find the kernel api to set cache properties on a page. Then I can set up caching for the chip ram region too.

Also I guess qemu does something sensible to flush the pages when the host cpu cache settings are changed, though have no idea.

Some light background reading: https://elinux.org/Tims_Notes_on_ARM_memory_allocation

Re: Lets actually try Hybrid Emulation

Posted: Wed May 19, 2021 11:01 am
by foft
Remembered there was a solution for this on gp2x.

mmuhack kernel module here (by squidge, modified by notaz)
https://notaz.gp2x.de/dev.php

Perhaps this can be adapted.

Re: Lets actually try Hybrid Emulation

Posted: Wed May 19, 2021 11:42 am
by Grabulosaure
RTG on MiSTer is just some DDRAM area directly fetched by the scaler.
The framebuffer address is the rtg_base[31:0] signal = 0x2700 0000

But, it is mapped in the 68K memory map @ 0x200 0000, in an unused memory area outside ZIII space.
(RTG doesn't need to allocate any fastram memory)

For this hybrid version, mapping RTG into ZIII memory could be simpler though.

Re: Lets actually try Hybrid Emulation

Posted: Wed May 19, 2021 1:14 pm
by foft
The control regs are in that region too?

Re: Lets actually try Hybrid Emulation

Posted: Wed May 19, 2021 1:48 pm
by robinsonb5
foft wrote: Wed May 19, 2021 1:14 pm The control regs are in that region too?
No, they're at 0xb80100 (Following the not-yet-implemented Akiko CD control regs. MiST and TC64 have the ChunkyToPlanar reg but not the rest of Akiko as yet.)

Re: Lets actually try Hybrid Emulation

Posted: Wed May 19, 2021 8:37 pm
by ByteMavericks
Robinsonb5, I should pick this up separately, but is it possible to port the blitter (etc) from fpgaarcade for performance?

Re: Lets actually try Hybrid Emulation

Posted: Wed May 19, 2021 8:49 pm
by robinsonb5
ByteMavericks wrote: Wed May 19, 2021 8:37 pm Robinsonb5, I should pick this up separately, but is it possible to port the blitter (etc) from fpgaarcade for performance?
Probably - though the current implementation doesn't resemble the fpgaarcade on at all. I just used their driver as a skeleton for mine, and then mine was adapted for MiSTer.
It might also make more sense to subcontract blitter duty to the ARM, since it has more direct access to the DDR than the FPGA does?

Re: Lets actually try Hybrid Emulation

Posted: Fri May 21, 2021 5:56 pm
by foft
So I took a look at the mmuhack. It was armv6 based and the cortex v9 is arm7.

I found this, which dumps the armv7 page tables. (I had to add it to the kernel module since it needs supervisor mode)
https://github.com/yifanlu/ARMv7_MMU_Dumper

Anyway I ended up using the kernel api. Its easy to add mmap to fops (for debugfs need to use debugfs_create_file_unsafe or it ignores it) then in there can use something this simple:
static int minimig_mmap_cached(struct file *filp, struct vm_area_struct * vma)
{
vma->vm_page_prot = pgprot_cached(vma->vm_page_prot);
printk("mmap of %lx into %lx, with cached\n",vma->vm_pgoff << PAGE_SHIFT,vma->vm_start);

return io_remap_pfn_range(vma,
vma->vm_start,
vma->vm_pgoff,
vma->vm_end - vma->vm_start,
vma->vm_page_prot);
}


Similarly for pgprot_writecombine, pgprot_uncached, pgprog_dmacoherent. I don't really know what they all map to on the armv7 hardware. Also for some reason pgprot_cached was not defined...

Anyway it works, I can map the DDR ram reserved for Z3 fastram and it runs at the same speed as using malloc.

I also tried a couple more things:
i) cached chipram mapping: Led to corrupted screen. I wonder is qemu does anything to host caching with CACR etc.
ii) mapped 16MB from 0x27000000 (phys) to 0x2000000 (amiga mem space) for rtg. When I select rtg modes I lose monitor sync. However, I double checked my rtg setup with the original core and in that case I just get a black screen (but keep sync). So I have a setup problem AND another problem I think. I had used the adf from here (viewtopic.php?p=12186#p12186).

I'm also wondering why disk access speed is 50% that with TG68. I'm postponing look at that until I figure out how to do proper chipram caching.

Re: Lets actually try Hybrid Emulation

Posted: Fri May 21, 2021 6:19 pm
by foft
qemu seems to have some cache flushing logic in the core. However I don't see any cache action taken on cacr changes or CINV,CPUSH. I guess the latter aren't used much on the Amiga anyway since its only 68040+.

Re: Lets actually try Hybrid Emulation

Posted: Fri May 21, 2021 6:33 pm
by ByteMavericks
I’ve automated downloading and installing the extensions for rtg (and networking): https://github.com/ByteMavericks/MinimigMiSTer

Re: Lets actually try Hybrid Emulation

Posted: Fri May 21, 2021 8:23 pm
by robinsonb5
Remember that the CPU isn't the only thing that writes to chip RAM - you're going to need some kind of bus snooping if you want to cache chip RAM.

Re: Lets actually try Hybrid Emulation

Posted: Fri May 21, 2021 9:12 pm
by foft
robinsonb5 wrote: Fri May 21, 2021 8:23 pm Remember that the CPU isn't the only thing that writes to chip RAM - you're going to need some kind of bus snooping if you want to cache chip RAM.
Wouldn’t this be the case on the original hardware? So the cpu is unaware that chip ram has changed and the software has to handle it?

I guess there is a cache inhibit signal for hardware regs.

Re: Lets actually try Hybrid Emulation

Posted: Fri May 21, 2021 9:47 pm
by robinsonb5
foft wrote: Fri May 21, 2021 9:12 pm Wouldn’t this be the case on the original hardware? So the cpu is unaware that chip ram has changed and the software has to handle it?

I guess there is a cache inhibit signal for hardware regs.
There is - and I believe the system uses a combination of function code and address to disallow the data cache (if present - 68030+ only, of course) for chip RAM and the hardware regs.
I'm not sure of the exact mechanism; I do know that bus snooping was necessary in order to enable full caching on Chip RAM for the turbo mode on MiST's Minimig.