[ipxe-devel] strange behavior (regression?) with SeaBIOS + iPXE + WDSNBP.COM

Fri Nov 15 08:45:04 UTC 2019

Hi Michael & Lists,

I'd like to ask for ideas with the following problem we have.

(1) There is a functional iPXE + WDS setup, with iPXE built as a
traditional BIOS PCI option ROM, using CONFIG=qemu. Accordingly the
platform is qemu, with SeaBIOS, and the NIC is virtio-net-pci.

I don't know anything about the particulars of the WDS setup at this
point, only that the boot loader program it exposes is WDSNBP.COM.

(2) The setup works fine when iPXE is built at commit 4e85b2708fa0
("[virtio] Use host-specified MTU when available", 2017-01-23).

(3) When iPXE is built at commit 133f4c47baef ("[build] Handle
R_X86_64_PLT32 from binutils 2.31", 2018-09-17), the setup breaks.

The symptom is that iPXE fetches WDSNBP.COM just fine, but WDSNBP.COM,
rather than doing whatever it does otherwise, keeps PXE-booting itself
(3+ times), and finally aborts.

Consider the following log output (my undertanding is that all this is
logged by WDSNBP.COM):

> Downloaded WDSNBP...
>
> Press F12 for network service boot
> Architecture: x64
> WDSNBP started using DHCP Referral.
> Contacting Server: ... (Gateway: ...)
> Contacting Server: ...
> TFTP Download: boot\x86\wdsnbp.com

This block repeats approx. 3 times, after which the following is
displayed:

> Windows Deployment Services: PXE Boot Aborted.
> Could not boot image: Error 0x7f8d8101 (http://ipxe.org/7f8d8101)
> No more network devices
>
> No bootable device

My understanding is that the first line from this last block is printed
by WDSNBP.COM, the second line by iPXE (in pxe_start_nbp()), the third
line also by iPXE, and the last one by SeaBIOS.

This seems to indicate that WDSNBP.COM exits with an error code, and
pxe_start_nbp() logs it as "Error 0x7f8d8101".

(4) Now, after a bit of searching the web, I've found the following
articles, which indicate that the WDS (= server side) setup is
incorrect:

(4a) "disable NetBios over TCPIP, on the WDS server"

  https://techthoughts.info/pxe-booting-wds-dhcp-scope-vs-ip-helpers/#comment-4307
  https://social.technet.microsoft.com/Forums/ie/en-US/f3883e8b-1039-477d-999d-73d9a6973fc4/wds-pxe-boot-tftp-download-loop-4-times-f12

(4b) "cover all combinations of forward and backwards slashes in
ReadFilter, on the WDS server"

   http://ipxe.org/appnote/chainload_wds#tftp_loops

However: the regression appears to be a function of *only* the git
commit at which we build iPXE. It seems so deterministic that we
bisected commit range 4e85b2708fa0..133f4c47baef. (Hence we have not
captured the network traffic yet, nor have we investigated the WDS
server config.)

The "culprit" commit is ea29122a70c6 ("[http] Include error messages for
4xx and 5xx response codes", 2017-12-28).

(5) Which makes no sense to me, unfortunately. :(

Commit ea29122a70c6 adds the "http_errors" array to the code. According
to

  src/include/ipxe/tables.h

and the build artifact

  src/bin/1af41000.rom.tmp.map

this new array is placed in a new section called

  .textdata.tbl.errortab.01

Trying to retro-fit those facts to the symptom encountered, I came up
with the idea that *maybe* the new array (or section) causes a memory
allocation failure in WDSNBP.COM -- due to increased memory footprint of
iPXE. Which then leads to the misbehavior of WDSNBP.COM.

After all, WDSNBP.COM is a 16-bit real-mode program:

  https://support.microsoft.com/en-us/help/4468601/pxe-boot-in-configuration-manager

so it could be susceptible to the size & fragmentation of the RAM that
is under 640KB.

(6) Unfortunately, this "low RAM exhaustion" idea doesn't seem to hold
water. There are at least two counter-arguments:

(6a) if I revert commit ea29122a70c6 on top of commit 133f4c47baef, then
the issue does *not* go away.

(The issue also does not go away if I remove the "netdev_errors" array,
also on top of commit 133f4c47baef -- that's a larger array.)

(... In theory anyway, this might not necessarily disprove the memory
exhaustion idea. What if the iPXE footprint grows, over the
ea29122a70c6..133f4c47baef so much, for independent reasons, that
reverting ea29122a70c6 at the end cannot compensate for that increase?)

(6b) I added "DEBUG=pxe_call:1" to the "make" command, and compared the
debug messages printed by pxe_start_nbp(), between 4e85b2708fa0 and
133f4c47baef. Alas, the debug messages are identical:

> PXE NBP starting with netdev net0, code 9c6c:0802, data 9cf0:2ce0

which to me suggests that there is no change in the amount of memory
that is made available to WDSNBP.COM -- its code and data continue to
start at 0x9_CEC2 and 0x9_FBE0, respectively.

Any hints as to what could be going wrong?

Thanks!
Laszlo