Accepting request 1075155 from home:NMorey:branches:science:HPC

- Update to 1.17.1
  - Core
    - hmem_cuda Add const to param to remove warning
    - Fix typos in fi_ext.h
    - ofi_epoll: Remove unused hot_index struct member
  - EFA
    - Print local/peer addresses for RX write errors
    - Unit test to verify no copy with shm for small host message
    - Avoid unnecessary copy when sending data from shm
    - Compare pci bus id in hints
    - Fix double free in rxr endpoint init
  - Hooks
    - dmabuf_peer_mem: Handle IPC handle caching in L0
  - OPX
    - Exclude from build if missing needed defines
    - Move some logs to optimized builds
    - Fix build warnings for unused return code from posix_memalign
    - Add reliability sanity check to detect when send buffer is illegally altered
    - SDMA Completion workaround for driver cache invalidation race condition
    - Fix replay payload pointer increment
    - Handle completion counter across multiple writes in SDMA
    - Cleanup pointers after free()
    - Modify domain creation to handle soft cache errors
    - Two biband performance improvements
    - Fixes based on Coverity Scan related to auto progress patch
    - Changed poll many argument to rx_caps instead of caps
    - Resynch with server configured for Multi-Engines (DAOS CART Self Tests)
    - Remove import_monitor as ENOSYS case
    - Address memory leaks reported on OFIWG issues page
    - Remove unused fields
    - Fix unwanted print statement case
    - Add replays over SDMA
    - Implement basic TID Cache
    - Revert work_pending check change
    - Fix use_immediate_blocks
    - Restore state after replay packet is NULL
    - Fix memory leak from early arrival packets.
    - Fix segfault in SHM operations from uninitialized value in atomic path.
    - Prevent SDMA work entries from being reused with outstanding
      replays pointing to bounce buf.
    - Set runtime as default for OPX_AV
    - Fix RTS replay immediate data
    - Fix errors caught by the upstream libfabric Coverity Scan
    - Support multiple HFI devices
    - Support OFI_PORT and Contiguous endpoint addresses
    - Update man pages
  - Util
    - util_cq: Remove annoying WARNING message for FI_AFFINITY

OBS-URL: https://build.opensuse.org/request/show/1075155
OBS-URL: https://build.opensuse.org/package/show/science:HPC/libfabric?expand=0&rev=83
This commit is contained in:
Nicolas Morey 2023-03-29 08:24:52 +00:00 committed by Git OBS Bridge
parent 1b73b978dd
commit 6d086ca72a
8 changed files with 372 additions and 37 deletions

View File

@ -8,7 +8,7 @@
<param name="versionformat">@PARENT_TAG@.@TAG_OFFSET@.%h</param>
<param name="versionrewrite-pattern">v(.*)</param>
<param name="versionrewrite-replacement">\1</param>
<param name="revision">619d9b3c4082dcf872c611ef18458ced067c29d7</param>
<param name="revision">1528ac2d6a1b94d51a677ca7e2422683551c24dc</param>
</service>
<service name="recompress" mode="disabled">
<param name="file">libfabric*.tar</param>

View File

@ -1,3 +1,184 @@
-------------------------------------------------------------------
Mon Mar 20 09:03:29 UTC 2023 - Nicolas Morey <nicolas.morey@suse.com>
- Update to 1.17.1
- Core
- hmem_cuda Add const to param to remove warning
- Fix typos in fi_ext.h
- ofi_epoll: Remove unused hot_index struct member
- EFA
- Print local/peer addresses for RX write errors
- Unit test to verify no copy with shm for small host message
- Avoid unnecessary copy when sending data from shm
- Compare pci bus id in hints
- Fix double free in rxr endpoint init
- Hooks
- dmabuf_peer_mem: Handle IPC handle caching in L0
- OPX
- Exclude from build if missing needed defines
- Move some logs to optimized builds
- Fix build warnings for unused return code from posix_memalign
- Add reliability sanity check to detect when send buffer is illegally altered
- SDMA Completion workaround for driver cache invalidation race condition
- Fix replay payload pointer increment
- Handle completion counter across multiple writes in SDMA
- Cleanup pointers after free()
- Modify domain creation to handle soft cache errors
- Two biband performance improvements
- Fixes based on Coverity Scan related to auto progress patch
- Changed poll many argument to rx_caps instead of caps
- Resynch with server configured for Multi-Engines (DAOS CART Self Tests)
- Remove import_monitor as ENOSYS case
- Address memory leaks reported on OFIWG issues page
- Remove unused fields
- Fix unwanted print statement case
- Add replays over SDMA
- Implement basic TID Cache
- Revert work_pending check change
- Fix use_immediate_blocks
- Restore state after replay packet is NULL
- Fix memory leak from early arrival packets.
- Fix segfault in SHM operations from uninitialized value in atomic path.
- Prevent SDMA work entries from being reused with outstanding
replays pointing to bounce buf.
- Set runtime as default for OPX_AV
- Fix RTS replay immediate data
- Fix errors caught by the upstream libfabric Coverity Scan
- Support multiple HFI devices
- Support OFI_PORT and Contiguous endpoint addresses
- Update man pages
- Util
- util_cq: Remove annoying WARNING message for FI_AFFINITY
-------------------------------------------------------------------
Mon Dec 19 08:39:57 UTC 2022 - Nicolas Morey <nicolas.morey@suse.com>
- Update to 1.17.0
- Core
- Add IFF_RUNNING check to indicate iface is up and running
- General code cleanups
- Add abstraction for common io_uring operations
- Support ROCR get_base_addr
- Add a 'flags' parameter to fi_barrier()
- Introduce new calls for opening domain and endpoint with flags
- Add ability to re-sort the fi_info list
- Allowing layering of rxm over net provider
- General cleanup of provider filtering functions
- Add io_uring operations to be used by sockapi
- Modify internal handling of async socket operations
- Sockets operations are moved to a common sockapi abstraction
- Add support for Ze host register/unregister
- Add new offload provider type
- Rename fi_prov_context and simplify its use
- Convert interface prefix string checks to exact checks
- EFA
- Code cleanups and various bug fixes
- Improved debug logging and warnings and assertions
- Do not ignore hints->domain_attr->name
- Fix the calculation of REQ header size for a packet entry
- Fix default value for host memory's max_medium_msg_size
- Add tracepoints to send/recv/read ops
- Simplified emulated read protocol
- Set use_device_rdma according to efa device id
- Fix shm initialization path on error
- Fix Implementation of FI_EFA_INTER_MIN_READ_MESSAGE_SIZE
- Do not enable rdma_read if rxr_env.use_device_rdma is false
- Remove de-allocated CUDA memory region during registration
- Fix the error handling path of efa_mr_reg_impl()
- Fix rxr_ep unit tests involving ibv_cq_ex
- Add check of rdma-read capability for synapseai
- Report correct default for runt_size parameter
- Toggle cuda sync memops via environment variable.
- Net
- Continued fork of tcp provider, will eventually merge changes back
- Fix inject support
- Fix memory leak in peek/claim path
- General code cleanups and bug fixes from initial fork
- Allow looking ahead in tcp stream to handle out-of-order messages
- Add message tracing ability
- Fetch correct ep when posting to a loopback connection
- Release lock in case of error in rdm_close
- Fix error path in xnet_enable_rdm
- Add missing progress lock in srx cleanup
- Code restructuring and enhancements with longer term goal of supporting io_uring
- Disable the progress thread in most situations
- Rename DL from libxnet-fi to libnet-fi
- Add missing initialization calls for DL provider
- Add support for FI_PEEK, FI_CLAIM, and FI_DISCARD
- Include source address with CQ entry
- Fix support for FI_MULTI_RECV
- OPX
- Bug fixes and general code cleanup
- Fix progress checks and default domain
- Allow atomic fetch ops to use SDMA for sufficiently large counts
- Cleaned up FI_LOG_LEVEL=warn output
- Reset default progress to FI_PROGRESS_MANUAL
- Fixed GCC 10 build error with Auto Progress
- Add support for FI_PROGRESS_AUTO
- Use max allowed packet size in SDMA path when expected TID is turned off
- Expected receive (TID) rendezvous
- RMA Read/Write operations over SDMA
- Remove origin_rs from cts and dput packet header.
- Fix for hang - unable to match inbound packets with receive
context->src_addr (DAOS CART tests)
- Use single IOV for bounce buffer in SDMA requests.
- Check for FI_MULTI_RECV with bitwise OR instead of AND
- Fix for intermittent intra-node deadlock hang (DAOS CART tests)
- Fix to RPC transport error failure (DAOS CART tests)
- Fix for context->buf set to NULL
- Fix bad asserts
- Ensure atomicity of atomic ops
- fi_opx_cq_poll_inline count and head check fix
- Fix intermittent intra-node hang causing RPC timeouts (DAOS CART tests)
- Temporarily reduce SDMA queue ring size for possible driver bug workaround
- Fix alignment issue and asserts
- Enable more parallel SDMA operations
- PSM3
- Synced to IEFS 11.4.0.0.198
- Tech Preview Ubuntu 22.04 Support
- Tech Preview Intel DSA Support
- Improved Intel GPU Support
- Various performance improvements
- Various bug fixes
- RxM
- Always use rendezvous protocol for ZE device memory send
- Code cleanup
- Add option to free resources on AV removal
- SHM
- Fix user_id support
- Write tx err comp to correct cq
- Fix index when setting FI_ADDR_USER_ID
- Remove extraneous ofi_cirque_next() call
- Add support for FI_AV_USER_ID
- Fix multi_recv messaging
- General code restructuring for maintainability
- Implement shared completion queues
- Decouple error processing from cq completion path to avoid switch
- Fix incorrect op passed into recv cancel operation
- Enhanced SHM implementation with DSA offload
- Use multiple SAR buffers per copy operation
- Fix ZE IPC race condition on startup
- TCP
- Minor updates in preparation for io_uring support (via net provider)
- Util
- Add option to free resources on AV removal
- Add 'flags' parameter to new fi_barrier2() call
- Add debugging in ofi_mr_map_verify
- Rename internal bitmask struct to include ofi prefix
- Verbs
- Add option to disable dmabuf support
- FI_SOCKADDR includes support of FI_SOCKADDR_IB
- Fabtests
- shared: Expand hmem support
- fi_loopback: Add support for tagged messages
- fi_mr_test: add support of hmem
- fi_rdm_atomic: Fix hmem support
- fi_rdm_tagged_peek: Read messages in order, code cleanup and fixes
- fi_multinode: Add performance and runtime control options, cleanups
- benchmarks: Add data verification to some bw tests
- fi_multi_recv: Fix possible crash in cleanup
- Drop prov-net-fix-error-path-in-xnet_enable_rdm.patch which was merged upstream.
-------------------------------------------------------------------
Tue Nov 8 11:46:56 UTC 2022 - Nicolas Morey-Chaisemartin <nmoreychaisemartin@suse.com>

View File

@ -1,7 +1,7 @@
#
# spec file for package fabtests
#
# Copyright (c) 2022 SUSE LLC
# Copyright (c) 2023 SUSE LLC
#
# All modifications and additions to the file contributed by third parties
# remain the property of their copyright owners, unless otherwise agreed
@ -16,10 +16,10 @@
#
%define git_ver .0.619d9b3c4082
%define git_ver .0.1528ac2d6a1b
Name: fabtests
Version: 1.16.1
Version: 1.17.1
Release: 0
Summary: Test suite for libfabric API
License: BSD-2-Clause OR GPL-2.0-only

View File

@ -1,3 +0,0 @@
version https://git-lfs.github.com/spec/v1
oid sha256:bab5c443ec19580c94e5ebd6543cd02d094e4b3930ba350156240bc48c97402c
size 2944448

View File

@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:170fcbbf7075ab6d167ae1b3da115cb19029dfa962d4609782ea40f7ce5a9fd1
size 3036923

View File

@ -1,3 +1,184 @@
-------------------------------------------------------------------
Mon Mar 20 09:03:29 UTC 2023 - Nicolas Morey <nicolas.morey@suse.com>
- Update to 1.17.1
- Core
- hmem_cuda Add const to param to remove warning
- Fix typos in fi_ext.h
- ofi_epoll: Remove unused hot_index struct member
- EFA
- Print local/peer addresses for RX write errors
- Unit test to verify no copy with shm for small host message
- Avoid unnecessary copy when sending data from shm
- Compare pci bus id in hints
- Fix double free in rxr endpoint init
- Hooks
- dmabuf_peer_mem: Handle IPC handle caching in L0
- OPX
- Exclude from build if missing needed defines
- Move some logs to optimized builds
- Fix build warnings for unused return code from posix_memalign
- Add reliability sanity check to detect when send buffer is illegally altered
- SDMA Completion workaround for driver cache invalidation race condition
- Fix replay payload pointer increment
- Handle completion counter across multiple writes in SDMA
- Cleanup pointers after free()
- Modify domain creation to handle soft cache errors
- Two biband performance improvements
- Fixes based on Coverity Scan related to auto progress patch
- Changed poll many argument to rx_caps instead of caps
- Resynch with server configured for Multi-Engines (DAOS CART Self Tests)
- Remove import_monitor as ENOSYS case
- Address memory leaks reported on OFIWG issues page
- Remove unused fields
- Fix unwanted print statement case
- Add replays over SDMA
- Implement basic TID Cache
- Revert work_pending check change
- Fix use_immediate_blocks
- Restore state after replay packet is NULL
- Fix memory leak from early arrival packets.
- Fix segfault in SHM operations from uninitialized value in atomic path.
- Prevent SDMA work entries from being reused with outstanding
replays pointing to bounce buf.
- Set runtime as default for OPX_AV
- Fix RTS replay immediate data
- Fix errors caught by the upstream libfabric Coverity Scan
- Support multiple HFI devices
- Support OFI_PORT and Contiguous endpoint addresses
- Update man pages
- Util
- util_cq: Remove annoying WARNING message for FI_AFFINITY
-------------------------------------------------------------------
Mon Dec 19 08:39:57 UTC 2022 - Nicolas Morey <nicolas.morey@suse.com>
- Update to 1.17.0
- Core
- Add IFF_RUNNING check to indicate iface is up and running
- General code cleanups
- Add abstraction for common io_uring operations
- Support ROCR get_base_addr
- Add a 'flags' parameter to fi_barrier()
- Introduce new calls for opening domain and endpoint with flags
- Add ability to re-sort the fi_info list
- Allowing layering of rxm over net provider
- General cleanup of provider filtering functions
- Add io_uring operations to be used by sockapi
- Modify internal handling of async socket operations
- Sockets operations are moved to a common sockapi abstraction
- Add support for Ze host register/unregister
- Add new offload provider type
- Rename fi_prov_context and simplify its use
- Convert interface prefix string checks to exact checks
- EFA
- Code cleanups and various bug fixes
- Improved debug logging and warnings and assertions
- Do not ignore hints->domain_attr->name
- Fix the calculation of REQ header size for a packet entry
- Fix default value for host memory's max_medium_msg_size
- Add tracepoints to send/recv/read ops
- Simplified emulated read protocol
- Set use_device_rdma according to efa device id
- Fix shm initialization path on error
- Fix Implementation of FI_EFA_INTER_MIN_READ_MESSAGE_SIZE
- Do not enable rdma_read if rxr_env.use_device_rdma is false
- Remove de-allocated CUDA memory region during registration
- Fix the error handling path of efa_mr_reg_impl()
- Fix rxr_ep unit tests involving ibv_cq_ex
- Add check of rdma-read capability for synapseai
- Report correct default for runt_size parameter
- Toggle cuda sync memops via environment variable.
- Net
- Continued fork of tcp provider, will eventually merge changes back
- Fix inject support
- Fix memory leak in peek/claim path
- General code cleanups and bug fixes from initial fork
- Allow looking ahead in tcp stream to handle out-of-order messages
- Add message tracing ability
- Fetch correct ep when posting to a loopback connection
- Release lock in case of error in rdm_close
- Fix error path in xnet_enable_rdm
- Add missing progress lock in srx cleanup
- Code restructuring and enhancements with longer term goal of supporting io_uring
- Disable the progress thread in most situations
- Rename DL from libxnet-fi to libnet-fi
- Add missing initialization calls for DL provider
- Add support for FI_PEEK, FI_CLAIM, and FI_DISCARD
- Include source address with CQ entry
- Fix support for FI_MULTI_RECV
- OPX
- Bug fixes and general code cleanup
- Fix progress checks and default domain
- Allow atomic fetch ops to use SDMA for sufficiently large counts
- Cleaned up FI_LOG_LEVEL=warn output
- Reset default progress to FI_PROGRESS_MANUAL
- Fixed GCC 10 build error with Auto Progress
- Add support for FI_PROGRESS_AUTO
- Use max allowed packet size in SDMA path when expected TID is turned off
- Expected receive (TID) rendezvous
- RMA Read/Write operations over SDMA
- Remove origin_rs from cts and dput packet header.
- Fix for hang - unable to match inbound packets with receive
context->src_addr (DAOS CART tests)
- Use single IOV for bounce buffer in SDMA requests.
- Check for FI_MULTI_RECV with bitwise OR instead of AND
- Fix for intermittent intra-node deadlock hang (DAOS CART tests)
- Fix to RPC transport error failure (DAOS CART tests)
- Fix for context->buf set to NULL
- Fix bad asserts
- Ensure atomicity of atomic ops
- fi_opx_cq_poll_inline count and head check fix
- Fix intermittent intra-node hang causing RPC timeouts (DAOS CART tests)
- Temporarily reduce SDMA queue ring size for possible driver bug workaround
- Fix alignment issue and asserts
- Enable more parallel SDMA operations
- PSM3
- Synced to IEFS 11.4.0.0.198
- Tech Preview Ubuntu 22.04 Support
- Tech Preview Intel DSA Support
- Improved Intel GPU Support
- Various performance improvements
- Various bug fixes
- RxM
- Always use rendezvous protocol for ZE device memory send
- Code cleanup
- Add option to free resources on AV removal
- SHM
- Fix user_id support
- Write tx err comp to correct cq
- Fix index when setting FI_ADDR_USER_ID
- Remove extraneous ofi_cirque_next() call
- Add support for FI_AV_USER_ID
- Fix multi_recv messaging
- General code restructuring for maintainability
- Implement shared completion queues
- Decouple error processing from cq completion path to avoid switch
- Fix incorrect op passed into recv cancel operation
- Enhanced SHM implementation with DSA offload
- Use multiple SAR buffers per copy operation
- Fix ZE IPC race condition on startup
- TCP
- Minor updates in preparation for io_uring support (via net provider)
- Util
- Add option to free resources on AV removal
- Add 'flags' parameter to new fi_barrier2() call
- Add debugging in ofi_mr_map_verify
- Rename internal bitmask struct to include ofi prefix
- Verbs
- Add option to disable dmabuf support
- FI_SOCKADDR includes support of FI_SOCKADDR_IB
- Fabtests
- shared: Expand hmem support
- fi_loopback: Add support for tagged messages
- fi_mr_test: add support of hmem
- fi_rdm_atomic: Fix hmem support
- fi_rdm_tagged_peek: Read messages in order, code cleanup and fixes
- fi_multinode: Add performance and runtime control options, cleanups
- benchmarks: Add data verification to some bw tests
- fi_multi_recv: Fix possible crash in cleanup
- Drop prov-net-fix-error-path-in-xnet_enable_rdm.patch which was merged upstream.
-------------------------------------------------------------------
Tue Nov 8 11:46:56 UTC 2022 - Nicolas Morey-Chaisemartin <nmoreychaisemartin@suse.com>

View File

@ -1,7 +1,7 @@
#
# spec file for package libfabric
#
# Copyright (c) 2022 SUSE LLC
# Copyright (c) 2023 SUSE LLC
#
# All modifications and additions to the file contributed by third parties
# remain the property of their copyright owners, unless otherwise agreed
@ -17,10 +17,10 @@
#
%define git_ver .0.619d9b3c4082
%define git_ver .0.1528ac2d6a1b
Name: libfabric
Version: 1.16.1
Version: 1.17.1
Release: 0
Summary: User-space RDMA Fabric Interfaces
License: BSD-2-Clause OR GPL-2.0-only
@ -28,7 +28,6 @@ Group: Development/Libraries/C and C++
Source: %{name}-%{version}%{git_ver}.tar.bz2
Source1: baselibs.conf
Patch0: libfabric-libtool.patch
Patch1: prov-net-fix-error-path-in-xnet_enable_rdm.patch
URL: http://www.github.com/ofiwg/libfabric
BuildRequires: autoconf
BuildRequires: automake
@ -71,7 +70,6 @@ services, such as RDMA. This package contains the development files.
%prep
%setup -q -n %{name}-%{version}%{git_ver}
%patch0 -p1
%patch1
%build
rm -f config/libtool.m4

View File

@ -1,25 +0,0 @@
commit b775a752b3b4017f39e542ef4f32576d2b018f05
Author: Nicolas Morey-Chaisemartin <nmoreychaisemartin@suse.com>
Date: Tue Nov 8 12:40:43 2022 +0100
prov/net: fix error path in xnet_enable_rdm
If xnet_listen fails (happens 100% of the time on a system with no
network interface but lo), the progress lock is not released which
causes a deadlock when fi_close is called later on the endpoint.
Signed-off-by: Nicolas Morey-Chaisemartin <nmoreychaisemartin@suse.com>
diff --git prov/net/src/xnet_rdm.c prov/net/src/xnet_rdm.c
index 77a236b51903..b5f77f068bf3 100644
--- prov/net/src/xnet_rdm.c
+++ prov/net/src/xnet_rdm.c
@@ -711,7 +711,7 @@ static int xnet_enable_rdm(struct xnet_rdm *rdm)
ret = xnet_listen(rdm->pep, progress);
if (ret)
- return ret;
+ goto unlock;
/* TODO: Move updating the src_addr to pep_listen(). */
len = sizeof(rdm->addr);