Improve RSample I/O for distributed RDataFrame usage#22749
Merged
vepadulano merged 3 commits intoJul 3, 2026
Merged
Conversation
An issue occurs during deserialization of distributed tasks for an RDataFrame created via FromSpec. The issue is only visible if the application runs standalone.
Since root-project@74fbf73, initialization of cppyy is defered to a later point then before, when it used to be initialized at `import ROOT` time. This caused in a specific scenario of distributed RDataFrame usage a particular mix of behaviours that ended in a crash. In case of running a distributed RDataFrame application using `FromSpec`, the distributed tasks need to serialize the information relative to the various samples. This is done in the Python class SerializableRSample in Ranges.py. At serialization time, i.e. in __setstate__, the class was storing a payload with a few objects among which collections of strings (for names of files and datasets) with type std::vector<std::string>. During the deserialization of a distributed task, the following would happen: 1. The ROOT module is deserialized, i.e. the __reduce__ method of the ROOT facade is called. 2. The __getstate__ method of the SerializableRSample class is called. 3. Inside of it, the members of the payload are accessed, thus its types must be deserialized, including std::vector<std::string>. 4. cppyy functions are invoked in order to gather information about those types. 5. cppyy has not been initialized yet at this point, hence the crash. This commit simplifies the implementation of SerializableRSample so that it does not require cppyy anymore for I/O.
Test Results 23 files 23 suites 3d 15h 42m 18s ⏱️ For more details on these failures, see this check. Results for commit 6f003db. |
Member
Author
|
This is to confirm the failure of the test introduced in this PR without the fixes https://github.com/root-project/root/actions/runs/28660728005/job/85000413196?pr=22750#step:8:13348 |
hageboeck
approved these changes
Jul 3, 2026
hageboeck
left a comment
Member
There was a problem hiding this comment.
LGTM, but I think I saw a usage of too many lists that may be simplified further. Sorry if I overlooked other uses of these lists.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Since
74fbf73,
initialization of cppyy is defered to a later point then before, when it used to
be initialized at
import ROOTtime.This caused in a specific scenario of distributed RDataFrame usage a particular
mix of behaviours that ended in a crash.
In case of running a distributed RDataFrame application using
FromSpec, thedistributed tasks need to serialize the information relative to the various
samples. This is done in the Python class SerializableRSample in Ranges.py.
At serialization time, i.e. in setstate, the class was storing a payload
with a few objects among which collections of strings (for names of files and
datasets) with type std::vectorstd::string.
During the deserialization of a distributed task, the following would happen:
facade is called.
types must be deserialized, including std::vectorstd::string.
those types.
This commit simplifies the implementation of SerializableRSample so that it does
not require cppyy anymore for I/O.