Query Types¶
ServiceX queries can be expressed using a number of query languages. The queries are translated to actual code in the ServiceX codegens. Not all query languages support all potential input data formats, so once you have determined what input data you need to manipulate, you can decide what query language to express your query in.
This table sumarizes the query types supported by ServiceX and the data formats they can be used with.
FuncADL |
Uproot-Raw |
Python* |
|
---|---|---|---|
Flat ROOT TTrees |
✅ |
✅ |
✅ |
CMS NanoAOD |
✅ |
✅ |
✅ |
ATLAS PHYSLITE |
✅ |
✅† |
✅† |
ATLAS PHYS |
✅ |
❌ |
❌ |
Parquet |
❌ |
❌ |
✅ |
Note
* Python queries may not be available on all ServiceX deployments.
† The subset of the PHYSLITE data format readable (and _writable_) by uproot is supported.
Warning
Note that although ServiceX supports multiple output formats, not all features of the query languages may be supported by all output formats. See the cautions in the detailed discussions of the query languages.
A brief introduction to the query languages¶
FuncADL is an Analysis Description Language inspired by functional languages and C#’s LINQ. Sophisticated filtering and computation of new values can be expressed by chaining a series of simple functions. Because FuncADL is written independently of the underlying data libraries, it can run on many data formats.
Uproot-Raw passes user requests to the
.arrays()
function inuproot
. In particular, the branches of the inputTTrees
can be filtered, cuts can be specified to select events, and additional expressions can be computed. Additional non-TTree
objects can be copied from the inputs to the outputs.Python executes a user-specified Python function and returns the results (assumed to be provided as an Awkward Array). The function has access to the
uproot
andawkward
libraries.
Uproot-Raw Query Type¶
The Uproot-Raw query language can be thought of as essentially calling the TTree.arrays() function of uproot
with the possibility to specify several of the arguments, and returning the result. Multiple queries can be bundled into one request. It is also possible to copy objects from the input file.
Let’s look at the structure of an Uproot-Raw query.
query = [
{
'treename': 'reco',
'filter_name': ['/mu.*/', 'runNumber', 'lbn', 'jet_pt_*'],
'cut':'(count_nonzero(jet_pt_NOSYS>40e3, axis=1)>=4)'
},
{
'copy_histograms': ['CutBookkeeper*', '/cflow.*/', 'metadata', 'listOfSystematics']
}
]
This is a list of dictionaries, which is the standard form of an Uproot-Raw query. Each dictionary reflects a separate subquery of the request; it is possible to have just a single dictionary, for a single query.
Each dictionary either has a treename
key (indicating that it is a query on a tree) or a copy_histograms
key (indicating that ROOT objects are to be copied from the old file to the new one - certain objects other than histograms, such as TGraph
s, can also be copied).
Query dictionaries: these dictionaries contain a
treename
key, which specifies the tree(s) to be queried. The keys are:treename
: either a string, a list of strings, or a dictionary. This selects the names of the tree(s) to which the query will be applied. In the case that a dictionary is passed, the keys will be used to choose the input trees, and the values will be used as the name of the tree that results from the query - this allows the user to run multiple queries on the same tree, saving the output to a different tree each time.expressions
,cut
,filter_name
,aliases
: have the same meaning as for TTree.arrays() inuproot
, except that functions aren’t permitted (but globs and regular expressions, which are special kinds of strings, are).
Other keys will be ignored.
Most queries will probably use
filter_names
, which selects specific branches, andcut
, which selects specific rows. Theexpressions
argument permits new values to be computed from the branches in the tree, andaliases
can be used to introduce shorthand to make these expressions cleaner.The Uproot-Raw language extends the default
uproot
expression language by adding many functions from Awkward Array (the example above usesawkward.count_nonzero
). This permits very powerful expressions for cuts and expression evaluation.Copy dictionaries: these dictionaries contain the
copy_histograms
key, which specifies the object(s) to be copied. The one key is:copy_histograms
: this is either a string of a list of strings. These strings can be anything thatuproot.ReadOnlyDirectory.items()
accepts in itsfilter_name
argument: this can be the exact name of the object (e.g.metadata
above), a glob (e.g.CutBookkeeper*
), or a regular expression (e.g./cflow.*/
). Because of this flexibility, at most one copy dictionary should be needed per request.
Warning
Uproot-Raw supports the parquet output format, but is subject to its limitations. It cannot copy ROOT objects to parquet output. “Multiple output trees” are supported by concatenating the different trees together, with the additional column “treename” which indicates the name that the tree would have had in a ROOT file; these trees must have exactly the same structure (no added or missing columns between the different trees).
FuncADL Query Type¶
FuncADL queries are based on functional programming concepts and allow the user to specify complex queries in a compact form. The query is written in a functional style, with a series of functions that are applied to the data in sequence. The query is written in a string or as typed python objects. Depending on the source file format, the query is translated into C++ EventLoop code, or uproot python code.
An example that fetches the \(p_T, \eta\) and EM fraction of jets from an ATLAS PHYSLITE file is as follows:
from func_adl_servicex_xaodr22 import FuncADLQueryPHYSLITE, cpp_float
query = FuncADLQueryPHYSLITE()
jets_per_event = query.Select(lambda e: e.Jets('AnalysisJets'))
jet_info_per_event = jets_per_event.Select(
lambda jets: {
'pt': jets.Select(lambda j: j.pt()),
'eta': jets.Select(lambda j: j.eta()),
'emf': jets.Select(lambda j: j.getAttribute[cpp_float]('EMFrac')) # type: ignore
}
)
FuncADL is based on the concept of sequences. The events in a dataset are a sequence of events. The jets in an event are a sequence of jets.
The Select
call applies a function that transforms the input sequence, element-by-element, into an output sequence. In the above example,
the first Select
call is used to transform the sequence of events into a sequence of a sequence of jets (e.g. a sequence of jets representing
the jets in an event - a 2D array, if you will). The lambda function passed to the Select
call
is applied to each event in the input sequence, and the result is a sequence of jets for each event.
The dictionary defines the columns of the output file (e.g. the leaves in a TTree
). In each case, the three lambda
functions are applied
to each jet, transforming the sequence of jets into a sequence of \(p_T\) values, a sequence of \(\eta\) values, and a sequence of EM fractions.
Full documentation on the func-adl query language can be found at this JupyterBook.
Python Function Query Type¶
This query type is the most flexible for extracting data from an uproot compatible dataset. The user provides a Python function that is called for each root file in the dataset. The function is expected to return a dictionary of awkward arrays that will be stored in the output file.
The function must be named run_query
and must accept a single argument, the path to the uproot file.
It can return either an awkward array or a dictionary of awkward arrays, where the keys are the tree names
for each array. If a single awkward array is returned, it is stored in the tree named servicex.
def run_query(input_filenames=None):
import uproot # type: ignore
with uproot.open({input_filenames: "reco"}) as o:
br = o.arrays("el_pt_NOSYS")
return br