Using ServiceX

A transformation request is a specifically formatted request sent to ServiceX. It includes information on what input dataset is to be used, what preselection is to be applied (including computation of new columns, if any), and what columns should be returned to the user.

Selecting endpoints

Each request requires two endpoints, one corresponding to the service itself, and one for the output of the request. The current available endpoints are shown below.

Endpoint Type Location Experiment Input
rc1-xaod-servicex.uc.ssl-hep.org ServiceX SSL-RIVER ATLAS xAOD files
rc1-xaod-minio.uc.ssl-hep.org MinIO SSL-RIVER ATLAS xAOD files
rc1-uproot-servicex.uc.ssl-hep.org ServiceX SSL-RIVER ATLAS Flat ntuples
rc1-uproot-minio.uc.ssl-hep.org MinIO SSL-RIVER ATLAS Flat ntuples

Creating a request via func_adl

In order to use func_adl directly we start with a Qastle query as our input. The following is a query designed to extract the transverse momenta of a jet collection in some xAOD-formatted dataset:

my_query = "(call ResultTTree" \
    "(call Select" \
        "(call SelectMany" \
            "(call EventDataset (list 'localds:bogus'))" \
            "(lambda (list e) (call (attr e 'Jets') 'AntiKt4EMTopoJets'))" \
        ") (lambda (list j) (/ (call (attr j 'pt')) 1000.0))" \
    ") (list 'JetPt') 'analysis' 'junk.root')"

Given this input, we can produce output containing the transverse momenta of all jets in an ATLAS xAOD file. We start by specifying the structure of the ServiceX request:

import servicex
dataset = ‘mc15_13TeV:mc15_13TeV.361106.PowhegPythia8EvtGen_AZNLOCTEQ6L1_Zee.merge.DAOD_STDM3.e3601_s2576_s2132_r6630_r6264_p2363_tid05630052_00’
sx_endpoint = 'http://rc1-xaod-servicex.uc.ssl-hep.org'
minio_endpoint = 'rc1-xaod-minio.uc.ssl-hep.org'
ds = servicex.ServiceXDataset(
    dataset,
    servicex.ServiceXAdaptor(sx_endpoint, username='mweinberg', password='XXXXXXXXX'),
    servicex.MinioAdaptor(minio_endpoint)
    )

Once we have this, we can call ServiceX to output the results of our query in a convenient format:

r = servicex.get_data_pandas_df(my_query)
print(r)

After about 1--2 minutes, this prints a data frame with a single column for the transverse momenta.

A badly formatted query, or a problem with the file in the backend, will cause an exception to be thrown. Note that there are also tools like the one here that are capable of turning a text file of requested columns (e.g. here) into a complete Qastle query.

Using helper functions to construct a query

For all but the simplest single-column requests, creating a Qastle query as input can be quite cumbersome. func_adl provides additional libraries to construct queries.

Simple single-variable query

For example, we can perform the same request using the func_adl_xAOD library:

import func_adl_xAOD
f_ds = func_adl_xAOD.ServiceXDatasetSource(ds)
r = f_ds \
    .SelectMany('lambda e: e.Jets("AntiKt4EMTopoJets")') \
    .Select('lambda j: j.pt() / 1000.0') \
    .AsPandasDF('JetPt') \
    .value()
print(r)

Note that the Select() function transforms the input dataset by allowing you to select only objects matching the selection criteria (in this case only the pT attribute of the jet collection). Meanwhile the function SelectMany() shifts the hierarchy by returning a list of lists (in this case a list of events, each containing a separate list of jets). AsPandasDF() formats the output as a Pandas dataframe, and value() is responsible for executing the query.

Multi-variable query

As a more realistic example, we can construct a request for the four-momenta of the Electron and Muon collection. In this case let's output the results as a set of AwkwardArrays:

r = f_ds \
    .Select('lambda e: (e.Electrons("Electrons"), e.Muons("Muons"))') \
    .Select('lambda ls: (ls[0].Select(lambda e: e.pt()), \
                         ls[0].Select(lambda e: e.eta()), \
                         ls[0].Select(lambda e: e.phi()), \
                         ls[0].Select(lambda e: e.e()), \
                         ls[1].Select(lambda m: m.pt()), \
                         ls[1].Select(lambda m: m.eta()), \
                         ls[1].Select(lambda m: m.phi()), \
                         ls[1].Select(lambda m: m.e()))') \
    .AsAwkwardArray(('ElePt', 'EleEta', 'ElePhi', 'EleE', 'MuPt', 'MuEta', 'MuPhi', 'MuE')) \
    .value()

Because the output is an AwkwardArray, which can handle the variable-size set of objects for each event, it is no longer necessary to use the SelectMany() function as above.

Query with applied filter

Next, let's consider the case where we wish to return information only for those jets with a pT passing some threshold cut. This can be done via the Where() function:

r = f_ds \
    .SelectMany('lambda e: e.Jets("AntiKt4EMTopoJets")') \
    .Where('lambda j: j.pt() / 1000.0 > 30.0') \
    .Select('lambda j: j.eta()') \
    .AsPandasDF('JetPt') \
    .value()

which returns a dataframe with the eta values of all jets whose pT is above 30 GeV.

Complex query with filtering and a computed variable

Finally, let's take a complicated query where we ask for a computed variable (for simplicity we'll use a nonsense variable like eta * phi) from the Electrons collection, but only for those events with at least two jets with pT > 30 GeV. This can be done via:

r = f_ds \
    .Where('lambda e: e.Jets("AntiKt4EMTopoJets") \
        .Where('lambda j: j.pt() / 1000.0 > 30.0').Count() >= 1') \
    .Select('lambda e: e.Electrons("Electrons")') \
    .Select('lambda e: e.Select(lambda ele: ele.eta() * ele.phi())') \
    .AsAwkwardArray('EleMyVar') \
    .value()

Note the nested Select() used to construct the computed variable; this ensures the variable is only computed for electrons in the list of filtered events.

Choosing the output

There are currently three choices for formatting the output of a ServiceX request: AsPandasDF returns the output as a Pandas dataframe, AsROOTTree returns the output as a flat TTree, and AsAwkwardArray returns the output as an Awkward array suitable for use with uproot.