Metadata-Version: 2.1
Name: dataframe-expressions
Version: 1.1.0b3
Summary: Library to help with accumulating expressions
Home-page: https://github.com/gordonwatts/dataframe_expressions
Author: G. Watts (IRIS-HEP/UW Seattle)
Author-email: gwatts@uw.edu
Maintainer: Gordon Watts (IRIS-HEP/UW Seattle)
Maintainer-email: gwatts@uw.edu
License: TBD
Description: # dataframe_expressions
        
         Simple accumulating of expressions for dataframe operations
        
        ## Expression Samples
        
        You start with a top level data frame:
        
        ```python
        from dataframe_expressions import DataFrame
        d = DataFrame()
        ```
        
        Now you can mask it with simple operations:
        
        ```python
        d1 = d[d.x > 10]
        ```
        
        The operators `<,>, <=, >=, ==,` and `!=` are all supported. You can also combine logical expressions, though watch for operator precedence:
        
        ```python
        d1 = d[(d.x > 10) & (d.x < 20)]
        ```
        
        Of course, chaining is also allowed:
        
        ```python
        d1 = d[dx > 10]
        d2 = d1[d1.x < 20]
        ```
        
        And `d2` will be identical to d1 of the last example. You can also reverse the order, for example:
        
        ```python
        d1 = d[10 < dx]
        ```
        
        The system will actually render the mask expression as `dx > 10` (as per math and python rules).
        
        The basic 4 binary math operators work as well
        
        ```python
        d1 = d.x/1000.0
        ```
        
        They also work as expected if reversed, in case you were worried about that (e.g. `1000.0/d.x`).
        
        Extension functions are supported:
        
        ```python
        d1 = d.x.count()
        ```
        
        And, much the same way, `numpy` functions are supported:
        
        ```python
        import numpy as np
        d1 = np.sin(d.x)
        ```
        
        as well as some python function:
        
        ```python
        d1 = abs(d.x)
        ```
        
        Internally, this is rendered as `d.x.sin()`. The `numpy` functions are translated directly into calls like this - it is up to whatever backend you have to actually implement them. For the complete list of `numpy` functions, see the [`numpy` math page](https://numpy.org/doc/stable/reference/routines.math.html).
        
        Finally, other `numpy` functions - `array_functions` are also translated. For example:
        
        ```python
        h = np.histogram(d.x, bins=50, range=(-0.5,10))
        ```
        
        creates a `DataFrame` which makes a call to the `np_histogram` function. A backend can then implement that function.
        
        One of the most useful extra expressions in a functional language is the `if-then-else` expression. In python this is `a if a > b else b`. Unfortunately, due to the way the python interpreter works, we can't use this directly with `DataFrame`s. Instead, we can use the `np.where` 3-argument function. `np.where(<test>, <test-true-result>, <test-false-result>)` - and the nice thing about `dataframe_expressions` is that the true and false results are not calculated unless they are needed (unlike true `numpy`). See the [`numpy.where` documentation](https://numpy.org/doc/stable/reference/generated/numpy.where.html) for further details. Support, of course, is dependent on the backend.
        
        ## Lambda functions and captured variables
        
        It is possible to use lambda's that capture variables, allowing combinations of objects. For example:
        
        ```python
        d.jets.map(lambda j: d.eles.map(lambda e: j.DeltaR(e)))
        ```
        
        Would produce a stream of `DataFrame`'s for each jet with each electron. It is up to the backend how a function like `map` is used (and of course `DeltaR`). Further, the backend must run the parsing as arguments can be arbitrary, so `dataframe_expressions` can't figure out the meaning on its own. The function `map` here, for example, has no special meaning in this library.
        
        ## Backend Functions
        
        Sometimes the backend defines some functions which are directly callable. For example, `DataR` which might take several parameters. With some hints, these are encoded as direct function calls in the final `ast`:
        
        ```python
        from dataframe_expressions import user_func
        
        @user_func
        def calc_it (pt1: float) -> float:
            assert False, 'Should never be called'
        
        calced = calc_it(d.jets.pt)
        ```
        
        In this case, `calced` would be expected to be a column of jet `pt`'s that were all put together.
        
        ## Filter Functions
        
        If a filter gets to be too complex (the code between a `[` and a `]`), then it might be simpler to put it in a separate function.
        
        ```python
        def good_jet(j):
            (j.pt > 30) & (abs(j.eta) < 2.4)
        
        good_jets_pt = df.jets[good_jet].pt
        ```
        
        ## Adding computed expressions to the Data Model
        
        There are two ways to define _new columns_ in the data model. In both cases the idea is that a new computation expression can replace the old one. The first method looks more `pandas` like, and the second one looks more like a regular expression substitution. The second method is quite general, powerful, and thus quite likely to take your foot off. Not sure it will survive the prototype.
        
        ### Adding a new computed expression column
        
        This is the most common way to add a new expression to the data model: one provides a lambda function that is computed during rendering by `dataframe_expressions`:
        
        ```python
        df.jets['ptgev'] = lambda j: j.pt / 1000.0
        ```
        
        By default the argument is everything that proceeds the brackets - in this case `df.jets`. All the rules about capturing variables apply here, so it is possible to add a set of tracks near the jet, for example, using this (as long as it is implemented by the backend). For example:
        
        ```python
        def near(tks, j):
            return tks[tks.map(lambda t: DeltaR(t, j) < 0.4)]
        
        df.jets['tracks'] = lambda j: near(df.tracks, j)
        
        # This will now get you the number of tracks near each jet:
        df.jets.tracks.Count()
        ```
        
        The above assumes a lot of backend implementation: `DeltaR`, `map`, `Count`, along with the detector data model that has jets and tracks, but hopefully gives one an idea of the power available.
        
        ### Replacing the contents of a column
        
        It is possible to graft one part of the data model into another part of the data model, when necessary. It can be done with the above lambda expression as well, but this is a short cut:
        
        ```python
        df.jets['mcs'] = df.mcs[df.mcs.pdgId == 11]
        
        how_many_mcs = df.jets.mcs.Count()
        ```
        
        Though that would have the same number for every jet.
        
        Because of the way rendering works, the following also does what you expect:
        
        ```python
        df.jets['ptgev'] = df.jets.pt/1000.0
        
        jetpt_in_gev = df.jets.ptgev
        ```
        
        This is because in the current `dataframe_expressions` model, every single appearance of a common expression, like `df.jets` corresponds to the same same set of jets. In sort, implied iterators are common here. In this prototype it isn't obvious this should be here.
        
        All of this will work even through a filter, as you might expect:
        
        ```python
        df.jets['ptgev'] = df.jets.pt / 1000.0
        
        jetpt_in_gev = df.jets[df.jets.ptgev > 30].ptgev
        ```
        
        The prototype implementation is particularly fragile - but that is due to poor design rather than a technical limitation.
        
        You can also refer to a leaf using a simple syntax. For example, `df.jets["ptgev"]` and `df.jets.ptgev` are the same on the right hand side of an expression. `df.xxx` and `df["xxx"]` are equivalent in all circumstances.
        
        ### Adding to the data model using objects
        
        Another way to do this is build an object. For example, lets say you want to make it easy to do 3-vector operations. You might write something like this:
        
        ```python
        class vec(DataFrame):
            def __init__(self, df: DataFrame):
                DataFrame.__init__(self, df)
        
            @property
            def x(self) -> DataFrame:
                return self.x
            @property
            def y(self) -> DataFrame:
                return self.y
            @property
            def z(self) -> DataFrame:
                return self.z
        
            @property
            def xy(self) -> DataFrame:
                import numpy as np
                return np.sqrt(self.x*self.x + self.y*self.y)
        ```
        
        Now you can write `v.xy` and you have the `L_xy` distance from the origin. It is also possible to implement vector operations. This library doesn't help you with that, but it isn't difficult.
        
        You can add the class decorator `exclusive_class` if you only want the supplied properties to be available (so `v.zz` would cause an error).
        
        The extra work to support this is almost trivial - see test cases, even one with vector addition, in the file `test_object.py` for further examples.
        
        ### Adding to the data model using an Alias
        
        This is a simple feature which allows you to invent short hand for more complex expressions. This makes it easy to use. Further, the backend never knows about these short-hand scripts - they are just substituted in on the fly as the DAG is built. For example, in the ATLAS experiment I to access jet pT in GeV i need to always divide by 1000. So:
        
        ```python
        define_alias('', 'pt', lambda o: o.pt / 1000.0)
        ```
        
        Now if one enters `d.jets.pt`, the backend will see it as if I typed `df.jets.pt/1000.0`. The same can be done for collections. For example:
        
        ```python
        define_alias('.', 'eles', lambda e: e.Electrons("Electrons"))
        ```
        
        And when one enters `d.eles.pt` the backend will see `df.Electrons("Electrons").pt / 1000.0`.
        
        The aliases can reference each other (though no recursion is allowed), so fairly complex expressions can be built up. This library's alias resolution is quite simple (it is a prototype). Matching is possible. For example, if the first argument is a `.`, then only references directly off the dataframe are translated. This feature could be used to define a _personality_ module for an analysis for an experiment.
        
        ## Usage with a backend
        
        While the above shows you want the library can track, it says nothing about how you use it. The following steps are necessary.
        
        1. Subclass `dataframe_expressions.DataFrame` with your library to create a "source" dataframe. For example, it could refer to a file, or a network endpoint the supplies data. Make sure you initialize the `DataFrame` sub class by calling its `__init__` method. However, no need to pass any arguments. For this discussion lets call this `MyDF`
        
        1. Users build expression as you would expect, `df = MyDF(...)`, and `df1 = df.jets[df.jets.pt > 10]`
        
        1. Users trigger rendering of the expression in your library in some way that makes sense, `get_data(df1)` for example, where you must supply the `get_data` method.
        
        1. When you get control with the `DataFrame` expression the user wants rendered, you can now do the following to render it:
        
        ```python
        from dataframe_expressions import render
        expression, context = render(df1)
        ```
        
        `expression` is an `ast.AST` that describes what is being looked at. If the expression is `df.jets.pt` then the ast is a chain of python `ast.Attribute` nodes, and the bottom one will be a special `ast_Dataframe` object that contains a member `DataFrame` which points to your original sub-classed `MyDF`. You can tell it is the _special_ `DataFrame` because it will have no children.
        
        If there are filters, there is another special ast object you need to be able to process: `ast_Filter`. For example, `df[df.met > 50].jets.pt`, will have expression starting with two `ast.Attribute` nodes for the `jets.pt` attributes, followed by a `ast_Filter` node. The `ast_Filter` object has one expression, `filter`, which points to an expression that is the filter. It should evaluate to true or false.  The second member points to the `DataFrame` it is filtering - in this case `MyDF`. As long as there is repeated phrase, like `df` in `df[df.met > 50].jets.pt` or `df.jets` in `df.jets[df.jets.count() == 2]`, they will point to the same `ast_DataFrame` object - so you can use that in walking the tree to recognize common sub-expressions expression(s).
        
        There is one last trick: `lambda` functions. `dataframe_expressions` can't evaluate the lambda functions without knowing more about the user's intent: so evaluating them must be triggered by your library. The lambda functions are represented by an `ast_Callable` object. When you do encounter them, you can render them into the same `ast.AST` like form by calling `render_callable` and passing the context along with the `ast_Callable` and any arguments to pass to the `lambda`.
        
        To see how this works, see packages like `hep_tables` and `hl_tables`.
        
        ## Helpers
        
        The `dumps` function will dump a dataframe to a string. For the most part, the string will be correct python (lambda functions and other function routines are the only exception). This is useful for including in error text or in logging in libraries that make use of this library.
        
        The `dataframe_expressions` library makes use of the python `logging` library to dump expressions it is asked to render at the debug level. If you want to turn on just messages from this library the following code will dump debug level messages to stdout:
        
        ```python
        import logging
        ch = logging.StreamHandler()
        logging.getLogger('dataframe_expressions').setLevel(logging.DEBUG)
        logging.getLogger('dataframe_expressions').addHandler(ch)
        ```
        
        ## Technology Choices
        
        Not sure these are the right thing, but...
        
        - Using the python `ast` module to record expressions. Mostly because it is already complete and there are nice visitor objects that make walking it easy. Down side is that python does change the ast every few versions.
        
        - An attribute on DataFrame refers to some data. A method call, however, does not refer to data. So, you can say `d.pt` to get at the pt, but if you said `d.pt()` that would be "bad". The reason for this is so that we can add functions that do things in a fluent way. For example, `d.jets.count()` to count the number of jets. Or `d.jets[d.jets.pt > 100].count()` or similar. Really, the back end can interpret this, but the front-end semantics sort-of make this assumption.
        
        ## Architecture Questions
        
        This isn't an exhaustive list. Just a list of some choices I had to make to get this off the ground.
        
        - Should there be a `Column` and `Dataset`?
          - Yes - turns out we have rediscovered why there is a Mask and a column distinction in numpy. So the Column object is really a Mask object. This is bad naming, but hopefully for this prototype that won't make much of a difference. So we should definitely think a bit about why a Mask has to be treated differently from a `DataFrame` - it isn't intuitively obvious until you get into the code.
          - No - since things can return "bool" values and we don't know it because we have no type system, they are identical to a column, except we assume they are a df: `df[df.hasProdVtx & df.hasDecayVtx]`, for example.
          - We should get rid of the concept of a parent, dynamic, and replace it with ast_DataFrame - we have it in here already - so why not just stick to that rather than having both it and `p`.
        
        - Should we allow for "&" and "|" as logical operators, redefining what they mean in python? numpy defines several logical operators which should translate, but those aren't implemented yet.
        
        - I currently have a parent as "p" in the expression, but then we have a dataframe ast and column ast - which makes it not needed. Why not just convert to using the same thing to refer to a df in an ast?
          - Internally, the "parent" dataframe is represented as `p` - which means nothing can ever have a `p` object on it or all hell is likely to break loose. A very good argument for not doing it this way.
        
        - For typing I do not know how to forward declare so I can use COlumn and DataFrame inside my method definitions. Static type checkers should pick this up for now by simple logic.
        
        - Using BitAnd and BitOr for and and or - but should I use the logical and and or here to make it clear in the AST what we are talking about?
        
        - What does `d1[d[d.x > 0].jets.pt > 20].pt` mean? Is this where we are hitting the limit of things? I'd say it means nothing and should create an error. Something like `d1[(d[d.x > 0].jets.pt > 20).count()].pt` works, however. Actually even the above - what does that mean? Isn't the right way to do that is `d1[(d[d.x > 0].jets[d.jets.pt>0].count())]` or similar? Ugh. Ok - the thing to do for now is be strict, and we can add things which make life easier later.
        
        - Sometimes functions are defined in places they make no sense. For example, the `abs` (or any `numpy` function) is defined always, even if your `DataFrame` represents a collection of jets. A reason to have `columns` and `collections` as different objects to help the user, and help editors guess possibilities.
        
        - There should be no concept of `parent` in a `DataFrame`. The expression should be everything, and point to any referenced objects. This will be especially true if multiple root `DataFrame`'s are ever to be used.
        
        - Is it important to define new columns using the '=' sign? e.g. `df.jets.ptgev = df.jets.pt/1000.0`?
        
        - The rule that every expression that is the same implies the same implied iterator. That means the current code can't do 2 jets, for example. There are several ways to "fix" this, however, the biggest question: is this reasonable?
        
        - The ability to have an `exclusive_object` is implemented at runtime - perhaps we can come up with a scheme where we just define objects and they "fit" in correctly? Thus editors, etc., would be able to tag this as a problem.
        
Platform: Any
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.7
Classifier: Topic :: Software Development
Classifier: Topic :: Utilities
Requires-Python: >=3.6, <3.8
Description-Content-Type: text/markdown
Provides-Extra: complete
Provides-Extra: test
