# -*- coding: utf-8 -*-
from setuptools import setup

packages = \
['sparkql', 'sparkql.fields']

package_data = \
{'': ['*']}

install_requires = \
['pyspark>=3.0,<4.0']

entry_points = \
{'console_scripts': ['debug-auto-git-tag = tasks:debug_auto_git_tag',
                     'find-releasable-changes = tasks:find_releasable_changes',
                     'lint = tasks:lint',
                     'prepare-release = tasks:prepare_release',
                     'reformat = tasks:reformat',
                     'test = tasks:test',
                     'typecheck = tasks:typecheck',
                     'verify-all = tasks:verify_all']}

setup_kwargs = {
    'name': 'sparkql',
    'version': '0.7.0',
    'description': 'sparkql: Apache Spark SQL DataFrame schema management for sensible humans',
    'long_description': '# sparkql ✨\n\n[![PyPI version](https://badge.fury.io/py/sparkql.svg)](https://badge.fury.io/py/sparkql)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![CI](https://github.com/mattjw/sparkql/workflows/CI/badge.svg)](https://github.com/mattjw/sparkql/actions)\n[![codecov](https://codecov.io/gh/mattjw/sparkql/branch/master/graph/badge.svg)](https://codecov.io/gh/mattjw/sparkql)\n\nPython Spark SQL DataFrame schema management for sensible humans.\n\n> _Don\'t sweat it... sparkql it ✨_\n\n## Why use sparkql\n\n`sparkql` takes the pain out of working with DataFrame schemas in PySpark.\nIt makes schema definition more Pythonic. And it\'s\nparticularly useful you\'re dealing with structured data.\n\nIn plain old PySpark, you might find that you write schemas\n[like this](https://github.com/mattjw/sparkql/tree/master/examples/conferences_comparison/plain_schema.py):\n\n```python\nCITY_SCHEMA = StructType()\nCITY_NAME_FIELD = "name"\nCITY_SCHEMA.add(StructField(CITY_NAME_FIELD, StringType(), False))\nCITY_LAT_FIELD = "latitude"\nCITY_SCHEMA.add(StructField(CITY_LAT_FIELD, FloatType()))\nCITY_LONG_FIELD = "longitude"\nCITY_SCHEMA.add(StructField(CITY_LONG_FIELD, FloatType()))\n\nCONFERENCE_SCHEMA = StructType()\nCONF_NAME_FIELD = "name"\nCONFERENCE_SCHEMA.add(StructField(CONF_NAME_FIELD, StringType(), False))\nCONF_CITY_FIELD = "city"\nCONFERENCE_SCHEMA.add(StructField(CONF_CITY_FIELD, CITY_SCHEMA))\n```\n\nAnd then plain old PySpark makes you deal with nested fields like this:\n\n```python\ndframe.withColumn("city_name", df[CONF_CITY_FIELD][CITY_NAME_FIELD])\n```\n\nInstead, with `sparkql`, schemas become a lot\n[more literate](https://github.com/mattjw/sparkql/tree/master/examples/conferences_comparison/sparkql_schema.py):\n\n```python\nclass City(Struct):\n    name = String(nullable=False)\n    latitude = Float()\n    longitude = Float()\n\nclass Conference(Struct):\n    name = String(nullable=False)\n    city = City()\n```\n\nAs does dealing with nested fields:\n\n```python\ndframe.withColumn("city_name", Conference.city.name.COL)\n```\n\nHere\'s a summary of `sparkql`\'s features.\n\n- ORM-like class-based Spark schema definitions.\n- Automated field naming: The attribute name of a field as it appears\n  in its `Struct` is (by default) used as its field name. This name can\n  be optionally overridden.\n- Programatically reference nested fields in your structs with the\n  `PATH` and `COL` special properties. Avoid hand-constructing strings\n  (or `Column`s) to reference your nested fields.\n- Validate that a DataFrame matches a `sparkql` schema.\n- Reuse and build composite schemas with `inheritance`, `includes`, and\n  `implements`.\n- Get a human-readable Spark schema representation with `pretty_schema`.\n- Create an instance of a schema as a dictionary, with validation of\n  the input values.\n\nRead on for documentation on these features.\n\n## Defining a schema\n\nEach Spark atomic type has a counterpart `sparkql` field:\n\n| PySpark type | `sparkql` field |\n|---|---|\n| `ByteType` | `Byte` |\n| `IntegerType` | `Integer` |\n| `LongType` | `Long` |\n| `ShortType` | `Short` |\n| `DecimalType` | `Decimal` |\n| `DoubleType` | `Double` |\n| `FloatType` | `Float` |\n| `StringType` | `String` |\n| `BinaryType` | `Binary` |\n| `BooleanType` | `Boolean` |\n| `DateType` | `Date` |\n| `TimestampType` | `Timestamp` |\n\n`Array` (counterpart to `ArrayType` in PySpark) allows the definition\nof arrays of objects. By creating a subclass of `Struct`, we can\ndefine a custom class that will be converted to a `StructType`.\n\nFor\n[example](https://github.com/mattjw/sparkql/tree/master/examples/arrays/arrays.py),\ngiven the `sparkql` schema definition:\n\n```python\nfrom sparkql import Struct, String, Array\n\nclass Article(Struct):\n    title = String(nullable=False)\n    tags = Array(String(), nullable=False)\n    comments = Array(String(nullable=False))\n```\n\nThen we can build the equivalent PySpark schema (a `StructType`)\nwith:\n\n```python\nfrom sparkql import schema\n\npyspark_struct = schema(Article)\n```\n\nPretty printing the schema with the expression\n`sparkql.pretty_schema(pyspark_struct)` will give the following:\n\n```text\nStructType([\n    StructField(\'title\', StringType(), False),\n    StructField(\'tags\',\n        ArrayType(StringType(), True),\n        False),\n    StructField(\'comments\',\n        ArrayType(StringType(), False),\n        True)])\n```\n\n## Features\n\nMany examples of how to use `sparkql` can be found in\n[`examples`](https://github.com/mattjw/sparkql/tree/master/examples).\n\n### Automated field naming\n\nBy default, field names are inferred from the attribute name in the\nstruct they are declared.\n\nFor example, given the struct\n\n```python\nclass Geolocation(Struct):\n    latitude = Float()\n    longitude = Float()\n```\n\nthe concrete name of the `Geolocation.latitude` field is `latitude`.\n\nNames also be overridden by explicitly specifying the field name as an\nargument to the field\n\n```python\nclass Geolocation(Struct):\n    latitude = Float(name="lat")\n    longitude = Float(name="lon")\n```\n\nwhich would mean the concrete name of the `Geolocation.latitude` field\nis `lat`.\n\n### Field paths and nested objects\n\nReferencing fields in nested data can be a chore. `sparkql` simplifies this\nwith path referencing.\n\n[For example](https://github.com/mattjw/sparkql/tree/master/examples/nested_objects/sparkql_example.py), if we have a\nschema with nested objects:\n\n```python\nclass Address(Struct):\n    post_code = String()\n    city = String()\n\n\nclass User(Struct):\n    username = String(nullable=False)\n    address = Address()\n\n\nclass Comment(Struct):\n    message = String()\n    author = User(nullable=False)\n\n\nclass Article(Struct):\n    title = String(nullable=False)\n    author = User(nullable=False)\n    comments = Array(Comment())\n```\n\nWe can use the special `PATH` property to turn a path into a\nSpark-understandable string:\n\n```python\nauthor_city_str = Article.author.address.city.PATH\n"author.address.city"\n```\n\n`COL` is a counterpart to `PATH` that returns a Spark `Column`\nobject for the path, allowing it to be used in all places where Spark\nrequires a column.\n\nFunction equivalents `path_str`, `path_col`, and `name` are also available.\nThis table demonstrates the equivalence of the property styles and the function\nstyles:\n\n| Property style | Function style | Result (both styles are equivalent) |\n| --- | --- | --- |\n| `Article.author.address.city.PATH` | `sparkql.path_str(Article.author.address.city)` | `"author.address.city"` |\n| `Article.author.address.city.COL` | `sparkql.path_col(Article.author.address.city)` | `Column` pointing to `author.address.city` |\n| `Article.author.address.city.NAME` | `sparkql.name(Article.author.address.city)` | `"city"` |\n\nFor paths that include an array, two approaches are provided:\n\n```python\ncomment_usernames_str = Article.comments.e.author.username.PATH\n"comments.author.username"\n\ncomment_usernames_str = Article.comments.author.username.PATH\n"comments.author.username"\n```\n\nBoth give the same result. However, the former (`e`) is more\ntype-oriented. The `e` attribute corresponds to the array\'s element\nfield. Although this looks strange at first, it has the advantage of\nbeing inspectable by IDEs and other tools, allowing goodness such as\nIDE auto-completion, automated refactoring, and identifying errors\nbefore runtime.\n\n### DataFrame validation\n\nStruct method `validate_data_frame` will verify if a given DataFrame\'s\nschema matches the Struct.\n[For example](https://github.com/mattjw/sparkql/tree/master/examples/validation/test_validation.py),\nif we have our `Article`\nstruct and a DataFrame we want to ensure adheres to the `Article`\nschema:\n\n```python\ndframe = spark_session.createDataFrame([{"title": "abc"}])\n\nclass Article(Struct):\n    title = String()\n    body = String()\n```\n\nThen we can can validate with:\n\n```python\nvalidation_result = Article.validate_data_frame(dframe)\n```\n\n`validation_result.is_valid` indicates whether the DataFrame is valid\n(`False` in this case), and `validation_result.report` is a\nhuman-readable string describing the differences:\n\n```text\nStruct schema...\n\nStructType([\n    StructField(\'title\', StringType(), True),\n    StructField(\'body\', StringType(), True)])\n\nDataFrame schema...\n\nStructType([\n    StructField(\'title\', StringType(), True)])\n\nDiff of struct -> data frame...\n\n  StructType([\n-     StructField(\'title\', StringType(), True)])\n+     StructField(\'title\', StringType(), True),\n+     StructField(\'body\', StringType(), True)])\n```\n\nFor convenience,\n\n```python\nArticle.validate_data_frame(dframe).raise_on_invalid()\n```\n\nwill raise a `InvalidDataFrameError` (see `sparkql.exceptions`) if the  \nDataFrame is not valid.\n\n### Creating an instance of a schema\n\n`sparkql` simplifies the process of creating an instance of a struct.\nYou might need to do this, for example, when creating test data, or\nwhen creating an object (a dict or a row) to return from a UDF.\n\nUse `Struct.make_dict(...)` to instantiate a struct as a dictionary.\nThis has the advantage that the input values will be correctly\nvalidated, and it will convert schema property names into their\nunderlying field names.\n\nFor\n[example](https://github.com/mattjw/sparkql/tree/master/examples/struct_instantiation/instantiate_as_dict.py),\ngiven some simple Structs:\n\n```python\nclass User(Struct):\n    id = Integer(name="user_id", nullable=False)\n    username = String()\n\nclass Article(Struct):\n    id = Integer(name="article_id", nullable=False)\n    title = String()\n    author = User()\n    text = String(name="body")\n```\n\nHere are a few examples of creating dicts from `Article`:\n\n```python\nArticle.make_dict(\n    id=1001,\n    title="The article title",\n    author=User.make_dict(\n        id=440,\n        username="user"\n    ),\n    text="Lorem ipsum article text lorem ipsum."\n)\n\n# generates...\n{\n    "article_id": 1001,\n    "author": {\n        "user_id": 440,\n        "username": "user"},\n    "body": "Lorem ipsum article text lorem ipsum.",\n    "title": "The article title"\n}\n```\n\n```python\nArticle.make_dict(\n    id=1002\n)\n\n# generates...\n{\n    "article_id": 1002,\n    "author": None,\n    "body": None,\n    "title": None\n}\n```\n\nSee\n[this example](https://github.com/mattjw/sparkql/tree/master/examples/conferences_extended/conferences.py)\nfor an extended example of using `make_dict`.\n\n### Composite schemas\n\nIt is sometimes useful to be able to re-use the fields of one struct\nin another struct. `sparkql` provides a few features to enable this:\n\n- _inheritance_: A subclass inherits the fields of a base struct class.\n- _includes_: Incorporate fields from another struct.\n- _implements_: Enforce that a struct must implement the fields of\n  another struct.\n\nSee the following examples for a better explanation.\n\n#### Using inheritance\n\nFor [example](https://github.com/mattjw/sparkql/tree/master/examples/composite_schemas/inheritance.py), the following:\n\n```python\nclass BaseEvent(Struct):\n    correlation_id = String(nullable=False)\n    event_time = Timestamp(nullable=False)\n\nclass RegistrationEvent(BaseEvent):\n    user_id = String(nullable=False)\n```\n\nwill produce the following `RegistrationEvent` schema:\n\n```text\nStructType([\n    StructField(\'correlation_id\', StringType(), False),\n    StructField(\'event_time\', TimestampType(), False),\n    StructField(\'user_id\', StringType(), False)])\n```\n\n#### Using an `includes` declaration\n\nFor [example](https://github.com/mattjw/sparkql/tree/master/examples/composite_schemas/includes.py), the following:\n\n```python\nclass EventMetadata(Struct):\n    correlation_id = String(nullable=False)\n    event_time = Timestamp(nullable=False)\n\nclass RegistrationEvent(Struct):\n    class Meta:\n        includes = [EventMetadata]\n    user_id = String(nullable=False)\n```\n\nwill produce the `RegistrationEvent` schema:\n\n```text\nStructType(List(\n    StructField(\'user_id\', StringType(), False),\n    StructField(\'correlation_id\', StringType(), False),\n    StructField(\'event_time\', TimestampType(), False)))\n```\n\n#### Using an `implements` declaration\n\n`implements` is similar to `includes`, but does not automatically\nincorporate the fields of specified structs. Instead, it is up to\nthe implementor to ensure that the required fields are declared in\nthe struct.\n\nFailing to implement a field from an `implements` struct will result in\na `StructImplementationError` error.\n\n[For example](https://github.com/mattjw/sparkql/tree/master/examples/composite_schemas/implements.py):\n\n```\nclass LogEntryMetadata(Struct):\n    logged_at = Timestamp(nullable=False)\n\nclass PageViewLogEntry(Struct):\n    class Meta:\n        implements = [LogEntryMetadata]\n    page_id = String(nullable=False)\n\n# the above class declaration will fail with the following StructImplementationError error:\n#   Struct \'PageViewLogEntry\' does not implement field \'logged_at\' required by struct \'LogEntryMetadata\'\n```\n\n\n### Prettified Spark schema strings\n\nSpark\'s stringified schema representation isn\'t very user-friendly, particularly for large schemas:\n\n\n```text\nStructType([StructField(\'name\', StringType(), False), StructField(\'city\', StructType([StructField(\'name\', StringType(), False), StructField(\'latitude\', FloatType(), True), StructField(\'longitude\', FloatType(), True)]), True)])\n```\n\nThe function `pretty_schema` will return something more useful:\n\n```text\nStructType([\n    StructField(\'name\', StringType(), False),\n    StructField(\'city\',\n        StructType([\n            StructField(\'name\', StringType(), False),\n            StructField(\'latitude\', FloatType(), True),\n            StructField(\'longitude\', FloatType(), True)]),\n        True)])\n```\n\n### Merge two Spark `StructType` types\n\nIt can be useful to build a composite schema from two `StructType`s. sparkql provides a\n`merge_schemas` function to do this.\n\n[For example](https://github.com/mattjw/sparkql/tree/master/examples/merge_struct_types/merge_struct_types.py):\n\n```python\nschema_a = StructType([\n    StructField("message", StringType()),\n    StructField("author", ArrayType(\n        StructType([\n            StructField("name", StringType())\n        ])\n    ))\n])\n\nschema_b = StructType([\n    StructField("author", ArrayType(\n        StructType([\n            StructField("address", StringType())\n        ])\n    ))\n])\n\nmerged_schema = merge_schemas(schema_a, schema_b) \n```\n\nresults in a `merged_schema` that looks like:\n\n```text\nStructType([\n    StructField(\'message\', StringType(), True),\n    StructField(\'author\',\n        ArrayType(StructType([\n            StructField(\'name\', StringType(), True),\n            StructField(\'address\', StringType(), True)]), True),\n        True)])\n```\n\n## Contributing\n\nContributions are very welcome. Developers who\'d like to contribute to\nthis project should refer to [CONTRIBUTING.md](./CONTRIBUTING.md).\n',
    'author': 'Matt J Williams',
    'author_email': 'mattjw@mattjw.net',
    'maintainer': 'None',
    'maintainer_email': 'None',
    'url': 'https://github.com/mattjw/sparkql',
    'packages': packages,
    'package_data': package_data,
    'install_requires': install_requires,
    'entry_points': entry_points,
    'python_requires': '>=3.7.2,<4.0.0',
}


setup(**setup_kwargs)
