Use Walkthrough
Install and import the package.
pip install metaform
import metaform
Basic Usage
Let's say we have some data:
data = {
'hello': 1.0,
'world': 2,
'how': ['is', {'are': {'you': 'doing'}}]
}
We can get the template for defining schema, by metaform.template
:
metaform.template(data)
{'': '', 'hello': {'': ''}, 'how': [{'': '', 'are': {'you': {'': ''}}}], 'world': {'*': ''}}
This provides an opportunity to specify metadata for each key and the object itself. For example:
schema = {
'*': 'greeting',
'hello': {'*': 'length'},
'world': {'*': 'atoms'},
'how': [
{'*': 'method',
'are': {
'*': 'yup',
'you': {'*': 'me'}}
}
]}
metaform.normalize(data, schema)
{'atoms': 2, 'length': 1.0, 'method': ['is', {'yup': {'me': 'doing'}}]}
We recommend saving schemas you create for normalizations for data analytics and driver projects in dot-folders .schema
, in a JSON or YAML files in that folder.
So, we have access to all keys, and can specify, what to do with them:
schema = {
'*': 'greeting',
'hello': {'*': 'length|lambda x: x+5.'},
'world': {'*': 'atoms|lambda x: str(x)+"ABC"'},
'how': [
{'*': 'method',
'are': {
'*': 'yup',
'you': {'*': 'me|lambda x: "-".join(list(x))'}}
}
]}
metaform.normalize(data, schema)
{'atoms': '2ABC', 'length': 6.0, 'method': ['is', {'yup': {'me': 'd-o-i-n-g'}}]}
And suppose, we want to define a more complex function, inconvenient via lambdas:
from metaform import converters
def some_func(x):
a = 123
b = 345
return (b-a)*x
converters.func = some_func
schema = {
'*': 'greeting',
'hello': {'*': 'length|converters.func'},
'world': {'*': 'atoms|lambda x: str(x)+"ABC"'},
'how': [
{'*': 'method',
'are': {
'*': 'yup',
'you': {'*': 'me|lambda x: "-".join(list(x))'}}
}
]}
metaform.normalize(data, schema)
{'atoms': '2ABC', 'length': 222.0, 'method': ['is', {'yup': {'me': 'd-o-i-n-g'}}]}
We just renamed the keys, and normalized values! What else could we want?
Normalizing Data
Suppose we have similar data from different sources. For example, topics and comments are not so different after all, because if a comment becomes large enough, it can stand as a topic of its own.
topics = requests.get('https://api.infty.xyz/topics/?format=json').json()['results']
comments = requests.get('https://api.infty.xyz/comments/?format=json').json()['results']
Let's define templates for them, with the key names and types to match:
topics_schema = [{
'id': {'*': 'topic-id'},
'type': {'*': '|lambda x: {0: "NEED", 1: "GOAL", 2: "IDEA", 3: "PLAN", 4: "STEP", 5: "TASK"}.get(x)'},
'owner': {'username': {'*': ''}, 'id': {'*': 'user-id'}},
'blockchain': {'*': '|lambda x: x and True or False'},
}]
normal_topics = metaform.normalize(topics, topics_schema)
topics_df = pandas.io.json.json_normalize(normal_topics)
topics_df.dtypes
blockchain bool body object categories object categories_names object children object comment_count int64 created_date object data object declared float64 editors object funds float64 is_draft bool languages object matched float64 owner.user-id int64 owner.username object parents object title object topic-id int64 type object updated_date object url object dtype: object
comments_schema = [{
'id': {'*': 'comment-id'},
'topic': {'*': 'topic-url'},
'text': {'*': 'body'},
'owner': {'username': {'*': ''}, 'id': {'*': 'user-id'}},
'blockchain': {'*': '|lambda x: x and True or False'},
}]
normal_comments = metaform.normalize(comments, comments_schema)
comments_df = pandas.io.json.json_normalize(normal_comments)
comments_df.dtypes
assumed_hours object blockchain bool body object claimed_hours object comment-id int64 created_date object donated float64 languages object matched float64 owner.user-id int64 owner.username object parent object remains float64 topic-url object updated_date object url object dtype: object
df = pandas.concat([topics_df, comments_df], sort=False)
df.head()
blockchain | body | categories | categories_names | children | comment_count | created_date | data | declared | editors | ... | type | updated_date | url | assumed_hours | claimed_hours | comment-id | donated | parent | remains | topic-url | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | True | .:en\nAdd the **trade.Exchange** model, to ena... | [] | [] | [] | 1.0 | 2019-09-21T09:15:48.194279 | 0.15 | [] | ... | TASK | 2019-09-21T09:34:00.686125 | https://api.infty.xyz/topics/894/?format=json | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
1 | False | .:en\nIt would make sense, especially useful i... | [] | [] | [] | 0.0 | 2019-09-18T14:15:57.579981 | 0.00 | [] | ... | TASK | 2019-09-18T14:15:57.580044 | https://api.infty.xyz/topics/893/?format=json | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
2 | True | .:lt\nInfinity yra labiau kūrybai skirtas proj... | [] | [] | [] | 0.0 | 2019-09-18T11:02:16.678286 | 0.00 | [] | ... | TASK | 2019-09-18T11:07:45.004434 | https://api.infty.xyz/topics/892/?format=json | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
3 | True | .:lt\nKadangi turime įmonių duomenų bazę, tai ... | [] | [] | [https://api.infty.xyz/topics/892/?format=json] | 0.0 | 2019-09-18T10:59:47.173797 | 0.00 | [] | ... | TASK | 2019-09-18T12:48:06.209215 | https://api.infty.xyz/topics/891/?format=json | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
4 | True | .:en\nEach goal that we set, is essentially ec... | [] | [] | [] | 1.0 | 2019-09-18T01:47:23.604488 | 0.00 | [] | ... | GOAL | 2019-09-21T10:22:13.226363 | https://api.infty.xyz/topics/890/?format=json | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 29 columns
But that leaves us with a potential alignment problem, if the keys representing the same things appear at different hierarchical places in different sources.
Aligning Data
So suppose we want to pick out the matching keys at different levels of hierarchies, and put them at the top.
Just for the sake of complexity, let's put the user references deeper somewhere in one of the sources, and remove original:
abnormal_comments = [dict(comment,**{"some": {"place": {"deep": comment["owner"]}}, "owner": None}) for comment in normal_comments]
abnormal_comments[0]
{'assumed_hours': '0.00000000', 'blockchain': True, 'body': '.:en\nhttps://wiki.mindey.com/shared/shots/b51de15b96a58b76fbeb3a1ef.png\n{0.15}', 'claimed_hours': '0.15000000', 'comment-id': 791, 'created_date': '2019-09-21T10:05:34.228102', 'donated': 0.0, 'languages': ['en'], 'matched': 0.15, 'owner': None, 'parent': None, 'remains': 0.0, 'some': {'place': {'deep': {'user-id': 147, 'username': 'Mindey@FE706DAF'}}}, 'topic-url': 'https://api.infty.xyz/topics/894/?format=json', 'updated_date': '2019-09-21T10:05:54.924341', 'url': 'https://api.infty.xyz/comments/791/?format=json'}
metaform.align([normal_topics[:1], abnormal_comments[:1]])
<generator object align at 0x7f5207473d58>
list(_)
[{0: 'en', 'blockchain': True, 'body': '.:en\nAdd the trade.Exchange model, to enable atomic exchange of assets between identities, identities being users.User, and assets being things registered as meta.Instances, which may be created at the time of operation, if necessary to identify some divisible quantity, like liters of water, or amounts of money .\n\nEach Exchange would involve equivalent exchange of hour-money.\n\nSo, an Exchange would credit one account, and debit another account.', 'created_date': '2019-09-21T09:15:48.194279', 'matched': 0.15, 'updated_date': '2019-09-21T09:34:00.686125', 'url': 'https://api.infty.xyz/topics/894/?format=json', 'user-id': 147, 'username': 'Mindey@FE706DAF'}, {0: 'en', 'blockchain': True, 'body': '.:en\nhttps://wiki.mindey.com/shared/shots/b51de15b96a58b76fbeb3a1ef.png\n{0.15}', 'created_date': '2019-09-21T10:05:34.228102', 'matched': 0.15, 'updated_date': '2019-09-21T10:05:54.924341', 'url': 'https://api.infty.xyz/comments/791/?format=json', 'user-id': 147, 'username': 'Mindey@FE706DAF'}]