Creating fixtures data in Mongodb using Python

Creating fixture data in MongoDB is a continuous challenge that we face in while we develop our web applications.
Often, the web applications that we work on developed on Meteor.js (and other NodeJS stacks).
While Meteor is an extremely good framework for developing web applications, bootstrapping the data through Mongo DB or to Mongo DB through Node is a fairly painful and laborious exercise.
The amount of cognitive load and the amount of programming required is actually pretty painful.
A lot of the data that gets generated is either through other sources or by creating fake data or alternate scraping methods of obtaining data and then feeding that into the node scripts.
However, in the interest of efficiency we believe that using Python, which is more a data analysis oriented language makes things much more easier and much more productive.
So here is a demonstration of using Python to bootstrap one of our applications.
Before we do that, one of the key important requirements of our applications is that the collections document that we populate require documents to be referenced using String IDs (Object Ids break out applications in certain places :-( ).
Using Node. js for bootstrapping the data, while being laborious also generated object IDs. In, I'm sure there are other ways of overcoming this but a quick and more efficient fix for us was to use Python.
The below section demonstrates the ease of bootstrapping data for your Node. js application using Python. And also maintain the document Ids as string IDs.
It's pretty straightforward.
- A couple of lines to reading the data that you want to import
- Manipulating data into a format that is compliant with your Mongo DB collection structure
- Creating the IDs as strings
- Finally populating the data into MongoDB.
Pretty sweet, straightforward and extremely productive.
Here is the code quickly put together after exporting from a jupyter
notebook:
#!/usr/bin/env python
# import standard libraries
import sys, json
import pandas as pd
from pymongo import MongoClient
import numpy as np
import platform
from pprint import pprint
from os.path import expanduser
import datetime
from os.path import join, dirname
from dotenv import load_dotenv
import os
# credit: https://github.com/theskumar/python-dotenv
# OR, explicitly providing path to '.env'
from pathlib import Path # python3 only
#
cwd = os.getcwd()
print(cwd)
env_path = Path(cwd) / '.env'
print(env_path)
load_dotenv(dotenv_path=env_path, verbose=True)
AWSAccessKeyId = os.getenv("AWSAccessKeyId")
AWSSecretAccessKey = os.getenv("AWSSecretAccessKey")
AWSRegion = os.getenv("AWSRegion")
AWSBucket = os.getenv("AWSBucket")
# test if the env variable are right!
print(AWSAccessKeyId)
if platform.system() == 'Darwin':
home = expanduser("~")
f_open_listings = home+"/Dropbox/pandora/My-Projects/repos/mypad-mini-projects/map-points-with-google-maps/sample-data/open-listings/consolidated-ol-props.csv"
# connect to the database to be populated
conn = MongoClient("127.0.0.1", 2602)
db = conn.get_database('meteor')
df_listings = pd.read_csv(f_open_listings)
# was running into error: InvalidDocument: cannot encode object: 499000, of type: <class 'numpy.int64'>
# credit: https://stackoverflow.com/questions/30098263/inserting-a-document-with-pymongo-invaliddocument-cannot-encode-object3
def correct_encoding(dictionary):
"""Correct the encoding of python dictionaries so they can be encoded to mongodb
inputs
-------
dictionary : dictionary instance to add as document
output
-------
new : new dictionary with (hopefully) corrected encodings"""
new = {}
for key1, val1 in dictionary.items():
# Nested dictionaries
if isinstance(val1, dict):
val1 = correct_encoding(val1)
if isinstance(val1, np.bool_):
val1 = bool(val1)
if isinstance(val1, np.int64):
val1 = int(val1)
if isinstance(val1, np.float64):
val1 = float(val1)
new[key1] = val1
return new
# Lets also do file uploads!
# https://stackoverflow.com/questions/15085864/how-to-upload-a-file-to-directory-in-s3-bucket-using-boto
import boto3
import requests
from urllib.parse import urlparse
from io import BytesIO;
import contextlib
import mimetypes
from slugify import slugify
import pathlib
session = boto3.Session(
aws_access_key_id=AWSAccessKeyId,
aws_secret_access_key=AWSSecretAccessKey
)
s3 = session.resource('s3')
# credit: https://stackoverflow.com/a/28210720/644081
from bson.objectid import ObjectId
import urllib3
for listing in df_listings.head(1000).to_dict('records'):
address = {'street1':listing['address.street1'], 'street2':'', 'city':listing['address.city'], 'state':listing['address.state'], 'postalCode':int(listing['address.postalCode'])}
listing['address'] = address
listing['status'] = ''
# remove the old keys
del listing['address.street1']
del listing['address.street2']
del listing['address.city']
del listing['address.state']
del listing['address.postalCode']
listing['createdAt'] = datetime.datetime.utcnow()
listing['updatedAt'] = datetime.datetime.utcnow()
listing['createdBy'] = 'wTMBsH8p9CEGxtPHf'
listing['updatedBy'] = 'wTMBsH8p9CEGxtPHf'
listing['listingDate'] = datetime.datetime.strptime(listing['listingDate'], "%Y-%m-%dT%H:%M:%S.%fZ")
listing['closingDate'] = datetime.datetime.strptime(listing['closingDate'], "%Y-%m-%dT%H:%M:%S.%fZ")
listing['photo'] = listing['photo'].replace(":width", str(int(listing['width'])))
listing['photo'] = listing['photo'].replace(":height", str(int(listing['height'])))
listing = correct_encoding(listing)
img_url=listing['photo']
print(img_url)
a = urlparse(img_url)
img_key = os.path.basename(a.path)
img_ext = pathlib.PurePosixPath(a.path).suffix
img_key = slugify(listing['address']['street1'])+img_ext
bucket_name_to_upload_image_to = AWSBucket
internet_image_url = img_url
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
with contextlib.closing(requests.get(img_url, stream=True, verify=False)) as response:
fp = BytesIO(response.content)
mimetype, _ = mimetypes.guess_type(img_key)
if mimetype is None:
raise Exception("Failed to guess mimetype")
s3.Bucket(bucket_name_to_upload_image_to).upload_fileobj(fp, "images/homePhotos/"+img_key, ExtraArgs={"ContentType": mimetype, "ContentDisposition":"inline; filename="+img_key})
listing['picture_url'] = "images/homePhotos/"+img_key
listing['image'] = "images/homePhotos/"+img_key
del listing['photo']
listing['_id'] = str(ObjectId())
# print(listing)
db.homes.insert_one(listing)