As part of data extraction efforts for various analytics projects that we work on, we often need to extract data from a variety of sources.

It is typical nowadays, that the data is exposed in a standard manner through a provider that can be guessed by browsing through few data sources.

The way to identifying this data is usually to proxy the app or website through Charles and then copy the requests as curl request and transfer them over into a bash scrpt that does the rest of the extraction magic!

Below is a useful pattern that can be used from one of our recent projects.

Data Provider's Pattern

Inspecting through Charles, here is a sample data pattern (removed cookies, etc for simplicity and represented using provider to maintain privacy)

curl -H 'Host: www.provider.com' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Origin: https://www.provider.com' -H 'X-Requested-With: XMLHttpRequest' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Sec-Fetch-Site: same-origin' -H 'Sec-Fetch-Mode: cors' -H 'Referer: https://www.provider.com/admin/data_source.html' -H 'Accept-Language: en-GB,en-US;q=0.9,en;q=0.8' -H 'Cookie: XXXX' --data-binary 'action=GetObjectDetails&id='**$m** --compressed 'https://www.provider.com/admin/provider.php' > "**$m.json**"

Inspection

On inspecting this provider, the key parameter in this case was driven by a set of serial numbers denoted by $m

Effectively, issuing the curl request to https://www.provider.com/admin/provider.php with the data payload represented by 'action=GetObjectDetails&id='**$m** would deliver the data and written to the local drive as $m.json

Shell Script - Magic!

Generating numbers can be done in many ways, I simply use a plugin (Text Pastry) in Sublime Text to generate a set of numbers seperated by spaces as shown here

The rest works like magic as shown below!

#!/bin/bash

# This script is used to get GetObjectDetails from Provider

# generated using text pastry in sublime text
ids=(10 11 12 13 14 15 16 17 18 19 20)

# credit: https://stackabuse.com/array-loops-in-bash/
for m in "${ids[@]}"; do

  # Change to data directory and run the curl scripts
  cd ~/my_project/data

  # MODIFICATION REQUIRED after getting a sample curl from Charles!
  # Credit: https://unix.stackexchange.com/a/386179
  # Copy curl from Charles!
  
echo 'downloading for: '$m

curl -H 'Host: www.provider.com' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Origin: https://www.provider.com' -H 'X-Requested-With: XMLHttpRequest' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Sec-Fetch-Site: same-origin' -H 'Sec-Fetch-Mode: cors' -H 'Referer: https://www.provider.com/admin/data_source.html' -H 'Accept-Language: en-GB,en-US;q=0.9,en;q=0.8' -H 'Cookie: XXXX' --data-binary 'action=GetObjectDetails&id='$m --compressed 'https://www.provider.com/admin/provider.php' > "$m.json"
done