Welcome to hackathonCLT 2019!
Download the kickoff deck for more information on the business problem, the hackathonCLT agenda, the data, and more! Find it in the data folder: HackathonCLT 2019 kickoff deck
Table of Content:
- IMPORTANT INFORMATION
- DEVPOST
- SLACK
- Getting Started
- Machines
- HDFS
- Spark
- pySpark
- Anaconda Python
- Tresata Software
- YARN Resource Manager
- PuTTY
- SAMBA
- FTP
- Elasticsearch
- Data Dictionary
- Data
- Helpful Data Links
- Please make sure you spread out on all the boxes. There are 4 servers available for login, make sure you spreadout and not all log into 1 box.
- Use tmux once you login into the server. If not, your session could get terminated and you can lose your work. Just type "tmux" once you login. To re-attach a tmux session "tmux attach". https://tmux.github.io/
- Screen is an alternative for tmux. https://www.rackaid.com/blog/linux-screen-tutorial-and-how-to/
Please make sure you # on DEVPOST to submit your codes and presentations prior to your scheduled shortlisting (no later than 8:45 AM). One DEVPOST submission per team. This is required for judging!
Please make sure you connect with your fellow hackers on SLACK. You can also ask any questions here on any of the available channels.
You can obtain a username and login information from command central (in the kids university downstairs).
There are 4 servers available.
ssh into a server where you can access the data:
$ ssh <username>@hack02.northstate.net OR
$ ssh <username>@hack03.northstate.net OR
$ ssh <username>@hack04.northstate.net OR
$ ssh <username>@hack05.northstate.net
and enter the password you were given.
We made Hive, Spark, pySpark, R and Anaconda Python command-line interfaces available.
We have a Hadoop cluster with one master and four workers. The workers have 32 cores, 7 X 1TB data drives, and 128GB of RAM each. You will have ssh access to the workers.
Please spread yourselves out across the machines.
The /home directory on every machine is limited to 170G, and shared between everyone logged in to the server. If you need disk space to work please use one of the following directories on a 1TB data disk:
/data/0/work
/data/1/work
/data/2/work
/data/3/work
/data/4/work
/data/5/work
/data/6/work
Do not work inside these directories directly, but instead create a subdirectory. For example:
mkdir /data/3/work/hacker123
# in case you want privacy
chmod og-rwx /data/3/work/hacker123
cd /data/3/work/hacker123
To access your HDFS location, you need to use hadoop fs commands (some reference: http://www.folkstalk.com/2013/09/hadoop-fs-shell-command-example-tutorial.html). For example, to take a look at your home directory on HDFS, use
$ hadoop fs -ls
or
$ hadoop fs -ls /user/username
You can find the data on HDFS in the /data/hackathon folder:
/data/hackathon/health_outcomes
/data/hackathon/medlink_partners_services
/data/hackathon/social_determinants_of_health
Spark-shell can be found at /opt/tresata/spark-2.4.1-tres-alpha1-provided/bin
Now give the Spark-shell a test:
$ /opt/tresata/spark-2.4.1-tres-alpha1-provided/bin/spark-shell --executor-cores 1 --executor-memory 1G
Read in the data and run a simple query that calcuates the unique count of ChildZip:
val df = spark.sqlContext.read.parquet("/data/hackathon/social_determinants_of_health/mecklenburg_quality_of_life/education/parq/qol-education.parq")
df.groupBy("NPA").count().collect()
Note that for your "production" run on the dataset you might want to increase resources used on the cluster:
--executor-memory 4G --executor-cores 4
Keep in mind that a spark-shell takes up these resources on the cluster even when you do not use them so please do not keep a spark-shell with "production" resources open unused.
pySpark can be found at /opt/tresata/spark-2.4.1-tres-alpha1-provided/bin
You can also do the same query using a python version of the Spark shell.
$ /opt/tresata/spark-2.4.1-tres-alpha1-provided/bin/pyspark --executor-cores 1 --executor-memory 1G
Read in the data and run a simple query that calcuates the unique count of ChildZip:
df = sqlContext.read.parquet("/data/hackathon/social_determinants_of_health/mecklenburg_quality_of_life/education/parq/qol-education.parq")
df.groupBy("NPA").count().collect()
Note that for your "production" run on the dataset you might want to increase resources used on the cluster:
--executor-memory 4G --executor-cores 4
Keep in mind that a pyspark takes up these resources on the cluster even when you do not use them so please do not keep a pyspark shell (interpreter) with "production" resources open unused.
Anaconda is a completely free Python distribution from Continuum Analytics. It includes more than 400 of the most popular Python packages for science, math, engineering, and data analysis. See the packages included with Anaconda.
Anaconda can be found here:
/usr/local/lib/anaconda
Getting familiar with conda: https://conda.io/projects/conda/en/latest/commands.html
An example of how to start anaconda python:
ssh hacker001@hack02.northstate.net
hacker001@hack02.northstate.net's password:
Last login: Fri Mar 22 10:06:35 2019 from rrcs-24-172-30-38.midsouth.biz.rr.com
[hacker001@hack02 ~]$ source /usr/local/lib/anaconda/bin/activate
(base) [hacker001@hack02 ~]$ python
Python 2.7.15 |Anaconda, Inc.| (default, Dec 14 2018, 19:04:19)
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
Data Inventory Engine built specifically to catalog, profile and report data ontology, quality and format attributes for all data in Hadoop. TREK rapidly profiles and inventories “as-is” data stored in Hadoop across all rows and columns to create an informed view of all valuable enterprise data feeds stored in a single Hadoop cluster.
TREK can be accessed via http://hack01.northstate.net:5603
For login, it is user tresata and password admin. After logging in, click on the menu icon next to the name "tresata" on the top right corner and select TREK. You will see a list of datasets (please refer to Data below) under "Project Name". Once you select a dataset in TREK, click on "Partitions" and select a partition (typically there is only one). You should now see a summary of all the columns in the dataset. Click on a column to get the statistics for that column. Keep in mind that we have only TREK'ed a sample table for each data source into Trek to give you an idea of what each data source looks like (schema, population of non-null values, number of unique values, top values, predicted patterns, etc).
Please DO NOT delete any dataset in TREK since all teams share the same TREK instance.
This is where you can track jobs that run on the hadoop cluster:
http://hack01.northstate.net:8088/
Here is the link to download puTTY for remote access to the data files. This is useful if you have a Windows computer. The download link is:
You can also use Samba to connect to the servers and download the data locally. The samba share is:
smb://hack01.northstate.net/data
on windows this is:
\\hack01.northstate.net\data
Instructions for how to use Samba for apple devices can be found here. Help for connecting to a Samba share on a windows device can be found here.
The data is also accessible via FTP on hack01.northstate.net. In a web browser:
ftp://hack01.northstate.net/hackathon
Elasticsearch (5 node) is available at port 9200 on all servers (hack01.northstate.net, hack02.northstate.net, hack03.northstate.net, hack04.northstate.net, hack05.northstate.net). There is no security enabled so you can create indices if you need to, but please do not delete or modify other peoples' indices.
The data dictionary can be found here.
HDFS You can find the data on HDFS in the /data/hackathon/ directory. We have provided the csv/bsv and parquet versions of the files. Please see the directory structure below. bsv: bar ("|") delimited. csv: comma (",") delimited.
LOCAL The same files are also copied to each server in the /srv/data/ directory. The same structure follows, except parquet files are removed on each server.
├── health_outcomes
│ ├── 500-cities
│ │ ├── bsv
│ │ └── parq
│ └── aids-vu
│ ├── csv
│ └── parq
├── medlink_partners_services
│ ├── camino_clinic
│ │ ├── csv
│ │ └── parq
│ ├── care_ring
│ │ ├── client_encounters
│ │ │ ├── csv
│ │ │ └── parq
│ │ ├── nurse_family_partnership
│ │ │ ├── csv
│ │ │ └── parq
│ │ └── physicians_reachout
│ │ ├── csv
│ │ └── parq
│ ├── charlotte_center_for_legal_advocacy
│ │ ├── csv
│ │ └── parq
│ ├── charlotte_community_health_clinic
│ │ ├── csv
│ │ └── parq
│ ├── meck_county_public_health_department
│ │ ├── csv
│ │ └── parq
│ └── nc_medassist
│ ├── csv
│ └── parq
├── social_determinants_of_health
│ ├── census
│ │ ├── csv
│ │ └── parq
│ ├── charlotte_housing_authority
│ │ ├── csv
│ │ ├── parq
│ │ └── README.txt
│ ├── consumer_financial_protection_bureau
│ │ ├── consumer-complaints
│ │ │ ├── csv
│ │ │ └── parq
│ │ ├── financial-well-being-survey
│ │ │ ├── csv
│ │ ├── home-mortgage-disclosure-act
│ │ │ ├── csv
│ │ │ └── parq
│ │ └── national-mortgage-rates
│ │ ├── csv
│ │ └── parq
│ ├── epa
│ │ └── clean-air-carolinas
│ │ ├── csv
│ │ └── parq
│ ├── geo-mappings
│ │ ├── csv
│ │ └── parq
│ ├── housing_urban_development
│ │ ├── affh
│ │ │ ├── mecklenburg
│ │ │ │ ├── csv
│ │ │ │ └── parq
│ │ │ ├── national
│ │ │ │ ├── csv
│ │ │ │ └── parq
│ │ │ └── raw-affh-data
│ │ │ ├── csv
│ │ │ └── parq
│ │ └── fair-market-value
│ │ ├── bsv
│ │ └── parq
│ ├── mecklenburg_public_services_geojson
│ │ └── demographics
│ │ └── mecklenburg-public-services-geojson
│ ├── mecklenburg_quality_of_life
│ │ ├── crime
│ │ │ ├── csv
│ │ │ ├── data_dictionary
│ │ │ ├── npa_mapping_geojson
│ │ │ └── parq
│ │ ├── demographics
│ │ │ ├── csv
│ │ │ ├── data_dictionary
│ │ │ ├── npa_mapping_geojson
│ │ │ └── parq
│ │ ├── economics
│ │ │ ├── csv
│ │ │ ├── data_dictionary
│ │ │ ├── npa_mapping_geojson
│ │ │ └── parq
│ │ ├── education
│ │ │ ├── csv
│ │ │ ├── data_dictionary
│ │ │ ├── npa_mapping_geojson
│ │ │ └── parq
│ │ ├── health
│ │ │ ├── csv
│ │ │ ├── data_dictionary
│ │ │ ├── npa_mapping_geojson
│ │ │ └── parq
│ │ ├── housing
│ │ │ ├── csv
│ │ │ ├── data_dictionary
│ │ │ ├── npa_mapping_geojson
│ │ │ └── parq
│ │ └── transportation
│ │ ├── csv
│ │ ├── data_dictionary
│ │ └── parq
│ ├── nc_board_of_education
│ │ ├── graduation-counts-including-summer-school
│ │ │ ├── csv
│ │ │ └── parq
│ │ ├── nc-student-counts-grade-race-sex
│ │ │ ├── csv
│ │ │ └── parq
│ │ ├── principles-report-nc
│ │ │ ├── csv
│ │ │ └── parq
│ │ └── school-to-geocode-mapping
│ │ ├── csv
│ │ └── parq
│ └── zillow
│ ├── csv
│ └── parq
Here are the sources of datasets under Social Determinant of Health
Census
Environmental Protection Agency (EPA)
Environmental Protection Agency
Health
Healthcare Cost & Utilization Project
Medicare Provider Utlization & Payment Data
Housing and Urban Development
National Housing Preservation Database
Mecklenburg Quality of Life
NC Board of Education
Zillow
You may use social media data in addition to the above datasets.
Google Trends
Google Trends may be used to check for popular search terms and related queries or keywords. This is a fantastic tool to check interest in keywords by checking relative popularity of a query's search volume over time.
You may use more than one search term to find comparison stats for keywords, interest by region, subregion, country, metro area, or city for a range of time periods. Additionally, you may use the rela ted searches, related queries, top questions on keyword, latest stories and insights, interactive map features. Data may be exported locally for your use.
Some useful tips when using Google Trends data:
* Use "related queries" to find new keyword ideas and expand your data search
* Trends eliminates repeated searches from the same person over a short period of time to give you a better picture of the search. It only shows data for popular terms, and low volumes appear as 0.
* Google Trends shows relative popularity of a search query instead of absolute search volume data to make comparisons between terms easier. This means that each data point is divided by the total sea rches of the geography and the time range that it represents, so be careful when comparing very small cities or regions to much larger geographic areas.
* Top searches are terms that are most frequently searched with the term you entered in the same search session, within the chosen category, country, or region.
* Rising searches are terms that were searched for with the keyword you entered (or overall searches, if no keyword was entered), which had the most significant growth in volume in the requested time period.
* For each rising search term, you see a percentage of the term’s growth compared to the previous time period. If you see “Breakout” instead of a percentage, it means that the search term grew by more than 5000%.
* Identify seasonal trends (for example, a spike in popularity for "flu shots" during flu season) to correctly understand your trend graphs
* Use the maps to exactly which cities and subregions best help answer your question
You may also use Twitter data from various Twitter pages or hashtags: