Resource Used

Great thanks to Darshil Parmar for putting together such an easy-to-follow project! I made adjustments to Kafka consumer and glue crawler part to fit my use cases.

https://www.youtube.com/watch?v=KerNf0NANMo

Environment Setup

1. Set up EC2 Instance

EC2 Instance should have linux as OS and T2 Micro tier (free-tier) to make life easier

2. Install Apache Kafka on EC2

Install Kafka

wget https://archive.apache.org/dist/kafka/3.3.1/kafka_2.13-3.3.1.tgz
tar -xvf kafka_2.13-3.3.1.tgz

Install Java (Kafka runs on JVM)

sudo yum install java-1.8.0-amazon-corretto.x86_64

Start Zookeeper

./kafka_2.13-3.3.1/bin/zookeeper-server-start.sh ./kafka_2.13-3.3.1/config/zookeeper.properties

Increase memory to Kafka Heap

export KAFKA_HEAP_OPTS="-Xmx256M -Xms128M"

Start Kafka

cd kafka_2.12-3.3.1
./kafka_2.13-3.3.1/bin/kafka-server-start.sh ./kafka_2.13-3.3.1/config/server.properties

Make the Kafka listen through public IP

vim ./config/server.properties 

# Change this part 
# Listener name, hostname and port the broker will advertise to clients.
# If not set, it uses the value for "listeners".
advertised.listeners=PLAINTEXT://ec2-54-237-239-178.compute-1.amazonaws.com:9092

Rerun Zoo-Keeper and Kafka

3. Allow Traffic for the EC2 Instance

Allow Inbound traffic on AWS console (Instance Details → Security → Inbound Rules → launch-wizard-1 → {security group name} → edit inbound rules )

4. Create Topic

cd kafka_2.13-3.3.1
bin/kafka-topics.sh --create --topic demo_test --bootstrap-server 54.237.239.178:9092 --replication-factor 1 --partitions 1

5. Start Producer

./kafka_2.13-3.3.1/bin/kafka-console-producer.sh --topic demo_test --bootstrap-server ec2-44-206-255-156.compute-1.amazonaws.com:9092

6. Start Consumer

cd kafka_2.13-3.3.1
./kafka_2.13-3.3.1/bin/kafka-console-consumer.sh --topic demo_test --bootstrap-server ec2-44-206-255-156.compute-1.amazonaws.com:9092

7. Use Python to run Kafka Consumer/ Producer

# Define Producer
producer = KafkaProducer(bootstrap_servers=["ec2-44-206-255-156.compute-1.amazonaws.com"],
                         value_serializer = lambda x: dumps(x).encode('utf-8'))
while True: 
    # Wait for random time beween 0 ~ 10 seconds
    sleep(random.uniform(0,3)) 
    # Generate random data 
    generated_data = df.sample(1).to_dict(orient='records')[0]
    # Send it to the kafka consumer 
    producer.send("demo_test", value=generated_data)

# Flush the data in producer 
producer.flush() 

============On a seperate File============

consumer = KafkaConsumer(
    'demo_test',
    bootstrap_servers=["ec2-44-206-255-156.compute-1.amazonaws.com:9092"],
    value_deserializer = lambda x: loads(x.decode('utf-8')))

for c in consumer: 
    print(c.value)

Set up AWS Storage

1. Create S3 Bucket

2. Go to IAM and Create the role with the permission

Create User

user → Add user
Check Access Key - Programmatic access
Attach existing policies directly → add AmazonS3FullAccess
Create User
IAM → Users → {user_name} → Access Keys → Create Access Keys → Command Line Interface (CLI)
Download .csv : keep the secret key & the access key

3. Set Up AWS CLI

Setting up the AWS CLI allows s3fs library to access the s3 directories as a local directory
aws configure → add the keys from previous step

Complete the Code and Check the data

1. Set the Consumer to save a Parquet File with 10 rows each

# Create S3 Object 
s3 = S3FileSystem() 

# Intialize a list to hold the data 
tmp_data = [] 

for count, message in enumerate(consumer): 
    # Append the data to the list 
    tmp_data.append(message.value)
    # If we collect 10 messages save it to a parquet file
    if count != 0 and count % 10 == 0: 
        # Create a pandas DataFrame from the accumulated data
        df = pd.DataFrame(tmp_data)

        # Convert the DataFrame to a PyArrow table
        table = pa.Table.from_pandas(df)
        
        # Record current time to use in file name 
        current_timestamp = datetime.now().strftime("%Y%m%d_%H%M%S%Z")
        with s3.open(f"s3://kafka-stock-data-trial/stock_market_data_\
        {current_timestamp}.parquet", "wb") as file:
            print(table)
            pq.write_table(table, file)    
            #Renew the list 
            tmp_data = []

parquet tables look like this:

and the files are saved as below:

2. Create Glue Crawler

Glue → Data Catalog → Crawlers → Create Crawler
Enter Crawler Name → Next
Select Not Yet for Is your data already mapped to Glue tables? → Add a data source → select your bucket for S3 Path → Add an S3 data Source → Next
Choose IAM role → (If None) click Create new IAM role → Next
Add database → select the created database → Click Advanced options
Crawler Schedule : On demand
Create Cralwer

3. Run Crawler

Click Run Crawler and wait until the crawler finishes

4. Query the Table in Athena

Athena → Settings → Manage → add a S3 bucket to save temporary query results
Editor → Tables and views : find the table name you put in previous steps
Query & Check the data!

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.ipynb_checkpoints		.ipynb_checkpoints
resources_for_md		resources_for_md
.DS_Store		.DS_Store
README.md		README.md
indexProcessed.csv		indexProcessed.csv
kafka_consumer.py		kafka_consumer.py
kafka_producer.py		kafka_producer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Resource Used

Environment Setup

1. Set up EC2 Instance

2. Install Apache Kafka on EC2

3. Allow Traffic for the EC2 Instance

4. Create Topic

5. Start Producer

6. Start Consumer

7. Use Python to run Kafka Consumer/ Producer

Set up AWS Storage

1. Create S3 Bucket

2. Go to IAM and Create the role with the permission

Create User

3. Set Up AWS CLI

Complete the Code and Check the data

1. Set the Consumer to save a Parquet File with 10 rows each

2. Create Glue Crawler

3. Run Crawler

4. Query the Table in Athena

About

Releases

Packages

Languages

douggkim/kafka_stock_data_collection_project

Folders and files

Latest commit

History

Repository files navigation

Resource Used

Environment Setup

1. Set up EC2 Instance

2. Install Apache Kafka on EC2

3. Allow Traffic for the EC2 Instance

4. Create Topic

5. Start Producer

6. Start Consumer

7. Use Python to run Kafka Consumer/ Producer

Set up AWS Storage

1. Create S3 Bucket

2. Go to IAM and Create the role with the permission

Create User

3. Set Up AWS CLI

Complete the Code and Check the data

1. Set the Consumer to save a Parquet File with 10 rows each

2. Create Glue Crawler

3. Run Crawler

4. Query the Table in Athena

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages