Part B: Protein Analysis and Visualization

Pick any protein (from any organism) of your interest that has a 3D structure and answer the following questions.

Briefly describe the protein you selected and why you selected it.

The selected protein is Musashi-2 (MSI2), an RNA-binding protein that regulates mRNA translation and stability. It was chosen because it plays a crucial role in post-transcriptional regulation and is involved in various diseases, including cancer. Additionally, its three-dimensional structure has been resolved, allowing for detailed analysis.

Identify the amino acid sequence of your protein.

MSI2 protein sequence

MEANGSQGTSGSANDSQHDPGKMFIGGLSWQTSPDSLRDYFSKFGEIRECMVMRDPTTKRSRGFGFVTFADPASVDKVLGQPHHELDSKTIDPKVAFPRRAQPKMVTRTKKIFVGGLSANTVVEDVKQYFEQFGKVEDAMLMFDKTTNRHRGFGFVTFENEDVVEKVCEIHFHEINNKMVECKKAQPKEVMFPPGTRGRARGLPYTMDAFMLGMGMLGYPNFVATYGRGYPGFAPSYGYQFPGFPAAAYGPVAAAAVAAARGSGSNPARPGGFPGANSPGPVADLYGPASQDSGVGNYISAASPQPGSGFGHGIAGPLIATAFTNGYH

Length: The sequence contains 328 amino acids.

protein_seq = "MEANGSQGTSGSANDSQHDPGKMFIGGLSWQTSPDSLRDYFSKFGEIRECMVMRDPTTKRSRGFGFVTFADPASVDKVLGQPHHELDSKTIDPKVAFPRRAQPKMVTRTKKIFVGGLSANTVVEDVKQYFEQFGKVEDAMLMFDKTTNRHRGFGFVTFENEDVVEKVCEIHFHEINNKMVECKKAQPKEVMFPPGTRGRARGLPYTMDAFMLGMGMLGYPNFVATYGRGYPGFAPSYGYQFPGFPAAAYGPVAAAAVAAARGSGSNPARPGGFPGANSPGPVADLYGPASQDSGVGNYISAASPQPGSGFGHGIAGPLIATAFTNGYH"

amino_acid_letters = ['A', 'a', 'C', 'c', 'D', 'd', 'E', 'e', 'F', 'f', 'G', 'g', 'H', 'h', 'I', 'i', 'K', 'k', 'L', 'l', 'M', 'm', 'N', 'n', 'P', 'p', 'Q', 'q', 'R', 'r', 'S', 's', 'T', 't', 'V', 'v', 'W', 'w', 'Y', 'y', "X", 'x']

print("length of the amino acid {}".format(len(protein_seq)))
count_json = {}
for letter in amino_acid_letters:
  count_json[letter] = 0
  for chaa in protein_seq:
    if chaa==letter:
      count_json[letter]+=1

sorted_amino_acid_dict = dict(sorted(count_json.items(), key=lambda item: item[1], reverse=True))
print(sorted_amino_acid_dict)

Most frequent amino acid:

The most frequent amino acid is: G with a frequency of 42

protein_seq = "MEANGSQGTSGSANDSQHDPGKMFIGGLSWQTSPDSLRDYFSKFGEIRECMVMRDPTTKRSRGFGFVTFADPASVDKVLGQPHHELDSKTIDPKVAFPRRAQPKMVTRTKKIFVGGLSANTVVEDVKQYFEQFGKVEDAMLMFDKTTNRHRGFGFVTFENEDVVEKVCEIHFHEINNKMVECKKAQPKEVMFPPGTRGRARGLPYTMDAFMLGMGMLGYPNFVATYGRGYPGFAPSYGYQFPGFPAAAYGPVAAAAVAAARGSGSNPARPGGFPGANSPGPVADLYGPASQDSGVGNYISAASPQPGSGFGHGIAGPLIATAFTNGYH"

amino_acid_letters = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y', "X"]

# Convertir la secuencia a mayúsculas
protein_seq = protein_seq.upper()

# Contar la frecuencia de cada aminoácido
count_json = {}
for letter in amino_acid_letters:
    count_json[letter] = 0
    for chaa in protein_seq:
        if chaa == letter:
            count_json[letter] += 1

# Ordenar el diccionario por frecuencia
sorted_amino_acid_dict = dict(sorted(count_json.items(), key=lambda item: item[1], reverse=True))

# Mostrar el aminoácido más frecuente
most_frequent_amino_acid = list(sorted_amino_acid_dict.keys())[0]
most_frequent_count = sorted_amino_acid_dict[most_frequent_amino_acid]

print(f"The most frequent amino acid is: {most_frequent_amino_acid} with a frequency of {most_frequent_count}")

Number of homologous sequences: 5 homologs of the MSI2

The homologs of the MSI2 protein identified through the BLAST search include several species with high identity and coverage. Among the main homologs are Musashi homolog 2 (Musashi-2) from Mus musculus (mouse), with 100% identity and 100% coverage (P0C7Z9), as well as from Rattus norvegicus (rat) with the same characteristics (Q6IF68). The Musashi-2 from Homo sapiens (human) was also identified with 100% identity and coverage (Q86T22). Additionally, Musashi-2 homologs were found in Bos taurus (cow) and Danio rerio (zebrafish), with 95% identity and 100% coverage, accessible via codes Q3MIB5 and Q5R3J4, respectively. These homologs represent evolutionary variants across different species and can be useful for comparative research on the function and structure of MSI2.

Protein family: MSI2 belongs to the Musashi protein family, which is characterized by RNA recognition motifs (RRMs) and functions in post-transcriptional regulation.

Identify the structure page of your protein in RCSB
- PDB Code: 6DBP
- Date of structure resolution: The structure was published approximately 5 years and 4 months ago
- Structure quality: The resolution is 2.1 Å, indicating a high-quality structure.
- Other molecules in the structure: In addition to the protein, the structure includes a phosphate molecule (PO₄).
- Structural classification: MSI2-RRM1 belongs to the RNA-binding protein superfamily, according to the SCOP database
  
  Multiple alignment of 100 protein sequences was performed using Clustal Omega, a tool based on progressive alignment algorithms optimized for large data sets. To facilitate interpretation, the Clustal2 coloring scheme was applied, which highlights the conservation and variability of residues in the alignment.

Open the structure of your protein in any 3D molecule visualization software:

Code used in python

# Instalar las librerías necesarias
!pip install -q py3Dmol

# Importar librerías necesarias
import py3Dmol
import os
from IPython.display import Image, display

# Definir el código PDB correcto de MSI2
pdb_code = "6DBP"
pdb_filename = f"{pdb_code}.pdb"

# Descargar el archivo PDB de la proteína MSI2 (si no está ya descargado)
if not os.path.exists(pdb_filename):
    !wget <https://files.rcsb.org/download/{pdb_code}.pdb> -O {pdb_filename}

# Crear una visualización con py3Dmol
view = py3Dmol.view(query=f'pdb:{pdb_code}')
view.setStyle({'cartoon': {'color': 'spectrum'}})
view.show()

# Visualización de la superficie de la proteína
view_surface = py3Dmol.view(width=800, height=600)
view_surface.addModel(open(pdb_filename, 'r').read(), 'pdb')
view_surface.setStyle({'cartoon': {'color': 'spectrum'}})
view_surface.addSurface(py3Dmol.VDW, {'opacity': 0.4, 'color': 'gray'})
view_surface.zoomTo()
view_surface.show()

# Visualización de moléculas de agua (HOH)
view_water = py3Dmol.view(width=800, height=600)
view_water.addModel(open(pdb_filename, 'r').read(), 'pdb')
view_water.setStyle({"cartoon": {"color": "spectrum"}})
view_water.addStyle({"resn": "HOH"}, {"sphere": {"color": "red", "radius": 0.5}})
view_water.zoomTo()
view_water.show()

# Mostrando los resultados de las visualizaciones
print("Visualizaciones generadas: superficie y agua")

Visualize the protein as "cartoon", "ribbon" and "ball and stick".

                          CARTOON

                             RIBBON

                    BALL AND STICK

Color the protein by secondary structure. Does it have more helices or sheets?

# Instalar la librería necesaria
!pip install -q py3Dmol

# Importar librerías
import py3Dmol
import os
import urllib.request

# Código PDB de MSI2
pdb_code = "6DBP"
pdb_filename = f"{pdb_code}.pdb"

# Descargar el archivo PDB si no existe
if not os.path.exists(pdb_filename):
    urllib.request.urlretrieve(f"<https://files.rcsb.org/download/{pdb_code}.pdb>", pdb_filename)

# Contar las estructuras secundarias en el archivo PDB
helices = 0
sheets = 0

with open(pdb_filename, 'r') as pdb_file:
    for line in pdb_file:
        if line.startswith("HELIX"):  
            helices += 1
        elif line.startswith("SHEET"):  
            sheets += 1

# Crear la visualización con py3Dmol
view = py3Dmol.view(width=800, height=600)
view.addModel(open(pdb_filename, 'r').read(), 'pdb')

# Aplicar color rojo a toda la estructura secundaria
view.setStyle({'cartoon': {'color': 'red'}})

# Ajustar la vista y mostrar
view.zoomTo()
view.show()

# Mostrar el resultado del conteo
print(f"La proteína {pdb_code} tiene:")
print(f"🔴 {helices} hélices α")
print(f"🟡 {sheets} láminas β")

# Comparación entre hélices y láminas
if helices > sheets:
    print("🟥 Tiene más hélices α que láminas β.")
elif sheets > helices:
    print("🟨 Tiene más láminas β que hélices α.")
else:
    print("⚖️ Tiene igual cantidad de hélices α y láminas β.")

The 6DBP protein has: 🔴 6 α-helices. 🟡 14 β-sheets. 🟨 It has more β-sheets than α-helices.

Color the protein by residue type. What can you tell about the distribution of hydrophobic vs hydrophilic residues?

# Instalar la librería necesaria
!pip install -q py3Dmol

# Importar librerías
import py3Dmol
import os
import urllib.request

# Código PDB de MSI2
pdb_code = "6DBP"
pdb_filename = f"{pdb_code}.pdb"

# Descargar el archivo PDB si no existe
if not os.path.exists(pdb_filename):
    urllib.request.urlretrieve(f"<https://files.rcsb.org/download/{pdb_code}.pdb>", pdb_filename)

# Crear la visualización con py3Dmol
view = py3Dmol.view(width=800, height=600)
view.addModel(open(pdb_filename, 'r').read(), 'pdb')

# Definir colores según el tipo de residuo
hydrophobic_residues = ["ALA", "VAL", "LEU", "ILE", "MET", "PHE", "TRP", "PRO"]
hydrophilic_residues = ["SER", "THR", "CYS", "TYR", "ASN", "GLN"]
negative_residues = ["ASP", "GLU"]
positive_residues = ["LYS", "ARG", "HIS"]

# Aplicar colores por tipo de residuo
view.setStyle({'cartoon': {'color': 'white'}})  # Fondo neutro
view.addStyle({'resn': hydrophobic_residues}, {'cartoon': {'color': 'green'}})  # Hidrofóbicos
view.addStyle({'resn': hydrophilic_residues}, {'cartoon': {'color': 'blue'}})  # Hidrofílicos
view.addStyle({'resn': negative_residues}, {'cartoon': {'color': 'red'}})  # Negativos
view.addStyle({'resn': positive_residues}, {'cartoon': {'color': 'yellow'}})  # Positivos

# Ajustar la vista y mostrar
view.zoomTo()
view.show()

# Contar residuos en el archivo PDB
hydrophobic_count = 0
hydrophilic_count = 0

with open(pdb_filename, 'r') as pdb_file:
    for line in pdb_file:
        if line.startswith("ATOM"):
            resn = line[17:20].strip()  # Obtener el nombre del residuo
            if resn in hydrophobic_residues:
                hydrophobic_count += 1
            elif resn in hydrophilic_residues:
                hydrophilic_count += 1

# Mostrar el resultado del conteo
print(f"La proteína {pdb_code} tiene:")
print(f"🟢 {hydrophobic_count} residuos hidrofóbicos")
print(f"🔵 {hydrophilic_count} residuos hidrofílicos")

# Comparación
if hydrophobic_count > hydrophilic_count:
    print("🔬 La proteína tiene más residuos hidrofóbicos. Es probable que tenga un núcleo hidrofóbico en su interior.")
elif hydrophilic_count > hydrophobic_count:
    print("💧 La proteína tiene más residuos hidrofílicos. Es probable que sea soluble en agua y esté orientada hacia el exterior.")
else:
    print("⚖️ La proteína tiene una distribución equilibrada de residuos hidrofóbicos e hidrofílicos.")

The 6DBP protein has: 🟢 559 hydrophobic residues. 🔵 244 hydrophilic residues. 🔬 The protein has more hydrophobic residues. It is likely to have a hydrophobic core within it.

Visualize the surface of the protein. Does it have any "holes" (aka binding pockets)?

The panel shows the surface of the protein. Several cavities are observed along the structure, which could be potential binding sites. These cavities have the typical characteristics of binding pockets, suggesting that the protein could interact with other molecules at these sites. Further analysis is recommended to confirm their functional relevance.